Array Processors Enable Flexibility in FFT Designs

Patrick Fuller, picoChip
Dec 29, 2004 (7:00 AM)

Although it has found many uses in signal processing, the Fast Fourier Transform (FFT) has taken on even more importance as a fundamental part of the algorithms used for communications protocols based on orthogonal frequency division multiplexing (OFDM).

In the wireless space, OFDM is used in the newer forms of IEEE 802.11 wireless LAN (WLAN) designs and in the IEEE 802.16-2004 (WiMAX) specification for metropolitan area networking. It has also been proposed as the basis for successors to 3G cellular communications systems (or even as an option for increased data rates for a future version of UMTS, which will cause a strange terminological situation: "the OFDM mode of WCDMA"). In the wired area, OFDM is referred to as discrete multi-tone (DMT) and is the basis for the ADSL standard.

The common strand in all these systems is a demand for high-speed FFT processing. But often time-to-market pressure drives vendors to release products that comply with early versions of a standard, thus locking down an FFT architecture. The problem, however, is that as standard change, FFT architectures can also change. Thus, designers need to implement a flexible FFT architecture in their design in order to account for potential spec changes.

For example, 802.16e has recently shifted from a constant FFT size (256 for OFDM, 2048 for OFDMA) to a scalable physical layer (PHY), with the FFT size shifting for different channel bandwidths with a maximum of either 1024 or 2048 being argued. This demands a solution that can be implemented on programmable silicon.

Many algorithms and architectures exist for implementing the FFT. But, very often, the use of a software-programmable platform may mandate the use of an algorithm that is optimized for software processing even though it may be slower or less power-efficient than a hardware-oriented design. Changing the FFT size for a traditional, block-based implementation also poses significant issues to the overall system performance due to scheduling considerations.

Fortuantely, designers now have an answer to this challenge. By implementing the FFT in a reconfigurable processor, designers can efficiently implement a hardware-based FFT that can be adapted over time to meet changing performance requirements. Let's see how.

Understaning FFTs
The FFT is simply an efficient implementation of the Discrete Fourier Transform (DFT). For an N-point DFT, a direct implementation requires of the order of N² complex multiply-and-add operations. But, as a perfect example of how a clever algorithm can deliver incredible efficiency gains, the classic FFT only requires of the order of N-log₂N operations. Figure 1 presents the (unscaled) definition of the DFT which acts as the starting point for the FFT algorithm.

Click here for Figure 1

Figure 1: FFT with decimation in (a) frequency and (b) decimation in time.

Two approaches exist for reducing the DFT into a series of simpler calculations. One is to perform decimation in frequency and the other is to perform decimation in time. Both approaches require the same number of complex multiplications and additions.

The key difference between the two approaches is that decimation in time takes bit-reversed inputs and generates normal-order outputs, whereas decimation in frequency takes normal-order inputs and generates bit-reversed outputs. The manipulation of inputs and outputs is carried out by so-called butterfly stages. The use of each butterfly stage involves multiplying an input by a complex twiddle factor, e^-j2πn/N.

A software implementation of the FFT has typically involved block-based processing due to the need to share the processor with other tasks. Any change to the implementation may adversely affect the whole system. A hardware implementation of the FFT typically involves dedicated logic in the form of a pipeline with buffering in each of the butterfly stages.

Pipeline Approach
A pipeline FFT is characterized by the real-time, non-stop processing of a sequential input stream. A hardware-orientated approach strives to minimize the cost or area of silicon by reducing the number of complex multipliers needed. This approach allows more elements to be computed in parallel for a given area.

Pipeline FFTs use decimation in frequency in order to avoid reordering the inputs. As a result, the outputs are in bit reversed order. The FFT algorithm involves the temporal separation of data, a task that is performed by the butterfly stage. Because samples will be taken from different points in the input stream, pipeline FFTs need a way of buffering and reordering data. Various architectures exist for this. The pipeline FFT processor is based on the Radix-2 Single-path Delay Feedback (R2SDF) architecture.

In the FFT approach described above, delay between each butterfly stage in a pipeline FFT is dictated by the amount of buffering required for the inputs. The largest delay is in the first stage where N/2 samples have to be buffered before outputs can be generated. The smallest delay is in the last stage where only one sample needs to be buffered.

An optimization of the FFT algorithm can be made with respect to the twiddle factors. This cuts the number of multipliers required in half. The FFT retains the Radix-2 butterfly structure but has the multiplicative complexity of a Radix-4 algorithm. This results in two Radix-2 butterflies for each complex multiplier. The resulting architecture is termed R2²SDF.

Figure 2 shows two, Radix-2 butterfly stages between each complex multiplier. The memory requirements for each butterfly stage are also shown.

Click here for Figure 2

Figure 2: Diagram showing an R2²SDF pipeline FFT.

Processor Details
In order to describe the FFT implementation in detail, an understanding of some key aspects of the reconfigurable processor used in this example is needed. The processor used to implement the FFT contains 300 individual processors, each featuring a 16-bit Harvard architecture and local memory.

A long instruction word of up to 64 bits allows up to three execution units to execute in each processor in a single cycle at 160MHz. Each array element has a number of ports for communicating with other elements within the array using a switch fabric.

The processors are linked by an interconnect fabric. Like an FPGA, this is allocated at compile time, so all communications & processing is predictable and deterministic (unlike a conventional DSP, with RTOS, scheduling and arbitration which are only statistically predictable).

FFT Implementation
By using a reconfigurable processor, designers can assemble parallel pipelined structures that can be used to implement the FFT. Each of the butterfly and multiplier stage shown in Figure 2 is mapped to a separate array element, as shown in Figure 3.

Click here for Figure 3

Figure 3: Diagram showing the implementation of a 256-point pipeline FFT on a reconfigurable processor.

In the pipelined FFT implementation, each array element takes its input from a bus, processes it, and then provides an output to the next array element in the pipeline. As the overall throughput is limited by the slowest array element, each loop on the each array element should, ideally, take the same number of cycles for optimum performance. For example, if each array element processes each sample in eight cycles, then the maximum throughput is 20 MSample/s at 160MHz.

The 256-point FFT shown in Figure 3 employs eight butterfly stages and three multipliers. The very first butterfly stage uses a memory array element as this first stage is where the most buffering is needed. The memory requirement for the second butterfly stage is half that of the first and can be satisfied using the memory contained in one of the standard array elements.

The twiddle factors needed for multiplication also take up memory. But the amount of memory needed for the twiddle factors only exceeds the capacity of the standard array element for the first of the multipliers. The bit-growth that results from each fixed-point multiply is removed by array elements that are dedicated to rounding.

Increasing FFT Length
The length of the FFT can be increased without reducing throughput by adding more array elements. The only effect on the remainder of the system is a small increase in the latency. For example, increasing the 256-point algorithm of Figure 3 to a 1024-point algorithm would call for two butterfly stages, a multiplier stage, a further twiddle store and a rounding stage: a total of three additional memory elements and two standard processing elements.

For slower sample rates, the overall FFT can be made smaller by combining several functions into one array element. For more substantial throughput increases, further parallelism could be exploited in each butterfly stage, multiplier and rounding stage, with each function being spread across multiple array elements.

An alternative approach would be to instantiate two FFTs and use an array element to multiplex between them. This multiplexing approach allows sixteen 256-point FFTs to run on a single reconfigurable processor with an overall throughput of 320 MSamples/s.

Wrap Up
Using a communications scheme designed to efficiently implement hardware-based algorithms, it is possible to use software-programmable architectures to emulate the advantages and flexibility of hardware-oriented tradeoffs. This will allow early entrance to markets that call for algorithms based on the FFT, such as OFDM-based protocols, to get products into the market ahead of competitors and still be able to ensure compliance with standards as they are ratified.

About the Author
Patrick Fuller is a senior application engineer at piocChip. Patrick holds a B.Eng from Edinburgh University and an MSc from RMIT Melbourne. He can be reached at patrickf@picochip.com.

Click here to read more ...

Industry Articles

Array Processors Enable Flexibility in FFT Designs