By J. Greg Nash, Centar LLC
Introduction
One reason that the power-of-two FFT dominates the landscape of high performance real-time signal processing applications is the perception that alternative non-power-of-two (NP2) circuits are difficult to implement. This perception is likely related to a variety of design issues:
- Need for mixed radices
- Arithmetic requirements
- Throughput scalability
- Design complexity
- Twiddle factor compaction
- Word growth/scaling
At the same time there are important disadvantages to using the power-of-two (P2) FFT, in particular limits on the number of reachable transform sizes that make it difficult to match data set sizes with available circuit transform lengths.
Given that there are good examples that speak to the importance NP2 circuits such as
- 3780-point DFT in Chinese digital multimedia TV Broadcasting-Terrestrial [1] based on orthogonal frequency-division multiplexing (OFDM).
- LTE single carrier uplink “SC-FDMA” protocol that requires computation of 35 different non-power-of-two DFT transform sizes (12 to 1296 points)
what is needed are better circuit methodologies for designing NP2 FFTs. Here a new model for FFT computation is described with an architecture that works equally well for NP2 and P2 designs. Unlike most commercially available FFT architectures, this one can produce either type of circuit with the same level of design effort and the same level of implementation efficiency.
Below we look at the design issues listed above in the context of this new FFT architectural model.
Centar FFT Model
Centar's approach is memory-based, but is architecturally very different than traditional memory-based approaches in that it uses the fine grained array structure shown Fig. 1. It is seen to consist of many very small processing elements (PEs), each containing a multiplier/adder and a few registers with control logic. Each PE reads and writes to a typically small memory. Aggregate bandwidth is limited only by the number of PEs and read/write conflicts don’t occur because each PE has its own memory.
Fig. 1. Centar memory-based array FFT architecture
Although not shown in Fig. 1, each PE is locally connected to its neighbors and data moves only between adjacent PEs. Input and output data flow into and out of the array from the top (not shown), using the nearest neighbor interconnection network.
Need for mixed radices
The easiest mathematical way to represent an N-point DFT is via the matrix multiplication expression, X=Cx, where x are the time domain input values, X are the frequency domain outputs, and C is an NxN coefficient matrix containing elements . The DFT isn’t computed this way because it requires O(N^{2}) multiplications. An FFT significantly reduces the number of multiplications by reorganizing this large matrix multiplication into a series of smaller matrix multiplications or small DFTs of the form X_{r}=C_{r}x_{r}, where r is a radix that is a factor in the decomposition of the transform N.
Then if a NP2 transform of size N can be factored into powers of preferably small radices, a,b,c,…, e.g., N=a^{n}b^{m}c^{p}… , where n,m, and p are integers, it follows that the FFT can be generated by small matrix multiplications where ,n=0,1,..,r-1; k=0,1,..r-1. "
Since it is well known that there is no more efficient way to perform matrix multiplications than on a small array of PEs as shown in Fig. 1 using “systolic” computations [2], the architecture of Fig. 1 is an ideal basis for performing FFTs.
And from the perspective of NP2 vs P2 FFTs, the array structure in Fig. 1 is agnostic, e.g., it can perform any size matrix multiplication. For example, Centar’s LTE SC-FDMA circuit architecture (below) consists of two 6x6 PE arrays as shown in Fig. 2. As pictured, the array nominally does a series of 6-point DFTs [3]; however, embedded within this structure are also 5x5, 4x4, 3x3 and 2x2 arrays. So this array is suitable for the LTE SC-FDMA, where transform values N can be factored into various groups of radices from the set {2,3,4,5,6}.
Fig. 2. Array for performing sets of DFTs using matrix multiplication [4]. This array structure is a variation that does a series of 6 6-point DFTs, using the well-known “row/column” method, e.g., the left side of the array does the row DFTs on the 6x6 array of inputs and the right hand side does the corresponding column DFTs, all using matrix multiplication. (PEs are the small boxes and memories are not shown.)
Arithmetic requirements
The arithmetic advantage enjoyed by P2 FFTs is derived from the simplicity of structure of the radix-2/4 coefficient (butterfly) matrix Cr,
Here it can be seen to contain only elements from the set {±1,±i}, so no real multiplication is involved. On the other hand radices likely seen in NP2 FFTs lead to butterfly structures that require real multiplications and consequently inherently more circuit multipliers are necessitated.
Although it is common to try to develop multi-radix “unified” butterfly hardware based on such methods as the Winograd DFT, which lead to reduced number of multipliers, this is not always necessary. For example in an FPGA context vendors Altera and Xilinx try to “overdesign” their offerings to ensure that all designs needing the LUT/register capacity of an FPGA will always have enough embedded structures like multipliers available (FPGAs can have thousands of multipliers). In these contexts multipliers for many design examples are essentially “free”, make minimal contributions to chip power budgets and therefore don’t compromise the design.
Throughput scalability
The design task of increasing the throughput of a single FFT circuit often starts abstractly with a signal flow graph as shown in Fig. 3 for an 8-point DFT. This describes what is basically an array of butterfly operations. So by “projecting” these onto multiple blocks of hardware that operate simultaneously on one or more rows/columns of the signal flow graph, higher throughputs can be obtained than simple, single-butterfly hardware.
Fig. 3 Signal flow graph for 8-point DFT
Because the signal flow graph for a P2 FFT is relatively uniform, the signal flow graph for other transform sizes can be embedded within it, so hardware implementations targeted to large transform sizes can perform smaller P2 transform sizes as well.
Contrastively the signal flow graphs associated with NP2 DFT circuitry are inherently more irregular because of the use of mixed radices and challenges associated with the complex data management when they are projected in the same way onto multiple hardware blocks. Furthermore, there is usually no simple way to embed smaller NP2 transform sizes within a larger NP2 transform size. Consequently, scaling up NP2 FFT throughputs while providing a run-time option for doing many other NP2 transform sizes is very difficult when done in a traditional way based on the signal flow graph.
But with the memory-based array approach shown in Fig. 1, scaling up throughputs can be done very naturally by simply increasing the array size in a variety of ways. For example Centar’s LTE SC-FDMA circuit (below) can process LTE resource blocks at twice the rate of the equivalent Xilinx design [5] and three times the rate of an Altera design [6]; however, it only uses one row of PE hardware in the structure shown in Fig. 2. The other rows are “virtualized” in this one row of PE hardware. Therefore another factor of ~x6 speedup can be obtained directly. Additional scaling options exist as well [3].
Design complexity
Because of the use of multiple radices, complex tradeoffs exist because for optimal designs often each butterfly, delay/commutator, and twiddle factor ROM uses a different circuit design and/or its operation varies from stage to stage. Also, the multipliers do not always work at 100% efficiency and it is difficult to offer run-time transform size options.
Alternatively, with the memory-based array approach as shown in Fig.1, the hardware consists of simple, regular, locally connected PEs. This reduces the design effort and keeps interconnections short, which decreases power dissipation, facilitates higher clock rates and can lower the compiler place-and-route effort.
Also an advantage of the memory-based circuit is that it can be programmable so that the computational hardware can be highly optimized and then reused with different control circuitry to meet different application specifications. For example, the LTE SC-FDMA example described later is programmed by simply entering parameter values in a ROM or RAM for each desired transform size, so that any desired set of transform sizes is possible.
Twiddle factor compaction:
The need for twiddle memory hardware support is often a non-trivial issue and is the reason for use of such FFT approaches as the Prime Factor algorithm, which avoids them. At least for P2 architectures the option exists to minimize storage requirements to N/8 using symmetries in the twiddle matrix. However, with mixed radices this isn’t an option and can lead to twiddle memory storage inefficiency. Additionally, when supporting run-time choice of multiple NP2 transform sizes, the size of twiddle memory can grow prohibitively large.
To avoid these major twiddle circuit issues the memory-array based approach here facilitates use of “on-the-fly” generated twiddle factors. In this way all that is required is a single small seed table to start the generation process. Additional DFT sizes can be accommodated by simply adding new seeds to this table.
Word growth/scaling
Another consequence of using multiple radices for NP2 FFTs, especially in support of multiple transform sizes, is that word growth between computational stages can become irregular, making it difficult to properly scale the final outputs. Alternatively, since the architecture of Fig. 1 uses the same PE structure and dataflow regardless of which small-radix DFTs are is being performed, the word growth in all NP2 and P2 computations is uniform throughout the array. A simple normalization circuit is used to scale words to a nominal length at the end of each small DFT calculation. Scaling is done using an adaptive combined block floating point and floating point scheme leading to ≥4-bits better signal-to-quantization-noise ratios than other fixed scaling and block floating point designs.
Example NP2 Design: LTE SC-FDMA
A good NP2 design example is the LTE single carrier uplink protocol that requires computation of 35 different non-power-of-two DFT transform sizes (12 to 1296 points) [3]. Centar’s circuit uses a single ROM memory to hold parameters that determine the specific factorizations and execution orderings used for loop index ranges in the verilog HDL coded control modules. This FFT circuit was designed with a smaller word length, but has the much more efficient hybrid scaling scheme, providing 64 to 74db SQNR, adequate for wireless applications.
As can be seen in the normalized performance comparisons with other commercial circuits in Table 1 (average over 35 DFT sizes), this circuit is 2 to 3 times faster in computing LTE Resource Blocks and the number of LUT/registers is relatively small (Table 2). (The Centar LTE circuit was tested at 450 MHz on an Altera Stratix III EP3SL150F1152C2 FPGA-65nm technology-using a development board.)
Table 1. Relative times to compute an LTE resource block for single carrier uplink DFTs (lower is better).
Design | Average LTE Resource Block Compute Time |
Centar | 1.0 |
Xilinx [5] | 2.1 |
Altera [6] | 3.0 |
Table 2. FPGA resource fabric usage and performance for uplink DFT circuitry
Design | FPGA | Bits | Scaling | LUT | ALM/LE | RAM 9K eqv | Mult 18-bit | Fmax (MHz) |
Centar | Stratix III | 12 | BFP/FP | 3582 | 2733 | 30 | 60 | 394 |
Xilinx [5] | Virtex-5 | 8 | BFP | 3447 | 2955 | 20 | 16 | 318 |
Xilinx [5] | Virtex-5 | 18 | BFP | 4707 | 3864 | 20 | 16 | 276 |
Altera [6] | Stratix III | 18 | BFP | 2600 | n.a. | 17 | 32 | 260 |
The memory-based array architecture does take advantage of the availability of embedded elements, but their dynamic power dissipation is only 5%/14% for multipliers/memory of total power dissipation.
Conclusion
By looking at how FFT circuit design issues related to mixed radices, radix arithmetic, scalability, design complexity, twiddle factor usage, and word growth, impact design tradeoffs for NP2 FFT circuitry, we have shown that it is possible to perform both NP2 and NP FFTs with equal efficiency except with regard to radix arithmetic. Fundamentally, this is a result of the fact the FFT architecture is based on small matrix multiplication operations, implemented on an array of PEs, and that such a matrix multiplication maintains efficiency regardless of the length of the data inputs.
All that remains is to be able to “package” successive small DFTs to facilitate overall FFT processing and this is done using special algorithms suited for this purpose [4]. Finally, a collateral design feature is that the fine-grained, array based architecture provides design simplicity, regularity and localized wiring, so that it can provide greater speeds and lower power dissipation than pipelined architectures, yet at the same time offer the flexibility/programmability of traditional memory based architectures.
References
[1] Z.-X. Yang, et.al, Design of a 3780-point IFFT processor for TDS-OFDM, IEEE Trans. Broadcast., vol. 48, pp.57–61, Mar. 2002.
[2] S. Y. Kung, VLSI Array Processors, Prentice Hall, 1988.
[3] J. Greg Nash, High-throughput programmable systolic array FFT architecture and FPGA Implementations, presented at the 2014 International Conference on Computing, Networking and Communications (ICNC), CNC Workshop (ICNC'14 - Workshops - CNC), Honolulu, HI, Feb 2014, pp. 878 -884.
[4] J. Greg Nash, Computationally efficient systolic architecture for computing the discrete Fourier transform, IEEE Trans Sig. Process., V. 53, Dec. 2005, pp. 4640-4651.
[5] Xilinx Discrete Fourier Transform v3.1, DS615 Mar. 1, 2011.
[6] Altera DFT/IDFT Reference Design, Application Note 464, May 2007.