

NonPowerofTwo FFT Circuit Designs Do Not Have to be DifficultBy J. Greg Nash, Centar LLC Introduction One reason that the poweroftwo FFT dominates the landscape of high performance realtime signal processing applications is the perception that alternative nonpoweroftwo (NP2) circuits are difficult to implement. This perception is likely related to a variety of design issues:
At the same time there are important disadvantages to using the poweroftwo (P2) FFT, in particular limits on the number of reachable transform sizes that make it difficult to match data set sizes with available circuit transform lengths. Given that there are good examples that speak to the importance NP2 circuits such as
what is needed are better circuit methodologies for designing NP2 FFTs. Here a new model for FFT computation is described with an architecture that works equally well for NP2 and P2 designs. Unlike most commercially available FFT architectures, this one can produce either type of circuit with the same level of design effort and the same level of implementation efficiency. Below we look at the design issues listed above in the context of this new FFT architectural model. Centar FFT Model Centar's approach is memorybased, but is architecturally very different than traditional memorybased approaches in that it uses the fine grained array structure shown Fig. 1. It is seen to consist of many very small processing elements (PEs), each containing a multiplier/adder and a few registers with control logic. Each PE reads and writes to a typically small memory. Aggregate bandwidth is limited only by the number of PEs and read/write conflicts don’t occur because each PE has its own memory. Fig. 1. Centar memorybased array FFT architecture Although not shown in Fig. 1, each PE is locally connected to its neighbors and data moves only between adjacent PEs. Input and output data flow into and out of the array from the top (not shown), using the nearest neighbor interconnection network. Need for mixed radices The easiest mathematical way to represent an Npoint DFT is via the matrix multiplication expression, X=Cx, where x are the time domain input values, X are the frequency domain outputs, and C is an NxN coefficient matrix containing elements . The DFT isn’t computed this way because it requires O(N^{2}) multiplications. An FFT significantly reduces the number of multiplications by reorganizing this large matrix multiplication into a series of smaller matrix multiplications or small DFTs of the form X_{r}=C_{r}x_{r}, where r is a radix that is a factor in the decomposition of the transform N. Then if a NP2 transform of size N can be factored into powers of preferably small radices, a,b,c,…, e.g., N=a^{n}b^{m}c^{p}… , where n,m, and p are integers, it follows that the FFT can be generated by small matrix multiplications where ,n=0,1,..,r1; k=0,1,..r1. " Since it is well known that there is no more efficient way to perform matrix multiplications than on a small array of PEs as shown in Fig. 1 using “systolic” computations [2], the architecture of Fig. 1 is an ideal basis for performing FFTs. And from the perspective of NP2 vs P2 FFTs, the array structure in Fig. 1 is agnostic, e.g., it can perform any size matrix multiplication. For example, Centar’s LTE SCFDMA circuit architecture (below) consists of two 6x6 PE arrays as shown in Fig. 2. As pictured, the array nominally does a series of 6point DFTs [3]; however, embedded within this structure are also 5x5, 4x4, 3x3 and 2x2 arrays. So this array is suitable for the LTE SCFDMA, where transform values N can be factored into various groups of radices from the set {2,3,4,5,6}. Fig. 2. Array for performing sets of DFTs using matrix multiplication [4]. This array structure is a variation that does a series of 6 6point DFTs, using the wellknown “row/column” method, e.g., the left side of the array does the row DFTs on the 6x6 array of inputs and the right hand side does the corresponding column DFTs, all using matrix multiplication. (PEs are the small boxes and memories are not shown.) Arithmetic requirements The arithmetic advantage enjoyed by P2 FFTs is derived from the simplicity of structure of the radix2/4 coefficient (butterfly) matrix Cr, Here it can be seen to contain only elements from the set {±1,±i}, so no real multiplication is involved. On the other hand radices likely seen in NP2 FFTs lead to butterfly structures that require real multiplications and consequently inherently more circuit multipliers are necessitated. Although it is common to try to develop multiradix “unified” butterfly hardware based on such methods as the Winograd DFT, which lead to reduced number of multipliers, this is not always necessary. For example in an FPGA context vendors Altera and Xilinx try to “overdesign” their offerings to ensure that all designs needing the LUT/register capacity of an FPGA will always have enough embedded structures like multipliers available (FPGAs can have thousands of multipliers). In these contexts multipliers for many design examples are essentially “free”, make minimal contributions to chip power budgets and therefore don’t compromise the design. Throughput scalability The design task of increasing the throughput of a single FFT circuit often starts abstractly with a signal flow graph as shown in Fig. 3 for an 8point DFT. This describes what is basically an array of butterfly operations. So by “projecting” these onto multiple blocks of hardware that operate simultaneously on one or more rows/columns of the signal flow graph, higher throughputs can be obtained than simple, singlebutterfly hardware. Fig. 3 Signal flow graph for 8point DFT Because the signal flow graph for a P2 FFT is relatively uniform, the signal flow graph for other transform sizes can be embedded within it, so hardware implementations targeted to large transform sizes can perform smaller P2 transform sizes as well. Contrastively the signal flow graphs associated with NP2 DFT circuitry are inherently more irregular because of the use of mixed radices and challenges associated with the complex data management when they are projected in the same way onto multiple hardware blocks. Furthermore, there is usually no simple way to embed smaller NP2 transform sizes within a larger NP2 transform size. Consequently, scaling up NP2 FFT throughputs while providing a runtime option for doing many other NP2 transform sizes is very difficult when done in a traditional way based on the signal flow graph. But with the memorybased array approach shown in Fig. 1, scaling up throughputs can be done very naturally by simply increasing the array size in a variety of ways. For example Centar’s LTE SCFDMA circuit (below) can process LTE resource blocks at twice the rate of the equivalent Xilinx design [5] and three times the rate of an Altera design [6]; however, it only uses one row of PE hardware in the structure shown in Fig. 2. The other rows are “virtualized” in this one row of PE hardware. Therefore another factor of ~x6 speedup can be obtained directly. Additional scaling options exist as well [3]. Design complexity Because of the use of multiple radices, complex tradeoffs exist because for optimal designs often each butterfly, delay/commutator, and twiddle factor ROM uses a different circuit design and/or its operation varies from stage to stage. Also, the multipliers do not always work at 100% efficiency and it is difficult to offer runtime transform size options. Alternatively, with the memorybased array approach as shown in Fig.1, the hardware consists of simple, regular, locally connected PEs. This reduces the design effort and keeps interconnections short, which decreases power dissipation, facilitates higher clock rates and can lower the compiler placeandroute effort. Also an advantage of the memorybased circuit is that it can be programmable so that the computational hardware can be highly optimized and then reused with different control circuitry to meet different application specifications. For example, the LTE SCFDMA example described later is programmed by simply entering parameter values in a ROM or RAM for each desired transform size, so that any desired set of transform sizes is possible. Twiddle factor compaction: The need for twiddle memory hardware support is often a nontrivial issue and is the reason for use of such FFT approaches as the Prime Factor algorithm, which avoids them. At least for P2 architectures the option exists to minimize storage requirements to N/8 using symmetries in the twiddle matrix. However, with mixed radices this isn’t an option and can lead to twiddle memory storage inefficiency. Additionally, when supporting runtime choice of multiple NP2 transform sizes, the size of twiddle memory can grow prohibitively large. To avoid these major twiddle circuit issues the memoryarray based approach here facilitates use of “onthefly” generated twiddle factors. In this way all that is required is a single small seed table to start the generation process. Additional DFT sizes can be accommodated by simply adding new seeds to this table. Word growth/scaling Another consequence of using multiple radices for NP2 FFTs, especially in support of multiple transform sizes, is that word growth between computational stages can become irregular, making it difficult to properly scale the final outputs. Alternatively, since the architecture of Fig. 1 uses the same PE structure and dataflow regardless of which smallradix DFTs are is being performed, the word growth in all NP2 and P2 computations is uniform throughout the array. A simple normalization circuit is used to scale words to a nominal length at the end of each small DFT calculation. Scaling is done using an adaptive combined block floating point and floating point scheme leading to ≥4bits better signaltoquantizationnoise ratios than other fixed scaling and block floating point designs. Example NP2 Design: LTE SCFDMA A good NP2 design example is the LTE single carrier uplink protocol that requires computation of 35 different nonpoweroftwo DFT transform sizes (12 to 1296 points) [3]. Centar’s circuit uses a single ROM memory to hold parameters that determine the specific factorizations and execution orderings used for loop index ranges in the verilog HDL coded control modules. This FFT circuit was designed with a smaller word length, but has the much more efficient hybrid scaling scheme, providing 64 to 74db SQNR, adequate for wireless applications. As can be seen in the normalized performance comparisons with other commercial circuits in Table 1 (average over 35 DFT sizes), this circuit is 2 to 3 times faster in computing LTE Resource Blocks and the number of LUT/registers is relatively small (Table 2). (The Centar LTE circuit was tested at 450 MHz on an Altera Stratix III EP3SL150F1152C2 FPGA65nm technologyusing a development board.) Table 1. Relative times to compute an LTE resource block for single carrier uplink DFTs (lower is better).
Table 2. FPGA resource fabric usage and performance for uplink DFT circuitry
The memorybased array architecture does take advantage of the availability of embedded elements, but their dynamic power dissipation is only 5%/14% for multipliers/memory of total power dissipation. Conclusion By looking at how FFT circuit design issues related to mixed radices, radix arithmetic, scalability, design complexity, twiddle factor usage, and word growth, impact design tradeoffs for NP2 FFT circuitry, we have shown that it is possible to perform both NP2 and NP FFTs with equal efficiency except with regard to radix arithmetic. Fundamentally, this is a result of the fact the FFT architecture is based on small matrix multiplication operations, implemented on an array of PEs, and that such a matrix multiplication maintains efficiency regardless of the length of the data inputs. All that remains is to be able to “package” successive small DFTs to facilitate overall FFT processing and this is done using special algorithms suited for this purpose [4]. Finally, a collateral design feature is that the finegrained, array based architecture provides design simplicity, regularity and localized wiring, so that it can provide greater speeds and lower power dissipation than pipelined architectures, yet at the same time offer the flexibility/programmability of traditional memory based architectures. References [1] Z.X. Yang, et.al, Design of a 3780point IFFT processor for TDSOFDM, IEEE Trans. Broadcast., vol. 48, pp.57–61, Mar. 2002. [2] S. Y. Kung, VLSI Array Processors, Prentice Hall, 1988. [3] J. Greg Nash, Highthroughput programmable systolic array FFT architecture and FPGA Implementations, presented at the 2014 International Conference on Computing, Networking and Communications (ICNC), CNC Workshop (ICNC'14  Workshops  CNC), Honolulu, HI, Feb 2014, pp. 878 884. [4] J. Greg Nash, Computationally efficient systolic architecture for computing the discrete Fourier transform, IEEE Trans Sig. Process., V. 53, Dec. 2005, pp. 46404651. [5] Xilinx Discrete Fourier Transform v3.1, DS615 Mar. 1, 2011. [6] Altera DFT/IDFT Reference Design, Application Note 464, May 2007.

Home  Feedback  Register  Site Map 
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. 