

LTE Single Carrier DFT: Faster Circuits with Reduced FPGA LUT/Register UsageBy J. Greg Nash, Centar LLC I.Introduction Single carrier frequency division multiple access (SCFDMA) is a part of the LTE protocol used for uplink data transmission. It involves a discrete Fourier transform (DFT) precoding of the transmitted signal, where the DFT can be any one of 35 transform sizes N from 12points to 1296points, where N=2^{a}3^{b}5^{c} and a, b, c are positive integers. Given the prominence of the LTE protocol in wireless devices, it is surprising that there are very few DFT FPGA circuit implementations from which to choose. This is likely due to the complexity of the circuit, which must accommodate runtime choice of many and large nonpoweroftwo transforms, requiring multiple radices for efficient DFT calculation. Here we provide a summary description and analysis of Centar’s unique “memorybased” DFT FPGA implementation that is 41% faster than a Xilinx equivalent which uses 66%/32% more registers/LUTs and x3 faster than one from Altera. The rationale for targeting FPGAs is due to the rapidly growing FPGA use in communications applications, e.g., base stations and remote radio heads at the top of cell phone towers. FPGAs as an implementation platform have unique features such as large numbers of embedded multipliers and memories (as many as 3960 multipliers and 137MB of memory on a single FPGA in the Altera Stratix 10 series), leading to very different design tradeoffs compared to ASIC designs. In particular, embedded elements in such quantities make them almost “free”, compared to their ASIC implementation costs. Consequently, the design goal for this LTE single carrier circuit was to minimize the “expensive” FPGA lookuptable (LUT) and register fabric usage rather than embedded element usage. By making use of extra embedded elements, the FFT architecture can be more efficient, leading to much higher throughput which can compensate for use of the additional embedded elements. An additional motivation for this goal is that this fabric is also the source of most of the FPGA dynamic power consumption, as opposed to the embedded elements, an important consideration since FPGAs are increasingly being used in mobile devices. It was also important that the circuit be programmable so that it could perform any transform size or number of transforms at runtime to meet any special application requirement or future wireless protocol needs, including poweroftwo DFTs. II. Background A. FFT computing models Centar’s memorybased FFT model is architecturally very different than traditional memorybased designs as illustrated in Fig. 1. Here a traditional highperformance memorybased design (Fig. 1a) is shown to contain a plurality of large arithmetic units, where typically butterfly computations are performed, connected to a similar plurality of memory units. The goal in such designs is to sequence data to/from the memories in such a way that data I/O rates are maximized. Alternatively, our model does the same thing, but at a finer level of granularity. From Fig. 1b this architecture is seen to consist of many very small “processing elements” (PEs), each containing a multiplier/adder and a few registers. Each PE reads and writes to a small, simple dualport memory and aggregate bandwidth is limited only by the number of PEs. The wellknown scalability of array structures means that high bandwidths, and thus performance, are achieved by simply increasing the array size. It is far more difficult to do the same for traditional memory based designs (Fig. 1a). Fig. 1. (a) Traditional memorybased FFT architecture and (b) Centar’s finegrained equivalent. Fig. 1b also shows that each PE is locally connected to its (4) neighbors which keeps interconnections short, resulting in reduced power dissipation and higher clock speeds. This is particularly important as process technologies migrate to 14nm sizes and lower when routing delays become an increasingly dominant part of the critical path. The simple and regular array structure also reduces circuit complexity and therefore reduces design costs. B. Comercially Available IP Both Xilinx [1] and Altera [2] provide users of their FPGAs options to support the LTE SCFDMA DFT protocol using a memorybased architecture as in Fig.1a, consisting of a single multiport memory that sends/receives data to/from a single arithmetic unit that performs the required butterfly computations. In the Xilinx design the arithmetic unit can do radix 2, 3, 4, and 5 operations using parallel and pipelined hardware. It is capable of performing two radix2 operations per clock cycle, one radix3 or radix4 operation per cycle and one radix5 operation in two clock cycles. The Altera design uses a simpler butterfly unit which results in relatively lower performance, but also reduced LUT/register usage. III. Algorithm and Architecture A. DFT from “baseb” processing Much of the data flow in the array structure of Fig. 1b uses the baseb algorithm [3] for computing a DFT. The basic equations for this are Y_{b} = W_{M}∙C_{M1}X_{b } (1) Z_{b}_{ = CM2Ybt,} where “t” means transpose, “b” is a “base” that is used to reformat the input transform data vector of size N, into a matrix X_{b}= b x N/b with the DFT output Z_{b }also a matrix of size b x N/b. is an N/b x N/b coefficient matrix used in an elementbyelement multiply and C_{M1, }C_{M2} are arrays of radixb butterfly matrices. The computation (1) can be carried out structurally as shown in Fig. 2. Here there are b x b arrays of PEs that do the matrixmatrix computations C_{M1}Xb and C_{M2}Y_{b}^{t} using well know “systolic” algorithms with the multiplication by W_{M} occurring in between the two arrays. Each PE nominally contains a complex multiplier, adder and a few registers. Two options for the processing are shown in Fig. 2, either X_{b}, containing the input data, is stored in an input buffer and C_{M1} values are stored in registers in the PE array, or viceversa. The matrixvector multiplication symmetry allows this option. Then on the right side there is another input buffer for C_{M2}, containing all the radixb butterfly coefficients that will be used (all preloaded before processing begins). During processing data flows in groups of b, and their movement is pipelined through the arrays of PEs. At the end of processing, the result Z_{b} is stored in memories associated those PEs. Fig. 2. Processing architecture for (1). B. Array Architecture Although the value of N can be obtained with just radices 2,3 and 5, as in the Xilinx design, the architecture used here is based on a value of b=6 so that radix6 is an option. This provides additional flexibility considering that all DFT sizes include a factor of 6, since a “resource block” or the smallest unit of subcarriers used by the SCFDMA protocol has at least 12 subcarriers. The radix6 option translates to more efficient processing for all 35 sizes. The actual structure of the architecture, shown in Fig. 3, contains two 6x6 arrays which are labeled lefthandside (LHS) and righthandside (RHS) for later reference. The input X is still labeled with a “b” because the array also contains subset bxb arrays with b=2,3,4 and 5, so the array in Fig. 3 can also perform radix2,3,4, and 5 processing as well. Fig. 3. Virtual array architecture to perform processing shown in Fig. 2. C. Factorization and DFT computations For all transform SCFDMA sizes computation proceeds using the wellknown “row/column” method that decomposes N into two factors, e.g., for the 540point size, N=36*15=N_{1}*N_{2}. Then the DFT is performed by doing 15 36point column DFTs, followed by a twiddle multiplication (“Multiplier Array” in Fig. 3) using and finally the 36 15point transforms. Since the implementation here consists of 6x6 PE arrays, it would be most efficient to choose the factorization N_{1}*N_{2}=36*15=6^{2}*(3*5) because using the biggest possible radices (≤6) reduces the number of computation clock cycles. IV. implementation A. Physical Structure The LHS and RHS virtual arrays in Fig. 3 are in principal not kept physically separate; rather, they are folded on top of one another since the design calls for an individual PE on the LHS to share a simple dualported memory “M” with an equivalent RHS PE as shown in Fig. 4. The array structure shown in Fig. 3 would result in higher throughputs than needed for most SCFDMA applications. Consequently, the physical array consists of only one of the PE rows in Fig. 3. In other words, each virtual LHS, RHS and multiplier PE columns in the array is projected (collapsed vertically) onto a single PE row. Thus, there are 6 RAM memories associated with the 6 LHS/RHS combination PEs. This linear array emulates the operations of the 2D virtual array of Fig. 3, but completes a DFT computation a factor of b times slower. Fig. 4. Folded LHS and RHS PE arrays (north/south connections not shown) B. Throughput Scaling One of the unique features of this architecture, unlike a traditional memorybased version, is that throughput can be scaled upward in several ways: (1) by using all PE rows in Fig. 3 to increase throughput by b; (2) by increasing the size of the array structure in Fig. 3; (3) by simply replicating the array. These different approaches can also be combined. C. Programmability Any transform size that can be factored into relative small numbers like the LTE protocol, is a candidate for implementation with our architecture. The control hardware is aware of the number of rows/columns used in the LHS and RHS arrays during each stage of the computation, the different coefficient and twiddle matrices required, the read/write memory addresses, the sequencing in/out of data storage and retrieval patterns, etc. All this parameter information is saved essentially as a table in ROM memory. The programmability arises because when a particular DFT size is requested, the corresponding table is read from ROM and used to set loop values in the verilog HDL control hardware. The number of different DFT sizes that can be supported is then only limited by the size of this parameter memory. The benefit of a programmable circuit is that the computational hardware can be highly optimized and then reused with different control circuitry to do different sets of computations. All that is necessary is that the DFT size be factorable into relatively small numbers. It is important to note that the same architecture can be programmed to perform poweroftwo FFTs as well. This is important as almost all wireless protocols, including LTE, require poweroftwo computations and having both options available in the same circuit could be useful for future wireless applications. V. Comparitive analysis A. Introduction Here Centar’s LTE SCFDMA circuit is compared to commercial designs also implemented in FPGA technologies. In order to provide a more relevant metric than throughput and latency numbers the length of time necessary to compute an LTE resource block (RB), as shown diagrammatically in Fig.5, was chosen. The RB is the minimum processing unit of data for the LTE protocol and occupies one time “slot” (0.5ms), divided up into 7 symbols for (normal cyclic prefix). For example, 1296 subcarriers would imply processing a maximum of 108 resource blocks. This is a better performance comparison metric in that it requires both low latency and high throughput for good results. Fig. 5. LTE Resource block definition B. Xilinx A Vertix6 (XC6VLX75T3) FPGA was used as the target hardware for both the Xilinx and Centar’s circuit. The same software tools (ISE 14.7) for synthesis and placeandroute were also used. The Xilinx LogiCORE IP version 3.1 was used to generate a 16bit version of their DFT because the SQNR of 60.0 db (average over all 35 transform sizes) was comparable to Centar’s 12bit circuit with average SQNR=61.3. (Xilinx LogiCORE includes a bit accurate C model, callable as a Matlab mex function, that was used to obtain Xilinx SQNR values.) The resource comparisons in Table I use a block RAM normalized to 18K bits, so that a 36K block RAM is considered equal to two 18K RAMs. Also, the “RB Average Throughput” column provides the average number of cycles (over all 35 DFT sizes) it takes to compute the DFT for the 7 symbols defined by a RB as a function of the transform size N. Finally, the Fmax (maximum clock frequency) value and the number of RB clock cycles are combined, providing a measure of the throughput, which is normalized to a value of “1” for the Centar design (a higher number is better). So Table I shows the Xilinx circuit uses 66% more registers and 32% LUTs, while Centar circuits provides 41% higher throughput. So the overall combined gain is significant. The Centar design uses more embedded memory and multipliers, but this was less a consideration as discussed in Section I. Also, since many wireless applications involve use of MIMO and cell towers can have three different sectors operating simultaneously, it is possible that more than one single carrier DFT cores could be required. In theory Centar’s 41% higher throughputs translate to fewer cores, which reduces considerably the Xilinx advantage in block RAM usage shown in Table 1. C. Altera Altera does not offer a DFT LTE core as does Xilinx; however, they have published results of an example design running on a Stratix III FPGA with 9K memory blocks that provides a useful basis for comparison. For comparison the Centar design was also targeted to a Stratix III FPGA (EP3SE110F780C2) and the same Altera Quartus tools for synthesis and placeandroute were used to implement the design. The Altera implementation uses less logic, but is far slower, both in terms of the lower values of Fmax, and the increased number of cycles to complete the RB computation. Consequently, the Centar design has a significant ~3x higher throughput while LUT usage is only ~47% higher. Note as well the Altera core doesn’t offer a 1296point transform option and the outputs are not in normal order. Adding buffer circuitry to sort the output data would require additional logic and add ~N additional words of memory (~5 9K RAM blocks) to the numbers shown in Table I. Table I. LTE Circuit Technology comparisons
To support several sectors with MIMO several Altera cores would need to be used. In this case only one or two Centar cores would be required, so that the Altera block RAM advantage shown in Table I would be nullified, as shown in Table II. Here, the ratios are shown for each of the elements, normalized to the design with the small resource requirement for a particular element. As can be seen, except for the the case of having just one sector and no more than two MIMO streams, the Centar design uses significantly fewer LUTs and block RAMs, and slightly fewer multipliers. Table II. Multicore circuit comparisons
VI. Conclusion We have shown how a new memorybased model for FFT calculations can be applied to the demanding needs of the LTE SCFDMA DFT protocol. This new class of FFT architecture combines algorithm efficiency and programmability with new circuit features leading to better implementations for FPGAbased hardware. In particular the comparative benefits are:
References
If you wish to download a copy of this white paper, click here

Home  Feedback  Register  Site Map 
All material on this site Copyright © 2017 Design And Reuse S.A. All rights reserved. 