By J. Greg Nash, Centar LLC
I.Introduction
Single carrier frequency division multiple access (SC-FDMA) is a part of the LTE protocol used for up-link data transmission. It involves a discrete Fourier transform (DFT) pre-coding of the transmitted signal, where the DFT can be any one of 35 transform sizes N from 12-points to 1296-points, where N=2^{a}3^{b}5^{c} and a, b, c are positive integers.
Given the prominence of the LTE protocol in wireless devices, it is surprising that there are very few DFT FPGA circuit implementations from which to choose. This is likely due to the complexity of the circuit, which must accommodate run-time choice of many and large non-power-of-two transforms, requiring multiple radices for efficient DFT calculation. Here we provide a summary description and analysis of Centar’s unique “memory-based” DFT FPGA implementation that is 41% faster than a Xilinx equivalent which uses 66%/32% more registers/LUTs and x3 faster than one from Altera. The rationale for targeting FPGAs is due to the rapidly growing FPGA use in communications applications, e.g., base stations and remote radio heads at the top of cell phone towers.
FPGAs as an implementation platform have unique features such as large numbers of embedded multipliers and memories (as many as 3960 multipliers and 137MB of memory on a single FPGA in the Altera Stratix 10 series), leading to very different design tradeoffs compared to ASIC designs. In particular, embedded elements in such quantities make them almost “free”, compared to their ASIC implementation costs. Consequently, the design goal for this LTE single carrier circuit was to minimize the “expensive” FPGA look-up-table (LUT) and register fabric usage rather than embedded element usage. By making use of extra embedded elements, the FFT architecture can be more efficient, leading to much higher throughput which can compensate for use of the additional embedded elements. An additional motivation for this goal is that this fabric is also the source of most of the FPGA dynamic power consumption, as opposed to the embedded elements, an important consideration since FPGAs are increasingly being used in mobile devices.
It was also important that the circuit be programmable so that it could perform any transform size or number of transforms at run-time to meet any special application requirement or future wireless protocol needs, including power-of-two DFTs.
II. Background
A. FFT computing models
Centar’s memory-based FFT model is architecturally very different than traditional memory-based designs as illustrated in Fig. 1. Here a traditional high-performance memory-based design (Fig. 1a) is shown to contain a plurality of large arithmetic units, where typically butterfly computations are performed, connected to a similar plurality of memory units. The goal in such designs is to sequence data to/from the memories in such a way that data I/O rates are maximized. Alternatively, our model does the same thing, but at a finer level of granularity. From Fig. 1b this architecture is seen to consist of many very small “processing elements” (PEs), each containing a multiplier/adder and a few registers. Each PE reads and writes to a small, simple dual-port memory and aggregate bandwidth is limited only by the number of PEs. The well-known scalability of array structures means that high bandwidths, and thus performance, are achieved by simply increasing the array size. It is far more difficult to do the same for traditional memory based designs (Fig. 1a).
Fig. 1. (a) Traditional memory-based FFT architecture and (b) Centar’s fine-grained equivalent.
Fig. 1b also shows that each PE is locally connected to its (4) neighbors which keeps interconnections short, resulting in reduced power dissipation and higher clock speeds. This is particularly important as process technologies migrate to 14nm sizes and lower when routing delays become an increasingly dominant part of the critical path. The simple and regular array structure also reduces circuit complexity and therefore reduces design costs.
B. Comercially Available IP
Both Xilinx [1] and Altera [2] provide users of their FPGAs options to support the LTE SC-FDMA DFT protocol using a memory-based architecture as in Fig.1a, consisting of a single multi-port memory that sends/receives data to/from a single arithmetic unit that performs the required butterfly computations.
In the Xilinx design the arithmetic unit can do radix 2, 3, 4, and 5 operations using parallel and pipelined hardware. It is capable of performing two radix-2 operations per clock cycle, one radix-3 or radix-4 operation per cycle and one radix-5 operation in two clock cycles. The Altera design uses a simpler butterfly unit which results in relatively lower performance, but also reduced LUT/register usage.
III. Algorithm and Architecture
A. DFT from “base-b” processing
Much of the data flow in the array structure of Fig. 1b uses the base-b algorithm [3] for computing a DFT. The basic equations for this are
Y_{b} = W_{M}∙C_{M1}X_{b } (1)
Z_{b}_{ } = C_{M2}Y_{b}^{t},
where “t” means transpose, “b” is a “base” that is used to reformat the input transform data vector of size N, into a matrix X_{b}= b x N/b with the DFT output Z_{b }also a matrix of size b x N/b. is an N/b x N/b coefficient matrix used in an element-by-element multiply and C_{M1, }C_{M2} are arrays of radix-b butterfly matrices.
The computation (1) can be carried out structurally as shown in Fig. 2. Here there are b x b arrays of PEs that do the matrix-matrix computations C_{M1}Xb and C_{M2}Y_{b}^{t} using well know “systolic” algorithms with the multiplication by W_{M} occurring in between the two arrays. Each PE nominally contains a complex multiplier, adder and a few registers.
Two options for the processing are shown in Fig. 2, either X_{b}, containing the input data, is stored in an input buffer and C_{M1} values are stored in registers in the PE array, or vice-versa. The matrix-vector multiplication symmetry allows this option. Then on the right side there is another input buffer for C_{M2}, containing all the radix-b butterfly coefficients that will be used (all preloaded before processing begins). During processing data flows in groups of b, and their movement is pipelined through the arrays of PEs. At the end of processing, the result Z_{b} is stored in memories associated those PEs.
Fig. 2. Processing architecture for (1).
B. Array Architecture
Although the value of N can be obtained with just radices 2,3 and 5, as in the Xilinx design, the architecture used here is based on a value of b=6 so that radix-6 is an option. This provides additional flexibility considering that all DFT sizes include a factor of 6, since a “resource block” or the smallest unit of subcarriers used by the SC-FDMA protocol has at least 12 subcarriers. The radix-6 option translates to more efficient processing for all 35 sizes. The actual structure of the architecture, shown in Fig. 3, contains two 6x6 arrays which are labeled left-hand-side (LHS) and right-hand-side (RHS) for later reference.
The input X is still labeled with a “b” because the array also contains sub-set bxb arrays with b=2,3,4 and 5, so the array in Fig. 3 can also perform radix-2,3,4, and 5 processing as well.
Fig. 3. Virtual array architecture to perform processing shown in Fig. 2.
C. Factorization and DFT computations
For all transform SC-FDMA sizes computation proceeds using the well-known “row/column” method that decomposes N into two factors, e.g., for the 540-point size, N=36*15=N_{1}*N_{2}. Then the DFT is performed by doing 15 36-point column DFTs, followed by a twiddle multiplication (“Multiplier Array” in Fig. 3) using
and finally the 36 15-point transforms. Since the implementation here consists of 6x6 PE arrays, it would be most efficient to choose the factorization N_{1}*N_{2}=36*15=6^{2}*(3*5) because using the biggest possible radices (≤6) reduces the number of computation clock cycles.
IV. implementation
A. Physical Structure
The LHS and RHS virtual arrays in Fig. 3 are in principal not kept physically separate; rather, they are folded on top of one another since the design calls for an individual PE on the LHS to share a simple dual-ported memory “M” with an equivalent RHS PE as shown in Fig. 4.
The array structure shown in Fig. 3 would result in higher throughputs than needed for most SC-FDMA applications. Consequently, the physical array consists of only one of the PE rows in Fig. 3. In other words, each virtual LHS, RHS and multiplier PE columns in the array is projected (collapsed vertically) onto a single PE row. Thus, there are 6 RAM memories associated with the 6 LHS/RHS combination PEs. This linear array emulates the operations of the 2-D virtual array of Fig. 3, but completes a DFT computation a factor of b times slower.
Fig. 4. Folded LHS and RHS PE arrays (north/south connections not shown)
B. Throughput Scaling
One of the unique features of this architecture, unlike a traditional memory-based version, is that throughput can be scaled upward in several ways: (1) by using all PE rows in Fig. 3 to increase throughput by b; (2) by increasing the size of the array structure in Fig. 3; (3) by simply replicating the array. These different approaches can also be combined.
C. Programmability
Any transform size that can be factored into relative small numbers like the LTE protocol, is a candidate for implementation with our architecture.
The control hardware is aware of the number of rows/columns used in the LHS and RHS arrays during each stage of the computation, the different coefficient and twiddle matrices required, the read/write memory addresses, the sequencing in/out of data storage and retrieval patterns, etc. All this parameter information is saved essentially as a table in ROM memory. The programmability arises because when a particular DFT size is requested, the corresponding table is read from ROM and used to set loop values in the verilog HDL control hardware. The number of different DFT sizes that can be supported is then only limited by the size of this parameter memory.
The benefit of a programmable circuit is that the computational hardware can be highly optimized and then reused with different control circuitry to do different sets of computations. All that is necessary is that the DFT size be factorable into relatively small numbers.
It is important to note that the same architecture can be programmed to perform power-of-two FFTs as well. This is important as almost all wireless protocols, including LTE, require power-of-two computations and having both options available in the same circuit could be useful for future wireless applications.
V. Comparitive analysis
A. Introduction
Here Centar’s LTE SC-FDMA circuit is compared to commercial designs also implemented in FPGA technologies. In order to provide a more relevant metric than throughput and latency numbers the length of time necessary to compute an LTE resource block (RB), as shown diagrammatically in Fig.5, was chosen. The RB is the minimum processing unit of data for the LTE protocol and occupies one time “slot” (0.5ms), divided up into 7 symbols for (normal cyclic prefix). For example, 1296 subcarriers would imply processing a maximum of 108 resource blocks. This is a better performance comparison metric in that it requires both low latency and high throughput for good results.
Fig. 5. LTE Resource block definition
B. Xilinx
A Vertix-6 (XC6VLX75T-3) FPGA was used as the target hardware for both the Xilinx and Centar’s circuit. The same software tools (ISE 14.7) for synthesis and place-and-route were also used. The Xilinx LogiCORE IP version 3.1 was used to generate a 16-bit version of their DFT because the SQNR of 60.0 db (average over all 35 transform sizes) was comparable to Centar’s 12-bit circuit with average SQNR=61.3. (Xilinx LogiCORE includes a bit accurate C model, callable as a Matlab mex function, that was used to obtain Xilinx SQNR values.)
The resource comparisons in Table I use a block RAM normalized to 18K bits, so that a 36K block RAM is considered equal to two 18K RAMs. Also, the “RB Average Throughput” column provides the average number of cycles (over all 35 DFT sizes) it takes to compute the DFT for the 7 symbols defined by a RB as a function of the transform size N. Finally, the Fmax (maximum clock frequency) value and the number of RB clock cycles are combined, providing a measure of the throughput, which is normalized to a value of “1” for the Centar design (a higher number is better). So Table I shows the Xilinx circuit uses 66% more registers and 32% LUTs, while Centar circuits provides 41% higher throughput. So the overall combined gain is significant.
The Centar design uses more embedded memory and multipliers, but this was less a consideration as discussed in Section I. Also, since many wireless applications involve use of MIMO and cell towers can have three different sectors operating simultaneously, it is possible that more than one single carrier DFT cores could be required. In theory Centar’s 41% higher throughputs translate to fewer cores, which reduces considerably the Xilinx advantage in block RAM usage shown in Table 1.
C. Altera
Altera does not offer a DFT LTE core as does Xilinx; however, they have published results of an example design running on a Stratix III FPGA with 9K memory blocks that provides a useful basis for comparison.
For comparison the Centar design was also targeted to a Stratix III FPGA (EP3SE110F780C2) and the same Altera Quartus tools for synthesis and place-and-route were used to implement the design. The Altera implementation uses less logic, but is far slower, both in terms of the lower values of Fmax, and the increased number of cycles to complete the RB computation. Consequently, the Centar design has a significant ~3x higher throughput while LUT usage is only ~47% higher.
Note as well the Altera core doesn’t offer a 1296-point transform option and the outputs are not in normal order. Adding buffer circuitry to sort the output data would require additional logic and add ~N additional words of memory (~5 9K RAM blocks) to the numbers shown in Table I.
Table I. LTE Circuit Technology comparisons
Design | FPGA | LUT | Registers | Block RAM (9/18K) | Multipliers (18-bit) | Fmax (MHz) | RB Average Throughput (cycles) | Throughput (Normalized) |
Centar | Virtex-6 | 2915 | 2581 | 19 | 72 | 401 | 16.6N | 1 |
Xilinx [1] | Virtex-6 | 3851 | 4326 | 10 | 16 | 403 | 23.4N | 0.71 |
Centar | Stratix III | 3816 | 3188 | 29 | 60 | 400 | 16.6N | 1 |
Altera [2] | Stratix III | 2600 | N/A. | 17 | 32 | 260 | 32.9N | 0.33 |
To support several sectors with MIMO several Altera cores would need to be used. In this case only one or two Centar cores would be required, so that the Altera block RAM advantage shown in Table I would be nullified, as shown in Table II. Here, the ratios are shown for each of the elements, normalized to the design with the small resource requirement for a particular element. As can be seen, except for the the case of having just one sector and no more than two MIMO streams, the Centar design uses significantly fewer LUTs and block RAMs, and slightly fewer multipliers.
Table II. Multicore circuit comparisons
Number MIMO Streams | Sectors | Total RB | Altera Cores Required | Centar Cores Required | LUT Altera::Centar | Block RAM Altera::Centar | Multipliers Altera::Centar |
1 | 1 | 1 | 1 | 1 | 1.00::1.47 | 1.00::1.32 | 1.00::1.875 |
2 | 1 | 2 | 1 | 1 | 1.00::1.47 | 1.00::1.32 | 1.00::1.875 |
4 | 1 | 4 | 2 | 1 | 1.36::1.00 | 1.52::1.00 | 1.07::1.00 |
1 | 3 | 3 | 2 | 1 | 1.36::1.00 | 1.52::1.00 | 1.07::1.00 |
2 | 3 | 6 | 3 | 1 | 2.04::1.00 | 2.27:1.00 | 1.60::1.00 |
4 | 3 | 12 | 4 | 2 | 1.36::1.00 | 1.52::1.00 | 1.07::1.00 |
VI. Conclusion
We have shown how a new memory-based model for FFT calculations can be applied to the demanding needs of the LTE SC-FDMA DFT protocol. This new class of FFT architecture combines algorithm efficiency and programmability with new circuit features leading to better implementations for FPGA-based hardware. In particular the comparative benefits are:
- Much reduced usage of LUT and registers in the FPGA fabric, while providing higher overall throughputs compared to other commercial designs for the important LTE metric of resource block calculations.
- Programmable by entering values into a parameter ROM. This feature can be useful in adapting the Centar circuit to any unique future wireless system requirements.
- High dynamic range as a result of a combined block floating point and floating point scaling leading to >60db SQNR with only 12-bit inputs.
- Throughput scalability, possible due to the use of array-based algorithms. (Xilinx and Altera circuits compared in Section V are hardwired and thus provide no option for increasing throughputs.)
- High clock speeds, made possible by the inherently localized circuit operation which reduces routing delays.
- Capability to do power-of-two calculations as well as non-power-of-two. (LTE wireless protocol also requires power-of-two FFTs.)
References
- Application Note: Xilinx Discrete Fourier Transform v3.1, DS615 Mar. 1, 2011. (Later versions provide no support for Vertix-6 FPGAs.)
- Application Note: Altera DFT/IDFT Reference Design, 464, May 2007.
- J. G. Nash, “High-throughput programmable systolic array FFT architecture and FPGA implementations”, Int. Conf. on Computing, Networking and Communications (ICNC), Honolulu, HI, Feb. 2014, pp. 878-884.
If you wish to download a copy of this white paper, click here