Radar, navigation and guidance systems process data that is acquired using arrays of sensors. The energy delta from sensor to sensor over time holds the key to information such as targets, position or course. This two-dimensional array of data, often referred to as an "observation matrix," must be solved as a set of linear equations to extract the desired information. Solution methods include matrix inverse, factorization, adaptive filtering and singular value decomposition and are typically performed using floating-point arithmetic to allow for sufficient dynamic range and precision of the input data. Doing so, however, limits the performance of a system.
Today's DSP-oriented FPGAs such as Xilinx Virtex 4 and Altera Stratix II provide far greater performance than a floating-point DSP processor for this class of applications and offer the flexibility to extend the dynamic range of a fixed-point implementation significantly beyond the limitations of a fixed-point DSP processor. Singular value decomposition (SVD) for an 8x8 matrix can run over 50 times faster in fixed-point arithmetic on an FPGA than a floating-point implementation running on a TI TMS320C67x DSP processor. Achieving this performance requires a hardware architecture that utilizes 261 of the Virtex4 DSP48 multipliers running in parallel at 200 MHz.
These are challenging applications to design on any hardware platform. Determining an FPGA architecture that effectively utilizes the DSP blocks to achieve a worthwhile performance advantage adds significantly to this design complexity. The type of architectural tradeoff analysis necessary to determine an optimal solution, however, is well-suited to a high-level DSP design methodology.
The Fixed-Point Dynamic Range "Issue"
As stated earlier, this class of applications makes extensive use of matrix inversion, matrix factorization, singular value decomposition and division. These operations require cascaded serial multiply operations on the input data that limit the dynamic range of a system. The example shown below is for a popular method of determining the inverse of a matrix called QR Decomposition.
MATLAB Example of QRD Matrix Inverse:
|% do QR factorization |
|[Q,R] = qr_factor(Xtmp); |
|% find inverse of R |
|Rinv = zeros(M,N); |
|for row2 = M+7:-1:1 |
| if row2 > 7 |
| Rtmp = R(row2-7,row2-7); |
| end |
| RDiagInv = 1/Rtmp; |
| if row2 < M+1 |
| Rinv(row2,row2) = RDiagInv; |
| for col2 = row2+1:N |
| accum = 0; |
| for t = row2+1:N |
| accum = accum - (R(row2,t) * Rinv(t,col2)); |
| end |
| Rinv(row2,col2) = accum * RDiagInv; |
| end |
| end |
|% inverse of input |
|Xi = Rinv * Q'; |
This algorithm, when implemented with Givens rotations, requires 5 cascaded multiply and one divide operation to be performed on the input data.
Fixed-point arithmetic dictates that the number of output bits of a multiply operation be equal to the sum of the two input operands if all precision is to be maintained. If left un-truncated bit widths can grow quickly as shown below in Figure 1.
Figure 1 - Fixed-Point Bit Growth of Multiplies
The Flexibility of the FPGA Fabric
FPGA logic is not limited to specified bit widths for internal busses and may grow as needed to meet the demands of the application. This bit growth comes at the expense of added hardware which, if left unbounded, can be significant. Reasonable internal bit growth beyond 16-bits, however, can improve the dynamic range of a fixed-point implementation to provide a viable hardware solution for systems using up to 16-bits.
Exploring the bit growth requirements of the QRD matrix inverse shows that quantizing the inputs to 16-bits signed offers an integer dynamic range between -32,768 to +32,768. Figure 2 shows the AccelChip DSP Synthesis tools "Fixed-Point Report" which lists the quantizations used for the QRD matrix inverse.
In Figure 2 the "Quantizer" column nomenclature is as follows: "fixed" means signed twos-complement, "ufixed" means unsigned binary, "floor" is the saturate mode if the MSB and "wrap" is the rounding mode of the LSB. The number in square brackets represents the word length and decimal point location respectively. For more information on this nomenclature refer to the MATLAB help for the command "quantizer."
Figure 2 - AccelChip Fixed-Point Report
The variable "RDiagInv," which is the result of a divide operation, is quantized using 32 total bits with 17 integer bits. Maintaining an adequate number of integer bits here is critical to maintaining an acceptable response of the inverse function. The flexibility offered by the Virtex 4 FPGA allows for the necessary bit growth of the integer bits to occur while some reasonable trimming of the fractional bits may take place.
Multiplying Operands Greater Than 16-Bits
The Xilinx Virtex 4 device includes dedicated hardware multipliers in the DSP48 blocks that support up to 18 input bits with up to 48-bits of accumulation. Even though generous, this does not place a hard limit of 18-bits on the internal busses. Multiplication operations requiring greater than 18-bits and accumulations requiring greater than 48-bits can be constructed using additional DSP48 blocks while maintaining exceptional performance in excess of 300 MHz.
Figure 3 - 32 bit multiply implemented in Virtex 4 DSP48 Blocks
The FPGA Performance Advantage
The real advantage of an FPGA implementation is realized when the hardware is architected to support multiple DSP operations running concurrently on a single device. Figure 4 shows the block diagram for a sensor array processing application that includes pre-filtering, beamforming, adaptive filtering and post processing.
Figure 4 - Sensor Array Processing Block Diagram
Implementing this application in a floating-point processor requires either multiple chips or a significant compromise in performance to allow resource sharing of the limited multiplier resources between the multiple DSP operations. A single FPGA, however, supports the entire operation providing a performance advantage
The 500 Multiplier Advantage!
The XC4VSX55 device offers a peak performance capacity that is 512 times greater than the TI TMS320C67x floating-point processor for multiplier dominated designs. The C67x offers 2 floating-point data paths, each containing a single multiplier that can operate up to 250 MHz to provide a peak performance of 500 MFLOPs. The XC4VSX55 includes 512 dedicated DSP48 blocks each containing 1 signed 18x18 multiplier capable of running at 500 MHz providing the fixed-point equivalent of a peak performance of 256 GFLOPs. Granted this is a simplistic method for comparing the performance capacity of a floating- vs. fixed-point device but this comparison should provide a sense of the possibilities the FPGA architecture has to offer.
Maximizing the performance of this system requires that partial parallelism be implemented in key areas of the design that will have the greatest impact on overall performance. The added hardware that results from this additional parallelism must not exceed the available resources of the target FPGA. The number of architectural possibilities a designer must evaluate is considerable and grows exponentially with the size of the system which makes the determination of an optimal hardware architecture a tedious and time consuming design task.
AccelChip® provides a high-level design methodology that greatly simplifies this process. Radar, navigation and guidance systems can be described in MATLAB using loops, and vector and matrix multiplies. These operations can be automatically "unrolled" during the algorithmic synthesis process providing designers a rapid way to explore the impact of parallelism on different blocks of the system without modifying their golden source. By using an automated flow the final solution can be easily tailored to maximize the available resources of the target FPGA. Table 1 provides an example of how design exploration can be used to tailor the performance of a QRD-RLS adaptive filter.
|# Multipliers ||Performance (MSPS) |
|1 ||9.5 |
|41 ||100 |
Table 1 - QRD-RLS Adaptive Filter Performance vs. Multipliers
AccelChip offers both stand alone IP cores and an algorithmic synthesis environment based on MATLAB for designing fixed-point implementations of radar, navigation and guidance systems. The quickest path to hardware is to use an AccelCore® or AccelWare® IP core. AccelCore IP provides users with a synthesizable RTL model along with documentation and a testbench that can be incorporated into a larger design through RTL instantiation. The AccelCore library includes matrix inversion, matrix factorization and singular value decomposition functions. The AccelWare IP library includes over 50 synthesizable MATLAB models that can be combined at the MATLAB level with user-defined functionality and synthesized into VHDL or Verilog with AccelChip DSP Synthesis. This form of IP is easy to integrate into larger system-level models defined in MATLAB.
Figure 5 - AccelWare IP Generation Form for QR Inverse
AccelChip® DSP Synthesis provides complete flexibility to define and implement custom architectures for radar, navigation and guidance systems using floating-point MATLAB. AccelChip provides automated floating- to fixed-point conversion to assist in solving the complex quantization issues resulting from the cascaded multiply and divide operations used in matrix inversion and factorization. Once an acceptable fixed-point model is determined, users can rapidly explore performance verses hardware tradeoffs using algorithmic synthesis. Here the number of dedicated hardware multipliers used in the design can be quickly increased to improve performance and take full advantage of the flexibility of the FPGA architecture.
The performance advantages of a Xilinx Virtex 4 or an Altera Stratix II FPGA is now available to radar, navigation and guidance system designers requiring up to 16-bits of input dynamic range. Realizing this performance advantage requires a high-level design methodology, such as the one offered by AccelChip, to craft a hardware architecture that fully utilizes the available FPGA DSP resources in a timely manner. Technical white papers and application notes are available for download at www.accelchip.com or send an e-mail to email@example.com.
TMS320c67x Floating Point DSP Performance
Report on NAG Benchmark Tests for SUN SMPs, The University of Liverpool
Comparing Fixed-and Floating-Point DSPs, Texas Instruments
A BDTI Analysis of the Texas Instruments TMS320C67x, BDTI