How to implement double-precision floating-point on FPGAs
October 03, 2007 -- pldesignline.com
Floating-point arithmetic is used extensively in many applications across multiple market segments. These applications often require a large number of calculations and are prevalent in financial analytics, bioinformatics, molecular dynamics, radar, and seismic imaging, to name a few. As opposed to integer and single-precision 32-bit floating-point math, many applications demand higher precision, forcing the use of double-precision 64-bit operations. This article demonstrates the double-precision floating-point performance of FPGAs using two different approaches. First, a theoretical "paper and pencil" calculation is used to demonstrate peak performance. This type of calculation may be useful for raw comparison between devices, but is somewhat unrealistic as it assumes data is always available to feed the device, and does not take into account memory interfaces and latencies, place and route constraints, and other aspects of an actual FPGA design. Thus, secondly, the real results of a double-precision matrix multiply core that can easily be extended to a full DGEMM benchmark are demonstrated and the real-world constraints and challenges of achieving such results are discussed in detail.
Introduction
An increasing number of applications in many vertical market segments, from financial analytics to military radar to various imaging applications, are relying on computations with floating-point (FP) numbers. These applications implement various basic functions and methods such as fast Fourier transforms (FFTs), finite impulse response (FIR) filters, synthetic aperture radar (SAR), matrix math, and Monte Carlo. Many of these implementations use single-precision FP, where FPGAs can provide up to ten times the sustained performance compared to traditional CPUs. Recently, there has been increasing interest in double-precision performance to see how well FPGAs can compete with CPUs, especially for designs that have power and cooling constraints.
In a recent article titled FPGA Floating-Point Performance – A Paper and Pencil Evaluation, the author – Dave Strenski – discusses how to estimate the double-precision (64-bit) peak FP performance of an FPGA. In this article, his method is evaluated and – more importantly – he expands on it with "real-world" considerations for estimating the sustained FP performance in an FPGA. These considerations are validated using a matrix multiplication design running in an Altera Stratix II FPGA.
The double-precision general matrix multiply (DGEMM) routine is referenced here. DGEMM is a common building block for many algorithms and is the most important component of the scientific LINPACK benchmark commonly used on CPUs. The Basic Linear Algebra Subprograms (BLAS) include DGEMM in the Level 3 group. The DGEMM routine calculates the new value of matrix C based on the product of matrix A and matrix B and the previous value of matrix C using the formula C = áAB + âC (where á and â are scalar coefficients).
For this analysis, á = â = 1 is used, though any scalar value can be used as it can be applied during the data transfer in and out. As can be seen, this operation results in a 1:1 ratio of adders and multipliers. This analysis also takes into account the logic required for a microprocessor interface protocol core and adds the following considerations:
- Memory interface module for low latency access to local data
- Data paths from memory interface to FPGA memory
- Data path from FPGA memory to FP cores
- Decrease to FP core FMAX when the FPGA is full
- Unusable FPGA logic due to routing challenges of a full FPGA
The FPGA benchmark focuses on the performance of an implementation of the AB matrix multiplication with data from a locally attached SRAM. The effort to extend this core to include the accumulator to add the old value of C is a relatively minor effort.
E-mail This Article | Printer-Friendly Page |
|
Intel FPGA Hot IP
Related Articles
New Articles
- Early Interactive Short Isolation for Faster SoC Verification
- The Ideal Crypto Coprocessor with Root of Trust to Support Customer Complete Full Chip Evaluation: PUFcc gained SESIP and PSA Certified™ Level 3 RoT Component Certification
- Advanced Packaging and Chiplets Can Be for Everyone
- Timing Optimization Technique Using Useful Skew in 5nm Technology Node
- Streamlining SoC Design with IDS-Integrate™
Most Popular
- System Verilog Assertions Simplified
- System Verilog Macro: A Powerful Feature for Design Verification Projects
- Enhancing VLSI Design Efficiency: Tackling Congestion and Shorts with Practical Approaches and PnR Tool (ICC2)
- PCIe error logging and handling on a typical SoC
- UPF Constraint coding for SoC - A Case Study