FPGA Coprocessors: Hardware IP for Software Engineers
by Robert Cottrell, Altera
High Wycombe, UK
The concept and application of FPGA Coprocessors as a means of delivering hardware IP to software and system engineers is presented. The hardware and software architecture of FPGA coprocessors is described in detail. A Turbo encoder FPGA coprocessor reference design is described as an example.
It is widely recognized that FPGAs are very efficient for the implementation of many computationally complex digital signal processing algorithms. In comparison with programmable DSP processors, they can deliver a lower-cost and lower-power solution for a variety of algorithms. FPGAs, however, do not offer the same flexibility and ease of design as DSP processors. FPGA coprocessors are blocks of hardware IP that can easily be integrated into a processor-based system in order to offload some of the most computationally intensive tasks.
A combination of standardized hardware interfaces, design automation tools to assemble a system, and a standardized software API forms the concept of FPGA coprocessors. The design automation tools and software API make it possible for system and software engineers to make use of hardware IP with a minimum of actual FPGA design. The standardized interfaces provide orthogonality. If an IP designer conforms to the standards, an IP block can be used as a coprocessor with any of the supported processors. In a similar way, once the necessary interface hardware and software drivers have been created, all FPGA coprocessor IP can be used with that processor.
FPGA coprocessors are ubiquitous. They can be used with standard DSP processors to offload computationally intensive tasks, or to provide digital signal processing capabilities to a general purpose microprocessor. They can be used with processors embedded within FPGAs as hard or soft logic, or with off-FPGA processors.
FPGA coprocessors by definition implement computationally complex tasks, mainly in the field of digital signal processing. Candidate functions include FIR filters, FFT processors, and error correction and detection. Altera has already demonstrated the concept with an encoder for Turbo convolutional codes.
Applications of coprocessors are varied, but include software-defined radio as well as error correction and detection in base-stations for mobile communications. Hardware Architecture In order to be used as an FPGA coprocessor, a hardware IP block needs to use defined standard interfaces. The design automation tools can take this IP block and connect it to ancillary functions such as FIFOs, DMA controllers and bus interfaces to create a system. Altera has chosen to use a defined subset of the Atlantic1,2 interface for the data input and output ports of an IP block and a simple Avalon3 slave interface for control and status.
Two hardware architectures have been defined. The first, known as “Type F” (see ) uses DMA controllers in the FPGA to move data between the processor’s memory and the coprocessor, and is particularly suited to systems where the processor’s main memory is accessible directly from the FPGA. This is typically true when the processor itself is included in the FPGA.
Figure 1: FPGA Coprocessor Architecture with DMA Controller in FPGA (Type F)
Figure 2: FPGA Coprocessor Architecture with Off-FPGA DMA Controller (Type E)
The second architecture, known as “Type E” (see ) uses an off-FPGA DMA controller, typically built into a standard processor chip, and is particularly suited to systems where the main memory is not directly accessible from the FPGA. Both architectures make use of the same IP block, but different ancillary functions. These ancillary functions are shared between all IP blocks and so need only be designed once for all coprocessors. The same design automation tool is used to assemble the systems, and both architectures present the same software API to the user.
FIFOs and Data Interfaces
In both architectures, the IP block receives data from a FIFO and sends data to a FIFO. These FIFOs are necessary because the DMA controllers need to transfer blocks of data for efficient operation. The FPGA coprocessor standard could have been written to require each IP block to include these FIFOs, but since these FIFOs are essentially identical in all IP blocks, it is more efficient to keep them separate and use design automation software to connect them together.
The interface between the FIFOs and the IP block is a defined configuration of the Atlantic interface. The IP block itself will include Atlantic master interfaces: a master sink on the input side and a master source on the output side. The FIFOs will include Atlantic slave interfaces: a slave source connected to the master sink on the IP block, and a slave sink connected to the master source. The data width of this Atlantic interface is determined by the IP block, but must be 8, 16, 32, 64 or 128 bits
The interface between the FIFOs and the DMA controllers in Type F architectures or the external processor interface logic (EPIF) is also Atlantic. The FIFO side is an Atlantic slave, and the DMA or EPIF side is an Atlantic master. The data width of this interface is the data width of the Avalon bus in a Type F system or is determined by the EPIF in Type E systems. It is typically 32 bits. The threshold on this Atlantic interface needs to be large enough to allow efficient data transfer by allowing the DMA controller to transfer reasonably large blocks of data, especially in Type E systems.
The FIFO is responsible, among other things, for data packing and unpacking to handle the different widths of the Atlantic interfaces on the IP block and the DMA controller or EPIF.
A key requirement of the FPGA coprocessor concept is the availability of a parameterized Atlantic FIFO that can be configured automatically by design automation software to connect an IP block as an FPGA coprocessor.
External Processor Interfaces
Type E systems require external processor interface (EPIF) hardware to be available for the chosen processor. The same EPIF can be used for any coprocessor. The EPIF needs Atlantic interfaces for connections to the FIFOs, and an Avalon master for connection to the control/status interfaces and any other peripherals that the user wishes to connect. The EPIF needs to have the facility for creating an interrupt to the external processor when it has space ready to accept a new packet from the processor and when it has data ready to transfer to the processor.
The number of Atlantic ports on the EPIF must be configurable so that multiple coprocessors can be connected to an off-FPGA coprocessor through one EPIF.
Multiple Clock Domains
It is not necessary for the IP block to be in the same clock domain as the processor or external processor interface. It is common, for example, for dedicated DSP hardware blocks to run with a much faster clock than a general-purpose processor. The inclusion of FIFOs in the architecture makes this relatively simple: these FIFOs can be used to handle the transfer between clock domains. Any control/status interface on the IP block will also need to include logic safely to handle transfers between the clock domains. Hardware for multiple clock domains need only be added by the design automation software if the user requires this function.
The API for a coprocessor is responsible for taking a packet of data stored in the processor’s memory, sending it to the coprocessor, receiving the processed packet of data from the coprocessor and storing it in a buffer in the processor’s memory. There are two modes of operation: in-line, and separate input-output. The in-line case is the simplest: the processor sends a packet of data to the coprocessor and waits for the processed data to be returned before continuing. This simple mode of operation is less efficient, however, because the processor and coprocessor cannot be active at the same time.
The separate input-output mode is more efficient, but more complex. The application software queues up packets of data to be sent to the coprocessor and provides a queue of buffers to be filled with return data. The application software also provides call-back routines that will be called by the driver whenever a packet of data is delivered to or received from the coprocessor. Either interrupt or polling mechanisms can be used by the processor to check if it needs to take any action. In this mode, the processor and co-processor can both be active simultaneously.
There are two parts to the software driver that need to be provided. The high-level part that includes the API routines called by the application software is coprocessor-specific. The low-level part is processor-specific and includes a driver for the DMA controller being used to transfer data between the processor’s memory and the co-processor. A well-defined interface between these two parts, effectively an API for the DMA controller, makes it easy to add support for new processors. This low-level API simply transfers data between memory buffers and a coprocessor with no knowledge of the semantics of the data. Once a low-level driver has been written for a new processor, all coprocessors are automatically supported on that processor. Similarly, a new coprocessor is immediately usable on all supported processors.
There are a number of situations where it is necessary to change the configuration of a coprocessor between blocks of data. For example, mobile communication systems use different convolutional codes for different channels. A Viterbi decoder coprocessor may be required to decode blocks of data from these different channels. A FIR filter coprocessor may need to work with various sets of coefficients; an FFT coprocessor may be configured to perform forward or inverse transforms and operate on different block sizes.
One possible approach would be for the processor to re-configure the coprocessor through its control/status interface. In this case, the processor would need to wait for the coprocessor to fully process the previous block before applying the new configuration. This is suitable for many applications, but not when frequent re-configuration is required. Another approach is to intersperse control packets with the data packets that are transmitted to the coprocessor. These will be processed in order and enable the coprocessor to be reconfigured between each data packet if necessary.
When a system always requires two coprocessors to be connected in series, with one coprocessor always taking its input from the output of the preceding one, it is inefficient to transfer the data to and from the processor’s memory in between. This is especially true in Type E systems with an off-FPGA processor. Due to the use of Atlantic interfaces and FIFOs, it is relatively easy to connect coprocessors together in hardware.
The FPGA coprocessor concept makes it very simple for a software engineer or system engineer to make use of hardware IP to offload computationally intensive functions. Once the IP block has been obtained, the entire hardware system can be configured and generated automatically using a design automation tool such as Altera’s SOPC Builder, without the engineer needing to write any HDL code. The tool will assemble the IP block together with the appropriately configured FIFOs, DMA controllers and external processor interfaces. The API provided with the IP can be integrated directly into the user’s application software. The speed at which a system can be assembled makes it easy to explore the design space to achieve the required performance at minimum cost.
Design Example: Turbo Encoder Co-processor Reference Design
Altera has created a Turbo encoder co-processor reference design for high-speed downlink packet access (HSDPA) links in third generation mobile communications4. The reference design incorporates the Turbo Encoder MegaCore function as an IP block in a Type E coprocessor system. The co-processor was implemented during the development of the specification for FPGA co-processors and does not fully conform to the specification, but it is a useful demonstration of the concept.
The reference design is a Type E system, and the external processor interface is a Texas Instruments (TI) 32-bit asynchronous external memory interface (EMIF). In this case, the external processor interface logic includes an Avalon master interface that can be used to connect simple slave peripherals implemented in the FPGA, such as UARTs, to the TI processor. Software Interface The API for the Turbo Encoder Co-Processor is illustrated by the following code fragments.
typedef void (* TXCALLBACK) (
void * handle);
typedef void (* RXCALLBACK)(
uchar * output,
void * handle);
const uchar * input,
uchar * output,
void * handle);
The turbo_encode routine is called to initiate the turbo encoding of a block of data. The user passes in a block of data to be encoded, and a buffer to receive the encoded data. The interface can operate in polled mode or interrupt mode. The user supplies two routines, txcallback and rxcallback, which are called respectively when the input data has been consumed and when the encoded data has been written back to the buffer. In interrupt mode, these routines are called by the interrupt service routine. In polled mode, they are called by the turbo_poll routine, which must be regularly called by the user.
Figure 3: Hardware Architecture of Turbo Encoder Co-Processor Reference Design
The reference design has been demonstrated in hardware using an Altera Stratix EP1S25 DSP Development Board5, part of the DSP Development Kit, Stratix Edition, connected to a Texas Instruments C6711 DSP Starter Kit (DSK)6 using the TI processor’s 32-bit External Memory Interface (EMIF).
The FPGA Coprocessor architecture greatly simplifies the process of offloading computationally intensive functions from a programmable processor into dedicated hardware. This is achieved through a combination of standardized hardware and software interfaces, and the use of design automation tools. A suitably packaged block of IP can be implemented in an FPGA with a minimal requirement for hardware design, and can be accessed through a standardized software API.
Atlantic Interface, Functional Spec A-FS-13-3.0, Altera Corporation, June 2002
2 Atlantic: a high performance datapath interface for SOPC Designs, Robert Cottrell, Proceedings of International Workshop on IP-Based System-On-Chip Design, Grenoble, October 2002
3 Avalon Bus Specification Reference Manual, MNL-AVABUSREF-2.3, Altera Corporation, July 2003.
4 Turbo Encoder Co-processor Reference Design, Application Note AN-317-1.0, Altera Corporation, September 2003.
5 Stratix EP1S25 DSP Development Board, Datasheet DS-STXDSPDVBD-1.4, Altera Corporation, May 2003 6 C6711 DSP Starter Kit (DSK) TMDS320006711, http://focus.ti.com/docs/tool/toolfolder.jhtml?PartNumber=TMDS320006711