Reconfigurable coprocessors create flexibility in DSP apps
Reconfigurable coprocessors create flexibility in DSP apps
By Eberhard Schueler, Product Management, PACT GmbH, Munich, Germany, email@example.com, EE Times
January 4, 2002 (12:37 p.m. EST)
With the advent of 3G and other computationally intensive market opportunities, there has never been a time better suited to the use of high-performance digital signal processors (DSPs). Unfortunately, even the highest-performance DSPs today cannot deliver the horsepower the design engineer demands. ASICs can deliver the performance, but they are expensive and time-consuming to create. The ideal would be a form of "algorithmic coprocessor core" to the existing standard DSP architectures that would provide the necessary performance without compromising the code-or knowledge-base-the designer has invested in these architectures. Pact and other companies have developed massively parallel, fully reconfigurable cores that represent the next evolutionary step in these DSP designs.
Typical algorithms used on DSPs can be categorized into two classes: control flow and data-flow oriented. Control flow is characterized by irregular code, whereas data flow is c haracterized by the computation of large amounts of data (for example, a digital data stream from a smart antenna) by relatively simple and uniform algorithms, typically involving a large number of multiplications. Most applications need both of these types of algorithms in their normal course of operation. Traditional DSPs are strong in control flow applications, but are also designed to support data-flow-oriented tasks. Unfortunately, in regards to the high-performance data-flow tasks, DSPs do not structurally provide the necessary processing power and bandwidth for the most demanding applications, such as 3G.
Several technologies have been proposed and implemented in the past to close this performance gap. The common denominator of all these new technologies is parallelism. This is not by accident; most of the critical time-consuming signal-processing algorithms do have an implicit parallelism. Nevertheless, architectures differ significantly in the way this parallelism is implemented.
One strateg y is to improve the standard DSP architecture itself. Very long instruction word (VLIW) architectures extend the basic Harvard Architecture of DSPs by instruction level parallelism, through the use of multiple arithmetic logic units ALUs that operate in parallel. The problem with this approach is that data that is to be processed in parallel must not have any dependencies-this makes efficient VLIW programming a difficult task for designers and compilers-and limits the number of usable ALUs to a handful of units at best.
Multi-VLIW-DSP cores provide more processing power than the previous approach. However, they suffer from the well-known problems inherent in this form of parallel processing: bus and memory bottlenecks, together with software-based communication mechanisms. Further, they are based on full-featured DSP cores-including caches, program sequencers, pipelines and branch prediction that are complex and power consuming-that, again, limit the number of parallel cores on a chip.
With both stra tegies, however, the cores can be programmed in the classical sequential methodologies, but they cannot solve the performance problem with streaming data.
Designers of DSP systems also had another choice to close the performance gap: additional "hardwired" application-specific standard product coprocessors or "soft-wired" field-programmable gate array (FPGA) coprocessors.
Application-specific integrated circuits used as coprocessors-Viterbi coprocessors for wireless applications, for example-are adapted to the specific demands of the designer. The problem with this is each algorithm to be accelerated requires a dedicated solution. With design costs and risks climbing ever higher, this approach is not a very economical one for most engineers. In addition, it also sacrifices the flexibility of DSPs: Manufacturers often must guess years in advance which algorithm will be needed for a design, which means they cannot take full advantage of evolving standards. Using this design approach is a costly gamble.
Designers can tailor the functionality of FPGAs to the final system. Due to the fine granularity of these devices, system functions as well as signal-processing algorithms can be implemented. The trade-off is the hardware-oriented design methodology (comparable to ASIC designs, including timing analysis) and the chip sizes. The complexity of million-gate designs should not be underestimated. The chip size must always be large enough so that the whole algorithm fits onto the array, because dynamic reconfiguration is very time consuming and not applicable with FPGAs. This makes it difficult to integrate an FPGA as a coprocessor onto a DSP chip-it's difficult to choose the size of an FPGA array when one does not know in advance which algorithms will be implemented. Both solutions narrow the performance gap, but there are severe problems with cost, design cycles, development risks and flexibility.
The wireless industry, for instance, demands a processor architecture that is as flexible as a DSP and which delivers ASIC-like performance without sacrificing flexibility. An example: In wireless systems, algorithms such as filter, Fourier Transform and Viterbi decoder process continuous streams of data. However, this is the algorithm of today-it is quite likely that the next wireless standard will specify other algorithms and will arrive very soon. Single DSPs cannot deliver the processing performance for the required bit rates and developing dedicated coprocessors is extremely time-consuming, with existing solutions proving to be difficult to adapt to emerging standards.
Reconfigurable processor architectures offer a solution to these problems. This class of processors is quite advanced; the instruction flow is replaced by a configuration flow, and single, basic ALU operations are replaced by complex operations performed on whole data streams. In one cycle, a complete algorithm, comp osed of many basic machine operations such as addition and multiplication, is calculated-this is how reconfigurable processors deliver superior performance. The key components of a reconfigurable processor are the array of ALUs, with their configurable interconnects, and the configuration managers, which load configurations to the array.
The parallel part of an algorithm is derived from a data and control flow graph and is mapped directly to the array. Nodes of the flow graph correspond directly to the ALU opcodes, and the edges are the connections between ALUs. A data stream travels through this network of ALUs and data is computed. Separate algorithms (or partial algorithms) are configured to the array step by step: filter-Fourier Transform-Viterbi. A block of data streams through each of the configurations and intermediate results are buffered in local RAMs for the next step. This is comparable to classical processors and DSPs, which are controlled by a flow of instructions defining source and destina tion addresses and the calculation of single words in the ALUs.
This is the basic principle, but a serious question arises: Is such a processor programmable? The final goal is an architecture that can also be programmed in a higher-level sequential language such as C. To reach this target, many more specific details must be integrated into a reconfigurable processor.
First of all, a reconfigurable processor must hide all implementation details from the programmer. Elements of data streams must be handled as packets, in order to ensure that ALUs generate a result only when all input packets are available. This automatic data synchronization results in timing-free programming without the need for extensive pipeline.
Conditional operations based on the results of calculations must be possible, which allows programming of loops, conditional ALU operations and conditional requests for reconfiguration. That control flow must completely independent of the data flow.
One of the most important challen ges is to control the reconfiguration process. In order not to lose clock cycles during reconfiguration, the next configuration should follow the last data packet of the previous configuration synchronously, like a wave. Thereby the next configuration does not need to wait until the previous data pipeline is emptied; the next configuration can be started. This feature keeps the maximum number of ALUs busy.
Further, configuration data should be very limited in size, locally cached and preloaded for the fastest possible reconfiguration. Several independent tasks (configurations) must be allowed to operate in parallel on the array. If one task requests already allocated resources, an automatic allocation scheme must prevent deadlocks and maintain the correct sequence of configurations. For conditional reconfigurations, results of array calculations must be able to control the configuration flow. All reconfiguration handling must be done by hardware protocols and ideally should be transparent to the program mer.
With these hardware preconditions, a high-level compiler is able to find parallel code sections in sequential code and then place and route the algorithm to any location on a given array. One method Pact has developed is the automatic vectorization of algorithms that are written in a high-level sequential language such as C. Code sections to be ported are analyzed automatically and parallel sections are mapped to the array in a sequence of configurations. Such a compiler makes extensive use of ALU status information for control structures such as conditional statements. This method allows programming on a higher abstraction level.
It is a common practice in a coprocessor environment to port only inner time-critical loops to the coprocessor-sequential or irregular code sections and control functions are executed on the DSP. This saves software investments into DSP code and development time and enables DSPs to handle high-speed data streams. The routines to be ported should be optimized for best u tilization of the hardware resources. An easy-to-learn descriptive language that opens all processor features to the programmer is the optimal choice to implement truly parallel algorithms directly for the target hardware.
Integration of a reconfigurable processor to a DSP is straightforward: Streaming I/O ports are coupled to the DSP's DMA ports or to shared memory that is actively addressed by the array. Adaptation of different execution speeds of the DSP core and the coprocessor is performed automatically by the data-flow synchronization mechanisms. The reconfigurable processor core runs self-contained without interrupting the DSP core until calculated results are available on the DMA bus or shared memory, or until new data is requested from the coprocessor.
In addition to the hardware, development tools must also be seamlessly integrated into the DSP's tool chain. A fast, clock-accurate simulator of the reconfigurable processor and sophisticated visualization and debugging tools must fit into t he DSP development environment. Because of their flexibility and programmability, reconfigurable processors are the optimal integrated coprocessors to maintain the flexibility and code base of DSPs, and to close the performance gap for streaming data.
Though reconfigurable processors could be integrated on board level as separate devices, the most promising solution is integrating them into standard DSPs or SOIC devices. Semiconductor manufacturers will integrate intellectual property (IP) of the reconfigurable processor into their products. To fit to the various DSP architectures, the IP model must provide a wide selection of options: word size; the size of the array in both dimensions; the number and size of memories; the final functionality of the ALUs; the number of routing resources; and the types of interfacing to the target DSP.
Within the next few years, reconfigurable processors will enter the semiconductor market as symbiotic partners of classical DSPs or RISC architectures. One driving fa ctor is the exceptional need for performance and flexibility in the fast-growing wireless infrastructure market. These processors will benefit from advances in semiconductor process technology in both the number of elements on a chip and the clock frequency. In the long term, this technology has the potential to replace classical instruction flow-based architectures with configuration-flow architectures.
Copyright © 2003 CMP Media, LLC | Privacy Statement