DSP performance: Useful work per clock tick
DSP performance: Useful work per clock tick
By Nat Seshan, Sanjive Agarwala, DSP Design Manager, Alan Gatherer, Manager , Systems Engineering, Texas Instruments, Dallas, EE Times
November 15, 2001 (4:13 p.m. EST)
The primary performance metric for general-purpose processors is clock speed. So it is tempting to think about all processors in terms of megahertz. In the case of high-performance digital signal processors (DSPs), however, clock speed is only part of the story. A vital determinant of overall performance lies in how the DSP uses instruction cycle.
For example, Texas Instruments' (TI's) TMS320C6416 is characterized for clock speeds up to 600 MHz. As a result, the C6416 delivers considerably more raw performance than most other DSPs, which typically operate between 180 and 300 MHz. However, regardless of clock speed, it's the DSP's cycle efficiency-the amount of useful work that the DSP performs in each cycle-that, when scaled with the megahertz, determines true performance. With instructions optimized for maximum efficiency, integrated coprocessors that reduce processing overhead, and high-speed I/Os configured for specific applications, a 600-MHz DS P outperforms a general-purpose processor running at 2 GHz on signal-processing functions. It does so while consuming less than one-tenth the power and generating a fraction of the heat.
A very long instruction word (VLIW) architecture is a critical component of exceptional DSP performance. VLIW is hardly new in the DSP world, but recent improvements and new sets of extensions provide increased parallelism and double or quadruple functionality, especially in wireless infrastructure applications such as third generation (3G) basestation transceivers and controllers as well as broadband communications.
On the C6416, for example, advanced VLIW architecture provides eight parallel execution units, a significant advance over earlier DSPs. More important, single instruction multiple data (SIMD) processing allows each execution unit to carry out multiple concurrent arithmetic operations. For example, this DSP includes two multipliers, each capable of twice as many operations as could be performed with the initial VLIW DSP architectures. Each multiplier can execute two 16 x 16 multiplies, giving a total of four, or four 8 x 8 multiplies, for a total of eight operations.
Leveraging VLIW processing muscle requires independent instructions for each functional unit. DSPs, even those based on VLIW architectures, frequently rely on one instruction to control various processor functions, such as the multiplier, the accumulator and load/store access to memory. In some applications, this complex instruction set (CISC) for multiple functions may prove inflexible to support new emerging applications. If an engineer wishes to employ those functional units in an innovative way, however, CISC instructions may impose barriers. Some architectures have attempted to add limited flexibility through implementation of instruction prefixes that provide additional functionality to adjacent instructions. For most programmers, however, this capability just layers another set of restrictions and special cases, further complicating programming-it's a lot easier to play tetris if all the blocks are squares, because things just fit better. Independent instruction control for each functional unit improves flexibility and also reduces the number of clock ticks needed to perform a given amount of work over a wider array of applications.
Specialized extensions further enhance VLIW performance capabilities. For example, Reed-Solomon encoding is used in a variety of communications systems for error correction or encryption. In the most advanced DSPs, Reed-Solomon is built in at the instruction level through a Galois field multiply operation to increase performance and simplify development. For video processing, other enhanced instructions perform four eight-bit arithmetic operations in the SIMD format, allowing each functional unit to process more per cycle. For example, extensions such as the sum-of-absolute-value-of-differences function for motion estimation also can boost performance in video and imaging systems.
Effectively supporting specialized eight-bit video extensions requires nonaligned access of data. Relatively unusual in today's DSPs, nonaligned access allows loads and stores of up to 64-bit wide data vectors along arbitrary eight-bit boundaries. Thus, the user does not have to restructure or recode their video application to have multiple routines for different alignments, or even worse, add additional instructions to load multiple 64-bit values, pack them, perform the necessary operation, repack them and perform multiple store operations, wasting more CPU performance and power consumption.
While the implementation and enhancement of a VLIW DSP CPU is necessary to wring maximum performance from each DSP clock tick, chip-level device architecture also plays a vital role. For wireless communications in particular, where the goal is to compress as much data as possible into a single channel, integratin g Viterbi and Turbo coprocessors for forward error correction onto the silicon increases system performance dramatically. Without the coprocessors, as much as 90 percent of the DSPs computing capabilities may be expended on error-correction coding functions.
Adding a Viterbi coprocessor to error-encode compressed voice data so that it is immune to error, and a Turbo coprocessor for higher-rate channels such as wireless Internet, streaming media and the like-allows the DSP to off-load error correction to the coprocessors. The DSP CPU is free to handle other critical processing in the system, with performance headroom remaining to adapt to new standards or manage additional functions designed into an application.
For designers, the increase in useful DSP capacity afforded by the integration of Viterbi and Turbo coprocessors creates opportunities to differentiate their systems with proprietary functionality, increased data capacity or lower cost. One designer may decide to use the DSP processing power f reed up by the coprocessors to achieve a given bandwidth with fewer DSPs. Another may choose to increase density, which boosts system performance. Either way, integrated Viterbi and Turbo coprocessors allow one DSP to do the work of four equivalent devices without the coprocessors.
Of course, this 4x increase in useful capacity will be realized only in applications requiring substantial data compression and concomitant error correction. Wireless transceivers are obvious examples of end equipment that can achieve significant hikes in performance as a direct result of Viterbi and Turbo coprocessor integration. Designers working with DSL, wireless local loop or wireless broadband datacom technologies also may realize an advantage from such integration. The Viterbi module even can improve performance in an application as simple as a V34 or V90 modem. Turbo coding will become increasingly valuable in 3G wireless applications.
No matter how much processing power and efficiency is built into a device, at le ast one other consideration is paramount in evaluating DSP performance: the ability to move data on and off chip, and to interface with other system components, quickly enough to exploit capabilities built into the chip.
Take symbol rate, for example. Typically, ASICS, not DSPs, execute certain ultrahigh-performance functions, such as correlation. Communication between those ASICs and DSPs requires high bandwidth, such as a 133-MHz synchronous 64-bit interface, in order to move data back and forth while maintaining hedge room to meet real-time deadlines for data transfer. It is never acceptable to drop data packets between the ASIC and the DSP.
The same kind of 64-bit interface can be valuable in systems requiring off-chip memory. In a multistandard basestation, for example, the ability to access different types of code and/or data parameters through a wide memory pipe may be essential.
In addition to the 64-bit interface, a 16-bit, 133-MHz port provides high-speed general I/O. Providing both po rts to the DSP permits concurrent high-speed memory access and I/O.
Support for specific I/O standards is vital in DSPs used in wireless infrastructure and similar applications. For example, the standard for communication between the transceiver and the transcoder in a wireless basestation is asynchronous transfer mode (ATM). The ATM interface standard is Utopia. Embedding a Utopia port in the DSP allows the device to natively receive and transmit data packets to and from a host. In the same way, adding a PCI port to the DSP provides native control data communications between the DSP and the host. Multichannel buffered serial ports can provide up to 128 channels of individually selectable voice data.
Coordinating all these different data streams can be a problem in some circumstances. To solve that problem, the DSP benefits from an on-board Enhanced DMA coprocessor that permits interleaving data streams on a cycle-by-cycle basis. The coprocessor built into TIs' C6416 is 64 bits wide and runs at 300 M Hz.
Each feature mentioned above-enhanced second generation VLIW, error-correction coprocessors and sophisticated I/O capabilities-improve efficiency and contribute to overall DSP performance. Together, they add up to a DSP that can run circles around a general-purpose processor in signal processing with a fraction of the clock ticks. And since instruction cycles eat up power and generate heat, lower clock speeds actually offer advantages when performing a given amount of work.
For system developers, clock speed is just one metric to consider in evaluating DSPs. The amount of useful work the DSP can perform with the clock ticks available is the real measure of true high performance.
Copyright © 2003 CMP Media, LLC | Privacy Statement