In response to the demand for faster, cheaper, smaller, more energy-efficient digital signal processors, vendors have developed processor architectures that depart significantly from those of traditional DSPs. Microcontroller and CPU vendors have also begun to add DSP enhancements to general-purpose architectures, creating hybrids that challenge the performance of dedicated DSPs.
Today's large selection of processors equipped with DSP capabilities is clearly an advantage for system designers, since there are now options for a variety of applications. However, the bewildering array makes it both difficult and time-consuming to compare performance and select the best processor for an application.
What distinguishes a DSP from the conventional general-purpose processors (GPPs) typically found in personal computers? In practice, the term "DSP" designates a programmable microprocessor whose architecture is designed specifically for digit al signal processing. Most DSP systems perform complicated mathematical operations on real-time signals, so DSPs are optimized for such operations through special architectures that accelerate repetitive, numerically intensive calculations.
For example, virtually all DSPs support fast multiply-accumulate (MAC) operations, which are useful in many signal processing algorithms. Other architectural features often found in DSPs include multiple memories with multiple bus sets that allow the processor to simultaneously load multiple operands (such as a data sample and a filter coefficient) in parallel with an arithmetic operation. DSPs usually include several special memory addressing modes and program-flow control features designed to accelerate the execution of repetitive operations. In addition, most DSPs contain on-chip peripherals and interfaces that allow the processor to efficiently communicate with other system components.
General-purpose processors have lacked such DSP-specific features. However, vendors are adding these features to their processors because the demand for DSP capabilities and features has exploded over the last few years. Examples of DSP features included on GPPs range from a single-cycle multiply-add instruction on the Motorola PowerPC 604e to the single-instruction, multiple-data instructions found on the Intel Pentium with MMX and the PowerPC 7400 with AltiVec.
In fact, nearly all GPP vendors, whether they sell high-performance CPUs or low-cost microcontrollers, offer some form of DSP enhancement for their products. Many microcontroller vendors, hoping to avoid having their chips replaced by DSPs, have added DSP capabilities to their existing microcontroller architectures. For example, ARM Ltd. (Cambridge, England) recently announced a DSP-oriented extension for its ARM9 processor, called the ARM9E. Hitachi Ltd. (Tokyo) augmented its widely used SH-2 and SH-3 microcontrollers with a fixed-point DSP data path to create the SH-DSP. Infineon Technologies AG (Munich, Germany)-the former Siemens Semiconductor-entered the DSP/microcontroller arena in 1998 with the introduction of TriCore, a multi-issue processor designed as a hybrid from the ground up rather than as a DSP retrofit of an existing microcontroller.
All three of these hybrid processors use RISC-based instruction sets that are similar to the instruction sets found on microcontrollers but include an array of DSP-oriented instructions. Some high-performance GPPs are capable of performing signal-processing tasks with execution times that are comparable with those of the fastest dedicated DSPs. Assuming that a GPP with sufficient DSP performance is already in place, there are clear advantages to using the existing processor for signal processing rather than adding a separate dedicated DSP to the system.
However, there are disadvantages, too-particularly in the case of high-performance GPPs, which incorporate dynamic features like superscalar execution and branch prediction. These features result in varying execution times for the same segment of software and make it difficult for the programmer to predict how long a given section of a program will take to execute. Variable execution times can pose a problem in real-time applications because the programmer may have difficulty guaranteeing that hard real-time constraints are met in every instance. In addition, the lack of good DSP application development tools for general-purpose processors can make the development process frustrating.
With the widening range of DSP-enhanced general-purpose processors, DSP system designers now must decide whether they need a DSP, or if a GPP is a better choice. Comparing the performance of these two categories of processors can be difficult, however, since there is often little information available about the DSP performance of general-purpose processors relative to the performance of DSPs. The methodology Berkeley Design Technology Inc. (BDTI) has developed for evaluating processor performance on DSP algorithms is independent of processor architecture, so that the same benchmarks can be implemented on DSPs, CPUs, microcontrollers and hybrids, allowing apples-to-apples comparisons of DSP performance.
We believe that architecture-independence is an important requirement for DSP benchmarks, particularly as the architectures used for DSP have diversified. Performance is often an important criterion for selecting a processor; this is usually what system designers use in their initial screening of candidate processors. Performance can be measured in many ways. The most common measurement is raw speed; that is, the length of time a processor requires to perform a given task. Depending on the application, however, other metrics-like memory usage or power consumption-can be equally important.
The DSP system designer must choose a processor that has sufficient performance to satisfy the requirements of demanding number-crunching applications, such as digital telephony and modems. At the same time, there are usually tight restrictions on the price of the processor, and any performance paid for and not used is wasted.
These demands imply that the best processor is often the least expensive one with acceptable performance. Unfortunately, quantifying the computational requirements of an application and then determining the best processor for the job is not a simple task. The ideal measurement of performance might be a composite metric encompassing execution time, memory and power consumption.
This combination tends to be difficult to analyze and understand and does not lend itself to quick comparisons. We will focus instead on execution time and will consider memory and power consumption as secondary metrics. The most common processor performance metric is Mips (millions of instructions per second). Unfortunately, this metric is misleading because the operations that are performed by a single instruction vary widely between processors. This is especially a concern for DSPs, which often have highly s pecialized and different instruction sets.
Unless the processors have similar architectures, comparisons based on Mips are practically useless. Similarly, another common measurement, Mops (millions of operations per second), is also misleading because there is no clear definition of what an operation is or how many operations are needed for a specific task. Other simple performance measurements can be misleading as well. The MAC is a central operation in many DSP applications and some vendors measure performance in MACs/second. Many DSPs perform one MAC per cycle, however, making this measurement equivalent to Mips for these processors.
Also, the definition of MAC varies from processor to processor. For example, does the MAC include associated data moves or just the multiply-accumulate operation? As a further consideration, this measurement provides no information about the relative performance of processors while they are executing DSP tasks other than M ACs/s. Mips, Mops and MACs/s also ignore secondary metrics like memory usage and power consumption. This is a problem, since fast execution times are meaningless if a processor requires more memory than is available.
A common benchmarking methodology for computers involves implementation of entire applications or even suites of applications as in the popular SPEC95 benchmarks. This methodology works best when the software is portable; that is, if the applications are written in a high-level language like C. Unfortunately, C compilers for most processors perform poorly on DSP software. For that reason, performance-critical DSP software is usually hand-optimized in assembly language. If a processor is benchmarked with applications written in a high-level language, the benchmark measures both the processor and the compiler and is not likely to yield results similar to what would be obtained by developing the application using assembly language.
In addition, only a few applications have sufficien tly well-defined specifications to guarantee a fair comparison. For example, it would be difficult to use the V.34 modem standard to implement an application benchmark because different V.34 modems use different algorithms and achieve varying error rates. Coding an application-based benchmark entirely in assembly language is not an attractive option either for several reasons. First, it is practically impossible to ensure optimal or even near-optimal implementation of complex applications, making the benchmark as much a gauge of programmer skill as processor performance.
Second, the benchmark would measure the performance of the entire system, not just the processor. Note that benchmarks developed for computers are intended to measure the performance of the entire system, not just the processor-in that scenario, the inability to isolate the performance of the processor from that of the system is not a drawback. Evaluating the DSP performance of processors as standalone units decoupled from their syst ems, however, requires a different strategy.
The BDTI Benchmarks comprise a variety of DSP algorithm kernels that have been selected from those commonly used, including FIR and IIR filters, a Viterbi decoder and an FFT. With the exception of one control-oriented benchmark, all algorithm kernels are optimized for speed. The control-oriented benchmark is optimized for minimum memory usage. On each benchmark, we measure the processor's cycle counts, execution time, cost performance (a combined figure of merit), energy consumption and memory usage. Often vendors benchmarking their own processors achieve results that are faster than those obtained by BDTI.
Apart from possible differences in the requirements of the benchmarks themselves, it is also important to note that the BDTI Benchmarks take into consideration how the algorithm kernels are used in typical applications, meaning that unrealistically aggressive optimizations are prohibited. For example, it may be poss ible to speed up an algorithm by completely unrolling its inner loop, but this often consumes an unreasonable amount of program memory. For this reason, BDTI prohibits excessive loop unrolling and other impractical optimizations from its benchmark implementations. As a result, the BDTI Benchmarks do not necessarily reflect the fastest possible implementation of each benchmark but rather reveal processor performance that can be expected in a typical application.
The methodology also does not take into account that two algorithm kernels may cooperate to allow further optimization in an application. A processor's performance in DSP applications can be estimated by combining the benchmark results with the application profiling results. Each of the processor's benchmark execution times are assigned a specific weight based on their relative importance.
THIS ARTICLE IS EXCERPTED FROM A PAPER PREPARED IN CONJUNCTION WITH A CLASS PRESENTED AT DSP WORLD SPRING CON FERENCE 2000 TITLED"INDEPENDENT DSP BENCHMARK AND RESULTS FOR THE LATEST PROCESSORS."