Many of today's digital signal processing (DSP) applications are subject to real-time constraints. And it seems many applications eventually grow to a point where they stress the available CPU and memory resources. It's like trying to fit 10 pounds of algorithms into a five-pound sack. Understanding the architecture of the DSP, as well as the compiler, can speed up applications, sometimes by an order of magnitude.
The fundamental rule in computer design, as well as programming real-time systems, is: "Make the common case fast, and favor the frequent case." This is really just Amdahl's Law that says the performance improvement to be gained using some faster mode of execution is limited by how often you use that mode of execution. So don't spend time trying to optimize a piece of code that will hardly ever run. You won't get much out of it, no matter how innovative you are. Instead, if you can eliminate just one cycle from a loop that execut es thousands of times, you will see a bigger impact on the bottom line.
Memory can be a severe bottleneck in embedded-system architectures. This problem can be reduced by storing the most-often referenced items in fast, on-chip memory and leaving the rest in slower off-chip memory. The problem is, getting the data from external memory to on-chip memory takes a lot of time. If the CPU is busy moving data, it cannot be performing other, more important tasks.
The fastest-and most expensive-memory is generally the registers on-chip. There never seems to be enough of this valuable resource, and managing it is paramount to improving performance. The next-fastest memory is usually the cache that holds the instructions or data the processor hopes to execute in the near future. The slowest memory is generally found off-chip and referred to as external memory. As a real-time programmer, you want to reduce the accesses to off-chip, external memory because the time to access this memory can be long, caus ing huge processing delays. The CPU pipeline must "stall," or wait for the CPU to load this memory. Use of on-chip memory, therefore, is one of the most effective ways to increase performance. On-chip memory can be thought of as a sort of data cache, with the main difference being that data cache needs to be managed, rather than done automatically.
Hardware architecture techniques have been used to enhance the performance of processors using pipelining concepts. To improve performance even more, multiple pipelines can be used. This approach, called "superscalar," exploits further the concept of parallelism. Some of today's high-performance DSPs, such as the Intel i860, have a superscalar design.
One way to control multiple execution units and other resources on the processor is to issue multiple instructions simultaneously. Some of the latest DSPs, such as the Texas Instruments C6200, are called very long instruction word machines. Each instruction in a VLIW machine can control multiple execution units on the processor. For example, each VLIW instruction in the TI 6200 DSP is eight instructions long, one instruction for each of the eight potentially available execution units. Again, the key is parallelism. In practice, however, it is hard to keep all these execution units full all the time because of various data dependencies. The possible performance improvement using a VLIW processor is excellent, especially for some DSP applications.
A superscalar architecture offers more parallelism than a pipelined processor. But, unless there is an algorithm or function that can exploit this parallelism, the extra pipe can go unused, reducing the amount of parallelism that can be achieved. An algorithm that is written to run fast on a pipelined processor may not run nearly as efficiently on a superscalar processor.
Direct memory access (DMA) is another option for speeding up DSP execution rates. A peripheral device is used to write data directly to and fro m memory, taking the burden off the CPU. The DMA is just another type of CPU whose only function is moving data around very quickly. The advantage of this is that the CPU can issue a few instructions to the DMA to move data, and then go back to what it was doing. This is just another way of exploiting the parallelism built into the device. The DMA is most useful for copying larger blocks of data. Smaller blocks of data do not have the payoff because of the setup and overhead time required for the DMA; the CPU can be used for these. But when used smartly, the DMA can save huge amounts of time.
A common use for the DMA is to stage data on- and off-chip. The CPU can access on-chip memory much faster than off-chip, or external, memory. Having as much data as possible on-chip is the best way to improve performance. If the data being processed cannot all fit on-chip at the same time-as with large arrays-then the data can be staged on- and off-chip in blocks using the DMA. All of the data transfers can be h appening in the background, while the CPU is actually crunching the data. Smart management and layout of on-chip memory can reduce the amount of times data has to be staged on- and off-chip.
It is worth the time and effort to develop a smart plan for how to use the on-chip memory. In general, the rule is to stage the data in and out of on-chip memory using the DMA, and generate the results on-chip. For cost and space reasons, most DSPs do not have a lot of on-chip memory. This requires the programmer to coordinate the algorithms in a way that efficiently uses the available on-chip memory.
Instrumenting code to use the DMA has some cost penalties. Code size will go up, depending on how much of the application uses the DMA. We have seen code size grow up to 50 percent using fully instrumented DMA. Using the DMA also increases complexity and synchronization in the application. The DMA should be used only in areas requiring high throughput. However, smart layout and utilization of on-chip memory and judicious use of the DMA can eliminate most of the penalty associated with accessing off-chip memory.
The standard rule when programming superscalar and VLIW devices is: "Keep the pipelines full." A full pipe means efficient code. In order to determine how full the pipelines are, you need to spend some time inspecting the assembly-language code generated by the compiler. You can usually spot inefficient code-and inefficient use of the pipelines-by an abundance of number of operands (NOPs) in the code.
There are ways to keep the CPU busy while it is waiting for data to arrive-for example, by doing operations that are not dependent on the data for which you are waiting or by using both sides of the superscalar architecture to help load and store other data values.
Loop unrolling is a technique used to increase the number of instructions executed between executions of the loop branch logic. This reduces the number of times the loop branch logic is executed, and because the loop branch logic is overhead, reducing the number of times this has to execute lowers the overhead and makes the loop body-the important part of the structure-run faster. A loop can be unrolled by replicating the loop body a number of times and then changing the termination logic to comprehend the multiple iterations of the loop body.
The drawback to loop unrolling is that it uses more on-chip registers. Different registers need to be used for each iteration. Once the available registers are used, the processor starts going to the stack to store required data. Going to the off-chip stack is expensive and may wipe out the gains achieved by unrolling the loop in the first place. Loop unrolling should be used only when the operations in a single iteration of the loop do not use all of the available resources of the processor architecture. Check the assembly-language output if you are not sure of this. Another drawback is the code size increase. Unrolled loop requires more instructions and, therefore, additional memory.
One of the best optimization strategies is to write code that can be pipelined efficiently by the compiler. Software pipelining is an optimization strategy to schedule loops and functional units efficiently. In the case of the C6200, eight functional units can be used at the same time, if the compiler can figure out how to do it. Sometimes a subtle change in the way the C code is structured makes all the difference. In software pipelining, multiple iterations of a loop are scheduled to execute in parallel. The loop is reorganized in a way that each iteration in the pipelined code is made from instruction sequences selected from different iterations in the original loop.
Software pipelining does not happen without careful analysis and structuring of the code. For example, loops that do not have enough processing will not be pipelined. On the other hand, loops that have too much processing will not be pipelined because the loop body will exhaust t he available registers. Also, function calls within a loop will not be pipelined. Instead, if you want a pipelined loop, replace the function call with an inline expansion of the function.
One drawback of pipelining is the disabling of interrupts. An interrupt in the middle of a fully primed pipe destroys the synergy in instruction execution. The compiler will protect a software pipelining operation by disabling interrupts before entering the pipelined section and enabling interrupts on the way out. This means you pay the price for the efficiency in software pipelining in the form of a non-pre-emptible section of code. The programmer must be able to determine the impact of sections of non-pre-emptible code on real-time performance.
Real-time programmers have always had to develop a library of tricks to allow software to run as fast as possible. As processors continue to grow more complicated, this becomes a more difficult endeavor. For superscalar VLIW processors, managing two separate pipeli nes and ensuring the highest amount of parallelism requires tools support. Optimizing compilers are helping to overcome many of the obstacles of these powerful new processors, but even the compilers have limitations. Real-time programmers should not trust the compiler to perform all of the necessary optimizations for you.
See related chart