Audio and video software for consumer media products can be quite complex, combining real-time signal processing, network protocols, complex I/O, and sophisticated user interfaces. Code and device optimizations, for example, can be problematic in the context of many embedded consumer applications where the requirements for performance, low power, minimal memory space and low cost are often in conflict.
Audio and video codecs, in particular, are typically the most resource-hungry components of the software in a consumer media product. The combination of high data rates, demanding algorithms, and the need to use inexpensive, and possibly low-power processors, means that careful optimization of codec software is almost always required in order to create a competitive consumer media product.
Even in cases where such optimization may not be strictly necessary, it is often beneficial-it can lower power consumption, free up processor re sources to permit the addition of more features, or facilitate the use of a less expensive processor.
The codec software optimization process can focus on a number of different performance metrics, e.g., execution speed, memory use, or energy consumption. In some cases optimizing for one metric also optimizes another, while in other cases optimizations conflict. Most often, developers optimize primarily for execution speed, memory use, and energy consumption, while maintaining sufficient video and audio quality.
Extensive execution-speed optimizations are usually required to achieve real-time performance goals. The need for such high levels of optimization is primarily driven by the combination of two factors: the high computational demands of compression and decompression algorithms (which execute complex mathematical operations at high sample rates), and the use of low-cost processors with limited performance.
The optimization process an embedded developer should follow is iterative , beginning with profiling, followed by analysis, then implementation of specific optimizations. This cycle is repeated until performance targets are met.
The first step is to profile the codec software at the function level. This yields information about the percentage of the total execution time attributable to each sub-function, or the number of times each sub-function is called. Typically, the results loosely adhere to an 80/20 rule: 20% of the software is responsible for 80% of the execution time, and vice-versa. Of the three general classes of operations -initiialization rate(I), control rate(K) and sample (S) rate - the S-rate operations are the ones that typically dominate processor loading.
In this classification, the frequency of occurrence differs by two to three orders of magnitude between adjacent levels. Thus, execution cost of operations at each level also differs by two to three orders of magnitude. Optimizing I-rate and K-rate operations typically has little impact on overall execution time because these functions occur relatively infrequently; however, S-rate operations, which have high frequencies of occurrence, must be thoroughly optimized for maximum efficiency. Signal processing functions typically have outer loops at the K-rate and inner loops at the S-rate. The 80/20 rule suggests that codec software contains a relatively small number of functions with S-rate loops. This attribute can greatly simplify the optimization process, since only a small fraction of the total functions in the codec typically require careful optimization.
There are many optimization techniques that can be useful for media applications. Two categories include processor-independent software optimizations and processor-specific software optimizations.
Beginning with the most time-consuming function, as identified by profiling, we can analyze the function and begin to develop an optimization strategy. For example, a common strategy is to first consider processor-independent optimization s that maintain portability of the software.
If additional optimization is needed, proceed with processor-specific high-level language optimizations, such as algorithmic transformations. Such high-level language processor-specific optimizations do not eliminate portability, but the optimizations may not yield good performance on another processor. Finally, additional processor-specific optimizations may be implemented at the assembly language level. These optimizations offer the potential to deliver the greatest gains, at the expense of portability.
Most A/V compression algorithms are first implemented in C. In many cases, the first implementations of a codec are not written with maximum efficiency in mind. For example, reference implementations are often intended for clarity of documentation rather than for efficiency. T
Therefore, it is often possible to reduce the processing and memory requirements of a codec significantly by modifying or rewriting the high-level language code. For example, most compilers employ an optimization technique called "strength reduction," in which costly operations are replaced by simple ones where possible.
There are some situations where the compiler isn't smart enough or lacks sufficient information about the algorithm to apply strength reduction, but a good programmer is able to rewrite the code to avoid costly operations. If the costly operations are in S-rate loops, the savings can be substantial. Other processor-independent optimizations include function in-lining and recycling of memory buffers.
Flattening the function call hierarchy by in-lining functions can yield significant speed-ups when low-level, frequently used operations have been placed in their own functions. Recycling memory buffers primarily decreases memory use, but can sometimes also result in speed-ups due to improved memory system performance (e.g., improved cache hit rates).
A number of processor spe cific optimizations may also apply at the high-level language or assembly code level.For example, algorithm modifications involve changing the internal operations of an algorithm in order to better match the available hardware resources. Many algorithm modifications involve trade-offs between memory use and computation.
Many algorithms use large look-up tables. These look-up tables may occupy more memory than is available in the target system or than can be efficiently accommodated by the processor's level-1 data memory (the memory closest to processor core). In such cases, a typical strategy is to replace the look-up table with algorithmic steps that compute the values formerly held in the table. A compromise approach is to reduce the size of the look-up table and to interpolate values that have been omitted from the table. This compromise typically requires less memory than the purely table-based approach and less execution time than the purely algorithmic approach.
Operation c riteria
Another classic example is an FFT implemented with a radix-4 vs. radix-2 butterfly. The total number of mathematical operations (the sum of additions and multiplications) is similar for either implementation. However, the ratio of additions to multiplications differs between these implementations: the radix-4 implementation requires more additions and fewer multiplications than the radix-2 implementation. If the available hardware resources favor addition over multiplication, then the radix-4 implementation may offer better performance.
Similarly, the FIR (finite impulse response) filter algorithm can sometimes be modified to better match the hardware resources. For example, the coefficients and inputs can use a single shared index variable if the coefficients are re-arranged in memory. In some cases branch testing against a zero or negative loop-counter register is directly supported in hardware, so a count-down loop is more efficient than a count-up loop.
Assemb ly language is the tool of choice for the ultimate in processor-specific optimization. After carefully analyzing the most critical sections of the algorithm and the details of the processor architecture, the assembly language programmer can often gain impressive speed-ups relative to the performance of compiler-generated code.
It is often the case, for example, that compilers do not make use of the processor's full instruction set, and almost never make use of specialized processor features such as single-instruction multiple-data (SIMD) instructions, which allow multiple operations to be performed in parallel. By working in assembly language, developers gain access to the full range of the processor's capabilities.
Due to the large amounts of data handled by A/V applications, the processor's on-chip and off-chip memory system is often a critically important determinant of performance in these applications.
For example, many low-cost embedded processors have small on-chip level-1 da ta memories-far too small to hold critical blocks of data and coefficients for audio and video decompression algorithms. In such cases, the level-1 data memory tends to be the performance bottleneck and the processor is unlikely to achieve anything near the level of performance suggested by its peak MIPS rating.
This problem can be compounded by the fact that processing one video frame often requires accessing data from two or three adjacent frames. The instruction memory can become a bottleneck too, since video algorithms in particular can require a large amount of code.
Memory access optimizations can greatly improve overall performance by minimizing memory access overhead. For example, the default unit for processing in video algorithms is a frame. That is, the algorithm performs one operation after another on a frame of video until the entire processing sequence is done, and then begins again with the next frame.
Frames, however, are typically much larger then the processor's le vel-1 memory. Thus, processing an entire frame requires pulling in new portions of the frame and overwriting old portions of the frame at each processing step, over and over again through the entire processing sequence. An optimization to this method is to change the fundamental unit of operation from a frame to a subset of a frame that can fit entirely in the level-1 memory. Now the entire processing sequence is performed on the frame subset that stays resident in the level-1 memory. This type of optimization improves execution time and can reduce power consumption by reducing external memory accesses.
This article was excerpted from ESC paper 266, titled "Developing A/V Software for Consumer Media Products."
See related chart