Digital video technology has generated extraordinary growth in the multimedia industry, encompassing a wide variety of other media, including audio, video, images, computer graphics, speech and data or any combination of those. Current industry support for multimedia appears in three forms: application-specific processors, multimedia extensions to general-purpose processors and multimedia co-processors.
Application-specific processors offer low-cost alternatives for specific applications, and multimedia extensions to general-purpose processors offer some support for media processing at little additional cost. Neither method, however, is able to achieve the flexibility or computational capabilities for future multimedia applications that require high data rates. Additionally, emerging applications like MPEG-4 have less processing regularity and will be difficult, if not impossible, to support with application-specific chips. For example, so me aspects of MPEG-4 require XML-like presentation and other middle-ware functions that require programmability.
Very long-instruction-word (VLIW) processors used within a system-on-chip (SoC) have instruction words with fixed "slots" for instructions that map to the functional units available. This makes the instruction unit much simpler, but places a much larger burden on the compiler to allocate useful work to every slot of every instruction.
The architectural simplicity of VLIW coupled with a high level of parallelism from single-instruction multiple-data (SIMD) arithmetic functional units give VLIW-based SoCs a cost, performance and flexibility advantage over other processors in implementing digital video algorithms in consumer electronics devices.
This can be readily illustrated with MPEG-4 video decompression (or decoding). VLIW provides bandwidth-efficient delivery of high-quality video to the consumer. For real-time MPEG-4 decoding, the software should be implemented to achi eve as high a degree of parallelism as possible. Certainly, new data flow approaches will utilize the architecture efficiently. In the past, many optimization techniques have focused on minimizing the total number of serial operations at the expense of more-complex data flows like the butterfly pattern required for fast Fourier transforms. Those optimization techniques are more appropriate for fixed-function and general-purpose processor solutions than for DSP processors.
To illustrate, the MPEG-4 simple-profile algorithm uses compression techniques similar to H.263 with some enhancements for lower-bit-rate coding. First, the input-compressed bit stream comprising coded data is parsed and decoded into discrete-cosine-transform (DCT) coefficients. The decoded information is then processed as such pixel-based operations as inverse quantization, inverse DCT and pixel interpolation, to reconstruct the coded picture. Algorithmic enhancements specific to MPEG-4 include the use of four motion vectors per m acroblock, advanced intrablock prediction of both dc and ac coefficients and quarter-per-pixel interpolation.
The first step in implementing an MPEG-4 decoder is parsing the input bit stream. Parsing an MPEG-4 bit stream into its DCT coefficients requires recognizing variable-length codes in the stream. Making efficient use of the parallel arithmetic units in a VLIW is difficult when performing variable-length decoding (VLD) since the algorithm is an inherently serial process. Some VLIW SoCs have separate on-chip coprocessors that specialize in bit serial algorithms.
Those specialized coprocessors allow the VLD to run in parallel with the VLIW, which is free to concentrate on many more parallel pixel-processing routines like inverse quantization and IDCT. On VLIW SoCs without a separate coprocessor, efficiency can be achieved by integrating VLD explicitly with the pixel-processing algorithms. For example, one could devote one arithmetic unit to VLD operations and leave all the other arithmet ic units for pixel processing. The disadvantage of such an approach, however, is that the tight coupling between VLD loops and pixel-processing loops leads to complicated algorithms that typically have to be implemented in assembly language, thus obviating much of the VLIW's flexibility.
After DCT coefficients are extracted from the input bit stream, they must be dequantized by an inverse-quantization algorithm, which requires multiplying every element by a unique quantized weight. Typically, the result is then divided by a defined constant and the result is rounded toward zero. Before the division, if the coefficient is nonzero, one is usually added. In the first step, the absolute value of the quantized DCT coefficients is taken so that truncations can properly correspond to round toward zero. This can be done several elements at a time on most VLIW instruction sets. When the inverse quantization is complete, the original sign is reapplied.
Many of the fast DCT and IDCT algorithms in the past have focused on reducing the number of multiplications, usually at the expense of increasing the number of additions or subtractions and also at the expense of more-complex and irregular data flow. For fixed-function chips, such algorithms were optimal in the sense that they reduced the number of transistors needed to implement the DCT and IDCT functions.
With a large number of multipliers and wide data paths for transferring multiple bytes in a single cycle, the number of mathematical operations is not the most critical factor for performance on today's VLIW processor. Since the cost of performing a highly complex operation like a vector product is the same as that of a simple operation like a register move, it becomes more important to look at the overall number of operations involved in the DCT and IDCT calculations, including such trivial operations as moving data between registers and memory. For example, if one cycle on the multiplicati ons is saved, but two more cycles are needed to move the data between registers, then the overall number of execution cycles will increase by one.
The matrix multiply algorithm, with its much more regular data flow, can be decomposed into a computationally simpler form by separating the even and odd elements of the input vector. In this method, the multiplication with an 8 x 8 matrix is replaced by two 4 x 4 matrix multiplications. The even elements (X0, X2, X4, X6) are multiplied with a 4 x 4 matrix to produce an intermediate vector. Similarly, the odd elements (X1, X3, X5, X7) are multiplied with another 4 x 4 matrix to produce an intermediate vector. The output vector is obtained by adding and subtracting the two intermediate vectors. By implementing the IDCT in that way, it becomes well-suited to the inner product operations.
Motion-compensation algorithms can be similarly broken up into a series of SIMD operations that act on multiple pixel elements at one time. With the ability of mos t VLIW SoCs to perform partitioned multiply, add, shift, round and the ability to access unaligned data, the actual number of cycles spent on interpolations and averaging required by motion compensation is small. Instead, a much greater proportion of the execution time is spent in deciding which type of interpolation and averaging is to be performed. Such decision trees are serial in nature and are usually accelerated with lookup-table operations.
VLIW SoCs that incorporate programmable DMA engines can also significantly reduce the number of cycles spent on motion compensation. The DMA engines can be programmed to pre-load the internal memories or data caches of the VLIW SoC via double buffering so that no time is wasted waiting on data to be input.
By utilizing the techniques mentioned above on a modern VLIW SoC, real-time performance can readily be achieved with enough room left over for other tasks. For example, with a 128-bit VLIW processor, the number of cycles to completely decode a 352 x 288 4:2:0 picture (at 30 frames/second) is greatly reduced, leaving more than 80 percent of a 300-MHz processor free for other tasks.
See related chart
See related chart