MPEG-4 encoder on an embedded parallel DSP
By Sal Tuccitto and Joel Turner, EE Times
August 19, 2003 (3:32 p.m. EST)
MPEG-4 is fast becoming the most common protocol for video compression due to its ability to handle multimedia over varying bandwidth conditions.
Implementing the MPEG-4 Simple Profile compression standard to take advantage of the processing power of ChipWrights' CW4511 digital signal processor (DSP) was quite a challenge. This DSP is a single instruction multiple data-path (SIMD) system on a chip (SOC) containing eight parallel processors. The SOC has its own 128K internal memory with interfaces to external SDRAM, input and output video ports, a mass storage interface, and three DMA channels to manage data movement. This architecture is well suited to MPEG-4 applications, as its parallelism can be used to process multiple blocks of an image simultaneously. This improves system throughput and overall performance as compared to serial processors.
Our implementation of the MPEG-4 encoder required more than just porting existing reference code to our DSP. We had access to a reference code base from the International Standards Organization (ISO). The ISO MPEG-4 reference code is written in C++ (for improved code clarity), assumes direct access to a flat memory map architecture, and assumes that performance is not an issue.
These assumptions are inconsistent with the needs of a real-time embedded system. The C++ object model introduces a large amount of instruction overhead, which increases both instruction cache fill stalls as well as the cycle-count directly. Most DSPs, including ours, have a limited amount of full-speed memory, so data has to be passed back and forth through the DMA channels. Finally, this project required a real-time implementation so some of the more elaborate multi-pass parts of the reference implementation had to be re-designed or abandoned.
Our implementation in C began by separating the MPEG-4 algorithm into a series of discrete functional operations. We then mapped out the data usage of each operati on relative to how the data can be broken up into independent blocks that can be processed in parallel.
From this information we created a code framework that handles data in the same manner as the final implementation on our SOC. The code framework started as a collection of function stubs that we filled in one at a time in C. This step-by-step approach allowed us to verify each newly written function against the ISO reference code.
During the data usage analysis portion of this process, we needed to decide how to partition the incoming image data into independent blocks, or patches. Deciding on the size of these patches and the frequency of their movement involved a tradeoff between memory bandwidth and processor performance. In our case, the patch size we chose was eight macro-blocks. In MPEG-4 Simple Profile a macro-block is a section of the image that is 16 x 16 pixels.
This patch size allowed the SIMD SOC's parallel processors to work on one macro-block each. This is important because all eight parallel processors must execute the same instruction at the same time. The operation on each macro-block is independent from the others, enabling simultaneous processing.
Once the model was complete and running in C, we began the optimization phase of the project. Here we took advantage of our chip simulator and its profiler for performance analysis and tuning.
The chip simulator is a program that runs on the PC as part of the ChipWrights integrated development environment (IDE) that emulates the DSP hardware functionally and cycle accurately. This allows us to write and execute assembly code without needing access to the final boards.
Since the simulator can model the cycle count of the DSP, it has a built-in profiler that generates an HTML report on all of the instructions executed in a given simulation. This enables us to evaluate our original assumptions and identify targets for improvement.
The summary section of the profiler output shows us (among other parameters) which functions the system ran, processor time in each function, and a sorted list of functions that use the most cycles (see accompanying figure).
For each function, the output shows us the number of cycles each instruction took to execute as well as any specific stalls for that instruction. The main stalls we looked at were DMA blocking, I-cache fill, instruction ordering, and loop process timing.
Using this information we optimized our code by writing critical sections in assembly language and making sure we rearranged incoming data in an optimal manner for parallel processing. Functions that lent themselves well to parallelization included the discrete cosine transform (DCT), AC & DC prediction, quantization, zig-zag scan, and motion estimation. Routines not readily parallelizable were run length encoding and construction of the output bitstream.
We verified our optimized implementation by bit comparison of its output agains t that of the C++ MPEG-4 reference code. Our implementation is surpassing its performance targets, exceeding 30 frames/second for Quarter-VGA resolution, and 12 frames/second at VGA resolution. The chip simulator and its profiler enabled us to analyze the tradeoffs involved in balancing memory bandwidth against instruction size and algorithm complexity and, ultimately, to meet our performance goals.
This encoder has been integrated with two embedded applications; a security camera, and a full-featured digital camera. The security camera application combines a sensor and an image-processing pipeline with the MPEG-4 encoder to transmit live video over a low bit-rate connection. The digital camera application also has a sensor and image-processing pipeline, but includes high resolution still capture, voice record, and MP3 playback in the same low power device.
See related chart
Sal Tuccitto, Software Compression Architect, and Joel Turner,Director of Software, ChipWrights, Inc., Newton, MA.