by Marco Jacobs and Antoine van Wel from Silicon Hive
This paper discusses two low-cost, low-power video processors for the mobile market. The first core is optimized for motion estimation acceleration. The second targets full video coding applications. The processors were designed using an internal design methodology that enables rapid design of programmable processors and their accompanying programming tools. We present high-level requirements, discuss design trade-offs that were made, and describe the resulting processors. We conclude with application performance and silicon area for the two processors.
Recent years have seen a rapid adoption of digital video and still-imaging products in the marketplace. Video recording and playback devices such as camera-enabled mobile phones, digital still cameras, camcorders and DVD devices have become commonplace. Recent video compression algorithms such as MPEG-4, WM9 and H.264 have been standardized and are now ready for adoption into the next generation of these devices. These new standards fulfill the market's needs for higher picture quality, longer storage times, suitability for streaming and lower bandwidth requirements for transmission to these devices.
The computational requirements for these videoprocessing algorithms are too high for implementation on industry-standard embedded RISC processors. Special video processors have to be developed and integrated into the ICs that form the engine of the next generation of these video-enabled products. The adoption of the new standards is thus dependent on the availability of high-performance, low-cost, lowpower- consumption hardware building blocks. This paper describes two processor cores that address this market. Silicon Hive, a fully owned subsidiary of Philips Electronics N.V., licenses these processors which enable its customers to rapidly bring new video ICs and derived products to market.
The rest of this paper is structured as follows. First, we present Silicon Hive's ULIW reconfigurable processor architecture, development methodology and programming model. Second, we describe the target application requirements, including the required flexibility and performance goals. Last, we describe the two video processing cores, each having a different performance point, application area and price point. For each of these cores, we present the high-level architecture, describe design choices that were made, and discuss performance and silicon area.
Processor Architecture and Programming Model
Silicon Hive has developed a proprietary internal design methodology and toolset to enable automatic instantiation of an architecture template [4,6]. With this methodology, design choices can be evaluated using a hardware/software co-design approach. Cycles of processor generation, application code compilation and evaluation, re-architecting and re-evaluation are very short. The number of issue slots, the size and number of register files, the interconnect between execution units and register files, the instruction set (including SIMD instructions and application-specific operations), and the number, connection and porting of local memories can all be rapidly evaluated and altered at design time. Using this iterative approach, an optimum can be achieved for a particular target application.
A Silicon Hive core contains a number of functional units and distributed register files. The register files are positioned close to the functional units that consume the operands they store, therefore exploiting locality of reference the same way ASICs do. Silicon Hive processors use distributed resources, with a focus on local interconnect. This implies very little operand network overhead. When a direct interconnect line is not available between two resources, Silicon Hive's C compiler schedules the communication with a minimal number of local hops.
Silicon Hive cores have no pipeline control overhead. All pipeline management and operand-forwarding issues are moved to the compiler, which explicitly schedules all pipeline stages. This does increase code size, but in return pushes the computational efficiency of our cores beyond that of traditional processors and DSPs.
If the application software is stable and no longer needs to be changed, cost-down measures can be taken by deconfiguring the core; the instruction words can be mapped into ROM, which is more cost-effective than RAM, the register file sizes can be optimized and connectivity can be reduced.
The most important tool for programming Silicon Hive processors is the HiveCC compiler . This compiler reads in programs written in ANSI-C, optionally annotated with compiler directives and applicationspecific instruction selections, and uses powerful, innovative constraint analysis and scheduling techniques to extract the intrinsic instruction level parallelism. The constraint analysis module turns the partial connectivity of the functional units with the register files into an advantage, as opposed to a hindrance. It prunes infeasible schedules (due to the lack of full connectivity) from the scheduling space, therefore helping the scheduler arrive at optimal solutions. By additionally using extensive software pipelining, the HiveCC compiler is capable of generating optimal instruction schedules.
We combine the benefits of an ASIC solution: low power, low cost, and optimization for a specific application domain, with the benefits of a general purpose DSP solution: maintainability, high-level software programmability, upgradeability, high flexibility, and shorter time-to-market.
Using the above architecture and design methodology, we designed two video processors that target the mobile handset market. Video recording and playback is quickly becoming a standard feature here. The reasons for this include:
- Required bandwidth for acceptable video quality is reduced due to new video compression standards.
- Next-generation wireless networks provide sufficient bandwidth at a reasonable cost.
- Silicon process technology advances make it possible to add video functionality at consumer price points.
- It presents new revenue opportunities for wireless operators.
The following figure describes a typical video compression/decompression pipeline:
Figure 1 - Video coding pipeline
First, the video is captured by a camera, after which the video is pre-processed before being fed into the compressor. Typical pre-processing functions include scaling, color conversion, noise reduction, and image stabilization. Compression (encoding) then takes place, after which the resulting bitstream is stored or directly transmitted. The decompression (decoding) stage receives a bitstream and decodes it back to the original frame size. The video frames are then postprocessed before being displayed. Post-processing typically includes deblocking and deranging, scaling, color conversion and frame-rate conversion.
The image enhancing algorithms used during preprocessing and post-processing are proprietary and often implemented using hard-wired technology due to the simplicity of the algorithms used and the high data rates. Implementing this functionality onto a programmable video processor platform, however, allows differentiation of products by adding customerproprietary picture quality-enhancing algorithms on top of the basic video coding applications.
The compression and decompression building blocks need to be very flexible. Backward compatibility with existing standards such as (motion) JPEG, MPEG-2 or H.263 and support for newer standards such as MPEG-4 SP/ASP, WM9 and H.264  are required to ensure interoperability with (evolving) other devices, enhance video quality and be future-proof. In order for hard-wired solutions to support multiple standards, large parts of the video-processing pipeline need to be designed and replicated for each of the supported standards individually. This makes the IC larger than neccesary, overly complex and inflexible. Hard-wired solutions cannot fulfill these requirements. Moreover, video standards have grown increasingly complex, making it virtually impossible to implement a video codec without using programmable processor techniques.
Furthermore, video compression standards only describe the bitstream syntax. This gives considerable freedom to improve the encoding algorithms over time. For MPEG-2, for example, bitrates for the same highquality video decreased from 6Mbps shortly after standardization to 2Mbps today . A programmable platform allows for inclusion of such algorithmic upgrades late in the design cycle, or even after fabrication.
Moustique MPP Video Processors
We describe two media and pixel processing (MPP) processors in this paper. The first processor, the Moustique MPP ME, is tailored for motion estimation (ME) acceleration. See the figure below for a typical mobile video SOC that includes the Moustique MPP ME processor. The video application runs on a combination of a host processor and a Moustique MPP ME processor.
Figure 2 - Mobile Video SOC with Moustique MPP ME
The Moustique MPP CIF processor is designed to run complete video encoder or decoder applications . This significantly reduces the load on the host processor. The following figure describes a typical mobile video SOC that includes the Moustique MPP CIF processor. Here, the host processor’s resources are completely free and can be used for other applications.
Figure 3 - Mobile Video SOC with Moustique MPP CIF
Both processors can also be used for additional implementation of pre-processing or post-processing routines. We will focus on the video encoder, since this is the most demanding application.
The following figure provides a very high-level description of a typical video encoder.
Figure 4 - Video encoder diagram
Video processing applications are characterized by:
- High data rates: Some subroutines consist primarily of data transfer without much computation (e.g. motion compensation). The processor should be efficient at moving data around.
- Pixels are processed in groups of e.g. 8x8 pixels: The processor datapath cannot be too wide, as there is only limited room for instruction level parallelism. The number of loop iterations is often small. In addition, moving these blocks around should be efficient.
- In video, 8-bit and 16-bit signed and unsigned fixed-point data types are prevalent: The instruction set should be optimized for these data types.
- Complete video coding applications contain many control code operations: The processor should efficiently execute control code. The following sections describe the Moustique MPP ME processor and Moustique MPP CIF processor in detail.
Moustique MPP ME
The Moustique MPP ME processor is designed to perform a key portion of the video encoder: motion estimation. Reasons are:
- ME consumes a large percentage of the video encoder cycles.
- The ME algorithm directly influences the picture quality and compression ratio.
- The ME algorithms are not standardized, so programmability provides an advantage.
The remainder of the encoder can be run on a highend RISC processor.
The Moustique MPP ME requirements are:
- Optimization for low cost
- A performance level adequate for motion estimation at a 352x288 resolution at 30 frames per second
- Enough headroom left to perform additional video processing routines
See below for a block diagram of the Moustique MPP ME processor:
Figure 5 - Moustique MPP ME video processor
- 4 issue slots
- 7 8x32-bit register files with 1R/1W port
- 1 2x64-bit register file 1R/1W port
- 1 4x64-bit register file 1R/1W port
- 4-way application-domain specific instructions
- 16KB local data memory (configurable)
- 16KB instruction memory (configurable)
- Data cache including prefetching support for optimizing transfers over the system bus
- Typical operation at 100MHz (synthesizable to higher clock rates)
The number of issue slots, the functional units available in each slot, the number of register files and their connection to the functional units are determined using an application / architecture co-design methodology.
First, a representative suite of application kernels is selected and ANSI-C source code is written that represent these kernels. For each of the kernels, a performance target is set. The processor designer then goes through cycles of compiling these kernels onto the processor architecture, measuring processor performance area and adapting the processor before starting a new cycle.
The instruction set is based on standard RISC-like functional units, such as 32-bit ALUs and multipliers. A high precision datapath (64-bit) is introduced for specialized operations such as a 4-way SIMD MAC and a 4-way SIMD SAD operation. These significantly increase motion estimation kernel throughput. The number of registers in the register files are chosen such that typical kernels could be mapped with sufficient headroom.
The number of issue slots and register files is furthermore determined by the design goal of keeping the instruction memory width within 128 bits, which simplifies the hardware implementation. As a result there are four issue slots with limited connections to the distributed register files.
The Moustique ME core is a ULIW processor with 4 parallel issue slots. Consequently, within one instruction, up to four operations execute in parallel, one operation in each active issue slot. Each issue slot is made up of one or more functional units. The Moustique has 26 functional units located in the four issue slots. Within an issue slot, at most one operation can be started per cycle on one of the functional units in that slot. Each functional unit is a hardware unit that offers a set of related operations. The datapath is 32 bits wide with some functional units also able to operate on 64 bits.
The processor achieves the design goal of performing motion estimation at 352x288 resolution at 30FPS with low cost and low power. Computing the sum-ofabsolute- difference of two 8x8 blocks including half-pel interpolation takes ~100 cycles. The motion estimation algorithms are software programmable and proprietary techniques can be used to enhance overall picture quality.
Moustique MPP CIF
The Moustique MPP CIF processor is designed to perform complete video encoder and decoder functions. Reasons are:
- Even high-end RISC processors cannot reach the required application performance.
- Its autonomous functioning allows for easy integration into an SOC.
- It can be a programmable drop-in replacement for hard-wired video coding blocks.
The Moustique MPP CIF processor requirements are:
- Target complete video compression applications at a 352x288 resolution at 30FPS
- Generic instruction set and architecture such that any imaging algorithm runs efficiently
- Good C compiler target: a video codec contains many chunks of control code, which should run efficiently
See below for a block diagram of the Moustique MPP CIF processor:
Figure 6 - Moustique MPP CIF video processor
- 3 issue slots
- 3 32x32-bit register files with 3R/2W ports
- 3 4x64-bit register files with 1R/1W ports
- 2 2-way/4-way multipliers
- 3 2-way/4-way ALUs
- 32KB local data memory (configurable)
- 32KB instruction memory cache (configurable)
- 32/64-bit load/store unit, including unaligned load/stores
- Tightly coupled programmable DMA processor
- Video-centric custom operations
- Typical operation at 200MHz
Again, the processor designer selects a representative suite of application kernels, this time larger and including control code. He then writes ANSI-C source code that represents these kernels. Performance targets per kernel are derived from the desired target application performance.
A low number of issue slots keeps instruction width within acceptable bounds. Just as in the Moustique MPP ME, the instruction set is based on standard RISC-like functional units such as 32-bit ALUs, multipliers and shifters. This provides a good C compiler target. Next, a complete set of 2-way 16-bit and 4-way 8-bit SIMD instructions are added. For some video kernels, specialized operations are introduced to obtain maximum performance. This includes specialized operations to speed up motion estimation, transforms, clipping, variable length encoding and decoding. A high-precision datapath is introduced to handle e.g. multiplications, where the data width doubles.
A distributed set of register files, each having a low number of read and write ports, is chosen such that the resulting synthesized RTL yields a low-cost processor, while typical kernels still run efficiently. The number of registers in the register files provides sufficient headroom to reduce compiler-inserted spilling instructions, as well as redundant loads, as much as possible.
The memory architecture comprises two 32-bit memory banks for a total 64-bit load/store bandwidth. This provides the right balance between bandwidth to local memory and computational functional unit throughput.
A DMA processor independently moves data between external memory and the Moustique’s local memory. This processor is also based on our processor template and designed using the same methodology. The communication between the main processor and the DMA engine processor is based on a combination of shared memory and blocking FIFOs, both standard components in our processor building block library and supported by our development environment. Communication between the two processors is very fast and has very low overhead. Some programming effort is required to program the DMA processor; however, the alternative of using a data-cache approach would mean data movement is nondeterministic and less controllable. This presents problems, especially for full video coding applications.
The resulting Moustique MPP CIF processor is a suitable platform for efficient implementations of a wide variety of image processing applications, in particular video coding and decoding. Targeted performance levels are reached at the desired price point.
Performance and Silicon Area
The following table lists the resulting processor performance and silicon area. These area numbers are based on a .13µ process. Since our cores are synthesizable, actual size depends on the exact process technology used, target frequency and other factors.
We have presented a methodology for generating processor cores and their programming tools. We have applied this methodology during the design of two video processor cores. The resulting cores meet the performance and application targets at the required price points. The processors provide the flexibility and ease of programming necessary to quickly develop video-centric ICs and applications. The cores are readily available for licensing. Silicon Hive is extending its video processor family to include processor designs for higher resolutions such as VGA or Standard Definition video, as well as at High Definition video performance levels.
 Lex Augusteijn, The HiveCC Compiler for Massively Parallel ULIW Cores, Embedded Processor Forum, San Jose, May 17-20, 2004
 Robert Bleidt, MPEG-4 and the Future of Mobile Video, Software Development Forum Multimodal SIG March 4, 2004
 Tom R. Halfhill, Silicon Hive Breaks Out, Microprocessor report, www.MPROnline.com, December 1, 2003
 J. Leijten, A Massively Parallel Reconfigurable ULIW Core, Microprocessor Forum, San. Jose, October 12, 2003
 Iain E.G. Richardson, H.264 and MPEG-4 Video Compression, Wiley, 2003
 Silicon Hive Technology Primer, www.SiliconHive.com
HIVECC, ULIW, MOUSTIQUE are trademarks owned by Philips Electronics N.V.