Anatomy of a hardware video codec

By Steve Leibson, Technology Evangelist,
Tensilica, Inc.

If bandwidth and storage were infinite and free, there would be no need for video compression. However, bandwidth is always limited; the per-byte price of storage decreases over time but it's never free; and so we need video compression. In fact, since its introduction in the early 1990s, video compression has grown increasingly important for the design of modern electronic products because it aids in the quest to deliver high-quality video using limited transmission bandwidth and storage capacity.

Video compression and decompression each consist of many steps. Video encoding starts with a series of still images (frames) captured at a certain frame rate (usually 15, 25, or 30 frames/sec). Video cameras capture images using CCD or CMOS sensors that sense red/green/blue (RGB) light intensity but these RGB images do not directly correspond to the way the human eye works.

The human eye uses rods to sense light intensity (luma) and cones to sense color (chroma) and the eye is much more sensitive to luma than chroma because it contains many more rods than cones. Consequently, most video-compression systems first transform RGB pictures into luma and chroma (YUV) images. Then they downsample the chroma portion, which reduces the number of bits in the video stream even before digital compression occurs. Thus most digital video compression schemes take a series of YUV images and produce compressed data while video decompression streams expand a compressed video stream into a series of YUV still images.

Because video streams start with a series of still images, video compression streams can use many of the compression techniques developed to compress still images. Many of these techniques are "lossy" (as opposed to "lossless"). Lossy compression techniques identify and discard portions of an image that cannot be perceived or are nearly invisible. One of the most successful lossy compression schemes used for still images has been the ubiquitous JPEG standard. Video streams are even better suited to lossy compression than still images because any image imperfections that result from the compression appear fleetingly and are often gone in a fraction of a second. The eye is much more forgiving with moving pictures than still images.

Most video-compression algorithms slice pictures into small pixel blocks and then transform these blocks from the spatial domain into a series of coefficients in the frequency domain. The most common transform used to perform this conversion has been the DCT (discrete cosine transform), first widely used for JPEG image compression. Most video-compression schemes prior to the H.264/AVC digital-video standard employ the DCT on 8x8-pixel blocks but H.264/AVC uses the more advanced, integer-based Hadamard transform on 16x16-pixel blocks.

Because the eye is more sensitive to lower frequencies, the pixel blocks' frequency-domain representations can be passed through a low-pass filter, which reduces the number of bits needed to represent that block. In addition, video-compression algorithms can represent low-frequency coefficients with more precision using more bits while using fewer bits to represent the high-frequency coefficients.

Determining the frequency-dependent coefficients takes two steps. First, the coefficients are quantized to discrete levels using perceptual weighting to limit the number of bits needed for the coefficients. Quantized coefficients are then coded using a lossless variable-length-coding (VLC) technique that codes frequently occurring numbers with fewer bits, which again reduces the size of the video bitstream.

Click here to read more ...

Industry Articles

Anatomy of a hardware video codec