Emerging H.264 standard supports broadcast video encoding

Emerging H.264 standard supports broadcast video encoding
By Faouzi Kossentini, President, CEO and Director, Foued Ben Amara, Research and Development Manager, Ali Jerbi , Marketing and Sales Manager, UB Video, Vancouver, British Columbia, EE Times
January 6, 2003 (1:03 p.m. EST)
URL: http://www.eetimes.com/story/OEG20030106S0035

The adoption of digital video in many applications has been fuelled by the development of many video coding standards, which have emerged targeting different application areas. These standards provide the means needed to achieve interoperability between systems designed by different manufacturers for any given application, hence facilitating the growth of the video market.

The International Telecommunications Union, Telecommunications Standardization Sector (ITU-T) is now one of two formal organizations that develop video coding standards - the other being the International Standardization Organization/International Electrotechnical Commission, Joint Technical Committee 1 (ISO/IEC JTC1). The ITU-T video coding standards are called recommendations, and they are denoted with H.26x (H.261, H.262, H.263 and H.264). The ISO/IEC standards are denoted with MPEG-x ( MPEG-1, MPEG-2 and MPEG-4).

The ITU-T recommendations have been designed m ostly for real-time video communication applications, such as video conferencing and video telephony. On the other hand, the MPEG standards have been designed mostly to address the needs of video storage (DVD), broadcast video (Cable, DSL, Satellite TV), and video streaming (e.g., video over the Internet, video over wireless) applications.

The main objective of the emerging H.264 standard is to provide a means to achieve substantially higher video quality as compared to what could be achieved using any of the existing video coding standards. Nonetheless, the underlying approach of H.264 is similar to that adopted in previous standards such as H.263 and MPEG-4, and consists of the following four main stages:

Dividing each video frame into blocks of pixels so that processing of the video frame can be conducted at the block level.
Exploiting the spatial redundancies that exist within the video frame by coding some of the original blocks through spatial prediction, transform, qua ntization and entropy coding (or variable-length coding).
Exploiting the temporal dependencies that exist between blocks in successive frames, so that only changes between successive frames need to be encoded. This is accomplished by using motion estimation and compensation. For any given block, a search is performed in the previously coded one or more frames or in a future frame to determine the motion vectors that are then used by the encoder and the decoder to predict the subject block.
Exploiting any remaining spatial redundancies that exist within the video frame by coding the residual blocks, for example, the difference between the original blocks and the corresponding predicted blocks, again through transform, quantization and entropy coding.

With H.264 a given video picture is divided into a number of small blocks referred to as macroblocks. For example, a picture with QCIF resolution (176x144) is divided into 99 16x16 macroblocks. A similar macroblock segmentati on is used for other frame sizes. The luminance component of the picture is sampled at these frame resolutions, while the chrominance components, Cb and Cr, are down-sampled by two in the horizontal and vertical directions. In addition, a picture may be divided into an integer number of "slices", which are valuable for resynchronization should some data be lost.

The resulting frame is referred to as an I-picture. I-pictures are typically encoded by directly applying a transform to the different macroblocks in the frame. Consequently, encoded I-pictures are large in size since a large amount of information is usually present in the frame, and no temporal information is used as part of the encoding process. In order to increase the efficiency of the intra coding process in H.264, spatial correlation between adjacent macroblocks in a given frame is exploited. The idea is based on the observation that adjacent macroblocks tend to have similar properties. The difference between the actual macroblock and i ts prediction is then coded, which results in fewer bits to represent the macroblock of interest as compared to when applying the transform directly to the macroblock itself.

For regions with less spatial detail (flat regions), H.264 supports 16x16 intra prediction, in which one of four prediction modes (DC, Vertical, Horizontal and Planar) is chosen for the prediction of the entire luminance component of the macroblock. In addition, H.264 supports intra prediction for the 8x8 chrominance blocks also using four prediction modes (DC, vertical, horizontal and planar). Finally, the prediction mode for each block is efficiently coded by assigning shorter symbols to more likely modes, where the probability of each mode is determined based on the modes used for coding of the surrounding blocks.

Effective coding

Inter prediction and coding is based on using motion estimation and compensation to take advantage of the temporal redundancies that exist between successive frames, hence, providing very efficient coding of video sequences. When a selected reference frame for motion estimation is a previously encoded frame, the frame to be encoded is referred to as a P-picture. When both a previously encoded frame and a future frame are chosen as reference frames, then the frame to be encoded is referred to as a B-picture.

Motion estimation in H.264 supports most of the key features adopted in earlier video standards, but its efficiency is improved through added flexibility and functionality. In addition to supporting P-pictures (with single and multiple reference frames) and B-pictures (with multiple prediction modes), H.264 supports a new inter-stream transitional picture called an SP-picture. The inclusion of SP-pictures in a bit stream enables efficient switching between bit streams with similar content encoded at different bit rates, as well as random access and fast playback modes.

The prediction capability of the motion compensation algorithm in H.264 is further improved by allowing motion vectors to be determined with higher levels of spatial accuracy than in existing standards. Quarter-pixel accurate motion compensation is the lowest-accuracy form of motion compensation in H.264 — in contrast with prior standards based primarily on half-pel accuracy, with quarter-pel accuracy only available in the newest version of MPEG-4.

The information contained in a prediction error block resulting from either intra prediction or inter prediction is then re-expressed in the form of transform coefficients. H.264 is unique in that it employs a purely integer spatial transform (a rough approximation of the DCT) which is primarily 4x4 in shape, as opposed to the usual floating-point 8x8 DCT specified with rounding-error tolerances as used in earlier standards. The small shape helps reduce blocking and ringing artifacts, while the precise integer specification eliminates any mismatch issues between the encoder and decoder in the inverse transform.

Th e quantization step is where a significant portion of data compression takes place. In H.264, the transform coefficients are quantized using scalar quantization with no widened dead-zone. The next step in the encoding process is to arrange the quantized coefficients in an array, starting with the DC coefficient. A single coefficient-scanning pattern is available in H.264 for frame coding, and another one is being added for field coding.

The last step in the video coding process is entropy coding. Entropy coding is based on assigning shorter codewords to symbols with higher probabilities of occurrence, and longer codewords to symbols with less frequent occurrences. Some of the parameters to be entropy coded include transform coefficients for the residual data, motion vectors and other encoder information. Two types of entropy coding have been adopted. The first method represents a combination of Universal Variable Length Coding (UVLC) and Context Adaptive Variable-Length coding (CAVLC). The second m ethod represents Context-Based Adaptive Binary Arithmetic Coding (CABAC).

To this date, three major profiles have been agreed upon: Baseline, mainly for video conferencing and telephony/mobile applications, Main, primarily for broadcast video applications, and X, mainly for streaming and mobile video applications. The Baseline profile allows the use of Arbitrary Slice Ordering (ASO) to reduce the latency in real-time communication applications, as well as the use of Flexible Macroblock Ordering (FMO) and redundant slices to improve error resilience in the coded bit stream. The Main profile enables an additional reduction in bandwidth over the Baseline profile through mainly sophisticated bi-directional prediction (B-pictures), CABAC and weighted prediction.

Industry Articles

Emerging H.264 standard supports broadcast video encoding