Inside DSP on Digital Video: H.264: the Video codec to watch

Digital video found its first big consumer market in DVD players, and has moved on from there. Now you can buy digital set-top boxes, camcorders, personal video recorders (PVRs), portable media players, and even digital-video-enabled cell phones. Products that can only handle analog video will soon be extinct; they’ll be relegated to technology museums, sitting next to vinyl records and eight-track tape players.

The mass migration from analog to digital video has been enabled by video compression algorithms, or “codecs” (for COmpression/DECompression). In the March 2004 edition of Inside DSP we introduced the basics of video compression. In this article, we take a closer look at one of the hottest new video codecs, H.264.

H.264 was jointly developed by the Moving Picture Experts Group (MPEG, part of ISO) and the International Telecommunication Union (ITU). H.264, published in 2003, is the standard’s ITU name; it also goes by the somewhat lengthy “MPEG-4 Part 10 Advanced Video Coding (AVC).”This designation distinguishes it from MPEG-4 Part 2 (often referred to as MPEG-4), a successor to MPEG-2 that has had limited success in the market.

H.264 is joining a field of established video codecs. The most popular of these is MPEG-2, which is used in all current DVD players. Windows Media 9 and DivX are widely used in streaming video applications (i.e., applications where compressed video “streams” over the Internet and is played back in real time rather than being stored first).

One key attribute of a video compression application is the bit rate of the compressed video stream. Codecs that target specific applications are designed to stay within the bit rate constraints of these applications, while offering acceptable video quality. For example, DVDs use 6-8 Mbps with MPEG-2; video conferencing applications require 50-300 kbps using H.263 (a video conferencing codec). Streaming video applications typically require 50 to 500 kbps, but can exceed 1 Mbps.

Emerging digital video applications such as HDTV and HD-DVD can easily demand a staggering 20-40 Mbps using MPEG-2. Such high bit rates translate into huge storage requirements for HD-DVDs, and a limited number of channels for HDTV. Thus, a key motivation for developing a new codec is to lower the bit rate while preserving (or even improving) video quality relative to MPEG-2.This was the motivation that led to the development of H.264.

As an example of the improvement offered by H.264, Figure 1 shows the same video frame encoded using MPEG-2 and H.264 at the same bit rate.

Functional overview
Figure 2 shows a simplified block diagram of the H.264 encoder. The encoder uses either intra-frame prediction or motion estimation and compensation to predict the pixels of each image block. Intra-frame prediction uses the pixels of neighboring blocks to predict the pixels of the current block. Motion estimation finds a block in a previously encoded frame that closely matches the current block, and motion compensation uses the selected block to predict the current block. The difference between the predicted pixels and actual pixels is transformed into the frequency domain, generating a block of frequency coefficients. These coefficients are quantized, and the output bitstream is further compressed using entropy coding.

For H.264 to be successful, it must overcome a key hurdle: achieving widespread adoption among product designers and consumers. A new video codec is more likely to be widely adopted if it can serve a variety of applications. This is challenging, because different applications require different bit rates and video quality, among other characteristics. To meet these divergent needs, digital video codec standards usually specify multiple variants, called “profiles.” Some profiles are designed for ease of implementation and low processing requirements; others emphasize reduced bit rate or high video quality. Profiles that target streaming video applications are often designed for improved error resilience.

H.264 defines three profiles: Baseline, Main, and Extended. The Baseline profile is the simplest profile; it targets mobile applications with limited processing resources.

The Main profile is intended for digital television broadcasting and next-generation DVD applications, and adds features that improve video quality—at the expense of a significant increase in computational complexity.

The Extended profile targets streaming video, and includes features to improve error resilience and to facilitate switching between different bit streams.

In addition to profiles, video codecs typically define multiple “levels,” each of which specifies a set of constraints for key algorithm parameters, such as the maximum bit rate, frame rate (in terms of frames per second, or fps), resolution, number of macroblocks per frame, motion vector range, etc. For example, in H.264’s Level 1, the maximum resolution is QCIF (144 lines and 176 pixels per line) at a frame rate of 15 fps.

Although levels and profiles are independent, in practice using a particular profile implies the use of a particular set of levels, and vice versa. For example, inexpensive products (e.g., those with small screens or modest processor speeds) are likely to use the Baseline profile along with a level that specifies a low resolution, frame rate, and bit rate.

Comparing key features

Table 1 compares some of the key features and characteristics of the three H.264 profiles with those of the most common MPEG-2 profile (Main Profile at Main Level) and the MPEG-4 Advanced Simple Profile.

As shown in Table 1, H.264 has a number of features that differ from those of MPEG-2 and MPEG-4. Not all features are supported by all three H.264 profiles; each profile makes a different tradeoff in terms of video quality, bit rate, and computational complexity, among other characteristics. In this section we discuss a few key features of H.264 and explain how they help it achieve high video quality and low bit rates.

Small Transform Size. MPEG-2, MPEG-4, and H.264 all transform the input video to the frequency domain using a Discrete Cosine Transform (DCT). (This transformation facilitates frequency-based compression techniques.) Unlike MPEG-2 and MPEG-4, however, H.264 uses a 4x4-pixel base transform rather than the more common block size of 8x8 pixels. The smaller block size reduces ringing artifacts, thus improving picture quality.

Variable Block Size. H.264 supports macroblock partitioning (i.e., partitioning a 16x16 macroblock into several smaller blocks).This partitioning can be done with a number of block sizes: 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4. Larger blocks need fewer motion vectors (and thus fewer bits), but may produce a bigger error between the original block and the motion-compensated block (thus requiring more bits). The encoder attempts to partition the frame in a way that makes an efficient trade-off between the number of bits needed to transmit the motion vectors and the number of bits needed to transmit the DCT coefficients of the residual.

Intra-Frame and Inter-Frame Prediction. Unlike MPEG-2, all three H.264 profiles use intra-frame block prediction, which estimates pixel values using previously decoded pixels from the same frame. H.264 also uses inter-frame block prediction, which uses motion estimation and motion compensation to exploit similarities between consecutive frames in a video sequence.

For each block, H.264 uses either inter-frame prediction or intra-frame prediction, selecting the method that yields the most efficient coding. It does this by performing both methods and then choosing the one that results in the smallest error between the original and the predicted block. The use of intraframe and inter-frame prediction is essential to lowering the bit rate of H.264 while preserving video quality.

Motion Vector Prediction. H.264 tries to predict each motion vector based on motion vectors in surrounding blocks. It then transmits the error between the predicted and actual motion vectors rather than transmitting the motion vector itself. This method is effective in reducing bit rates in cases where there is a large moving object and the motion vectors that comprise it are similar. In this case, the error between the predicted and actual vectors is small, and requires fewer bits to transmit than the actual motion vector.

Quarter-Pixel Motion Vector Resolution. H.264 uses motion vectors with 1/4-pixel resolution, compared to 1/2-pixel resolution in MPEG-2.The finer resolution helps to decrease the magnitude of the residuals and thus reduce the number of bits needed to transmit them.

Achieving sub-pixel resolution requires interpolation between pixels. Interpolating for 1/4-pixel resolution rather than 1/2-pixel resolution is somewhat more computationally demanding, and requires higher memory bandwidth. In addition, using 1/4-pixel resolution means that the encoder has to evaluate more candidate motion vectors for each block—further increasing the computational load.

Multiple Reference Frames. The Main and Extended profiles, like other MPEG standards, support bi-directional motion prediction, which uses both past and future reference frames to predict the contents of the current block.

However, H.264 differs from other codecs in that the encoder is allowed to use more than two reference frames (i.e., more than one past and one future) for motion estimation. Using multiple past or future frames can improve coding efficiency when encoding video sequences with repetitive motion or brief object occlusion. For example, the H.264 encoder can reference an older frame from the sequence and achieve a better compression ratio than an encoder that always uses the previous frame.

The Main and Extended profiles also use weighted prediction, which blends two motion-compensated blocks from different reference pictures, thus improving compression efficiency in fade-ins and fade-outs. This feature, however, doubles the work performed by the encoder and decoder in motion estimation and compensation.

In-Loop Deblocking Filter. H.264 uses an adaptive in-loop deblocking filter to deblock the reconstructed frame. Implementing the deblocking filter in-loop (i.e., as part of the encoding and decoding algorithms rather than as a separate post-processing step) can increase image quality, especially at low bit rates—but it increases computational complexity. Neither MPEG-2 nor MPEG-4 use in-loop deblocking.

Sophisticated Entropy Coding. Both MPEG-2 and MPEG-4 use Huffman coding (a type of entropy coding) to encode the output bit stream. Instead of Huffman coding, the H.264 Baseline and Extended profiles use Context-based Adaptive Variable Length Coding (CAVLC). CAVLC is somewhat more complicated than Huffman coding, but is still fairly simple. It uses an integer number of bits to represent each coded value, and doesn’t require a lot of processing horsepower.

The Main profile uses a more complex entropy coding scheme, called Context based Adaptive Binary Arithmetic Coding (CABAC). Since it is based on arithmetic coding, CABAC can use a fractional number of bits to encode each coded value, which results in better coding efficiency (and a lower bit rate) than CAVLC—at the cost of additional computational complexity.

Multiple Intra-Prediction Modes. H.264 defines several intra-prediction modes for predicting blocks. Each mode specifies a method of predicting the pixels in a 16x16 or 4x4-pixel block using the previously decoded pixels above and to the left of the current block. A few interpolation directions (in addition to horizontal and vertical) are supported; there is also a DC mode, which sets all pixels to the average of the reference pixels, and a special plane-fitting mode useful for areas with constant luminance gradient.

New Slices. Each frame in H.264 can be partitioned into one or more “slices;” each slice, in turn, can contain a different number of macroblocks. There are several different slice types. All three profiles support “I-slices” that contain only intra-predicted macroblocks, and “P-slices” that contain inter-predicted (motion-compensated) macroblocks. The Main and Extended profiles also support “B-slices,” which implement inter-prediction from two reference frames.

The Extended profile adds “Switching I” (SI) slices and “Switching P” (SP) slices. These slices are used to facilitate features like random accesses and video stream switching, and are useful for streaming video applications. For example, SP slices can be used for seamless switching between video streams carrying the same video content but encoded at different bit rates. SI slices use only intra-frame prediction (not inter-frame prediction), and thus can be used for switching between unrelated video streams.

The Baseline and Extended profiles also support “slice groups,” “redundant slices,” and “arbitrary slice order (ASO).” Slice groups allow the macroblocks that comprise a frame to be transmitted in an order that’s different from the raster display order, which improves error resilience. Redundant slices carry a reduced-resolution version of the primary video sequence. These slices are normally ignored by the decoder, but can be used if the primary video sequence is corrupted. Arbitrary slice order allows the slices to be transmitted in any order.

Implementation issues
H.264 is more complex and computationally demanding than previous-generation video codecs. The complexity of the codec translates into additional development effort and a longer time to market; both of these can be a significant burden for companies trying to implement the H.264 themselves. Reference C code is available, but it is not a good starting point for a well-optimized implementation because it is written to illustrate the specification—not for efficiency. Fortunately, there are companies that specialize in providing optimized H.264 codecs for different platforms. Using such third-party implementations reduces the software development effort, but still requires a significant system design effort. Designers also need to ensure that they have sufficient horsepower to run the codec in real time. The latter can be an important consideration given that H.264 has higher processing demands. To read more about the challenges of implementing video software, see “Developing Software for a Digital Video Product.”

Watch that codec
The initial reaction to H.264 from the industry has been extremely positive, which has encouraged many companies to develop H.264-based solutions. (At least one company—Conexant—has already begun sampling H.264 chips.) The codec’s high implementation costs and processing requirements, along with competition from other codecs, may initially slow its rate of adoption—but it’s already clear that H.264 is the codec to watch.

Georgi Beloev, Engineer, contributed to this article.

Inside DSP on Digital Video: H.264: the Video codec to watch

Add new comment