White Paper: MPEG overview

With DVD Video, its not just a case of transferring data to and from disk, MPEG compression and decompression is equally as vital

With DVD Video, its not just a case of transferring data to and from disk, MPEG compression and decompression is equally as vital

This document presents an overview of the Moving Picture Experts Group (MPEG) standard that is implemented by the CL480. The standard is officially known as ISO/IEC Standard, Coded Representation of Picture, Audio and Multimedia/Hypermedia Information, ISO 11172. It is more commonly referred to as the MPEG-1 standard.

MPEG addresses the compression, decompression and synchronisation of video and audio signals. The MPEG video algorithm can compress video signals to an average of about 1/2 to 1-bit per coded pixel. At a compressed data rate of 1.2Mbits per second, a coded resolution of 352 x 240 at 30Hz is often used, and the resulting video quality is comparable to VHS. Image quality can be significantly improved by using a more highly-compressed data rate (for example, 2Mbits per second) without changing the coded resolution.

MPEG system stream structure

In its most general form, an MPEG system stream is made up of two layers:

The system layer contains timing and other information needed to demultiplex the audio and video streams and to synchronise audio and video during playback

The compression layer includes the audio and video streams

General decoding process

The system decoder extracts the timing information from the MPEG system stream and sends it to the other system components. The system decoder also demultiplexes the video and audio streams from the system stream; then sends each to the appropriate decoder.

Video stream data hierarchy

The MPEG standard defines a hierarchy of data structures in the video stream. The video sequence begins with a sequence header (may contain additional sequence headers), includes one or more groups of pictures and ends with an end-of-sequence code.

A header and a series of one or more pictures intended to allow random access into the sequence.

A picture consists of three rectangular matrices representing luminance (Y) and two chrominance (Cb and Cr) values. The Y matrix has an even number of rows and columns. The Cb and Cr matrices are one-half the size of the Y matrix in each direction (horizontal and vertical).

A slice is one or more "contiguous'' macroblocks. The order of the macroblocks within a slice is from left-to-right and top-to-bottom. Slices are important in the handling of errors. If the bitstream contains an error, the decoder can skip to the start of the next slice. Having more slices in the bitstream allows better error concealment, but uses bits that could otherwise be used to improve picture quality.

A macroblock is a 16-pixel by 16-line section of luminance components and the corresponding 8-pixel by 8-line section of the two chrominance components. A macroblock contains four Y blocks, one Cb block and one Cr block. The numbers correspond to the ordering of the blocks in the data stream, with block 1 first.

A block is an 8-pixel by 8-line set of values of a luminance or a chrominance component. Note that a luminance block corresponds to one-fourth as large a portion of the displayed image as does a chrominance block.

Audio stream data hierarchy

The MPEG standard defines a hierarchy of data structures that accept, decode and produce digital audio output. The MPEG audio stream, like the MPEG video stream, consists of a series of packets. Each audio packet contains an audio packet header and one or more audio frames.

Each audio packet header contains the following information:

Packet start code ( identifies the packet as being an audio packet

Packet length ( indicates the number of bytes in the audio packet

An audio frame contains the following information:

Audio frame header ( contains synchronisation, ID, bit rate and sampling frequency information

Error-checking code ( contains error-checking information

Audio data ( contains information used to reconstruct the sampled audio data

Ancillary data ( contains user-defined data

Inter-picture coding

Much of the information in a picture within a video sequence is similar to information in a previous or subsequent picture. The MPEG standard takes advantage of this temporal redundancy by representing some pictures in terms of their differences from other (reference) pictures, or what is known as inter-picture coding. This section describes the types of coded pictures and explains the techniques used in this process.

Picture types

The MPEG standard specifically defines three types of pictures: intra, predicted and bidirectional.

Intra pictures, or I-pictures, are coded using only information present in the picture itself. I-pictures provide potential random access points into the compressed video data. I-pictures use only transform coding and provide moderate compression. I-pictures typically use about two bits per coded pixel.

Predicted pictures, or P-pictures, are coded with respect to the nearest previous I- or P-picture. This technique is called forward prediction.

Like I-pictures, P-pictures serve as a prediction reference for B-pictures and future P-pictures. However, P-pictures use motion compensation to provide more compression than is possible with I-pictures. Unlike I-pictures, P-pictures can propagate coding errors because P-pictures are predicted from previous reference (I- or P-) pictures.

Bidirectional pictures, or B-pictures, are pictures that use both a past and future picture as a reference. This technique is called bidirectional prediction. B-pictures provide the most compression and do not propagate errors because they are never used as a reference. Bidirectional prediction also decreases the effect of noise by averaging two pictures.

Video stream composition

The MPEG algorithm allows the encoder to choose the frequency and location of I-pictures. This choice is based on the application's need for random accessibility and the location of scene cuts in the video sequence. In applications where random access is important, I-pictures are typically used two times a second.

The encoder also chooses the number of B-pictures between any pair of reference (I- or P-) pictures. This choice is based on factors such as the amount of memory in the encoder and the characteristics of the material being coded. For example, a large class of scenes have two bidirectional pictures separating successive reference pictures.

The MPEG encoder reorders pictures in the video stream to present the pictures to the decoder in the most efficient sequence. In particular, the reference pictures needed to reconstruct B-pictures are sent before the associated B-pictures.

Motion compensation

Motion compensation is a technique for enhancing the compression of P- and B- pictures by eliminating temporal redundancy. Motion compensation typically improves compression by about a factor of three compared to intra-picture coding. Motion compensation algorithms work at the macroblock level.

When a macroblock is compressed by motion compensation, the compressed file contains this information:

The spatial vector between the reference macroblock(s) and the macroblock being coded (motion vectors)

The content differences between the reference macroblock(s) and the macroblock being coded (error terms)

Not all information in a picture can be predicted from a previous picture. Consider a scene in which a door opens: the visual details of the room behind the door cannot be predicted from a previous frame in which the door was closed. When a case such as this arises ( i.e., a macroblock in a P-picture cannot be efficiently represented by motion compensation ( it is coded in the same way as a macroblock in an I-picture using transform coding techniques.

The difference between B- and P-picture motion compensation is that macroblocks in a P-picture use the previous reference (I- or P-picture) only, while macroblocks in a B-picture are coded using any combination of a previous or future reference picture.

Four codings are therefore possible for each macroblock in a B-picture:

Intra coding: no motion compensation

Forward prediction: the previous reference picture is used as a reference

Backward prediction: the next picture is used as a reference

Bidirectional prediction: two reference pictures are used; the previous reference picture and the next reference picture

Backward prediction can be used to predict uncovered areas that do not appear in previous pictures.

Intra-picture (transform) coding

The MPEG transform coding algorithm includes these steps:

Discrete Cosine Transform (DCT)


Run-length encoding

Both image blocks and prediction-error blocks have high spatial redundancy. To reduce this redundancy, the MPEG algorithm transforms 8 x 8 blocks of pixels or 8 x 8 blocks of error terms from the spatial domain to the frequency domain with the Discrete Cosine Transform (DCT).

Next, the algorithm quantises the frequency coefficients. Quantisation is the process of approximating each frequency coefficient as one of a limited number of allowed values. The encoder chooses a quantisation matrix that determines how each frequency coefficient in the 8 x 8 block is quantised. Human perception of quantisation error is lower for high spatial frequencies, so high frequencies are typically quantised more coarsely (i.e., with fewer allowed values) than low frequencies.

The combination of DCT and quantisation results in many of the frequency coefficients being zero, especially the coefficients for high spatial frequencies. To take maximum advantage of this, the coefficients are organised in a zigzag order to produce long runs of zeros. The coefficients are then converted to a series of run-amplitude pairs, each pair indicating a number of zero coefficients and the amplitude of a non-zero coefficient. These run-amplitude pairs are then coded with a variable-length code, which uses shorter codes for commonly occurring pairs and longer codes for less common pairs.

Some blocks of pixels need to be coded more accurately than others. For example, blocks with smooth intensity gradients need accurate coding to avoid visible block boundaries. To deal with this inequality between blocks, the MPEG algorithm allows the amount of quantisation to be modified for each macroblock of pixels. This mechanism can also be used to provide smooth adaptation to a particular bit rate.


The MPEG standard provides a timing mechanism that ensures synchronisation of audio and video. The standard includes two parameters: the System Clock Reference (SCR) and the presentation timestamp (PTS).

The MPEG-specified "system clock'' runs at 90 KHz. System clock reference and presentation timestamp values are coded in MPEG bitstreams using 33 bits, which can represent any clock cycle in a 24-hour period.

System Clock References

An SCR is a snapshot of the encoder system clock which is placed into the system layer of the bitstream. During decoding, these values are used to update the system clock counter in the CL480.

Presentation timestamps

Presentation timestamps are samples of the encoder system clock that are associated with video or audio presentation units. A presentation unit is a decoded video picture or a decoded audio time sequence. The PTS represents the time at which the video picture is to be displayed or the starting playback time for the audio time sequence.

The decoder either skips or repeats picture displays to ensure that the PTS is within one picture's worth of 90 KHz clock tics of the SCR when a picture is displayed. If the PTS is earlier (has a smaller value) than the current SCR, the decoder discards the picture. If the PTS is later (has a larger value) than the current SCR, the decoder repeats the display of the picture.

Compiled by John Sabine

(c) 1999 by C-Cube Microsystems, Inc.

Read more on IT for small and medium-sized enterprises (SME)