# Introduction to Video Compression – Part II

October 4, 2006 Leave a comment

**Quantization and Coding**

*Quantization*

As mentioned above, the DCT coefficients for each eight-pixel by eight-pixel block are encoded using more bits for the more perceptually significant low-frequency DCT coefficients and fewer bits for the less significant high-frequency coefficients. This encoding of coefficients is accomplished in two steps: First, quantization is used to discard perceptually insignificant information. Next, statistical methods are used to encode the remaining information using as few bits as possible.

Quantization rounds each DCT coefficient to the nearest of a number of predetermined values. For example, if the DCT coefficient is a real number between –1 and 1, then scaling the coefficient by 20 and rounding to the nearest integer quantizes the coefficient to the nearest of 41 steps, represented by the integers from -20 to +20. Ideally, for each DCT coefficient a scaling factor is chosen so that information contained in the digits to the right of the decimal point of the scaled coefficient may be discarded without introducing perceptible artifacts to the image.

In the image decoder, dequantization performs the inverse of the scaling applied in the encoder. In the example above, the quantized DCT coefficient would be scaled by 1/20, resulting in a dequantized value between -1 and 1. Note that the dequantized coefficients are not equal to the original coefficients, but are close enough so that after the IDCT is applied, the resulting image contains few or no visible artifacts. Dequantization can require anywhere from about 3% up to about 15% of the processor cycles spent in a video decoding application. Like the DCT and IDCT, the memory requirements of quantization and dequantization are typically negligible.

*Coding*

The next step in the compression process is to encode the quantized DCT coefficients in the digital bit stream using as few bits as possible. The number of bits used for encoding the quantized DCT coefficients can be minimized by taking advantage of some statistical properties of the quantized coefficients.

After quantization, many of the DCT coefficients have a value of zero. In fact, this is often true for the vast majority of high-frequency DCT coefficients. A technique called "run-length coding" takes advantage of this fact by grouping consecutive zero-valued coefficients (a "run") and encoding the number of coefficients (the "length") instead of encoding the individual zero-valued coefficients.

To maximize the benefit of run-length coding, low-frequency DCT coefficients are encoded first, followed by higher-frequency coefficients, so that the average number of consecutive zero-valued coefficients is as high as possible. This is accomplished by scanning the eight-by-eight-coefficient matrix in a diagonal zig-zag pattern.

Run-length coding is typically followed by variable-length coding (VLC). In variable-length coding, each possible value of an item of data (i.e., each possible run length or each possible value of a quantized DCT coefficient) is called a symbol. Commonly occurring symbols are represented using code words that contain only a few bits, while less common symbols are represented with longer code words. VLC uses fewer bits for the most common symbols compared to fixed-length codes (e.g., directly encoding the quantized DCT coefficients as binary integers) so that on average, VLC requires fewer bits to encode the entire image. Huffman coding is a variable-length coding scheme that optimizes the number of bits in each code word based on the frequency of occurrence of each symbol.

Note that theoretically, VLC is not the most efficient way to encode a sequence of symbols. A technique called "arithmetic coding" can encode a sequence of symbols using fewer bits than VLC. Arithmetic coding is more efficient because it encodes the entire sequence of symbols together, instead of using individual code words whose lengths must each be an integer number of bits. Arithmetic coding is more computationally demanding than VLC and has only recently begun to make its way into commercially available video compression algorithms. Historically, the combination of run-length coding and VLC has provided sufficient coding efficiency with much lower computational requirements than arithmetic coding, so VLC is the coding method used in the vast majority of video compression algorithms available today.

Variable-length coding is implemented by retrieving code words and their lengths from lookup tables, and appending the code word bits to the output bit stream. The corresponding variable-length decoding process (VLD) is much more computationally intensive. Compared to performing a table lookup per symbol in the encoder, the most straightforward implementation of VLD requires a table lookup and some simple decision making to be applied for each bit. VLD requires an average of about 11 operations per input bit. Thus, the processing requirements of VLD are proportional to the video compression codec’s selected bit rate. Note that for low image resolutions and frame rates, VLD can sometimes consume as much as 25% of the cycles spent in a video decoder implementation.

In a typical video decoder, the straightforward VLD implementation described above requires several kilobytes of lookup table memory. VLD performance can be greatly improved by operating on multiple bits at a time. However, such optimizations require the use of much larger lookup tables.

One drawback of variable-length codes is that a bit error in the middle of an encoded image or video frame can prevent the decoder from correctly reconstructing the portion of the image that is encoded after the corrupted bit. Upon detection of an error, the decoder can no longer determine the start of the next variable-length code word in the bit stream, because the correct length of the corrupted code word is not known. Thus, the decoder cannot continue decoding the image. One technique that video compression algorithms use to mitigate this problem is to intersperse "resynchronization markers" throughout the encoded bit stream. Resynchronization markers occur at predefined points in the bit stream and provide a known bit pattern that the video decoder can detect. In the event of an error, the decoder is able to search for the next resynchronization marker following the error, and continue decoding the portion of the frame that follows the resynchronization marker.

In addition, the MPEG-4 video compression standard employs "reversible" variable-length codes. Reversible variable-length codes use code words carefully chosen so that they can be uniquely decoded both in the normal forward direction, and also backwards. In the event of an error, the use of reversible codes allows the video decoder to find the resynchronization marker following the error and decode the bit stream in the backward direction from the resynchronization marker toward the error. Thus, the decoder can recover more of the image data than would be possible with resynchronization markers alone.

All of the techniques described so far operate on each eight-pixel by eight-pixel block independently from any other block. Since images typically contain features that are much larger than an eight-by-eight block, more efficient compression can be achieved by taking into account the correlation between adjacent blocks in the image.

To take advantage of inter-block correlation, a prediction step is often added prior to quantization of the DCT coefficients. In this step, the encoder attempts to predict the values of some of the DCT coefficients in each block based on the DCT coefficients of the surrounding blocks. Instead of quantizing and encoding the DCT coefficients directly, the encoder computes, quantizes, and encodes the difference between the actual DCT coefficients and the predicted values of those coefficients. Because the difference between the predicted and actual values of the DCT coefficients tends to be small, this technique reduces the number of bits needed for the DCT coefficients. The decoder performs the same prediction as the encoder, and adds the differences encoded in the compressed bit stream to the predicted coefficients in order to reconstruct the actual DCT coefficient values. Note that in predicting the DCT coefficient values of a particular block, the decoder has access only to the DCT coefficients of surrounding blocks that have already been decoded. Therefore, the encoder must predict the DCT coefficients of each block based only on the coefficients of previously encoded surrounding blocks.

In the simplest case, only the first DCT coefficient of each block is predicted. This coefficient, called the "DC coefficient," is the lowest frequency DCT coefficient and equals the average of all the pixels in the block. All other coefficients are called "AC coefficients." The simplest way to predict the DC coefficient of an image block is to simply assume that it is equal to the DC coefficient of the adjacent block to the left of the current block. This adjacent block is typically the previously encoded block. In the simplest case, therefore, taking advantage of some of the correlation between image blocks amounts to encoding the difference between the DC coefficient of the current block and the DC coefficient of the previously encoded block instead of encoding the DC coefficient values directly. This practice is referred to as "differential coding of DC coefficients" and is used in the JPEG image compression algorithm.

More sophisticated prediction schemes attempt to predict the first DCT coefficient in each row and each column of the eight-by-eight block. Such schemes are referred to as "AC-DC prediction" and often use more sophisticated prediction methods compared to the differential coding method described above: First, a simple filter may be used to predict each coefficient value instead of assuming that the coefficient is equal to the corresponding coefficient from an adjacent block. Second, the prediction may consider the coefficient values from more than one adjacent block. The prediction can be based on the combined data from several adjacent blocks. Alternatively, the encoder can evaluate all of the previously encoded adjacent blocks and select the one that yields the best predictions on average. In the latter case, the encoder must specify in the bit stream which adjacent block was selected for prediction so that the decoder can perform the same prediction to correctly reconstruct the DCT coefficients.

AC-DC prediction can take a substantial number of processor cycles when decoding a single image. However, AC-DC prediction cannot be used in conjunction with motion compensation. Therefore, in video compression applications AC-DC prediction is only used a small fraction of the time and usually has a negligible impact on processor load. However, some implementations of AC-DC prediction use large arrays of data. Often it may be possible to overlap these arrays with other memory structures that are not in use during AC-DC prediction to dramatically optimize the video decoder’s memory use.