Notes on video compression and compression standards

This page describes the basic techniques used to transmit digital video over a communications network with special attention paid to techniques used to compress the video stream. It was written for the Kan-ed network within the Kansas Board of Regents.

Much of the material on compression is modeled after Peter Symes' wonderful primer on compression, Video Compression Demystified. In particular, Symes used a two-step didactic approach beginning with an over-simplified example and followed with an example describing techniques used by most compression standards employed today.

This page finishes with a description of several common compression standards making extensive use of the introductory material.

Table of contents

The basic process of image transmission

The basic process is that: and this process is repeated at some "frame rate", possibly fast enough to convey a sense of motion (as in "full-motion video").

Note that he sender and receiver may both keep several frames in memory at all times to improve the compression process. For example, while working on one frame the sender may refer to content in one or more previous or subsequent frames.

All along the way there exist "standards" describing how the information should be encoded, decoded, and otherwise represented, and "protocols" describing how information should be exchanged between components of this process (where a component may be a computer, a computer "card", a set-top appliance, etc.).

Note that these functions are always performed by a collection of components, but that individual functions may be grouped together in different ways. That is, a camera and a computer together may perform the same functions as does a PolyCom set-top box.

"...which is converted to a digital image"

What does it mean to convert a sound or an image to a digital form?

In general, it means that sounds and images are converted to electronic signals as voltages or amperages in amplitudes that are "analogous" to the strength of the signals (hence the term "analog").

For example, sound is composed of sound "waves" that can be "sampled" at regular intervals to convert the average wave amplitude during each interval to a voltage value (analogous to the wave amplitude).

To convert the analog information to digital information, the average amplitude value during each interval is simply converted to an integer, represented as string of binary values, so that a stream of sounds, becomes a stream of integers, say 44,000 per second.

In the video realm, an NTSC (SDTV) camera scans a video image as 525 separate scan lines.

To produce a digitized version of the camera output, each scan line can be divided into a series of small areas, called "pixels", for "picture element," and the brightness and color of each pixel can be determined using one of several schemes.

Then each pixel can be encoded as a set of 3 integers:

or as with CCIR 601:

so that each image becomes THREE 2-dimensional matrices (one matrix for Luminance values, one for Cb values, etc.)....

...which may NOT necessarily be of identical size. Cb and Cr may not be sampled for each pixel (but rather "subsampled"):

Notions of "quality" and "resolution"

In general, such processes provide renditions of the original sounds and images with varying degrees of verisimilitude, which correspond to variations in image or sound "quality."

For example, most humans don't hear much sound composed of waves varying more than 20,000 times per second...or 20KHz.

Consequently, sampling sounds at around 44,000 times a second should capture with full fidelity, highest quality.

In the video realm, quality is affected by the number of pixels into which each image is separated, or the video "resolution," usually specified as a rectangular collection of pixels. For example, an NTSC video image is considered to have a quality roughly equivalent to 680 by 483 pixels, while film used in the movie industry is considered to have a quality better than about 4000x2000 pixels (Super High Definition Television).

However, things are not that simple. For example, the eye is more sensitive to the luminance than to color components of a pixel, so encoding luminance values at full resolution, and color components at a lower resolution can produce a better image than might be expected. (Hence: the use of subsampled YUV, rather than RGB.)

...versus computer and network speeds

Here are rates for raw video signals:

     Raw NTSC        168 Mbps
     Raw PAL         216 Mbps
     Raw HDTV        1 to 1.5 Gbps

which are clearly too high for most networks.

One Common Intermediate Format (CIF) can produce a video stream composed of 352x240 (or 288 for PAL) images, which are considered to be roughly equivalent the quality captured by Video Cassette Recorders, which is less than native NTSC quality.

So if you are only sending small CIF images, what's the problem?

Well, let's say that each pixel is represented by 3 values (Y, Cr, Cb), and that you are using 4:2:2 subsampling.

You must encode luminance values for 352 times 240 = 84,480 pixels, plus two smaller color matrices of 176 times 120 = 21,120 pixels, for a total of 126,720 pixels.

If each value is encoded in a single byte, then each image will require 380,160 bytes.

So, if we were to attempt a frame rate of 30 frames per second, we would need about 11.4 MBytes per second (MBps), or 91 Mbits per second (Mbps) bandwidth....

and, then, there's sound.....

How compression can help

It is frequently possible to change the way information is represented to reduce the number of bytes/bits required to store it or send it. The process of doing so is called "compression." And the process of converting the compressed data back into an image is called "decompression."

A single device called a "codec" may be employed to provide this service.

(Of course you might wonder why the information wasn't just created in a compact form to begin with, and the answer to that is that cards that can convert analog to digital video signals may well have enough to do on their own, and may also wish to support the use of alternative encoding schemes without biasing the choices.)

Note, however, that the original image quality is not always preserved by the compression and decompression schemes employed. In fact, quality may be purposely sacrificed to meet other constraints.

There is a continuous, dynamic trade-off between stream quality, computer/codec speed, and available network bandwidth.

To get the aforementioned CIF image to transmit over a 768 Kbps circuit we need about a "compression ratio" of around 100 to 1....while retaining "adequate image quality".

To display such a stream over a display adapter connected through a 528 MBps AGP bus, or even a 33MBps PCI bus, should be possible provided the CPU clock rate is high enough.

As we will see later, the compression process is CPU-intensive, and both network bandwidth and CPU speed are likely candidates for forcing quality reduction to maintain near real-time communications.

Separate the picture into predictable portions and unpredicted discrepancies

The basic approach to compressing a free-standing image or frame is to separate the picture into predictable portions and a matrix of unpredicted discrepancies. Here is an example illustrating this process, but it is NOT exactly the approach used in production-level work.

The example uses a famous image, often used in discussions of compression (copyright 1972, Playboy International):

Here is a grey level version of the image, similar to the luminance matrix described above:


(E = 7.41)

Now use the top-most row and left-most column as starting values and imagine predicting the value of any other pixel by assuming it will be the same as the value of the pixel to its left + the value of the pixel above it divided by 2. (To do this, you have to start by calculating the value of the second element in the second row, and work out from there.)

Then subtract each predicted value from the corresponding actual value to build a matrix of differences or discrepancies (aka "residuals").

This process will generate both positive and negative values, so to get a visual approximation of the result, we can shift all values into positive territory by adding 128 to each value. Having done that, 0 values will be represented by a medium-dark grey, and previously negative values will appear lighter.

Here is a rendition of the matrix of differences generated by this process (copyright McGraw-Hill, 2001):


(E = 4.63)

This process has separated the original data into a set of predicted values along with a set of difference/discrepancy values.

The difference value information seems to represent object edges, while predicted information seems to better represent surfaces.

Another way to think about this is that the prediction process identifies "correlated" or "autocorrelated" pixels within the image, while the difference matrix better identifies "edges,"

Note that the range of difference values is usually smaller than the range of actual values, AND the distribution of difference values is usually very uneven. (Many are close to 0 in this example).

Streams of such values can usually be sent in fewer bytes when encoded in a suitable fashion.

In fact, the E values suggest the first image could be encoded using 8 bits per pixel, while the second could be encoded using only 5 bits per pixel.

Encode the predicted surface data

The predicted surface data can be transmitted by sending ONLY:

Some protocols offer a predetermined set of prediction algorithms from which the encoder may select the most effective, and identify with a single integer value.

Encode the "unpredicted" or "discrepancy" data

The discrepancy data can be encoded using so-called "variable length encoding (VLE)" or "entropy encoding" schemes.

VLE schemes can yield compression when the distribution of values to be encoded is very uneven. That is, some values occur frequently; some occur infrequently.

For example, look at this table (from Friedman) of the frequencies of occurence of letters in English text (not counting spaces which occur with a 19% frequency among the set of letters and space):

Frequency (%): 13 9 8 8 7 7 7 6 6 4 4 3 3 3 3 2 2 2 1 1 1 - - - - -
Letter:         E T A O N I R S H L D C U P F M W Y B G V K Q X J Z

Samuel Morse used this kind of information to define Morse code. It made the code easier to use and shortened the encoded string of "dots" and "dashes".

"E" is coded as a single "dot", which is to say that the most frequently used letter was encoded using the shortest code value.

"T" is encoded as a single "dash", etc.

The less frequent letters end up being coded using longer sequences:

"Q", "X", "J", and "Z" are all encoded as four-part sequences (c.f., "Q" is "dash, dash, dot, dash").

Encoding picture data values is more complicated, but the idea is the same; examine the picture values to determine which are most frequent, and encode them with the shortest codes.

A modified "Huffman" scheme is usually used.

Progress so far

It turns out that this approach can sometimes achieve a lossless 2:1 compression ratio.

However, 2:1 compression is not good enough for most video applications and other methods were developed to achieve better results.

The take-home message is that compression schemes often attempt to predict data values using some simple scheme, and then transmit the predictions and the discrepancies separately using encoding methods appropriate for each type of information.

Categorizing or "quantizing" the data to reduce stream size

Note that the encoding described above is "lossless". The reconstructed image is (very-nearly) an exact copy of the original.

To reduce the stream size, the difference data values could be placed into some number of categories:

      Values -5 to +5 could be placed in category 1,
      Values 6 to 15 into category 2,
      etc.

and the category for each data value would be sent in place of the actual value along with a table of category mean values.

The resulting stream would be smaller than the stream generated for the uncategorized data, and the reconstructed image would be only "relatively similar" to the original.

The quality of the result will usually decline as the number of categories decreases or the size of the category intervals increases.

This process injects "loss" into the compression process.

Note that the size/number of categories can be adjusted on-the-fly to match the video stream to the available bandwidth during periods of congestion or partial failure.

Common compression standards

There exist numerous compression schemes, many of which have been defined as standards by one or more standards groups (sometimes simultaneously).

These standards typically have different target applications and different performance goals.


AVC is "Advanced Video Coding."

Note that MPEG-7 and MPEG-21 have different functions than the standards shown above. MPEG-7 deals with metadata and 21 with comprehensive archive and access systems.

It is difficult to arrange these standards into some hierarchical order on the basis of coding efficiency or product quality. However....

Individual steps in production-quality compression processes

To compress an image, a codec will usually remove "spatial" and "temporal" redundancies.

To remove spatial redundancy, codecs will:

  1. use inTRA-picture prediction of pixel values to build a "residual" or "difference" matrix.

  2. transform the residuals, using a Discrete Cosine Transform (DCT), or equivalent, which does not compress (and may be done losslessly with adequate precision), but sets the stage for the remaining steps.

  3. quantize the transformed residuals, by breaking the value range into a set of categories and transmitting only category numbers. (Significant (and dynamic) compression can be accomplished here, but with an attendant tradeoff in image quality.)

  4. select the order in which values are transmitted (e.g., zig-zag traversal) so as to maximize "runs" of identical category numbers.

  5. run-length encode the resulting series of values (that is, send the number of identical values in a "run" along with that value, instead of sending every value separately), and, finally

  6. variable-length (entropy) encode the resulting stream.

This activity is (usually) performed on "blocks" of 8x8 pixels, and such blocks are sometimes aggregated into "macroblocks" of 4 blocks and/or "slices" for other purposes.

A spatial redundancy elimination example using DCT

Given a matrix f(x,y) of dimension N by N, the Discrete Cosine Transform will generate another N by N matrix, F(u,v), as follows:

                                      N-1  N-1 
    F(u,v) = ( 2 / N) * C(u) * C(v) * SUM( SUM( f(x,y) * cos( a ) * cos( b ) ) )
                                      x=0  y=0
where
     a = ( ( 2x + 1 ) * u * pi ) / 2N 
     b = ( ( 2y + 1 ) * v * pi ) / 2N

     C(u) = 1 / sqrt( 2 ) if u = 0 else C(u) = 1 
     C(v) = 1 / sqrt( 2 ) if v = 0 else C(v) = 1

Here is a history of the changes experienced by an example block of luminance values taken from Symes:

       70   72   70   70   72   68   68   64
      103  101  103  100   99   97   94   94
      132  132  132  130  129  129  125  121
      157  157  155  154  153  150  148  145
      168  163  164  162  163  161  161  156
      172  170  165  166  163  163  162  158
      174  170  167  167  164  163  164  159
      174  173  170  167  167  166  166  160

This block can be transformed via DCT into the next block:

       86    2   -3    6   -2    2   -2    2
     -247   -4   -5   -3    0   -3    1    1
     -117   -1    1   -1   -1   -1   -1    0
      -40   -2    2    1    2    1   -2    0
       -7   -2   -1    1    0   -1   -1    2
       -6    1    0    0    0    0   -2   -1
       -4   -1   -1    1   -2   -1   -1   -1
       -3   -3   -1    1    0    1    1    1

which has concentrated information from the entire matrix into the upper left-hand corner.

In particular, the upper-left-most DCT value represents an average block value (actually 8 times the mean value).

It is called the "DC coefficient"; the other values are "AC coefficients".

The DC coefficient must be sent with higher precision, so it is not included in the stream of values to be quantized, run-length encoded, and then variable-length encoded.

Instead, the DC coefficients from all blocks may be combined and treated as a separate block to which the DCT transformation and quantizing and encoding steps are applied prior to transmission. (See the Tudor reference below for details on the DCT transform.)

Click here to see some graphs illustrating how well correlations were removed and how unevenly the sample data are distributed before and after DCT processing.

Quantizing and traversing the DCT block

The transformed example block can be quantized into this block using Quantization Tables defined in the standards. Note that these Tables treat each location within the block separately, so as to improve subsequent encoding efficiency:

        5    2    0    0    0    0    0    0
      -21    0    0    0    0    0    0    0
       -8    0    0    0    0    0    0    0
       -3    0    0    0    0    0    0    0
        0    0    0    0    0    0    0    0
        0    0    0    0    0    0    0    0
        0    0    0    0    0    0    0    0
        0    0    0    0    0    0    0    0
 

Using a zig-zag pattern starting from the upper-left corner (and going right and then down-left) will usually give the longest run(s) of terminal zeros (53 in this example?).

The entropies of these 3 blocks are 5.04, 1.42, and .30 (so that these matrices COULD be encoded losslessly in 6, 2, and 1 bits per pixel or less).

The example can be encoded at .69 bits per pixel (rather than the original 8 bits per pixel) using a modified Huffman scheme that results in a compression ratio of 11.6 to 1.

(A rough estimate of progress so far)

In very rough terms we might say,

leaving only about 1.5% of the original 109 Mbps....around 1.6 Mbps. This represents a cummulative compression ratio of somewhat more than 40:1.

To further increase the compression ratio, the quantization could be done using more aggressive quantization tables, and/or....

Eliminating temporal redundancy

In addition to this inTRA-picture encoding, inTER-picture prediction can be used to remove "temporal redundancy".

The basic idea is that up to this point a frame has been thought of as a matrix of 8x8 pixel blocks, encoded using the DCT process. However, it often occurs that a block in a frame being encoded is quite similar to a block in another frame, either before the current frame or after it.

As an example, consider the following sequence of 3 frames. If you want to encode the center, or "Current" frame, you can find all of the visual content in either the "Previous" frame or the "Next" frame.

The front portion of the ball can be found in the "Previous" frame, and the back portion can be found in the subsequent ("Next") frame.

This brings up the ideas of

To implement this approach there exist several frame types:

Typically a Group of Pictures (GOP) will be an I frame, followed by several sets composed of....one P frame followed by several B frames.

Caveat: This process is actually quite a bit more complicated. For example, frames are thought of as a matrix of 16x16 pixel blocks ("macroblocks"), each of which contains the DCT-encoded data for 4 8x8 blocks (along with the smaller chroma blocks) that will be used to create such an image, pointers to similar blocks elsewhere in the data stream, DCT-encoded data using similar blocks elsewhere in the data stream, or some combination of these.

See Symes for details.

How similar blocks are identified

The process of searching frames for similar portions is very complicated; here is an attempt at a simplified model.

The first problem is "How can you tell how similar two visual regions are?" The obvious approach is to develop a formula that compares pixel values at the same relative positions within each block. Two popular approaches are to compute the:

Mean Square Error is calculated by calculating the mean value of...the list of values generated by squaring the difference between corresponding values in each block.

The two most similar regions should have the smallest MSE or MAD.

The next problem is "Which blocks do you compare?" The idea is to start with the block in the same relative location as the block being encoded, and then try examine block-size regions within some number of pixels from the center of that block, say 75, plus or minus and up or down (although horizontal moves are more common, so it is probably better to have a wider than taller search area).

The next diagram shows how an encoder could search for a segment of a subsequent ("Next") frame that is similar to a block to be encoded within the "Current" frame, outlined in green (after Li).

The search begins with the block in the same position as the reference block, as described above, and proceeds in this example by moving the comparison region (shown as a red square) to the right some number of pixels at each step. Here, the step size is several pixels.

There exist alternate strategies for searching the area: full search, two-dimensional logarithmic search, etc. In addition, some compression standards (c.f., MPEG-2, MPEG-4) allow partial-pixel step sizes, or "resolution," during these searches. Since pixels are discrete, this requires interpolation of the pixel values between two known pixels.

Note that you don't necessarily want an exact match. Taking a close match can be more useful than having to recode the whole block.

In fact, it is often more economical to send residuals against an inexact match than to DCT encode the current block.

How well do these techniques work?

Here is a single data point (taken from Ze-Nian Li) presenting a clear-cut comparison:

            Compression technique          Compression ratio

                 I frames only                     7:1
                 I and P frames                   20:1
                 I, P and B frames                50:1

Clearly, the INTER-block similarity searches are effective (keeping in mind that YOUR mileage will vary...A LOT).

Note, however, that these methods are also CPU and possibly memory intensive.

H.261 compression

"H.261 is a video coding standard published by the ITU (International Telecom Union) in 1990. It was designed for datarates which are multiples of 64 Kbit/s.... These datarates suit ISDN lines, for which this video codec was designed."

The datarate of the coding algorithm was designed to be set between 40 Kbits/s and 2 Mbits/s." (Peter Cherriman)

H.261 utilizes the generic compression toolset presented above, though in a limited fashion. For example, it will only perform forward prediction, supports only two resolutions:

and is optimized for images similar to the ones sent during video conferencing.

H.261 was designed for non-interlaced image streams.

H.263

"The coding algorithm of H.263 is similar to that used by H.261, with some improvements and changes to improve performance and error recovery. The differences between the H.261 and H.263 coding algorithms are:

H.263 supports five resolutions:

The support for 4CIF and 16CIF means the codec COULD compete with other higher bitrate video coding standards such as the MPEG standards."

H.263 will process interlaced images?

"The original H.263 standard included four optional coding modes." (Richardson) Codecs described as H.263v2 and H.263v3 (aka + and ++) support some or all of the optional modes.

Comparing codec performance

Codec performance depends on many factors. Here are some comments taken from the web primarily about H.261 and H.263:

MPEG-1

The Moving Picture Experts Group (MPEG) was established under ISO/IEC in 1988 to create standards for the delivery of video and audio, primarily on computers, which is to say, "digitally".

MPEG standards were designed to be easy to decode but rather more difficult to encode. (They are "asymmetric".) For example, one might be willing to spend a lot of energy encoding a movie to be placed on a VCD or DVD to be decoded by millions of decoders.

Such codecs may not be suitable for interactive applications, but then again, chip speeds have increased so quickly over the past 10 years, that MPEG-2 IS, in fact, now commonly used for conferencing applications.

The MPEG-1 standard defines video, audio, and "system" components. MPEG-1 Audio Layer 3 is aka "MP3".

MPEG-1 encoders produce Common Source Intermediate Format (CSIF aka SIF) streams (from either NTSC or PAL) input signals, typically at 352x240 resolution at 30 fps with 4:2:0 subsampling, but can produce an image of up to 4095 by 4095 pixels. (However transmission is limited by the number of "macroblocks".)

Image stream quality encoded to be transmitted at around 1.2 Mbps is considered similar to the quality yielded by VHS recording technology,

MPEG-2

Video standards are fairly tightly "tuned" to the performance characteristics of the electronic devices, transmission media, and expected applications that will manipulate and transmit traffic.

To meet the varying needs of a range of devices, MPEG-2 may be thought of as a collection of substandards, or "profiles" that employ different encoding techniques and operate at different rates and/or efficiencies.

Here is a description of SOME of the MPEG-2 profiles with SOME of their "levels":

     Level\Profile    Simple        Main          SNR           High
     -------------    ------        -----         ---          ------- 
     Low                            4:2:0        4:2:0
                                   352x288      352x288
                                   4 Mbps       4 Mbps
                                    I,P,B        I,P,B
     
     Main              4:2:0        4:2:0        4:2:0        4:2:0,4:2:2
                      720x576      720x576      720x576        720x576
                      15 Mbps      15 Mbps      15 Mbps         20Mbps
                       I,P          I,P,B        I,P,B           I,P,B
     
     High              4:2:0                                  4:2:0,4:2:2 
                     1920x1152                                 1920x1152  
                      80 Mbps                                   100 Mbps  
                      I,P,B                                      I,P,B   

For example, MPEG-1 is a proper subset of MPEG-2 known as "Main Profile Low Level" or "MP@LL" for short.

Note that both MPEG-1 and MPEG-2 support I, P, and B frames.

MPEG-2 extensions include:

Typical transmission rates

H.261 used for video conferencing typically generates 64kbps to 1.5 Mbps traffic.

MPEG-1 streams use up to 1.5 Mbps (your average T1).

MPEG-2 is typically used to move 720 by 480 images, but can be used to move 1920 by 1080 (interlaced as well as progressive) images when fed a High Definition input video signal (2200 horizontal samples by 1125 scan lines).

MPEG-2 is used by cable and satellite systems, generating:

MPEG-2 is also used to store info on DVDs at a rate of around 6 Mbps (according to MY math).

MPEG-4 Part 2

The MPEG-4 standard has multiple "Parts"; be careful generalizing.

MPEG-4 Part 2 (aka Visual) was designed to support multimedia; it adds "object" constructs to the video transmission process. (e.g., layers, sprites, non-rectangular image dimensions, etc.)

MPEG-4 Visual was designed to give DVD level quality at lower bit rates and to provide "scalable" quality to serve a wide range of receiving devices.

For such a range of goals, Visual defines 19 profiles grouped in categories:

MPEG-4 Part 10

MPEG-4 Part 10 (aka MPEG-4/AVC or just AVC) is the up-and-coming compression format for video conferencing, video streaming, HDTV, production DVD, broadcast, and satellite broadcast, etc.

MPEG-4/AVC makes improvements to low bandwidth encoding as well as high-quality encoding. (The "low bandwidths" referred to are pretty much below what Kan-ed is concerned with for backbone-connected equipment, but may impact video transmission over telephones and home connections using QuickTime or other streaming servers.)

There are 3 profiles:

Note that compression efficiency appears to be compared on the basis of how much bandwidth is required to send the same image using different codecs. For example, MPEG-4 may be able to send an image in half the bandwidth required by H.263. This suggests that MPEG-4 could send an image of 1.414 times the height (in pixels) and 1.414 times the width (in pixels) of the original image at the original bandwidth.

MPEG-4/AVC aims to improve MPEG-2 and MPEG-4 Part 2 compression by a factor of 2, via such modifications as:

But remember, this compression improvement comes at a cost. One estimate has it that MPEG-4 encoders require 8 times more processing power and decoders are 2.5 to 4 times more complex. (Malak, 2005)

Fidelity Range Extensions (FRExt) added 4 new profiles that support up to 12 bit color sampling and 4:4:4 color encoding, etc. making the standard useful for high-quality video production work

Transport

So far we have only talked about compression standards.

Compressed video streams must be enclosed within some transport protocol that will allow users to establish connections, transmit data, coordinate data delivery in support of synchronization, etc.

The workhorses of such transport are:

RTP provides an interface to UDP (the User Datagram Protocol), and thence IP, for transmission over the Internet.

RTP does not address resource reservation and does not guarantee quality-of-service for real time services.

Applications in the Access Grid, vic and rat, interact directly with RTP.

MPEG-1 and MPEG-2 System Layers (which determine how MPEG packets are transmitted) also interact directly with RTP.

MPEG-4 Part 8 defines transmission over networks.

H.323 conferencing applications also use RTP, as well as a variety of other protocols to manage communications sessions.

Additional information

Video compression is a complex area of study. Techniques have been explored and implemented by multiple instaurations of multiple committees, a small army of grad students, and engineers working in a variety of industries. Many appear to have made a dent in a large and complex sculpture that is difficult to describe. (....a manifestation of post-modern eclecticism in an age of technological optimism?)

Here are some excellent reference materials that have produced both background and content for this talk and will certainly help the motivated reader continue to appreciate this topic:

For more information please contact Michael Grobe at 785-817-2992 or grobe@ku.edu.