This page describes the basic techniques used to transmit digital video over a communications network with special attention paid to techniques used to compress the video stream. It was written for the Kan-ed network within the Kansas Board of Regents.
Much of the material on compression is modeled after Peter Symes' wonderful primer on compression, Video Compression Demystified. In particular, Symes used a two-step didactic approach beginning with an over-simplified example and followed with an example describing techniques used by most compression standards employed today.
This page finishes with a description of several common compression standards making extensive use of the introductory material.
Note that he sender and receiver may both keep several frames in memory at all times to improve the compression process. For example, while working on one frame the sender may refer to content in one or more previous or subsequent frames.
All along the way there exist "standards" describing how the information should be encoded, decoded, and otherwise represented, and "protocols" describing how information should be exchanged between components of this process (where a component may be a computer, a computer "card", a set-top appliance, etc.).
Note that these functions are always performed by a collection of components, but that individual functions may be grouped together in different ways. That is, a camera and a computer together may perform the same functions as does a PolyCom set-top box.
In general, it means that sounds and images are converted to electronic signals as voltages or amperages in amplitudes that are "analogous" to the strength of the signals (hence the term "analog").
For example, sound is composed of sound "waves" that can be "sampled" at regular intervals to convert the average wave amplitude during each interval to a voltage value (analogous to the wave amplitude).
To convert the analog information to digital information, the average amplitude value during each interval is simply converted to an integer, represented as string of binary values, so that a stream of sounds, becomes a stream of integers, say 44,000 per second.
In the video realm, an NTSC (SDTV) camera scans a video image as 525 separate scan lines.
To produce a digitized version of the camera output, each scan line can be divided into a series of small areas, called "pixels", for "picture element," and the brightness and color of each pixel can be determined using one of several schemes.
Then each pixel can be encoded as a set of 3 integers:
so that each image becomes THREE 2-dimensional matrices (one matrix for Luminance values, one for Cb values, etc.)....
...which may NOT necessarily be of identical size. Cb and Cr may not be sampled for each pixel (but rather "subsampled"):
For example, most humans don't hear much sound composed of waves varying more than 20,000 times per second...or 20KHz.
Consequently, sampling sounds at around 44,000 times a second should capture with full fidelity, highest quality.
In the video realm, quality is affected by the number of pixels into which each image is separated, or the video "resolution," usually specified as a rectangular collection of pixels. For example, an NTSC video image is considered to have a quality roughly equivalent to 680 by 483 pixels, while film used in the movie industry is considered to have a quality better than about 4000x2000 pixels (Super High Definition Television).
However, things are not that simple. For example, the eye is more sensitive to the luminance than to color components of a pixel, so encoding luminance values at full resolution, and color components at a lower resolution can produce a better image than might be expected. (Hence: the use of subsampled YUV, rather than RGB.)
Raw NTSC 168 Mbps
Raw PAL 216 Mbps
Raw HDTV 1 to 1.5 Gbps
which are clearly too high for most networks.
One Common Intermediate Format (CIF) can produce a video stream composed of 352x240 (or 288 for PAL) images, which are considered to be roughly equivalent the quality captured by Video Cassette Recorders, which is less than native NTSC quality.
So if you are only sending small CIF images, what's the problem?
Well, let's say that each pixel is represented by 3 values (Y, Cr, Cb), and that you are using 4:2:2 subsampling.
You must encode luminance values for 352 times 240 = 84,480 pixels, plus two smaller color matrices of 176 times 120 = 21,120 pixels, for a total of 126,720 pixels.
If each value is encoded in a single byte, then each image will require 380,160 bytes.
So, if we were to attempt a frame rate of 30 frames per second, we would need about 11.4 MBytes per second (MBps), or 91 Mbits per second (Mbps) bandwidth....
and, then, there's sound.....
A single device called a "codec" may be employed to provide this service.
(Of course you might wonder why the information wasn't just created in a compact form to begin with, and the answer to that is that cards that can convert analog to digital video signals may well have enough to do on their own, and may also wish to support the use of alternative encoding schemes without biasing the choices.)
Note, however, that the original image quality is not always preserved by the compression and decompression schemes employed. In fact, quality may be purposely sacrificed to meet other constraints.
There is a continuous, dynamic trade-off between stream quality, computer/codec speed, and available network bandwidth.
To get the aforementioned CIF image to transmit over a 768 Kbps circuit we need about a "compression ratio" of around 100 to 1....while retaining "adequate image quality".
To display such a stream over a display adapter connected through a 528 MBps AGP bus, or even a 33MBps PCI bus, should be possible provided the CPU clock rate is high enough.
As we will see later, the compression process is CPU-intensive, and both network bandwidth and CPU speed are likely candidates for forcing quality reduction to maintain near real-time communications.
The example uses a famous image, often used in discussions of compression (copyright 1972, Playboy International):
Here is a grey level version of the image, similar to the luminance matrix described above:
Now use the top-most row and left-most column as starting values and imagine predicting the value of any other pixel by assuming it will be the same as the value of the pixel to its left + the value of the pixel above it divided by 2. (To do this, you have to start by calculating the value of the second element in the second row, and work out from there.)
Then subtract each predicted value from the corresponding actual value to build a matrix of differences or discrepancies (aka "residuals").
This process will generate both positive and negative values, so to get a visual approximation of the result, we can shift all values into positive territory by adding 128 to each value. Having done that, 0 values will be represented by a medium-dark grey, and previously negative values will appear lighter.
Here is a rendition of the matrix of differences generated by this process (copyright McGraw-Hill, 2001):
This process has separated the original data into a set of predicted values along with a set of difference/discrepancy values.
The difference value information seems to represent object edges, while predicted information seems to better represent surfaces.
Another way to think about this is that the prediction process identifies "correlated" or "autocorrelated" pixels within the image, while the difference matrix better identifies "edges,"
Note that the range of difference values is usually smaller than the range of actual values, AND the distribution of difference values is usually very uneven. (Many are close to 0 in this example).
Streams of such values can usually be sent in fewer bytes when encoded in a suitable fashion.
In fact, the E values suggest the first image could be encoded using 8 bits per pixel, while the second could be encoded using only 5 bits per pixel.
Some protocols offer a predetermined set of prediction algorithms from which the encoder may select the most effective, and identify with a single integer value.
VLE schemes can yield compression when the distribution of values to be encoded is very uneven. That is, some values occur frequently; some occur infrequently.
For example, look at this table (from Friedman) of the frequencies of occurence of letters in English text (not counting spaces which occur with a 19% frequency among the set of letters and space):
Frequency (%): 13 9 8 8 7 7 7 6 6 4 4 3 3 3 3 2 2 2 1 1 1 - - - - - Letter: E T A O N I R S H L D C U P F M W Y B G V K Q X J Z
Samuel Morse used this kind of information to define Morse code. It made the code easier to use and shortened the encoded string of "dots" and "dashes".
"E" is coded as a single "dot", which is to say that the most frequently used letter was encoded using the shortest code value.
"T" is encoded as a single "dash", etc.
The less frequent letters end up being coded using longer sequences:
"Q", "X", "J", and "Z" are all encoded as four-part sequences (c.f., "Q" is "dash, dash, dot, dash").
Encoding picture data values is more complicated, but the idea is the same; examine the picture values to determine which are most frequent, and encode them with the shortest codes.
A modified "Huffman" scheme is usually used.
It turns out that this approach can sometimes achieve a lossless 2:1 compression ratio.
However, 2:1 compression is not good enough for most video applications and other methods were developed to achieve better results.
The take-home message is that compression schemes often attempt to predict data values using some simple scheme, and then transmit the predictions and the discrepancies separately using encoding methods appropriate for each type of information.
To reduce the stream size, the difference data values could be placed into some number of categories:
Values -5 to +5 could be placed in category 1,
Values 6 to 15 into category 2,
etc.
and the category for each data value would be sent in place of the actual value along with a table of category mean values.
The resulting stream would be smaller than the stream generated for the uncategorized data, and the reconstructed image would be only "relatively similar" to the original.
The quality of the result will usually decline as the number of categories decreases or the size of the category intervals increases.
This process injects "loss" into the compression process.
Note that the size/number of categories can be adjusted on-the-fly to match the video stream to the available bandwidth during periods of congestion or partial failure.
There exist numerous compression schemes, many of which have been defined as standards by one or more standards groups (sometimes simultaneously).
These standards typically have different target applications and different performance goals.
Note that MPEG-7 and MPEG-21 have different functions than the standards shown above. MPEG-7 deals with metadata and 21 with comprehensive archive and access systems.
It is difficult to arrange these standards into some hierarchical order on the basis of coding efficiency or product quality. However....
To compress an image, a codec will usually remove "spatial" and "temporal" redundancies.
To remove spatial redundancy, codecs will:
This activity is (usually) performed on "blocks" of 8x8 pixels, and such blocks are sometimes aggregated into "macroblocks" of 4 blocks and/or "slices" for other purposes.
Given a matrix f(x,y) of dimension N by N, the Discrete Cosine Transform will generate another N by N matrix, F(u,v), as follows:
N-1 N-1
F(u,v) = ( 2 / N) * C(u) * C(v) * SUM( SUM( f(x,y) * cos( a ) * cos( b ) ) )
x=0 y=0
where a = ( ( 2x + 1 ) * u * pi ) / 2N
b = ( ( 2y + 1 ) * v * pi ) / 2N
C(u) = 1 / sqrt( 2 ) if u = 0 else C(u) = 1
C(v) = 1 / sqrt( 2 ) if v = 0 else C(v) = 1
Here is a history of the changes experienced by an example block of luminance values taken from Symes:
70 72 70 70 72 68 68 64
103 101 103 100 99 97 94 94
132 132 132 130 129 129 125 121
157 157 155 154 153 150 148 145
168 163 164 162 163 161 161 156
172 170 165 166 163 163 162 158
174 170 167 167 164 163 164 159
174 173 170 167 167 166 166 160
This block can be transformed via DCT into the next block:
86 2 -3 6 -2 2 -2 2
-247 -4 -5 -3 0 -3 1 1
-117 -1 1 -1 -1 -1 -1 0
-40 -2 2 1 2 1 -2 0
-7 -2 -1 1 0 -1 -1 2
-6 1 0 0 0 0 -2 -1
-4 -1 -1 1 -2 -1 -1 -1
-3 -3 -1 1 0 1 1 1
which has concentrated information from the entire matrix into the upper left-hand corner.
In particular, the upper-left-most DCT value represents an average block value (actually 8 times the mean value).
It is called the "DC coefficient"; the other values are "AC coefficients".
The DC coefficient must be sent with higher precision, so it is not included in the stream of values to be quantized, run-length encoded, and then variable-length encoded.
Instead, the DC coefficients from all blocks may be combined and treated as a separate block to which the DCT transformation and quantizing and encoding steps are applied prior to transmission. (See the Tudor reference below for details on the DCT transform.)
5 2 0 0 0 0 0 0
-21 0 0 0 0 0 0 0
-8 0 0 0 0 0 0 0
-3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
The entropies of these 3 blocks are 5.04, 1.42, and .30 (so that these matrices COULD be encoded losslessly in 6, 2, and 1 bits per pixel or less).
The example can be encoded at .69 bits per pixel (rather than the original 8 bits per pixel) using a modified Huffman scheme that results in a compression ratio of 11.6 to 1.
To further increase the compression ratio, the quantization could be done using more aggressive quantization tables, and/or....
The basic idea is that up to this point a frame has been thought of as a matrix of 8x8 pixel blocks, encoded using the DCT process. However, it often occurs that a block in a frame being encoded is quite similar to a block in another frame, either before the current frame or after it.
As an example, consider the following sequence of 3 frames. If you want to encode the center, or "Current" frame, you can find all of the visual content in either the "Previous" frame or the "Next" frame.
This brings up the ideas of
To implement this approach there exist several frame types:
Typically a Group of Pictures (GOP) will be an I frame, followed by several sets composed of....one P frame followed by several B frames.
Caveat: This process is actually quite a bit more complicated. For example, frames are thought of as a matrix of 16x16 pixel blocks ("macroblocks"), each of which contains the DCT-encoded data for 4 8x8 blocks (along with the smaller chroma blocks) that will be used to create such an image, pointers to similar blocks elsewhere in the data stream, DCT-encoded data using similar blocks elsewhere in the data stream, or some combination of these.
See Symes for details.
The first problem is "How can you tell how similar two visual regions are?" The obvious approach is to develop a formula that compares pixel values at the same relative positions within each block. Two popular approaches are to compute the:
Mean Square Error is calculated by calculating the mean value of...the list of values generated by squaring the difference between corresponding values in each block.
The two most similar regions should have the smallest MSE or MAD.
The next problem is "Which blocks do you compare?" The idea is to start with the block in the same relative location as the block being encoded, and then try examine block-size regions within some number of pixels from the center of that block, say 75, plus or minus and up or down (although horizontal moves are more common, so it is probably better to have a wider than taller search area).
The next diagram shows how an encoder could search for a segment of a subsequent ("Next") frame that is similar to a block to be encoded within the "Current" frame, outlined in green (after Li).
The search begins with the block in the same position as the reference block, as described above, and proceeds in this example by moving the comparison region (shown as a red square) to the right some number of pixels at each step. Here, the step size is several pixels.
There exist alternate strategies for searching the area: full search, two-dimensional logarithmic search, etc. In addition, some compression standards (c.f., MPEG-2, MPEG-4) allow partial-pixel step sizes, or "resolution," during these searches. Since pixels are discrete, this requires interpolation of the pixel values between two known pixels.
Note that you don't necessarily want an exact match. Taking a close match can be more useful than having to recode the whole block.
In fact, it is often more economical to send residuals against an inexact match than to DCT encode the current block.
Compression technique Compression ratio
I frames only 7:1
I and P frames 20:1
I, P and B frames 50:1
Clearly, the INTER-block similarity searches are effective (keeping in mind that YOUR mileage will vary...A LOT).
Note, however, that these methods are also CPU and possibly memory intensive.
The datarate of the coding algorithm was designed to be set between 40 Kbits/s and 2 Mbits/s." (Peter Cherriman)
H.261 utilizes the generic compression toolset presented above, though in a limited fashion. For example, it will only perform forward prediction, supports only two resolutions:
and is optimized for images similar to the ones sent during video conferencing.
H.261 was designed for non-interlaced image streams.
H.263 supports five resolutions:
The support for 4CIF and 16CIF means the codec COULD compete with other higher bitrate video coding standards such as the MPEG standards."
H.263 will process interlaced images?
"The original H.263 standard included four optional coding modes." (Richardson) Codecs described as H.263v2 and H.263v3 (aka + and ++) support some or all of the optional modes.
It is assumed, that one big egg-shaped object is moving before a still background. This egg mainly varies at the borders and in the lower third. If the real content of the video stream matches this assumption, the compression of H.263 is better than H.261.
In other words: A single person sitting before the cam takes advantage, if the illumination is good enough for avoiding cam noise flickering on the background."
MPEG standards were designed to be easy to decode but rather more difficult to encode. (They are "asymmetric".) For example, one might be willing to spend a lot of energy encoding a movie to be placed on a VCD or DVD to be decoded by millions of decoders.
Such codecs may not be suitable for interactive applications, but then again, chip speeds have increased so quickly over the past 10 years, that MPEG-2 IS, in fact, now commonly used for conferencing applications.
The MPEG-1 standard defines video, audio, and "system" components. MPEG-1 Audio Layer 3 is aka "MP3".
MPEG-1 encoders produce Common Source Intermediate Format (CSIF aka SIF) streams (from either NTSC or PAL) input signals, typically at 352x240 resolution at 30 fps with 4:2:0 subsampling, but can produce an image of up to 4095 by 4095 pixels. (However transmission is limited by the number of "macroblocks".)
Image stream quality encoded to be transmitted at around 1.2 Mbps is considered similar to the quality yielded by VHS recording technology,
To meet the varying needs of a range of devices, MPEG-2 may be thought of as a collection of substandards, or "profiles" that employ different encoding techniques and operate at different rates and/or efficiencies.
Here is a description of SOME of the MPEG-2 profiles with SOME of their "levels":
Level\Profile Simple Main SNR High
------------- ------ ----- --- -------
Low 4:2:0 4:2:0
352x288 352x288
4 Mbps 4 Mbps
I,P,B I,P,B
Main 4:2:0 4:2:0 4:2:0 4:2:0,4:2:2
720x576 720x576 720x576 720x576
15 Mbps 15 Mbps 15 Mbps 20Mbps
I,P I,P,B I,P,B I,P,B
High 4:2:0 4:2:0,4:2:2
1920x1152 1920x1152
80 Mbps 100 Mbps
I,P,B I,P,B
For example, MPEG-1 is a proper subset of MPEG-2 known as "Main Profile Low Level" or "MP@LL" for short.
Note that both MPEG-1 and MPEG-2 support I, P, and B frames.
MPEG-2 extensions include:
MPEG-1 streams use up to 1.5 Mbps (your average T1).
MPEG-2 is typically used to move 720 by 480 images, but can be used to move 1920 by 1080 (interlaced as well as progressive) images when fed a High Definition input video signal (2200 horizontal samples by 1125 scan lines).
MPEG-2 is used by cable and satellite systems, generating:
MPEG-4 Part 2 (aka Visual) was designed to support multimedia; it adds "object" constructs to the video transmission process. (e.g., layers, sprites, non-rectangular image dimensions, etc.)
MPEG-4 Visual was designed to give DVD level quality at lower bit rates and to provide "scalable" quality to serve a wide range of receiving devices.
For such a range of goals, Visual defines 19 profiles grouped in categories:
MPEG-4/AVC makes improvements to low bandwidth encoding as well as high-quality encoding. (The "low bandwidths" referred to are pretty much below what Kan-ed is concerned with for backbone-connected equipment, but may impact video transmission over telephones and home connections using QuickTime or other streaming servers.)
There are 3 profiles:
Note that compression efficiency appears to be compared on the basis of how much bandwidth is required to send the same image using different codecs. For example, MPEG-4 may be able to send an image in half the bandwidth required by H.263. This suggests that MPEG-4 could send an image of 1.414 times the height (in pixels) and 1.414 times the width (in pixels) of the original image at the original bandwidth.
MPEG-4/AVC aims to improve MPEG-2 and MPEG-4 Part 2 compression by a factor of 2, via such modifications as:
But remember, this compression improvement comes at a cost. One estimate has it that MPEG-4 encoders require 8 times more processing power and decoders are 2.5 to 4 times more complex. (Malak, 2005)
Fidelity Range Extensions (FRExt) added 4 new profiles that support up to 12 bit color sampling and 4:4:4 color encoding, etc. making the standard useful for high-quality video production work
Compressed video streams must be enclosed within some transport protocol that will allow users to establish connections, transmit data, coordinate data delivery in support of synchronization, etc.
The workhorses of such transport are:
RTP provides an interface to UDP (the User Datagram Protocol), and thence IP, for transmission over the Internet.
RTP does not address resource reservation and does not guarantee quality-of-service for real time services.
Applications in the Access Grid, vic and rat, interact directly with RTP.
MPEG-1 and MPEG-2 System Layers (which determine how MPEG packets are transmitted) also interact directly with RTP.
MPEG-4 Part 8 defines transmission over networks.
H.323 conferencing applications also use RTP, as well as a variety of other protocols to manage communications sessions.
Here are some excellent reference materials that have
produced both background and content for this talk and will
certainly help the motivated reader continue to appreciate
this topic:
For more information please contact Michael Grobe at
785-817-2992 or grobe@ku.edu.
Additional information
Video compression is a complex
area of study. Techniques have been explored and implemented
by multiple instaurations of multiple committees, a small army
of grad students, and engineers working in a variety of
industries. Many appear to have made a dent in a large and
complex sculpture that is difficult to describe. (....a
manifestation of post-modern eclecticism in an age of
technological optimism?)