"The Reality About Pixels"

A Taxonomy of the 4:2:2
Digital Video Format

When someone says that they have 4:2:2 video, they specify not so much a data format as a bandwidth specification. Originally, the 4:2:2 designation was used to indicate that the luminance channel had a bandwidth of approximately 4 MHz, and each of the chrominance channels had bandwidths of approximately 2 MHz. Nowadays, with myriad resolutions of video, 4:2:2 is used to indicate relative bandwidth, i.e. that chrominance has half the bandwidth of luminance.

When we digitize video and serialize it into a byte stream, a number of choices present itself:

In which direction is each frame/field scanned (left/right, up/down)?
Where are the samples taken spatially (chrominance relative to luminance)?
How many bits are used for each component sample?
What kind of quantization is used?
What numeric representation is used for the samples?
What values are used for black, white, full red, full blue?
How is each sample converted into a byte stream?
In what order should the components appear?

Additionally, we can have transparency (usually sampled at the same rate as luminance, thus called 4:2:2:4), derived from chroma keying or synthetic imagery. This type of video has additional considerations:

Does alpha represent opacity or transparency?
Are the luminance and chrominance premultiplied by alpha or not?
In what order in the byte stream does the alpha sample appear relative to the luminance and chrominance samples?

We would like to give symbols to each of these choices, so that we can we can have a compact and accurate code that describes the video. For example, one of the most prevalent 4:2:2 formats might be called:

4:2:2 CB0DVO8N

where

the ordering of the components is {Cb,Y,Cr,Y},
chrominance is sampled with 0 relative horizontal phase from luminance,
the image is scanned downward, from top to bottom,
the range is that of standard video, with extra headroom and footroom to accommodate overshoot,
the chrominance is represented in offset-binary fixed point,
the video is quantized to 8 bits per component,
the quantization is linear (a.k.a. uniform)

These choices are enumerated in the table below, with explanations following.

Ordering of Luma and Chroma	luma first (Y)	chroma first (C)
Ordering of Chroma	Cb first (B)	Cr first (R)
Horizontal Sampling	zero phase (0)	half phase (H)
Vertical Scan Direction	top down (D)	bottom up (U)
Range	video (V)	wide (W)	full (F)
Numeric Formal	two's complement fixed point (C)	sign+magnitude fixed point (M)	offset binary fixed point (O)	floating point (F)
Quantization Coarseness (bits)	8 (8)	10 (10)	12 (12)	14 (14)	16 (16)	32 (32)
Quantization Distribution	linear (N)	logarithmic (G)	custom (C)
Byte Packing	little-endian (L)	big-endian (B)
Padding	top (T)	bottom (B)

Alpha Polarity	opacity (O)	transparency (T)
Alpha Premultiplication	straight (S)	premultiplied (P)
Alpha Ordering	first (F)	last (L)

Ordering of Luma and Chroma (2 variations, 1 bit)

The 4:2:2 format usually has the luminance and chrominance samples interleaved, as either YCYC or CYCY, where the former has luminance first, and the latter has chrominance first.

Ordering of Chroma

There are two dimensions to the chromonance samples: Cb and Cr, and either Cb appears first or Cr appears first in the stream.

Horizontal Sampling

The chrominance is sampled only half as frequently as the luminance, and is usually sampled coincident with the first luminance sample (zero phase) or halfway between the first two luminance samples, as shown in the following table as "0" and "1/2".

Luminance Samples	Y0		Y1
Chrominance Relative Phase	0	1/2	1	3/2

Luckily, there seems to be no systems that locate the chrominance phase at 1 or 3/2, so we only have two phases to be concerned with in practice.

Vertical Scan Direction

Most systems scan from the top down, but there are a few that scan from the bottom up.

Range

The standard "video" range includes headroom and footroom to accommodate overshoot. In a computer, the range is sometimes stretched to yield higher resolution, and in the process, saturating the overshoots. There are two such scales, the "wide" range and the "full" range. The symmetric wide range is preferred over the asymmetric full range.

	Luminance		Chrominance
	Min	Max	Min	Max
Video	16	235	-112	+112
Wide	0	255	-127	+127
Full	0	255	-128	+127

Numeric Format

Numbers are represented either in fixed-point or floating-point.

Fixed-point numbers can be either two's complement, sign and magnitude, or offset binary. Offset binary is the most common in external interfaces. Two's complement is convenient when doing computations. I don't know of any system that uses sign and magnitude.

Although the above subcategorization can be used for floating-point as well, typically all floating-point samples are represented using sign and magnitude, as specified in the ubiquitously implemented IEEE-754 specification.

Quantization Coarseness

Typically, 8, 10, 12, 14, and 16 bit samples are fixed-point, and 32 bit samples are floating-point, though alternatives are not prohibited. In particular, a 16-bit floating-point format has been popularized by OpenEXR.

Quantization Distribution

Typically, the video signals are uniformly quantized, so that the encoding is a linear function of the video signal. Occasionally, though, the encoding is a logarithmic function of the video signal. A custom quantization requires a supplementary table mapping the encoded values to video signal strength.

Byte Packing

When samples are greater than 8 bits (1 byte), there is the question of how the multiple bytes will be arranged in a stream. If they are arranged with the most significant byte first, followed by the next most significant byte, etc., the packing is called big-endian. If they are arranged with the least significant byte first, followed by the next least significant byte, etc., the packing is called little-endian. Other packing schemes are possible when there are more than 2 bytes, but such schemes have virtually disappeared from practice.

Padding

If the video sample is not an integral number of 8 bits, then there are unused bits, assuming that only the bits of a single sample is to be found in a particular byte (though this may not remain true). Zeros are usually stuffed into the unused bits, which can appear either at the top end of a byte or the bottom end.

Alpha Polarity

Usually alpha is used to specify the opacity of a pixel; however, sometimes the polarity is reversed and it is used to indicate transparency instead.

Alpha Premultiplication

Since alpha is almost always used to scale the color components during composition, it is advantageous to pre-multiply each of the color components by alpha, to avoid multiplication during composition. If the color components are not so scaled, it is called straight alpha, and if the components are scaled, it is called premultiplied alpha. This is misleading terminology, though, because the color components are scaled, not the alpha.

Alpha Ordering

Alpha is either placed before the luminance or after it.

4:2:2 Format Encosing Specification

For 4:2:2, we have the choices:
{Y,C} {B,R} {0,H} {D,U} {V,W,F} {C,M,O,F} {8,10,12,14,16,32} {N,G,C} {B,L} {T,B}

For 4:2:2:4, we have three more fields:
{Y,C} {B,R} {0,H} {D,U} {V,W,F} {C,M,O,F} {8,10,12,14,16,32} {N,G,C} {B,L} {T,B} - {O,T} {S,P} {F,L}

We've placed the endian and padding symbols at the end so they can be dropped if not needed.

Examples

Probably the most prevalent format is {Cb,Y,Cr,Y}, 0-hPhase, down-scan, video-range, offset-binary, 8-bit, linear:
4:2:2 CB0DVO8N

If there is alpha, it is a 3-symbol code can be appended:
4:2:2:4 CB0DVO8N-OSF

A Taxonomy of the 4:2:2 Digital Video Format