Systematic learning of H264 video encoding (03) data compression process and related concepts

Systematically learn H264 video encoding (03) data compression process and related concepts

2024-07-12

Note: We refer to the Golden Circle Learning Method (What is the Golden Circle Rule? -> Model Golden Circle Rule, this article uses: why-what) to learn audio and video encoding. This series of articles focuses on understanding the knowledge system and practical methods of video encoding. The theoretical aspects will explain more about the origin of concepts in audio and video and the connection between various concepts. Know the facts and the reasons. At the same time, it emphasizes the establishment of a knowledge system.

For this article, we mainly study the process of H.264 data compression and its related concepts. H.264 data compression has only one purpose, which is to compress, compress and compress again, reducing the size of video data while ensuring the image quality as much as possible. Therefore, before learning H.264 data compression, we must first understand that although there are many concepts and methods involved in this process, all methods and processes are for this purpose: compression.

1 Interpretation of H.264 data compression process

The data compression process of H.264 can be summarized into the following key steps: macroblock division and subblock division -> frame grouping -> frame prediction -> integer discrete cosine transform (DCT) -> CABAC compression. The specific explanation is as follows.

1.1 Macroblock Division && Subblock Division

Macroblock: It can be understood that when a video frame is sent to the H264 encoder's buffer, the encoder divides each picture into macroblocks. H264 encoding uses a 16X16 pixel area as a macroblock by default (H265 uses a 64X64 pixel area). The effect of dividing a frame of a video into macroblocks is as follows:

The overall effect after macroblock division

At the same time, in the above figure, we can actually divide the 16X16 pixel macroblock into smaller sub-blocks. The size of the sub-block can be 8X16, 16X8, 8X8, 4X8, 8X4, 4X4, which is very flexible. The purpose of dividing it into sub-blocks again is to use very small data to record this data. After the macroblocks are divided, all the pictures in the H264 encoder cache can be grouped.

Macroblock and subblock division allows video encoding to analyze and process video content more accurately and achieve more efficient data compression. Video frames are divided into macroblocks of 16x16 pixels, which can be further divided into smaller blocks such as 8x8, 8x4, 4x8, etc. to adapt to the complexity of image content.

1.2 Frame Grouping

Frame grouping (GOP structure) is defined before intra-frame prediction and inter-frame prediction. In the process of video coding, the setting of frame grouping is one of the initial steps, which determines the organization and coding order of video frames. The main purpose of frame grouping is to reduce data redundancy. Video data redundancy can be divided into two categories:

Temporal data redundancy, which requires the use of inter-frame prediction compression, accounts for a larger proportion(Because even if the camera captures 30 frames per second, these 30 frames of data are mostly related. It is also possible that there are more than 30 frames of data, and dozens of frames of data may be closely related. For frames that are particularly closely related, we only need to save one key frame(I-frame)Data, other frames(B-frames and P-frames)All of them can be predicted through this frame according to certain rules, so the video data has the most temporal redundancy).
Spatial data redundancy, which requires the use of intra-frame prediction compression, accounts for a relatively small proportion.

H264 frame grouping execution steps:

Each time, two adjacent frames are taken out for macroblock comparison to calculate the similarity between the two frames.
After quantization, similarity is achieved: if the number of different pixels is within 10%, the brightness difference does not exceed 2%, and the chromaticity difference is within 1%, we believe that such images can be grouped together.

When the grouping is completed, the problem of frame prediction compression begins.

1.3 Frame Prediction Compression

This part mainly includes intra-frame prediction and inter-frame prediction. Intra-frame prediction is to compress static image frames, predict the pixel value of each macroblock through 9 different modes, and then calculate the residual value with the original image. Inter-frame prediction (motion estimation and motion compensation): For P frames and B frames, motion estimation is used to find the motion vector between the previous frame or double frames, and then motion compensation prediction is performed to generate a predicted image and calculate the residual with the current frame.

Note: The order of intra prediction and inter prediction depends on the type of coded frame and the coding strategy. In actual video encoder implementation, these steps are part of the coding process and are automatically handled by the encoder. It is not necessary to process one first.

1.3.1 Intra prediction

Intra-frame prediction compression: It solves the problem of spatial data redundancy. Spatial redundant data refers to color, brightness, etc., which are mainly information that the human eye is not sensitive to. Encoding compresses it by deleting it. The principle of intra-frame prediction is that the human eye has a certain degree of recognition of images, and is very sensitive to low-frequency brightness, but not very sensitive to high-frequency brightness.(It can be understood that a macroblock is 16X16 in size. In H264, the top 16 pixels and the leftmost 16 pixels are used to predict other data. In this way, the information originally required to represent 16*16=256 pixels can be converted to 16+16-1=31 pixels to represent information)。

1.3.2 Inter-frame prediction (motion estimation and motion compensation)

Inter-frame prediction compression (motion estimation and compensation): It solves the problem of temporal data redundancy. Multiple frames in a video are arranged linearly in time order, and the correlation between frames is very strong, so there will be a lot of data that can be deleted between frames. After compression, it will be divided into three types: I frame, P frame, and B frame(Note: Motion estimation and compensation principle: The H264 encoder first takes out two frames of video data from the buffer header in sequence, and then performs macroblock scanning. When an object is found in one of the pictures, it searches in the adjacent position (search window) of the other picture. If the object is found in the other picture at this time, the motion vector of the object can be calculated. The picture below shows the position of the billiard ball after the search)。

For P frames and B frames, motion estimation is used to find the motion vector between the previous frame or the double frames, and then motion compensation prediction is performed to generate a predicted image and calculate the residual with the current frame.

1.4 Integer Discrete Cosine Transform (DCT，Discrete Cosine Transform）

This step mainly involves performing DCT transformation on the prediction residual, converting the spatial correlation into data that is irrelevant in the frequency domain, and then quantizing it. DCT is mainly used for data or image compression, converting the spatial correlation into data that is irrelevant in the frequency domain and then quantizing it.(Note: DCT transform can gather more important information of the image together, and those unimportant frequency domain areas and coefficients can be directly cropped)。

Summary: The order of frame data compression is to first perform inter-frame and intra-frame compression, then perform DCT transformation on the residual data to remove the data correlation and further compress the data. As shown below:

1.5 CABAC Compression

CABAC compression is a lossless compression technology and a method of entropy coding. CABAC defines high-frequency data as short codes and low-frequency data as long codes during compression coding. This method is more efficient than the VLC method. The quantized coefficients are further compressed using the CABAC (Context-Adaptive Binary Arithmetic Coding) entropy coding method to improve compression efficiency. Here, the data obtained from the previous 4 steps is encoded into the final bitstream using a coding algorithm. The compressed frames are finally divided into: I frames, P frames, and B frames. The video bitstream is obtained after the CABAC lossless compression method is executed.

2 Concepts related to data compression process

2.1 Data redundancy in time and space

Spatial data redundancy and temporal data redundancy are two basic concepts in video compression, which describe the repeated information within a video frame and between video frames respectively.

Interpretation of Spatial Redundancy：

Refers to data duplication caused by similarity or correlation between adjacent pixels within a single video frame. Due to the continuity of natural images, adjacent pixels often have similar brightness or color values.
Typical examples of spatial redundancy include large areas of solid color, and areas of gradient or slowly changing textures.
Video compression algorithms reduce spatial redundancy through spatial prediction, transform coding (such as DCT) and other technologies, convert images from the spatial domain to the frequency domain, and concentrate energy on a few coefficients to achieve the purpose of compression.

Temporal Redundancy：

It refers to the similarity between consecutive frames in a video sequence, that is, the same object or scene in consecutive frames has no significant visual changes.
Temporal redundancy usually occurs when the camera is stationary or objects in the scene move slowly, and only a small part of the area in the subsequent frames changes compared to the previous frame.
Video compression algorithms reduce temporal redundancy through techniques such as inter-frame prediction, motion estimation, and motion compensation, using the correlation between previous and subsequent frames to predict and encode the differences between frames instead of fully encoding each frame.

2.2 Residual Data and Related Concepts

In H.264 video compression, residual data refers to the difference between the original video frame and the predicted frame. To have a deeper understanding of residual data, you need to understand the following concepts, as shown below:

Predicted Frame: In the video encoding process, intra prediction or inter prediction is used to generate predicted frames. Intra prediction is based on the pixel information of the current frame, while inter prediction is based on the motion compensation information of the previous or subsequent frames.
Original frame: Refers to the original image frames actually captured in the video sequence.
Residual calculation: The residual data is calculated by subtracting the predicted frame from the original frame. The residual data represents the difference between the predicted frame and the original frame.

With the foundation of the above concepts, we can further understand the characteristics of residual data:

The residual data usually has high spatial randomness because the prediction frame has removed most of the redundant information. This randomness makes the residual data suitable for further compression through transformation and quantization.
The residual data can be significantly reduced in size after being processed by Integer Discrete Cosine Transform (IDCT) and quantization. The transformation converts the two-dimensional spatial information of the residual into frequency information, while quantization reduces the precision of these coefficients and removes details that are not easily perceived by the human eye.

The purpose of encoding residual data is to further reduce the bit rate of video data while maintaining image quality as much as possible. At the same time, effective compression of residual data is crucial to H.264 coding efficiency because it directly affects the quality of the encoded video and the required storage or transmission bandwidth.

2.3 Entropy coding related concepts and extended interpretation

Entropy coding is a lossless data compression technique based on the concept of information entropy, which aims to represent data with as few bits as possible. The core idea of entropy coding is to allocate fewer bits to symbols that are more likely to appear, and more bits to symbols that are less likely to appear. In this way, the expected storage space or transmission bandwidth will be reduced because the average bit rate of the entire data set will be reduced.

Entropy coding is commonly used for text and some specific types of data compression. It can significantly improve compression efficiency, especially when the data has a significant non-uniform probability distribution. In video compression coding, the entropy coding methods that are most closely related to video compression are mainly the following:

Huffman Coding：Huffman coding is a basic entropy coding technique used in many video compression standards (such as JPEG, MPEG series). It assigns a variable-length code to each symbol by constructing a Huffman tree based on the frequency of symbol occurrence.
Arithmetic Coding: Arithmetic coding is used in some video compression standards (such as JPEG-LS, JPEG 2000). It uses a fractional interval to represent the probability distribution of input data, and usually provides better compression efficiency than Huffman coding.
Context-based Adaptive Binary Arithmetic Coding (CABAC)：CABAC is an entropy coding method used in H.264/AVC and H.265/HEVC video compression standards. It combines the concepts of arithmetic coding and context adaptation, and can dynamically adjust the probability model of coding according to context information, thereby achieving more efficient coding.
Variable Length Coding (VLC): VLC is a general term used to describe coding methods that assign variable-length codes to symbols, including Huffman coding. In video compression, VLC usually refers to the coding method used to represent transform coefficients, etc.

In video compression, entropy coding is the final coding step, which is used to encode the residual data after intra-frame prediction and inter-frame prediction. The residual data is the difference between the original data and the predicted data, and usually has less energy and a more uneven probability distribution. Through entropy coding, the bit rate of these residual data can be further reduced, thereby achieving the purpose of compressing video data.

Technology Sharing