2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Note: We refer to the Golden Circle Learning Method (What is the Golden Circle Rule? -> Model Golden Circle Rule, this article uses: why-what) to learn audio and video encoding. This series of articles focuses on understanding the knowledge system and practical methods of video encoding. The theoretical aspects will explain more about the origin of concepts in audio and video and the connection between various concepts. Know the facts and the reasons. At the same time, it emphasizes the establishment of a knowledge system.
For this article, we mainly study the process of H.264 data compression and its related concepts. H.264 data compression has only one purpose, which is to compress, compress and compress again, reducing the size of video data while ensuring the image quality as much as possible. Therefore, before learning H.264 data compression, we must first understand that although there are many concepts and methods involved in this process, all methods and processes are for this purpose: compression.
The data compression process of H.264 can be summarized into the following key steps: macroblock division and subblock division -> frame grouping -> frame prediction -> integer discrete cosine transform (DCT) -> CABAC compression. The specific explanation is as follows.
Macroblock: It can be understood that when a video frame is sent to the H264 encoder's buffer, the encoder divides each picture into macroblocks. H264 encoding uses a 16X16 pixel area as a macroblock by default (H265 uses a 64X64 pixel area). The effect of dividing a frame of a video into macroblocks is as follows:
At the same time, in the above figure, we can actually divide the 16X16 pixel macroblock into smaller sub-blocks. The size of the sub-block can be 8X16, 16X8, 8X8, 4X8, 8X4, 4X4, which is very flexible. The purpose of dividing it into sub-blocks again is to use very small data to record this data. After the macroblocks are divided, all the pictures in the H264 encoder cache can be grouped.
Macroblock and subblock division allows video encoding to analyze and process video content more accurately and achieve more efficient data compression. Video frames are divided into macroblocks of 16x16 pixels, which can be further divided into smaller blocks such as 8x8, 8x4, 4x8, etc. to adapt to the complexity of image content.
Frame grouping (GOP structure) is defined before intra-frame prediction and inter-frame prediction. In the process of video coding, the setting of frame grouping is one of the initial steps, which determines the organization and coding order of video frames. The main purpose of frame grouping is to reduce data redundancy. Video data redundancy can be divided into two categories:
H264 frame grouping execution steps:
When the grouping is completed, the problem of frame prediction compression begins.
This part mainly includes intra-frame prediction and inter-frame prediction. Intra-frame prediction is to compress static image frames, predict the pixel value of each macroblock through 9 different modes, and then calculate the residual value with the original image. Inter-frame prediction (motion estimation and motion compensation): For P frames and B frames, motion estimation is used to find the motion vector between the previous frame or double frames, and then motion compensation prediction is performed to generate a predicted image and calculate the residual with the current frame.
Note: The order of intra prediction and inter prediction depends on the type of coded frame and the coding strategy. In actual video encoder implementation, these steps are part of the coding process and are automatically handled by the encoder. It is not necessary to process one first.
Intra-frame prediction compression: It solves the problem of spatial data redundancy. Spatial redundant data refers to color, brightness, etc., which are mainly information that the human eye is not sensitive to. Encoding compresses it by deleting it. The principle of intra-frame prediction is that the human eye has a certain degree of recognition of images, and is very sensitive to low-frequency brightness, but not very sensitive to high-frequency brightness.(It can be understood that a macroblock is 16X16 in size. In H264, the top 16 pixels and the leftmost 16 pixels are used to predict other data. In this way, the information originally required to represent 16*16=256 pixels can be converted to 16+16-1=31 pixels to represent information)。
Inter-frame prediction compression (motion estimation and compensation): It solves the problem of temporal data redundancy. Multiple frames in a video are arranged linearly in time order, and the correlation between frames is very strong, so there will be a lot of data that can be deleted between frames. After compression, it will be divided into three types: I frame, P frame, and B frame(Note: Motion estimation and compensation principle: The H264 encoder first takes out two frames of video data from the buffer header in sequence, and then performs macroblock scanning. When an object is found in one of the pictures, it searches in the adjacent position (search window) of the other picture. If the object is found in the other picture at this time, the motion vector of the object can be calculated. The picture below shows the position of the billiard ball after the search)。
For P frames and B frames, motion estimation is used to find the motion vector between the previous frame or the double frames, and then motion compensation prediction is performed to generate a predicted image and calculate the residual with the current frame.
This step mainly involves performing DCT transformation on the prediction residual, converting the spatial correlation into data that is irrelevant in the frequency domain, and then quantizing it. DCT is mainly used for data or image compression, converting the spatial correlation into data that is irrelevant in the frequency domain and then quantizing it.(Note: DCT transform can gather more important information of the image together, and those unimportant frequency domain areas and coefficients can be directly cropped)。
Summary: The order of frame data compression is to first perform inter-frame and intra-frame compression, then perform DCT transformation on the residual data to remove the data correlation and further compress the data. As shown below:
CABAC compression is a lossless compression technology and a method of entropy coding. CABAC defines high-frequency data as short codes and low-frequency data as long codes during compression coding. This method is more efficient than the VLC method. The quantized coefficients are further compressed using the CABAC (Context-Adaptive Binary Arithmetic Coding) entropy coding method to improve compression efficiency. Here, the data obtained from the previous 4 steps is encoded into the final bitstream using a coding algorithm. The compressed frames are finally divided into: I frames, P frames, and B frames. The video bitstream is obtained after the CABAC lossless compression method is executed.
Spatial data redundancy and temporal data redundancy are two basic concepts in video compression, which describe the repeated information within a video frame and between video frames respectively.
Interpretation of Spatial Redundancy:
Temporal Redundancy:
In H.264 video compression, residual data refers to the difference between the original video frame and the predicted frame. To have a deeper understanding of residual data, you need to understand the following concepts, as shown below:
Predicted Frame: In the video encoding process, intra prediction or inter prediction is used to generate predicted frames. Intra prediction is based on the pixel information of the current frame, while inter prediction is based on the motion compensation information of the previous or subsequent frames.
Original frame: Refers to the original image frames actually captured in the video sequence.
Residual calculation: The residual data is calculated by subtracting the predicted frame from the original frame. The residual data represents the difference between the predicted frame and the original frame.
With the foundation of the above concepts, we can further understand the characteristics of residual data:
The residual data usually has high spatial randomness because the prediction frame has removed most of the redundant information. This randomness makes the residual data suitable for further compression through transformation and quantization.
The residual data can be significantly reduced in size after being processed by Integer Discrete Cosine Transform (IDCT) and quantization. The transformation converts the two-dimensional spatial information of the residual into frequency information, while quantization reduces the precision of these coefficients and removes details that are not easily perceived by the human eye.
The purpose of encoding residual data is to further reduce the bit rate of video data while maintaining image quality as much as possible. At the same time, effective compression of residual data is crucial to H.264 coding efficiency because it directly affects the quality of the encoded video and the required storage or transmission bandwidth.
Entropy coding is a lossless data compression technique based on the concept of information entropy, which aims to represent data with as few bits as possible. The core idea of entropy coding is to allocate fewer bits to symbols that are more likely to appear, and more bits to symbols that are less likely to appear. In this way, the expected storage space or transmission bandwidth will be reduced because the average bit rate of the entire data set will be reduced.
Entropy coding is commonly used for text and some specific types of data compression. It can significantly improve compression efficiency, especially when the data has a significant non-uniform probability distribution. In video compression coding, the entropy coding methods that are most closely related to video compression are mainly the following:
In video compression, entropy coding is the final coding step, which is used to encode the residual data after intra-frame prediction and inter-frame prediction. The residual data is the difference between the original data and the predicted data, and usually has less energy and a more uneven probability distribution. Through entropy coding, the bit rate of these residual data can be further reduced, thereby achieving the purpose of compressing video data.