2024-07-08
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
author:Soup Dumplings
I have recently worked on several real-time data development requirements, and inevitably encountered some problems in the process of using Flink, such as back pressure caused by data skew, interval join, water level failure caused by windowing, etc. By thinking about and solving these problems, I have deepened my understanding of Flink's principles and mechanisms. Therefore, I would like to share these development experiences and hope to help students in need.
The following will introduce three case studies. Each case will be divided into three parts: background, cause analysis and solution.
Data skew occurs both offline and in real time and is defined as:When processing data in parallel, the data divided by certain keys is significantly larger than other parts, and the distribution is uneven, resulting in a large amount of data being concentrated on one or several computing nodes. This makes the processing speed of this part much lower than the average computing speed, becoming the bottleneck of the entire data set processing, thereby affecting the overall computing performance.There are many reasons for data skew, such as uneven key distribution during group by, too many null values, count distinct, etc. This article will only introduce the group by count distinct situation.