Technology Sharing

Hive partition table

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

1. Partition table:

It is a type of table in Hive. By dividing the data in the table into multiple subsets (partitions), each partition corresponds to a specific column value in the table, which can improve query performance and data management efficiency. Each partition of the partition table is stored in a separate directory, and the definition of the partition is based on one or more columns in the table. The main purpose of using a partition table is to reduce the amount of data scanned by the query, thereby improving query efficiency.

If the partitions are too fine, a large number of small files may be generated, affecting the performance of HDFS and the efficiency of MapReduce tasks. Small files need to be merged regularly.

  1. CREATE TABLE customer_data (
  2. customer_id STRING,
  3. name STRING,
  4. age INT,
  5. email STRING
  6. )
  7. PARTITIONED BY (city STRING)
  8. STORED AS ORC;
  9. select *
  10. from customer_data;
  11. -- 插入 New York 的数据
  12. INSERT INTO TABLE customer_data PARTITION (city='New York')
  13. VALUES
  14. ('1', 'John Doe', 30, '[email protected]'),
  15. ('2', 'Jane Smith', 25, '[email protected]'),
  16. ('3', 'Bob Johnson', 40, '[email protected]');
  17. -- 插入 Los Angeles 的数据
  18. INSERT INTO TABLE customer_data PARTITION (city='Los Angeles')
  19. VALUES
  20. ('4', 'Alice Brown', 32, '[email protected]'),
  21. ('5', 'Charlie Davis', 28, '[email protected]');
  22. -- 插入 Chicago 的数据
  23. INSERT INTO TABLE customer_data PARTITION (city='Chicago')
  24. VALUES
  25. ('6', 'Eve White', 45, '[email protected]'),
  26. ('7', 'Frank Black', 37, '[email protected]');

You can see that three directories are created on HDFS, corresponding to the three partitions. When you use the select with where condition to query, you can directly search for data in the corresponding partition directory, thereby reducing the amount of data scanned by the query and improving performance.

SELECT * FROM customer_data WHERE city='New York';

2. Bucket table:

It is another table type in Hive. By hashing the data in the table, query performance can be further improved, especially when performing join and aggregation operations. Bucket tables divide data into a fixed number of buckets, each of which is stored in a separate file.
---------------------------------------------------Characteristics of bucket table---------------------------------------------------
Data partitioning: Distribute data into a fixed number of buckets based on the hash values ​​of one or more columns.
File storage: The data of each bucket is stored in a separate file.
Uniform distribution: Ideally, data is evenly distributed across all buckets, which improves query performance.

  1. CREATE TABLE customer_data2 (
  2. customer_id STRING,
  3. name STRING,
  4. age INT,
  5. email STRING
  6. )
  7. CLUSTERED BY (customer_id) INTO 4 BUCKETS
  8. STORED AS ORC;
  9. -- 插入数据到分桶表
  10. --通过这些步骤,我们创建了一个按 customer_id 列进行分桶的 Hive 表 customer_data,并插入了具体的数据。
  11. INSERT INTO TABLE customer_data2 VALUES
  12. ('1', 'John Doe', 30, '[email protected]'),
  13. ('2', 'Jane Smith', 25, '[email protected]'),
  14. ('3', 'Bob Johnson', 40, '[email protected]'),
  15. ('4', 'Alice Brown', 32, '[email protected]'),
  16. ('5', 'Charlie Davis', 28, '[email protected]'),
  17. ('6', 'Eve White', 45, '[email protected]'),
  18. ('7', 'Frank Black', 37, '[email protected]'),
  19. ('8', 'Grace Green', 22, '[email protected]');
  20. select *
  21. from customer_data2;

By looking at the path on hdfs, we can see that the data will be divided into different buckets according to the hash value of the corresponding column.