Introduction to Hive and its architecture

2024-07-12

What is Hive?

A Hadoop-based data warehouse, suitable for some high-latency applications (offline development), can map structured and semi-structured data files stored in Hadoop files into a database table, and provide a SQL-like query model based on the table, called Hive Query Language (HQL), for accessing and analyzing large data sets stored in Hadoop files;
The core of Hive is to convert HQL into MapReduce programs, and then submit the programs to the Hadoop cluster for execution;
Hive itself does not store or calculate data. It completely relies on HDFS and MapReduce. The tables in Hive are purely logical tables.

The data warehouse itself does not "produce" any data, and its data comes from different external systems;
At the same time, the data warehouse itself does not need to "consume" any data, and its results are open to various external applications;
This is why it is called a "warehouse" instead of a "factory";

insert image description here

Meta store：Usually stored in relational databases such as MySQL/Derby, the metadata in Hive includes the name of the table, the columns and partitions of the table and their attributes, the attributes of the table (whether it is an external table, etc.), the directory where the table data is located, etc.;
Driver：Driver, including parser, plan compiler, optimizer, and executor. Complete HQL query statements from lexical analysis, syntax analysis, compilation, optimization, and query plan generation. The generated query plan is stored in HDFS and then called and executed by the execution engine;
- SQL Parser: Convert the SQL string into an abstract syntax tree AST and perform syntax analysis on the AST, such as whether the SQL semantics are correct, whether the table exists, and whether the field exists;
- Compiler (Physical Plan): Compile AST to generate a logical execution plan (DAG);
- Query Optimizer: Optimize the logical execution plan;
- Execution: Convert the logical execution plan into a physical plan that can be run, which is the MapReduce / Spark program;

insert image description here

Make a request: Initiate execution request from UI to Driver;
Get the execution plan: The driver sends the user request to the compiler to obtain the execution plan;
Get metadata: The compiler obtains relevant metadata from the Meta store based on the relevant table and partition information in the SQL statement;
Returns metadata: Meta store returns corresponding metadata to the compiler;
Returns the execution plan: Parse and optimize SQL statements based on table and partition metadata to generate a logical execution plan. The plan is a DAG graph, where each stage corresponds to a MapReduce map or reduce operation.
Run the execution plan: Send the execution plan to the Execution Engine, which will submit the logical execution plan to Hadoop for execution in the form of MapReduce;
Operation results obtained: Driver collects the running results and sends them to the UI;