Provide JDBC/ODBC interface for third-party tools to process data with Spark
Provide a higher-level interface to facilitate data processing
Support multiple operation modes: SQL, API programming
API programming: Spark SQL has developed a set of SQL statement operators based on SQL, and the names are similar to standard SQL statements.
Supports multiple external data sources such as Parquet, CSV, JSON, RDBMS, Hive, HBase, etc. (Master multiple data reading methods)
Spark SQL Core: YesRDD Schema(Operator table structure), in order to make our operation more convenient, we willRDD SchemaIssue toDataFrame
Data re-injection: used to write processed and cleaned data back to Hive for subsequent analysis and use.
BI Tools: Mainly used for data presentation.
Spark Application: Developers use Spark Application to write data processing and analysis logic. These applications can be written in different programming languages, such as Python, Scala, Java, etc.
2. Spark SQL Operation Principle
The operation process of the Catalyst optimizer is as follows:
Frontend
enter: Users can enter data processing logic through SQL queries or DataFrame API.
Unresolved Logical Plan: The input SQL query or DataFrame conversion operation will first be converted into an unparsed logical plan. This plan contains all the operations requested by the user, but the table name and column name may not be parsed yet.
Catalyst Optimizer The Catalyst optimizer is the core component of Spark SQL. It is responsible for converting the logical plan into a physical execution plan and optimizing it. The Catalyst optimizer includes the following stages:
Analysis: Parse the table names and column names in the unparsed logical plan into specific metadata. This step depends on the Catalog (metadata storage). The output is a parsed logical plan.
Logical Optimization: Perform various optimizations on the parsed logical plan, such as projection cutting, filter pushdown, etc. The optimized logical plan is more efficient.
Physical Planning: Convert the optimized logical plan into one or more physical execution plans. Each physical plan represents a possible execution method.
Cost Model: Evaluate the execution costs of different physical plans and select the physical plan with the lowest cost as the final physical plan.
Backend
Code Generation: Convert the selected physical plan into RDD operations that can be executed on Spark. This step generates the actual execution code.
RDDs: The RDD operation finally generated is executed to complete the data processing task requested by the user.
The optimization process of a SQL query in Spark SQL
SELECT name FROM(SELECT id, name FROM people
) p
WHERE p.id =1