Technology Sharing

Spark SQL

2024-07-11

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

insert image description here
insert image description here

Spark SQL

1. Spark SQL Architecture

  • Ability to directly access existing Hive data

  • Provide JDBC/ODBC interface for third-party tools to process data with Spark

  • Provide a higher-level interface to facilitate data processing

  • Support multiple operation modes: SQL, API programming

    • API programming: Spark SQL has developed a set of SQL statement operators based on SQL, and the names are similar to standard SQL statements.
  • Supports multiple external data sources such as Parquet, CSV, JSON, RDBMS, Hive, HBase, etc. (Master multiple data reading methods)
    insert image description here

  • Spark SQL Core: YesRDD Schema(Operator table structure), in order to make our operation more convenient, we willRDD SchemaIssue toDataFrame

  • Data re-injection: used to write processed and cleaned data back to Hive for subsequent analysis and use.

  • BI Tools: Mainly used for data presentation.

  • Spark Application: Developers use Spark Application to write data processing and analysis logic. These applications can be written in different programming languages, such as Python, Scala, Java, etc.

2. Spark SQL Operation Principle

insert image description here

  • The operation process of the Catalyst optimizer is as follows:
  1. Frontend
    • enter: Users can enter data processing logic through SQL queries or DataFrame API.
    • Unresolved Logical Plan: The input SQL query or DataFrame conversion operation will first be converted into an unparsed logical plan. This plan contains all the operations requested by the user, but the table name and column name may not be parsed yet.
  2. Catalyst Optimizer The Catalyst optimizer is the core component of Spark SQL. It is responsible for converting the logical plan into a physical execution plan and optimizing it. The Catalyst optimizer includes the following stages:
    • Analysis: Parse the table names and column names in the unparsed logical plan into specific metadata. This step depends on the Catalog (metadata storage). The output is a parsed logical plan.
    • Logical Optimization: Perform various optimizations on the parsed logical plan, such as projection cutting, filter pushdown, etc. The optimized logical plan is more efficient.
    • Physical Planning: Convert the optimized logical plan into one or more physical execution plans. Each physical plan represents a possible execution method.
    • Cost Model: Evaluate the execution costs of different physical plans and select the physical plan with the lowest cost as the final physical plan.
  3. Backend
    • Code Generation: Convert the selected physical plan into RDD operations that can be executed on Spark. This step generates the actual execution code.
    • RDDs: The RDD operation finally generated is executed to complete the data processing task requested by the user.
  • The optimization process of a SQL query in Spark SQL
SELECT name FROM(
  SELECT id, name FROM people
) p
WHERE p.id = 1