Technology sharing

Apache Doris Apache Hudi Quick Setup Guide |

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

auctor:SelectDB Technical team

Introductio: Data Lakehouse coniungit altam observantiam et realem temporis observantiam notitiarum horreorum cum parvo pretio et flexibilitate lacus datorum ad auxilium users ad occurrendum varias processus notitias et analyses aptius indiget. In praeteritis multiplicibus versionibus, Apache Doris suam integrationem cum lacu notitiae profundiorem fecit et in maturam integrationem lacum et horreorum solutionem evolvit. Ut utentes faciliores cito incipias, lacum et horreum integram architecturae constructionis Apache Doridis ac varias formas lacus amet datas et systemata reposita per seriem articulorum, incluso Hudi, Iceberg, Paimon, OSS, Delta Lacum inducemus. , Kudu, BigQuery, etc.

Cum nova aperta notitia procuratio architecturae, Data Lakehouse coniungit altam observantiam et realem temporis observantiam notitiarum horreorum cum parvo pretio et flexibilitate lacus notitiae ad auxilium users ad varias notitias occurrendas commodius eget ac magis in coeptis magnis data systematibus.

Praeteritum aliquot versiones;Apache Doris Pergit profundius integrationem suam cum lacu notitiae et nunc in maturam integrationem lacus et horreis solutionem evolvit.

  • Cum versio 0.15, Apache Doris Hive et Iceberg mensas externas induxit, explorare facultatem cum data laco Apache Iceberg coniungendi.
  • Cum versio 1.2, Apache Doris munus multi-Catalogus publice induxit, quod metadatae schedulae automatice percepit et notitiarum multiplicium fontium accessus, et multas optimas res in externa notitia lectionis et inquisitionis executione perficiendas fecit, etc., ac plene capax est. de aedificationis capabilities of Lakehouse architecturae perquam celeriter et facile utendi.
  • In versione 2.1, Apache lacus Doris CELLA architectonica integrata comprehense firmata est, quae non solum auget legendi scribendique facultatem formatorum formatorum amet (Hudi, Iceberg, Paimon, etc.), sed etiam multi-SQL dialectum convenientiam ac compatibilitatem inducit. facultas convertendi ab originali Systema seamlessly permutat Apache Doris. In notitia scientiarum et magnarum notitiarum missionum legendi, Doris sagittam fugae celeritatis integrat lectionis interfaciei, quae notitias transmissionis efficientiae 100 tempore melioris facit.

Apache lacus Doris et horreum in one.png

Apache Doris + Apache Hudi

Apache Hudi In sit amet turpis lacus, amet aperta notitia et forma transactionalis lacus diam sit amet, suscipit varias machinas amet interrogationes inter Apache Doris.Apache Doris Apache Hudi datarum lectionum facultates tabulae etiam auctae sunt;

  • Effingo in Write Table: Snapshot Query
  • Merge on Read Table:Snapshot Queries, Read Optimised Queries
  • Support Tempus Travel
  • Support Incremental Read

Cum summus perficiendi interrogatione Apache Doridis exsecutio et in reali temporis administratione facultates Apache Hudi datae, efficaces, flexibiles, et humilis sumptus, interrogatione et analysi data possunt effici Munera. Nunc in Apache Coniunctio Doris et Apache Hudi verificata est et provecta in missionibus realibus a pluribus communitatibus utentibus:

  • Real-time Analysis et dispensando : missiones communes sicut analysis transactionis in industria oeconomico, real-time click stream analysis in vendo industria, et usorum analysin mores in industria e-commercia omnia requirit real-time notitias updates et analysin quaesita. Hudi percipere potest realem tempus renovationis et administrationis notitiarum et in tuto collocare notitiarum constantiam et constantiam .

  • Data REGRESSUS et auditing : Industriae pro rebus oeconomicis et cura medicae quae altissimas requiruntur ad securitatis et accurationem datae, notitiae regressionis et audiendi munera magni ponderis sunt. Hudi tempus praebet munus peregrinatione, quod utentes ad videndum statum notitiae historicae permittit.

  • Lectio incremental notitia et analysis: Cum magnas analysin analysin gerendo, saepe versamur ad problemata magnarum rerum datarum ac frequentium updates , Apache Doris Incremental Leges functiones possunt etiam hunc processum efficaciorem facere, signanter meliorem efficientiam processus notitiae et analyseos.

  • Foederatae queries per data fontes : Multi incepti notitia fontes implicati sunt, et notitia in diversis databases condi potest. Multi-Catalogus Doris munus sustinet automatic mapping et synchronisationi multiplicium notitiarum fontium, ac interrogationes foederatas adiuvat per fontes datos. Pro inceptis, quae notitias multiplicium notitiarum ad analysin acquirendas et integrandas requirunt, valde breviat notitias viae et efficaciam laboris melioris.

Articulus hic lectores inducet quomodo cito probationem et demonstrationem ambitus Apache Doridis + Apache Hudi in Docker environment erigeret, et operationem cuiusque functionis ad lectores cito incipias demonstrare.

usoris ductu

Omnia scripta et codicem, qui in hoc articulum implicantur, ex hac inscriptione haberi possunt:https://github.com/apache/doris/tree/master/samples/datalake/hudi

01 Environment praeparatio

Exemplum in hoc articulo explicatur utendo Docker Componere.

Environment preparation.png

02 Operatio instruere

  1. Docker creare network
sudo docker network create -d bridge hudi-net
  • 1
  1. Satus omnes components
sudo ./start-hudi-compose.sh
  • 1
  1. Post startup, sequenti scripto uti potes ut in scintilla linea iubes vel doris lineam imperare:
sudo ./login-spark.sh
sudo ./login-doris.sh
  • 1
  • 2

03 Data praeparatio

Deinde generant Hudi datam per scintillam.Ut in codice infra monstratum est, botrus iam chartam nominatam continetcustomer Alveare mensam, mensam Hudi per hanc alveare mensam creare potes;

-- ./login-spark.sh
spark-sql> use default;

-- create a COW table
spark-sql> CREATE TABLE customer_cow
USING hudi
TBLPROPERTIES (
  type = 'cow',
  primaryKey = 'c_custkey',
  preCombineField = 'c_name'
)
PARTITIONED BY (c_nationkey)
AS SELECT * FROM customer;

-- create a MOR table
spark-sql> CREATE TABLE customer_mor
USING hudi
TBLPROPERTIES (
  type = 'mor',
  primaryKey = 'c_custkey',
  preCombineField = 'c_name'
)
PARTITIONED BY (c_nationkey)
AS SELECT * FROM customer;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24

04 Data interrogatione

Ut infra patebit, lima nomine hudi Catalogue (available viaHOW CATALOGS sisto). The following is the creation constitution of Catalogue:

-- 已经创建,无需再次执行
CREATE CATALOG `hive` PROPERTIES (
    "type"="hms",
    'hive.metastore.uris' = 'thrift://hive-metastore:9083',
    "s3.access_key" = "minio",
    "s3.secret_key" = "minio123",
    "s3.endpoint" = "http://minio:9000",
    "s3.region" = "us-east-1",
    "use_path_style" = "true"
);
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  1. Manually Catalogum recreare et tabulam Hudi creatam synchronizare;
-- ./login-doris.sh
doris> REFRESH CATALOG hive;
  • 1
  • 2
  1. Scintilla utens ad notitias in Hudi operandas videri potest in Doride tempore reali sine Catalogo iterum reficiendo. Ordinem notitiarum in BOS et MOR tabulas per Scintillam respective inserimus:
spark-sql> insert into customer_cow values (100, "Customer#000000100", "jD2xZzi", "25-430-914-2194", 3471.59, "BUILDING", "cial ideas. final, furious requests", 25);
spark-sql> insert into customer_mor values (100, "Customer#000000100", "jD2xZzi", "25-430-914-2194", 3471.59, "BUILDING", "cial ideas. final, furious requests", 25);
  • 1
  • 2
  1. Novissima notitia inserta per Doridem directe investigari potest;
doris> use hive.default;
doris> select * from customer_cow where c_custkey = 100;
doris> select * from customer_mor where c_custkey = 100;
  • 1
  • 2
  • 3
  1. Deinde inserere per scintillam c_custkey=32 Data iam exsistentia, id est, data exsistentia overwriting;
spark-sql> insert into customer_cow values (32, "Customer#000000032_update", "jD2xZzi", "25-430-914-2194", 3471.59, "BUILDING", "cial ideas. final, furious requests", 15);
spark-sql> insert into customer_mor values (32, "Customer#000000032_update", "jD2xZzi", "25-430-914-2194", 3471.59, "BUILDING", "cial ideas. final, furious requests", 15);
  • 1
  • 2
  1. Data renovatio per Doridem queri potest:
doris> select * from customer_cow where c_custkey = 32;
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
| c_custkey | c_name                    | c_address | c_phone         | c_acctbal | c_mktsegment | c_comment                           | c_nationkey |
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
|        32 | Customer#000000032_update | jD2xZzi   | 25-430-914-2194 |   3471.59 | BUILDING     | cial ideas. final, furious requests |          15 |
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
doris> select * from customer_mor where c_custkey = 32;
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
| c_custkey | c_name                    | c_address | c_phone         | c_acctbal | c_mktsegment | c_comment                           | c_nationkey |
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
|        32 | Customer#000000032_update | jD2xZzi   | 25-430-914-2194 |   3471.59 | BUILDING     | cial ideas. final, furious requests |          15 |
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

05 Incremental Read

Incremental Read is one of the functional features provided by Hudi.Doris enim hoc inserere potestc_custkey=100 Queritur notitia sequens mutationis.Ut infra, inseruimus ac_custkey=32Notitia:

doris> select * from customer_cow@incr('beginTime'='20240603015018572');
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
| c_custkey | c_name                    | c_address | c_phone         | c_acctbal | c_mktsegment | c_comment                           | c_nationkey |
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
|        32 | Customer#000000032_update | jD2xZzi   | 25-430-914-2194 |   3471.59 | BUILDING     | cial ideas. final, furious requests |          15 |
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
spark-sql> select * from hudi_table_changes('customer_cow', 'latest_state', '20240603015018572');

doris> select * from customer_mor@incr('beginTime'='20240603015058442');
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
| c_custkey | c_name                    | c_address | c_phone         | c_acctbal | c_mktsegment | c_comment                           | c_nationkey |
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
|        32 | Customer#000000032_update | jD2xZzi   | 25-430-914-2194 |   3471.59 | BUILDING     | cial ideas. final, furious requests |          15 |
+-----------+---------------------------+-----------+-----------------+-----------+--------------+-------------------------------------+-------------+
spark-sql> select * from hudi_table_changes('customer_mor', 'latest_state', '20240603015058442');
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

VI TimeTravel

Doris interrogationem Hudi datae versionis snapshots definiti sustinet, inde tempus Travel munus notitiarum cognoscens. Primum, historiam duarum Hudi tabularum per Scintillam submissionem investigare potes:

spark-sql> call show_commits(table => 'customer_cow', limit => 10);
20240603033556094        20240603033558249        commit        448833        0        1        1        183        0        0
20240603015444737        20240603015446588        commit        450238        0        1        1        202        1        0
20240603015018572        20240603015020503        commit        436692        1        0        1        1        0        0
20240603013858098        20240603013907467        commit        44902033        100        0        25        18751        0        0

spark-sql> call show_commits(table => 'customer_mor', limit => 10);
20240603033745977        20240603033748021        deltacommit        1240        0        1        1        0        0        0
20240603015451860        20240603015453539        deltacommit        1434        0        1        1        1        1        0
20240603015058442        20240603015100120        deltacommit        436691        1        0        1        1        0        0
20240603013918515        20240603013922961        deltacommit        44904040        100        0        25        18751        0        0
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

deinde perfici potest per Doridem c_custkey=32 , interrogatione data priusquam notitia inserta sit.Ut videre potes infrac_custkey=32 Notitia nondum renovata est:

Nota: Tempus Peregrinationis syntaxum non nunc novum optimizeris sustinet ac primum exsecutioni opus estset enable_nereids_planner=false;Averte novum optimizer, hoc problema in versionibus subsequentibus figetur.

doris> select * from customer_cow for time as of '20240603015018572' where c_custkey = 32 or c_custkey = 100;
+-----------+--------------------+---------------------------------------+-----------------+-----------+--------------+--------------------------------------------------+-------------+
| c_custkey | c_name             | c_address                             | c_phone         | c_acctbal | c_mktsegment | c_comment                                        | c_nationkey |
+-----------+--------------------+---------------------------------------+-----------------+-----------+--------------+--------------------------------------------------+-------------+
|        32 | Customer#000000032 | jD2xZzi UmId,DCtNBLXKj9q0Tlp2iQ6ZcO3J | 25-430-914-2194 |   3471.53 | BUILDING     | cial ideas. final, furious requests across the e |          15 |
|       100 | Customer#000000100 | jD2xZzi                               | 25-430-914-2194 |   3471.59 | BUILDING     | cial ideas. final, furious requests              |          25 |
+-----------+--------------------+---------------------------------------+-----------------+-----------+--------------+--------------------------------------------------+-------------+
-- compare with spark-sql
spark-sql> select * from customer_mor timestamp as of '20240603015018572' where c_custkey = 32 or c_custkey = 100;

doris> select * from customer_mor for time as of '20240603015058442' where c_custkey = 32 or c_custkey = 100;
+-----------+--------------------+---------------------------------------+-----------------+-----------+--------------+--------------------------------------------------+-------------+
| c_custkey | c_name             | c_address                             | c_phone         | c_acctbal | c_mktsegment | c_comment                                        | c_nationkey |
+-----------+--------------------+---------------------------------------+-----------------+-----------+--------------+--------------------------------------------------+-------------+
|       100 | Customer#000000100 | jD2xZzi                               | 25-430-914-2194 |   3471.59 | BUILDING     | cial ideas. final, furious requests              |          25 |
|        32 | Customer#000000032 | jD2xZzi UmId,DCtNBLXKj9q0Tlp2iQ6ZcO3J | 25-430-914-2194 |   3471.53 | BUILDING     | cial ideas. final, furious requests across the e |          15 |
+-----------+--------------------+---------------------------------------+-----------------+-----------+--------------+--------------------------------------------------+-------------+
spark-sql> select * from customer_mor timestamp as of '20240603015058442' where c_custkey = 32 or c_custkey = 100;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

Query ipsum

Notitia in Apache Hudi dure dividi potest in duo genera - baseline notitia et notitia incrementalia. Data baseline plerumque est fasciculus Parquet merged, cum notitia incrementalia refert ad incrementum datorum ex INDO, UPDATE vel DELETE. Data baseline directe legi possunt, cum notitia incrementalis legi debet per Merge in Read.

Pro mensa Hudi COW queries seu MOR mensa Read Optimized queries, data baseline notitia, et fasciculi notitia directo per Dorida Parquet Lectorem legi possunt, et perquam celeriter interrogationes responsiones obtineri possunt. Ad data incrementalia, Doris Hudi Javam SDK vocare debet per JNI ut accedere.Ut ad meliorem interrogationem perficiendam consequendam, Apache Doris notitia quaesita in duas partes dividet: baseline et incrementa data, eosque modos superius utentes lege.

Ad hanc notionem optimization comprobandam, transivimus EXPLAIN propositionis videre quantum baselines notitiae et incrementales notae in interrogatione infra sunt. Pro mensa vacca, omnia 101 shards data sunt baseline data (hudiNativeReadSplits=101/101 ), ut omnes Tabulae Uaccae directe per Doris Parquet Lectorem legi possunt, ut optima quaestio perficiendi obtineri possit. Tabulae pro ROW, pleraeque shards datarum baseline datae sunt (hudiNativeReadSplits=100/101) numerus shardorum est notitia incrementalis et plerumque melius effectum inquisitionis consequi potest.

-- COW table is read natively
doris> explain select * from customer_cow where c_custkey = 32;
|   0:VHUDI_SCAN_NODE(68)                                        |
|      table: customer_cow                                       |
|      predicates: (c_custkey[#5] = 32)                          |
|      inputSplitNum=101, totalFileSize=45338886, scanRanges=101 |
|      partition=26/26                                           |
|      cardinality=1, numNodes=1                                 |
|      pushdown agg=NONE                                         |
|      hudiNativeReadSplits=101/101                              |

-- MOR table: because only the base file contains `c_custkey = 32` that is updated, 100 splits are read natively, while the split with log file is read by JNI.
doris> explain select * from customer_mor where c_custkey = 32;
|   0:VHUDI_SCAN_NODE(68)                                        |
|      table: customer_mor                                       |
|      predicates: (c_custkey[#5] = 32)                          |
|      inputSplitNum=101, totalFileSize=45340731, scanRanges=101 |
|      partition=26/26                                           |
|      cardinality=1, numNodes=1                                 |
|      pushdown agg=NONE                                         |
|      hudiNativeReadSplits=100/101                              |
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

Operationes deletionis aliquas per Scintillam facere potes ut mutationes in Hudi baseline notas et incrementales notas ulteriores observes;

-- Use delete statement to see more differences
spark-sql> delete from customer_cow where c_custkey = 64;
doris> explain select * from customer_cow where c_custkey = 64;

spark-sql> delete from customer_mor where c_custkey = 64;
doris> explain select * from customer_mor where c_custkey = 64;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Praeterea putatio partitio perfici potest per condiciones partitionis ut amplius quantitatem notitiarum minuant et celeritati interrogationis amplio.in exemplo, per partitionem conditionemc_nationkey = 15 Partitionem reductionem praestare ut interrogationi postulata tantum ad unam partitionem accedere necesse sit.partition=1/26) data.

-- customer_xxx is partitioned by c_nationkey, we can use the partition column to prune data
doris> explain select * from customer_mor where c_custkey = 64 and c_nationkey = 15;
|   0:VHUDI_SCAN_NODE(68)                                        |
|      table: customer_mor                                       |
|      predicates: (c_custkey[#5] = 64), (c_nationkey[#12] = 15) |
|      inputSplitNum=4, totalFileSize=1798186, scanRanges=4      |
|      partition=1/26                                            |
|      cardinality=1, numNodes=1                                 |
|      pushdown agg=NONE                                         |
|      hudiNativeReadSplits=3/4                                  |
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

conclusio

Supra singula ductor est ut celeriter testam/demo ambitum e Apache Doridis et Apache Hudi aedificaturus sit. In posterum etiam seriem ducum deducemus ad aedificandum integratum lacum et architecturae horreorum cum Apache Doridis et variis amet data lacuum. formas et systemata reposita, in quibus Iceberg, Paimon, OSS, Delta Lake, etc., gratam operam dare pergunt.