Technology Sharing

Summary of data types and storage formats in Hive

2024-07-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

1. Data Type

Hive supports multiple data types, which are divided into two categories: primitive data types and complex data types. The following are the data types supported by Hive:

Primitive data types:

1. Integer type:

                tinyint: 1-byte signed integer
                smallint: 2-byte signed integer
                int: 4-byte signed integer
                bigint: 8-byte signed integer
                float: 4-byte single-precision floating point number
                double: 8-byte double-precision floating point number
                decimal: High-precision numeric type, you can specify the precision and scale, such as decimal(10,2)

Byte: One of the most basic storage units in a computer. 1 byte occupies 8 bits. Data range: negative range: -128 to -1, positive range: 0 to 127

2. String type:

                string: Variable-length strings
                varchar: A variable-length string with a maximum length limit, such as varchar(255)
                char: Fixed-length string, such as char(10)

3. Date/Time Type:

                timestamp: A timestamp containing the date and time, accurate to nanoseconds
                date: Contains only the date part, not the time part
                interval: Time interval, used to represent the difference between two dates or times

4. Boolean type:

                boolean: Boolean value, true or false

5. Binary Type:

                binary: byte array of arbitrary length

Complex data types:
1. Array Type

        array<T>: An ordered list containing multiple elements of the same type, such as an array<int>

2. Mapping Type

        map<K, V>: An unordered collection of key-value pairs, where the key and value can be of any data type, such as map<string, int>


    3. Structural type

        struct<col1: type1, col2: type2, ...>: A record contains multiple fields, each of which can be of a different data type, such asstruct<name: string, age: int>

  1. CREATE TABLE example_table (
  2. tinyint_col tinyint,
  3. smallint_col smallint,
  4. int_col int,
  5. bigint_col bigint,
  6. float_col float,
  7. double_col double,
  8. decimal_col decimal(10, 2),
  9. string_col string,
  10. varchar_col varchar(255),
  11. char_col char(10),
  12. timestamp_col timestamp,
  13. date_col date,
  14. boolean_col boolean,
  15. binary_col binary,
  16. array_col array<int>,
  17. map_col map<string, int>,
  18. struct_col struct<name: string, age: int>,
  19. union_col uniontype<int, string>
  20. );

2. Hive file storage format

Hive storage formats are divided into two categories:

A type of plain text file: textfile, uncompressed, and also the default storage format of hive

One type is binary file storage:

sequencefile: will be compressed, and data cannot be loaded using the load method

orcfile: will be compressed, and data cannot be loaded using the load method

parquet: will be compressed, and data cannot be loaded using the load method

rcfile: It will be compressed and cannot load data using the load method. It is a low-end version of orcfile.

The storage formats of textfile and sequencefile are both based on row storage; orc and parquet are based on column storage, and rcfile is a mixed row and column storage.

When creating a table, you can use stored as parquet to specify the storage format of the table, for example:

  1. create table if not exists stocks_parquet (
  2. track_time string,
  3. url string,
  4. session_id string,
  5. referer string,
  6. ip string,
  7. end_user_id string,
  8. city_id string
  9. )
  10. stored as parquet;

Modify the default storage format of hive:

  1. <property>
  2. <name>hive.default.fileformat</name>
  3. <value>TextFile</value>
  4. <description>
  5. Expects one of [textfile, sequencefile, rcfile, orc].
  6. Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
  7. </description>
  8. </property>
  9. 也可以使用set方式修改:
  10. set hive.default.fileformat=TextFile