Apache Hive is a data warehouse built on the top of Hadoop for data analysis, summarization, and querying. Hive provides an SQL-like interface to query data stored in various data sources and file systems.
Use Tez Engine
Apache Tez Engine is an extensible framework for building high-performance batch processing and interactive data processing. It is coordinated by YARN in Hadoop. Tez improved the MapReduce paradigm by increasing the processing speed and maintaining the MapReduce ability to scale to petabytes of data.
Tez engine can be enabled in your environment by setting hive.execution.engine
to tez
:
Use Vectorization
Vectorization improves the performance by fetching 1,024 rows in a single operation instead of fetching single row each time. It improves the performance for operations like filter
, join
, aggregation
, etc.
Vectorization can be enabled in the environment by executing below commands.
Use ORCFile
Optimized Row Columnar format provides highly efficient ways of storing the hive data by reducing the data storage format by 75% of the original. The ORCFile format is better than the Hive files format when it comes to reading, writing, and processing the data. It uses techniques like predicate push-down, compression, and more to improve the performance of the query.
Consider two tables: employee
and employee_details
, tables that are stored in a text file. Let's say we will use join
to fetch details from both tables.
Above query will take a long time, as the table is stored as text. Converting this table into ORCFile format will significantly reduce the query execution time.
ORC supports compressed (ZLIB and Snappy), as well as uncompressed storage.
Use Partitioning
With partitioning, data is stored in separate individual folders on HDFS. Instead of querying the whole dataset, it will query partitioned dataset.
Create Temporary Table and Load Data Into Temporary Table
Create Partitioned Table
Enable Dynamic Hive Partition
Import Data From Temporary Table To Partitioned Table
Use Bucketing
The Hive table is divided into a number of partitions and is called Hive Partition. Hive Partition is further subdivided into clusters or buckets and is called bucketing or clustering.
Cost-Based Query Optimization
Hive optimizes each query's logical and physical execution plan before submitting for final execution. However, this is not based on the cost of the query during the initial version of Hive.
During later versions of Hive, query has been optimized according to the cost of the query (like which types of join to be performed, how to order joins, the degree of parallelism, etc.).
To use cost-based optimization, set the below parameters at the start of the query.
Summary
Apache Hive is a very powerful tool for analyzing data, and it supports batch and interactive data processing. It is one of the most-used techniques by data analysts and data scientists. It is very important that you know how to improve the performance of query when you are processing petabytes of data.
Now you know how to improve the performance of the Hive query!
No comments:
Post a Comment