https://intellipaat.com/interview-question/hive-interview-questions/
Top Answers to Hive Interview Questions
1. Compare Pig and Hive
Criteria | Pig | Hive |
Architecture | Procedural data flow language | SQL type declarative language |
Application | Programming purposes | Report creation |
Operational field | Client side | Server side |
Support for avro files | Yes | No |
2. What is the definition of Hive? What is the present version of Hive and explain about ACID transactions in Hive?
Hive is an open source data warehouse system. We can use Hive for analyzing and querying in large data sets of Hadoop files. It’s similar to SQL. The present version of hive is 0.13.1. Hive supports ACID transactions: The full form of ACID is Atomicity, Consistency, Isolation, and Durability. ACID transactions are provided at the row levels, there are Insert, Delete, and Update options so that Hive supports ACID transaction.
- Insert
- Delete
- Update
Want to learn more about Hive? Go through this insightful blog “What is Hive?”
3. Explain what is a Hive variable. What do we use it for?
Hive variable is basically created in the Hive environment that is referenced by Hive scripting languages. It provides to pass some values to the hive queries when the query starts executing. It uses the source command.
4. What kind of data warehouse application is suitable for Hive? What are the types of tables in Hive?
Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put restrictions on what Hive can do.Hive is most suitable for data warehouse applications.
Where :
Where :
- Analyzing the relatively static data.
- Less Responsive time.
- No rapid changes in data.Hive doesn’t provide fundamental features required for OLTP, Online Transaction Processing.Hive is suitable for data warehouse applications in large data sets.Two types of tables in Hive
- Managed table.
- External table.
Get a better understanding of Hive by going through this Hive Tutorial now.
5. Can We Change settings within Hive Session? If Yes, How?
Yes we can change the settings within Hive session, using the SET command. It helps to change Hive job settings for an exact query.
Example: The following commands shows buckets are occupied according to the table definition.
Example: The following commands shows buckets are occupied according to the table definition.
hive> SET hive.enforce.bucketing=true;
We can see the current value of any property by using SET with the property name. SET will list all the properties with their values set by Hive.
hive> SET hive.enforce.bucketing;
hive.enforce.bucketing=true
And this list will not include defaults of Hadoop. So we should use the below like
SET -v
It will list all the properties including the Hadoop defaults in the system.
Interested in learning Hive? Well, we have the comprehensive Hive Training Course to give you a head start in your career.
6. Is it possible to add 100 nodes when we have 100 nodes already in Hive? How?
Yes, we can add the nodes by following the below steps.
- Take a new system create a new username and password.
- Install the SSH and with master node setup ssh connections.
- Add ssh public_rsa id key to the authorized keys file.
- Add the new data node host name, IP address and other details in /etc/hosts slaves file
192.168.1.102 slave3.in slave3. - Start the Data Node on New Node.
- Login to the new node like suhadoop or ssh -X hadoop@192.168.1.103.
- Start HDFS of a newly added slave node by using the following command
./bin/hadoop-daemon.sh start data node. - Check the output of jps command on a new node
7. Explain the concatenation function in Hive with an example .
Concatenate function will join the input strings.We can specify the
‘N’ number of strings separated by a comma.
Example:
‘N’ number of strings separated by a comma.
Example:
CONCAT ('Intellipaat','-','is','-','a','-','eLearning',’-’,’provider’);
Output:
Intellipaat-is-a-eLearning-provider
So, every time we set the limits of the strings by ‘-‘. If it is common for every strings, then Hive provides another command
CONCAT_WS. In this case,we have to specify the set limits of operator first.
CONCAT_WS ('-',’Intellipaat’,’is’,’a’,’eLearning’,‘provider’);
Output: Intellipaat-is-a-eLearning-provider.
8. Trim and Reverse function in Hive with examples.
Trim function will delete the spaces associated with a string.
Example:
Example:
TRIM(‘ INTELLIPAAT ‘);
Output:
INTELLIPAAT
To remove the Leading space
LTRIM(‘ INTELLIPAAT’);
To remove the trailing space
RTRIM(‘INTELLIPAAT ‘);
In Reverse function, characters are reversed in the string.
Example:
REVERSE(‘INTELLIPAAT’);
Output:
TAAPILLETNI
9. How to change the column data type in Hive? Explain RLIKE in Hive.
We can change the column data type by using ALTER and CHANGE.
The syntax is :
The syntax is :
ALTER TABLE table_name CHANGE column_namecolumn_namenew_datatype;
Example: If we want to change the data type of the salary column from integer to bigint in the employee table.
ALTER TABLE employee CHANGE salary salary BIGINT;RLIKE: Its full form is Right-Like and it is a special function in the Hive. It helps to examine the two substrings. i.e, if the substring of A matches with B then it evaluates to true.
Example:
ALTER TABLE employee CHANGE salary salary BIGINT;RLIKE: Its full form is Right-Like and it is a special function in the Hive. It helps to examine the two substrings. i.e, if the substring of A matches with B then it evaluates to true.
Example:
‘Intellipaat’ RLIKE ‘tell’ True
‘Intellipaat’ RLIKE ‘^I.*’ True (this is a regular expression)
Learn what is Hadoop Hive in this detailed blog post now.
Download Hive Interview Questions asked by top MNCs in 2018
GET PDF
10. What are the components used in Hive query processor?
The components of a Hive query processor include
- Logical Plan of Generation.
- Physical Plan of Generation.
- Execution Engine.
- Operators.
- UDF’s and UDAF’s.
- Optimizer.
- Parser.
- Semantic Analyzer.
- Type Checking
11. What is Buckets in Hive?
The present data is partitioned and divided into different Buckets. This data is divided on the basis of Hash of the particular table columns.
12. Explain process to access sub directories recursively in Hive queries.
By using below commands we can access sub directories recursively in Hive
hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
Hive tables can be pointed to the higher level directory and this is suitable for the directory structure which is like /data/country/state/city/
13. How to skip header rows from a table in Hive?
Header records in log files
System=….
Version=…
Sub-version=….
In the above three lines of headers that we do not want to include in our Hive query. To skip header lines from our tables in the Hive,set a table property that will allow us to skip the header lines.
System=….
Version=…
Sub-version=….
In the above three lines of headers that we do not want to include in our Hive query. To skip header lines from our tables in the Hive,set a table property that will allow us to skip the header lines.
CREATE EXTERNAL TABLE employee (
name STRING,
job STRING,
dob STRING,
id INT,
salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘ STORED AS TEXTFILE
LOCATION ‘/user/data’
TBLPROPERTIES("skip.header.line.count"="2”);
14. What is the maximum size of string data type supported by hive? Mention the Hive support binary formats.
The maximum size of string data type supported by hive is 2 GB.
Hive supports the text file format by default and it supports the binary format Sequence files, ORC files, Avro Data files, Parquet files.
Sequence files: Splittable, compressible and row oriented are the general binary format.
ORC files: Full form of ORC is optimized row columnar format files. It is a Record columnar file and column oriented storage file. It divides the table in row split. In each split stores that value of the first row in the first column and followed sub subsequently.
AVRO data files: It is same as a sequence file splittable, compressible and row oriented, but except the support of schema evolution and multilingual binding support.
Hive supports the text file format by default and it supports the binary format Sequence files, ORC files, Avro Data files, Parquet files.
Sequence files: Splittable, compressible and row oriented are the general binary format.
ORC files: Full form of ORC is optimized row columnar format files. It is a Record columnar file and column oriented storage file. It divides the table in row split. In each split stores that value of the first row in the first column and followed sub subsequently.
AVRO data files: It is same as a sequence file splittable, compressible and row oriented, but except the support of schema evolution and multilingual binding support.
15. What is the precedence order of HIVE configuration?
We are using a precedence hierarchy for setting the properties
- SET Command in HIVE
- The command line –hiveconf option
- Hive-site.XML
- Hive-default.xml
- Hadoop-site.xml
- Hadoop-default.xml
16. If you run a select * query in Hive, Why does it not run MapReduce?
The hive.fetch.task.conversion property of Hive lowers the latency of mapreduce overhead and in effect when executing queries like SELECT, FILTER, LIMIT, etc., it skips mapreduce function
17. How Hive can improve performance with ORC format tables?
We can store the hive data in highly efficient manner in the Optimized Row Columnar file format. It can simplify many Hive file format limitations. We can improve the performance by using ORC files while reading, writing and processing the data.
Set hive.compute.query.using.stats-true;
Set hive.stats.dbclass-fs;
CREATE TABLE orc_table (
idint,
name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\:’
LINES TERMINATED BY ‘\n’
STORES AS ORC;
Need a reason to learn Apache Hadoop and Hive? Well, go through this blog post to find out why Hadoop is the new black.
18. Explain the functionality of Object-Inspector.
It helps to analyze the internal structure of row object and individual structure of columns in HIVE. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.
Instance of Java class
A standard Java object
A lazily initialized object
The Object-Inspector tells structure of the object and also ways to access the internal fields inside the object.
Instance of Java class
A standard Java object
A lazily initialized object
The Object-Inspector tells structure of the object and also ways to access the internal fields inside the object.
19. Whenever we run hive query, new metastore_db is created. Why?
Local metastore is created when we run Hive in embedded mode. And before creating it checks whether the metastore exists or not and this metastore property is defined in the configuration file hive-site.xml. Property is“javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true”.So to change the behavior of the location to an absolute path, so that from that location meta-store will be used.
20. Differentiate between Hive and HBase
Hive | HBase |
Enables most of the SQL queries | This doesn’t allow SQL queries |
Doesn’t support record level insert, update, and delete operations on table | It supports |
It is a data warehouse framework | It is NoSQL database |
Hive run on the top of MapReduce | HBase runs on the top of HDFS |
21. How can we access the sub directories recursively?
By using below commands we can access sub directories recursively in Hive
hive> Set mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true;
Hive tables can be pointed to the higher level directory and this is suitable for the directory structure which is like /data/country/state/city/
22. What are the uses of explode Hive?
Hadoop developers consider the array as their inputs and convert them into a separate table row. To convert complicate data types into desired table formats Hive is essentially using explode.
23. What is available mechanism for connecting from applications, when we run hive as a server?
- Thrift Client: Using thrift you can call hive commands from various programming languages. Example: C++, PHP,Java, Python and Ruby.
- JDBC Driver: JDBC Driver supports the Type 4 (pure Java) JDBC Driver
- ODBC Driver: ODBC Driver supports the ODBC protocol.
24. How do we write our own custom SerDe?
End users want to read their own data format instead of writing, so the user wants to write a Deserializer than SerDe.
Example: The RegexDeserializer will deserialize the data using the configuration parameter ‘regex’, and a list of column names.
If our SerDe supports DDL, we probably want to implement a protocol based on DynamicSerDe. It’s non-trivial to write a “thrift DDL” parser.
Example: The RegexDeserializer will deserialize the data using the configuration parameter ‘regex’, and a list of column names.
If our SerDe supports DDL, we probably want to implement a protocol based on DynamicSerDe. It’s non-trivial to write a “thrift DDL” parser.
25. Mention the date data type in Hive. Name the Hive data type collection.
The TIMESTAMP data type stores date in java.sql.timestamp format.
Three collection data types in Hive
- ARRAY
- MAP
- STRUCT
26. Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? How? Give an example.
Yes, we can run UNIX shell commands from Hive using the! Mark before the command .For example: !pwd at hive prompt will list the current directory.
We can execute Hive queries from the script files by using the source command.
Example −
We can execute Hive queries from the script files by using the source command.
Example −
Hive> source /path/to/file/file_with_query.hql
https://www.dezyre.com/article/hive-interview-questions-and-answers-for-2018/246
Hadoop Hive Interview Questions and Answers
1) What is the difference between Pig and Hive ?
Criteria | Pig | Hive |
Type of Data | Apache Pig is usually used for semi structured data. | Used for Structured Data |
Schema | Schema is optional. | Hive requires a well-defined Schema. |
Language | It is a procedural data flow language. | Follows SQL Dialect and is a declarative language. |
Purpose | Mainly used for programming. | It is mainly used for reporting. |
General Usage | Usually used on the client side of the hadoop cluster. | Usually used on the server side of the hadoop cluster. |
Coding Style | Verbose | More like SQL |
For a detailed answer on the difference between Pig and Hive, refer this link -
2) What is the difference between HBase and Hive ?
HBase | Hive |
HBase does not allow execution of SQL queries. | Hive allows execution of most SQL queries. |
HBase runs on top of HDFS. | Hive runs on top of Hadoop MapReduce. |
HBase is a NoSQL database. | Hive is a datawarehouse framework. |
Supports record level insert, updated and delete operations. | Does not support record level insert, update and delete. |
2) I do not need the index created in the first question anymore. How can I delete the above index named index_bonuspay?
DROP INDEX index_bonuspay ON employee;
3) Can you list few commonly used Hive services?
- Command Line Interface (cli)
- Hive Web Interface (hwi)
- HiveServer (hiveserver)
- Printing the contents of an RC file using the tool rcfilecat.
- Jar
- Metastore
4) Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state. Can this be achieved using Apache Hive?
Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.
FREE eBook on 250 Hadoop Interview Questions and Answers
5) What is the use of Hcatalog?
Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive’s data warehouse.
6) Write a query to rename a table Student to Student_New.
Alter Table Student RENAME to Student_New
7) Where is table data stored in Apache Hive by default?
hdfs: //namenode_server/user/hive/warehouse
8) Explain the difference between partitioning and bucketing.
- Partitioning and Bucketing of tables is done to improve the query performance. Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e. either by timestamp ranges, by location, etc. Bucketing does not work by default.
- Partitioning helps eliminate data when used in WHERE clause. Bucketing helps organize data inside the partition into multiple files so that same set of data will always be written in the same bucket. Bucketing helps in joining various columns.
- In partitioning technique, a partition is created for every unique value of the column and there could be a situation where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
- Basically, a bucket is a file in Hive whereas partition is a directory.
9) Explain about the different types of partitioning in Hive?
Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.
Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.
10) When executing Hive queries in different directories, why is metastore_db created in all places from where Hive is launched?
When running Hive in embedded mode, it creates a local metastore. When you run the query, it first checks whether a metastore already exists or not. The property javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby: databaseName=metastore_db; create=true.
The value implies that embedded derby will be used as the Hive metastore and the location of the metastore is metastore_db which will be created only if it does not exist already. The location metastore_db is a relative location so when you run queries from different directories it gets created at all places from wherever you launch hive. This property can be altered in the hive-site.xml file to an absolute path so that it can be used from that particular location instead of creating multiple metastore_db subdirectory multiple times.
11) How will you read and write HDFS files in Hive?
i) TextInputFormat- This class is used to read data in plain text file format.
ii) HiveIgnoreKeyTextOutputFormat- This class is used to write data in plain text file format.
iii) SequenceFileInputFormat- This class is used to read data in hadoop SequenceFile format.
iv) SequenceFileOutputFormat- This class is used to write data in hadoop SequenceFile format.
12) What are the components of a Hive query processor?
Query processor in Apache Hive converts the SQL to a graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies. The various components of a query processor are-
- Parser
- Semantic Analyser
- Type Checking
- Logical Plan Generation
- Optimizer
- Physical Plan Generation
- Execution Engine
- Operators
- UDF’s and UDAF’s.
13) Differentiate between describe and describe extended.
Describe database/schema- This query displays the name of the database, the root location on the file system and comments if any.
Describe extended database/schema- Gives the details of the database or schema in a detailed manner.
14) Is it possible to overwrite Hadoop MapReduce configuration in Hive?
Yes, hadoop MapReduce configuration can be overwritten by changing the hive conf settings file.
15) I want to see the present working directory in UNIX from hive. Is it possible to run this command from hive?
Hive allows execution of UNIX commands with the use of exclamatory (!) symbol. Just use the ! Symbol before the command to be executed at the hive prompt. To see the present working directory in UNIX from hive run !pwd at the hive prompt.
16) What is the use of explode in Hive?
Explode in Hive is used to convert complex data types into desired table formats. explode UDTF basically emits all the elements in an array into multiple rows.
17) Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.
SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.
DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.
CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.
18) Difference between HBase and Hive.
- HBase is a NoSQL database whereas Hive is a data warehouse framework to process Hadoop jobs.
- HBase runs on top of HDFS whereas Hive runs on top of Hadoop MapReduce.
19) Write a hive query to view all the databases whose name begins with “db”
SHOW DATABASES LIKE ‘db.*’
20) How can you prevent a large job from running for a long time?
This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict;
The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.
What do u think is more popular among the developers – Pig or Hive?
21) What is a Hive Metastore?
Hive Metastore is a central repository that stores metadata in external database.
22) Are multiline comments supported in Hive?
No
23) What is ObjectInspector functionality?
ObjectInspector is used to analyse the structure of individual columns and the internal structure of the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats.
24) Explain about the different types of join in Hive.
HiveQL has 4 different types of joins –
JOIN- Similar to Outer Join in SQL
FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.
25) How can you configure remote metastore mode in Hive?
To configure metastore in Hive, hive-site.xml file has to be configured with the below property –
hive.metastore.uris
thrift: //node1 (or IP Address):9083
IP address and port of the metastore host
26) Is it possible to change the default location of Managed Tables in Hive, if so how?
Yes, we can change the default location of Managed tables using the LOCATION keyword while creating the managed table. The user has to specify the storage path of the managed table as the value to the LOCATION keyword.
27) How data transfer happens from HDFS to Hive?
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user just has to define the table using the keyword external that creates the table definition in the hive metastore.
Create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
28) In case of embedded Hive, can the same metastore be used by multiple users?
We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.
29) The partition of hive table has been modified to point to a new directory location. Do I have to move the data to the new location or the data will be moved automatically to the new location?
Changing the point of partition will not move the data to the new location. It has to be moved manually to the new location from the old one.
30) What will be the output of cast (‘XYZ’ as INT)?
It will return a NULL value.
31) What are the different components of a Hive architecture?
Hive Architecture consists of a –
- User Interface – UI component of the Hive architecture calls the execute interface to the driver.
- Driver create a session handle to the query and sends the query to the compiler to generate an execution plan for it.
- Metastore - Sends the metadata to the compiler for the execution of the query on receiving the sendMetaData request.
- Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
- Execute Engine- Execution engine is responsible for submitting each of these stages to the relevant components by managing the dependencies between the various stages in the execution plan generated by the compiler.
32) What happens on executing the below query? After executing the below query, if you modify the column –how will the changes be tracked?
Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
The query creates an index named index_bonuspay which points to the bonus column in the employee table. Whenever the value of bonus is modified it will be stored using an index value.
33) What is the default database provided by Hive for Metastore ?
Derby is the default database.
34) Is it possible to compress json in Hive external table ?
Yes, you need to gzip your files and put them as is (*.gz) into the table location.
Scenario based or Real-Time Interview Questions on Hadoop Hive
- How will you optimize Hive performance?
There are various ways to run Hive queries faster -
- Using Apache Tez execution engine
- Using vectorization
- Using ORCFILE
- Do cost based query optimization.
- Will the reducer work or not if you use “Limit 1” in any HiveQL query?
- Why you should choose Hive instead of Hadoop MapReduce?
- I create a table which contains transaction details of customers for the year 2018.
CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
I have inserted 60K tuples in this table and now want to know the total revenue that has been generated for each month. However, Hive takes too much time to process this query. List all the steps that you would follow to solve this problem. - There is a Python application that connects to Hive database for extracting data, creating sub tables for data processing, drops temporary tables, etc. 90% of the processing is done through hive queries which are generated from python code and are sent to hive server for execution.Assume that there are 100K rows , would it be faster to fetch 100K rows to python itself into a list of tuples and mimic the join or filter operations hive performs and avoid the executuon of 20-50 queries run against hive or you should look into hive query optimization techniques ? Which one is performance efficient ?
Other Interview Questions on Hadoop Hive
- Explain the difference between SQL and Apache Hive.
- Why mapreduce will not run if you run select * from table in hive?
We hope that these Hive Interview questions and answers have pre-charged you for your next Hadoop interview on the subject of Hive. Let us know about your experience on Hive interview questions in Hadoop interviews in the comments below.
-------------------------------------------------------
Apache Hive Interview Questions
Here is the comprehensive list of the most frequently asked Apache Hive Interview Questions that have been framed after deep research and discussion with the industry experts.
1. What kind of applications is supported by Apache Hive?
Hive supports all those client applications that are written in Java, PHP, Python, C++ or Ruby by exposing its Thrift server.
2. Define the difference between Hive and HBase?
The key differences between Apache Hive and HBase are as follows:
- The Hive is a data warehousing infrastructure whereas HBase is a NoSQL database on top of Hadoop.
- Apache Hive queries are executed as MapReduce jobs internally whereas HBase operations run in a real-time on its database rather than MapReduce.
3. Where does the data of a Hive table gets stored?
By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can change it by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.
4. What is a metastore in Hive?
Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.
5. Why Hive does not store metadata information in HDFS?
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.
6. What is the difference between local and remote metastore?
Local Metastore:
In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
Remote Metastore:
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
7. What is the default database provided by Apache Hive for metastore?
By default, Hive provides an embedded Derby database instance backed by the local disk for the metastore. This is called the embedded metastore configuration.
8. Scenario:
Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?
The default metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.
Following are the steps to configure MySQL database as the local metastore in Apache Hive:
- One should make the following changes in hive-site.xml:
- javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba
seIfNotExist=true. - javax.jdo.option.ConnectionDriverName property should be set to com.mysql.jdbc.Driver.
- One should also set the username and password as:
- javax.jdo.option.ConnectionUserName is set to desired username.
- javax.jdo.option.ConnectionPassword is set to the desired password.
- javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba
- The JDBC driver JAR file for MySQL must be on the Hive’s classpath, i.e. The jar file should be copied into the Hive’s lib directory.
- Now, after restarting the Hive shell, it will automatically connect to the MySQL database which is running as a standalone metastore.
9. What is the difference between external table and managed table?
Here is the key difference between an external table and managed table:
- In case of managed table, If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory.
- On the contrary, in case of an external table, Hive just deletes the metadata information regarding the table and leaves the table data present in HDFS untouched.
Note: I would suggest you to go through the blog on Hive Tutorial to learn more about Managed Table and External Table in Hive.
10. Is it possible to change the default location of a managed table?
Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause – LOCATION ‘<hdfs_path>’.
11. When should we use SORT BY instead of ORDER BY?
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.
12. What is a partition in Hive?
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory.
13. Why do we perform partitioning in Hive?
Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning onlyrelevant partitioned data instead of the whole data set.
For example, we can partition a transaction log of an e – commerce website based on month like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have to scan the Jan partition (sub – directory) only instead of the whole table data.
14. What is dynamic partitioning and when is it used?
In dynamic partitioning values for partition columns are known in the runtime, i.e. It is known during loading of the data into a Hive table.
One may use dynamic partition in following two cases:
- Loading data from an existing non-partitioned table to improve the sampling and therefore, decrease the query latency.
- When one does not know all the values of the partitions before hand and therefore, finding these partition values manually from a huge data sets is a tedious task.
15. Scenario:
Suppose, I create a table that contains details of all the transactions done by the customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for each month. But, Hive is taking too much time in processing this query. How will you solve this problem and list the steps that I will be taking in order to do so?
We can solve this problem of query latency by partitioning the table according to each month. So, for each month we will be scanning only the partitioned data instead of whole data sets.
As we know, we can’t partition an existing non-partitioned table directly. So, we will be taking following steps to solve the very problem:
- Create a partitioned table, say partitioned_transaction:
CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
2. Enable dynamic partitioning in Hive:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
3. Transfer the data from the non – partitioned table into the newly created partitioned table:
INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;
Now, we can perform the query using each partition and therefore, decrease the query time.
16. How can you add a new partition for the month December in the above partitioned table?
For adding a new partition in the above table partitioned_transaction, we will issue the command give below:
ALTER TABLE partitioned_transaction ADD PARTITION (month=’Dec’) LOCATION ‘/partitioned_transaction’;
Note: I suggest you to go through the dedicated blog on Hive Commands where all the commands present in Apache Hive have been explained with an example.
17. What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode = <value>
Note: You can set the total number of dynamic partitions that can be created by one statement by using: SET hive.exec.max.dynamic.partitions = <value>
18. Scenario:
I am inserting data into a table based on partitions dynamically. But, I received an error – FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode requires at least one static partition column. How will you remove this error?
To remove this error one has to execute following commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Things to Remember:
- By default, hive.exec.dynamic.partition configuration property is set to False in case you are using Hive whose version is prior to 0.9.0.
- hive.exec.dynamic.partition.mode is set to strict by default. Only in non – strict mode Hive allows all partitions to be dynamic.
19. Why do we need buckets?
There are two main reasons for performing bucketing to a partition:
- A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from that of join key? Therefore, in these cases you can perform a map side join by bucketing the table using the join key.
- Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.
20. How Hive distributes the rows into buckets?
Hive determines the bucket number for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets). Here, hash_function depends on the column data type. For integer data type, the hash_function will be:
hash_function (int_type_column)= value of int_type_column
21. What will happen in case you have not issued the command: ‘SET hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache Hive 0.x or 1.x?
The command: ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one may find the number of files that will be generated in the table directory to be not equal to the number of buckets. As an alternative, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.
22. What is indexing and why do we need it?
One of the Hive query optimization methods is Hive index. Hive index is used to speed up the access of a column or set of columns in a Hive database because with the use of index the database system does not need to read all rows in the table to find the data that one has selected.
23. Scenario:
Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following entries:
id first_name last_name email gender ip_address
1 Hugh Jackman hughjackman@cam.ac.uk Male 136.90.241.52
2 David Lawrence dlawrence1@gmail.com Male 101.177.15.130
3 Andy Hall andyhall2@yahoo.com Female 114.123.153.64
4 Samuel Jackson samjackson231@sun.com Male 89.60.227.31
5 Emily Rose rose.emily4@surveymonkey.com Female 119.92.21.19
How will you consume this CSV file into the Hive warehouse using built SerDe?
SerDe stands for serializer/deserializer. A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive. SerDes are implemented using Java. Hive comes with several built-in SerDes and many other third-party SerDes are also available.
Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the sample.csv by issuing following commands:
CREATE EXTERNAL TABLE sample
(id int, first_name string,
last_name string, email string,
gender string, ip_address string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE LOCATION ‘/temp’;
Now, we can perform any query on the table ‘sample’:
SELECT first_name FROM sample WHERE gender = ‘male’;
24. Scenario:
Suppose, I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive table corresponding to these files. The data in these files are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table for lots of small files without degrading the performance of the system?
One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:
- Create a temporary table:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;
- Load the data into temp_table:
LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;
- Create a table that will store data in SequenceFile format:
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;
- Transfer the data from the temporary table into the sample_seqfile table:
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;
Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.
Apache Hive Interview Questions
Here is the comprehensive list of the most frequently asked Apache Hive Interview Questions that have been framed after deep research and discussion with the industry experts.
1. What kind of applications is supported by Apache Hive?
Hive supports all those client applications that are written in Java, PHP, Python, C++ or Ruby by exposing its Thrift server.
2. Define the difference between Hive and HBase?
The key differences between Apache Hive and HBase are as follows:
- The Hive is a data warehousing infrastructure whereas HBase is a NoSQL database on top of Hadoop.
- Apache Hive queries are executed as MapReduce jobs internally whereas HBase operations run in a real-time on its database rather than MapReduce.
3. Where does the data of a Hive table gets stored?
By default, the Hive table is stored in an HDFS directory – /user/hive/warehouse. One can change it by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml.
4. What is a metastore in Hive?
Metastore in Hive stores the meta data information using RDBMS and an open source ORM (Object Relational Model) layer called Data Nucleus which converts the object representation into relational schema and vice versa.
5. Why Hive does not store metadata information in HDFS?
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.
6. What is the difference between local and remote metastore?
Local Metastore:
In local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
Remote Metastore:
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
7. What is the default database provided by Apache Hive for metastore?
By default, Hive provides an embedded Derby database instance backed by the local disk for the metastore. This is called the embedded metastore configuration.
8. Scenario:
Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore configuration. Then, what will happen if we have multiple clients trying to access Hive at the same time?
The default metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an error. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.
Following are the steps to configure MySQL database as the local metastore in Apache Hive:
- One should make the following changes in hive-site.xml:
- javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba
seIfNotExist=true. - javax.jdo.option.ConnectionDriverName property should be set to com.mysql.jdbc.Driver.
- One should also set the username and password as:
- javax.jdo.option.ConnectionUserName is set to desired username.
- javax.jdo.option.ConnectionPassword is set to the desired password.
- javax.jdo.option.ConnectionURL property should be set to jdbc:mysql://host/dbname?createDataba
- The JDBC driver JAR file for MySQL must be on the Hive’s classpath, i.e. The jar file should be copied into the Hive’s lib directory.
- Now, after restarting the Hive shell, it will automatically connect to the MySQL database which is running as a standalone metastore.
9. What is the difference between external table and managed table?
Here is the key difference between an external table and managed table:
- In case of managed table, If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory.
- On the contrary, in case of an external table, Hive just deletes the metadata information regarding the table and leaves the table data present in HDFS untouched.
Note: I would suggest you to go through the blog on Hive Tutorial to learn more about Managed Table and External Table in Hive.
10. Is it possible to change the default location of a managed table?
Yes, it is possible to change the default location of a managed table. It can be achieved by using the clause – LOCATION ‘<hdfs_path>’.
11. When should we use SORT BY instead of ORDER BY?
We should use SORT BY instead of ORDER BY when we have to sort huge datasets because SORT BY clause sorts the data using multiple reducers whereas ORDER BY sorts all of the data together using a single reducer. Therefore, using ORDER BY against a large number of inputs will take a lot of time to execute.
12. What is a partition in Hive?
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory.
13. Why do we perform partitioning in Hive?
Partitioning provides granularity in a Hive table and therefore, reduces the query latency by scanning onlyrelevant partitioned data instead of the whole data set.
For example, we can partition a transaction log of an e – commerce website based on month like Jan, February, etc. So, any analytics regarding a particular month, say Jan, will have to scan the Jan partition (sub – directory) only instead of the whole table data.
14. What is dynamic partitioning and when is it used?
In dynamic partitioning values for partition columns are known in the runtime, i.e. It is known during loading of the data into a Hive table.
One may use dynamic partition in following two cases:
- Loading data from an existing non-partitioned table to improve the sampling and therefore, decrease the query latency.
- When one does not know all the values of the partitions before hand and therefore, finding these partition values manually from a huge data sets is a tedious task.
15. Scenario:
Suppose, I create a table that contains details of all the transactions done by the customers of year 2016: CREATE TABLE transaction_details (cust_id INT, amount FLOAT, month STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
Now, after inserting 50,000 tuples in this table, I want to know the total revenue generated for each month. But, Hive is taking too much time in processing this query. How will you solve this problem and list the steps that I will be taking in order to do so?
We can solve this problem of query latency by partitioning the table according to each month. So, for each month we will be scanning only the partitioned data instead of whole data sets.
As we know, we can’t partition an existing non-partitioned table directly. So, we will be taking following steps to solve the very problem:
- Create a partitioned table, say partitioned_transaction:
CREATE TABLE partitioned_transaction (cust_id INT, amount FLOAT, country STRING) PARTITIONED BY (month STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ ;
2. Enable dynamic partitioning in Hive:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
3. Transfer the data from the non – partitioned table into the newly created partitioned table:
INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, amount, country, month FROM transaction_details;
Now, we can perform the query using each partition and therefore, decrease the query time.
16. How can you add a new partition for the month December in the above partitioned table?
For adding a new partition in the above table partitioned_transaction, we will issue the command give below:
ALTER TABLE partitioned_transaction ADD PARTITION (month=’Dec’) LOCATION ‘/partitioned_transaction’;
Note: I suggest you to go through the dedicated blog on Hive Commands where all the commands present in Apache Hive have been explained with an example.
17. What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode = <value>
Note: You can set the total number of dynamic partitions that can be created by one statement by using: SET hive.exec.max.dynamic.partitions = <value>
18. Scenario:
I am inserting data into a table based on partitions dynamically. But, I received an error – FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode requires at least one static partition column. How will you remove this error?
To remove this error one has to execute following commands:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Things to Remember:
- By default, hive.exec.dynamic.partition configuration property is set to False in case you are using Hive whose version is prior to 0.9.0.
- hive.exec.dynamic.partition.mode is set to strict by default. Only in non – strict mode Hive allows all partitions to be dynamic.
19. Why do we need buckets?
There are two main reasons for performing bucketing to a partition:
- A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from that of join key? Therefore, in these cases you can perform a map side join by bucketing the table using the join key.
- Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.
20. How Hive distributes the rows into buckets?
Hive determines the bucket number for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets). Here, hash_function depends on the column data type. For integer data type, the hash_function will be:
hash_function (int_type_column)= value of int_type_column
21. What will happen in case you have not issued the command: ‘SET hive.enforce.bucketing=true;’ before bucketing a table in Hive in Apache Hive 0.x or 1.x?
The command: ‘SET hive.enforce.bucketing=true;’ allows one to have the correct number of reducer while using ‘CLUSTER BY’ clause for bucketing a column. In case it’s not done, one may find the number of files that will be generated in the table directory to be not equal to the number of buckets. As an alternative, one may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.
22. What is indexing and why do we need it?
One of the Hive query optimization methods is Hive index. Hive index is used to speed up the access of a column or set of columns in a Hive database because with the use of index the database system does not need to read all rows in the table to find the data that one has selected.
23. Scenario:
Suppose, I have a CSV file – ‘sample.csv’ present in ‘/temp’ directory with the following entries:
id first_name last_name email gender ip_address
1 Hugh Jackman hughjackman@cam.ac.uk Male 136.90.241.52
2 David Lawrence dlawrence1@gmail.com Male 101.177.15.130
3 Andy Hall andyhall2@yahoo.com Female 114.123.153.64
4 Samuel Jackson samjackson231@sun.com Male 89.60.227.31
5 Emily Rose rose.emily4@surveymonkey.com Female 119.92.21.19
How will you consume this CSV file into the Hive warehouse using built SerDe?
SerDe stands for serializer/deserializer. A SerDe allows us to convert the unstructured bytes into a record that we can process using Hive. SerDes are implemented using Java. Hive comes with several built-in SerDes and many other third-party SerDes are also available.
Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the sample.csv by issuing following commands:
CREATE EXTERNAL TABLE sample
(id int, first_name string,
last_name string, email string,
gender string, ip_address string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
STORED AS TEXTFILE LOCATION ‘/temp’;
Now, we can perform any query on the table ‘sample’:
SELECT first_name FROM sample WHERE gender = ‘male’;
24. Scenario:
Suppose, I have a lot of small CSV files present in /input directory in HDFS and I want to create a single Hive table corresponding to these files. The data in these files are in the format: {id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots of small files.
So, how will you solve this problem where we want to create a single Hive table for lots of small files without degrading the performance of the system?
One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:
- Create a temporary table:
CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS TEXTFILE;
- Load the data into temp_table:
LOAD DATA INPATH ‘/input’ INTO TABLE temp_table;
- Create a table that will store data in SequenceFile format:
CREATE TABLE sample_seqfile (id INT, name STRING, e-mail STRING, country STRING)
ROW FORMAT FIELDS DELIMITED TERMINATED BY ‘,’ STORED AS SEQUENCEFILE;
- Transfer the data from the temporary table into the sample_seqfile table:
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table;
Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.
https://www.edureka.co/blog/interview-questions/hive-interview-questions/
-------------------------------
https://www.tutorialspoint.com/cgi-bin/printpage.cgi
There are two types. Managed table and external table. In managed table both the data an schema in under control of hive but in external table only the schema is under control of Hive.
No Hive does not provide insert and update at row level. So it is not suitable for OLTP system.
Alter Table table_name RENAME TO new_name
Using REPLACE column option
ALTER TABLE table_name REPLACE COLUMNS ……
ALTER TABLE table_name REPLACE COLUMNS ……
It is a relational database storing the metadata of hive tables, partitions, Hive databases etc
Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data. SO users need to write their own java code to satisfy their data format requirements.
Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.
hdfs://namenode_server/user/hive/warehouse
- Local mode
- Distributed mode
- Pseudodistributed mode
Yes. The TIMESTAMP data types stores date in java.sql.timestamp format
There are three collection data types in Hive.
- ARRAY
- MAP
- STRUCT
Yes, using the ! mark just before the command.
For example !pwd at hive prompt will list the current directory.
For example !pwd at hive prompt will list the current directory.
The hive variable is variable created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.
Using the source command.
Example −
Hive> source /path/to/file/file_with_query.hql
Example −
Hive> source /path/to/file/file_with_query.hql
It is a file containing list of commands needs to run when the hive CLI starts. For example setting the strict mode to be true etc.
The default record delimiter is − \n
And the filed delimiters are − \001,\002,\003
And the filed delimiters are − \001,\002,\003
The schema is validated with the data when reading the data and not enforced when writing data.
SHOW DATABASES LIKE ‘p.*’
With the use command you fix the database on which all the subsequent hive queries will run.
There is no way you can delete the DBPROPERTY.
It sets the mapreduce jobs to strict mode.By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.
This can be done with following query
SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’)
org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
When we issue the command DROP TABLE IF EXISTS table_name
Hive throws an error if the table being dropped does not exist in the first place.
Hive throws an error if the table being dropped does not exist in the first place.
The data stays in the old location. It has to be moved manually.
ALTER TABLE table_name
CHANGE COLUMN new_col INT
BEFORE x_col
No. It only reduces the number of files which becomes easier for namenode to manage.
By using the ENABLE OFFLINE clause with ALTER TABLE atatement.
By Omitting the LOCAL CLAUSE in the LOAD DATA statement.
The new incoming files are just added to the target directory and the existing files are simply overwritten. Other files whose name does not match any of the incoming files will continue to exist.
If you add the OVERWRITE clause then all the existing data in the directory will be deleted before new data is written.
If you add the OVERWRITE clause then all the existing data in the directory will be deleted before new data is written.
It creates partition on table employees with partition values coming from the columns in the select clause. It is called Dynamic partition insert.
A table generating function is a function which takes a single column as argument and expands it to multiple column or rows. Example exploe
If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.
The LIKE operator behaves the same way as the regular SQL operators used in select queries. Example −
street_name like ‘%Chi’
But the RLIKE operator uses more advance regular expressions which are available in java
Example − street_name RLIKE ‘.*.*’ which will select any word which has either chi or oho in it.
street_name like ‘%Chi’
But the RLIKE operator uses more advance regular expressions which are available in java
Example − street_name RLIKE ‘.*.*’ which will select any word which has either chi or oho in it.
No. As this kind of Join can not be implemented in mapreduce
In a join query the smallest table to be taken in the first position and largest table should be taken in the last position.
It controls ho wthe map output is reduced among the reducers. It is useful in case of streaming data
Select cast
Hive will return NULL
No. The name of a view must be unique whne compared to all other tables and views present in the same database.
No. A view can not be the target of a INSERT or LOAD statement.
Indexes occupies space and there is a processing cost in arranging the values of the column on which index is cerated.
SHOW INDEX ON table_name
This will list all the indexes created on any of the columns in the table table_name.
This will list all the indexes created on any of the columns in the table table_name.
The values in a column are hashed into a number of buckets which is defined by user. It is a way to avoid too many partitions or nested partitions while ensuring optimizes query output.
It is query hint to stream a table into memory before running the query. It is a query optimization Technique.
Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
It is a UDF which is created using a java program to server some specific need not covered under the existing functions in Hive. It can detect the type of input argument programmatically and provide appropriate response.
The local inpath should contain a file and not a directory. The $env:HOME is a valid variable available in the hive environment.
The TBLPROPERTIES clause is used to add the creator name while creating a table.
The TBLPROPERTIES is added like −
The TBLPROPERTIES is added like −
TBLPROPERTIES(‘creator’= ‘Joan’)