Interview Questions for – Sqoop(http://www.bigdatatrunk.com/top-50-sqoop-interview-questions/)
Q1 What is the process to perform an incremental data load in Sqoop?
Answer: The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.
Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-
1)Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.
2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.
3)Value (last-value) –This denotes the maximum value of the check column from the previous import operation.
Q2 How Sqoop can be used in a Java program?
Answer: The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.
Q3 What is the significance of using –compress-codec parameter?
Answer: To get the out file of a sqoop import in formats other than .gz like .bz2 we use the –compress -code parameter.
Q4 How are large objects handled in Sqoop?
Answer: Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-
1)CLOB ‘s – Character Large Objects
2)BLOB’s –Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.
Q5 What is a disadvantage of using –direct parameter for faster data load by sqoop?
Answer: The native utilities used by databases to support faster load do not work for binary data formats like SequenceFile
Q6 Is it possible to do an incremental import using Sqoop?
Answer: Yes, Sqoop supports two types of incremental imports-
1)Append
2)Last Modified
To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.
Q7 How can you check all the tables present in a single database using Sqoop?
Answer: The command to check the list of all tables present in a single database using Sqoop is as follows-
Sqoop list-tables –connect jdbc: mysql: //localhost/user;
Q8 How can you control the number of mappers used by the sqoop command?
Answer: The Parameter –num-mappers is used to control the number of mappers executed by a sqoop command. We should start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers initially may slow down the performance on the database side.
Q9 What is the standard location or path for Hadoop Sqoop scripts?
Answer: /usr/bin/Hadoop Sqoop
Q10 How can we import a subset of rows from a table without using the where clause?
Answer: We can run a filtering query on the database and save the result to a temporary table in database.
Then use the sqoop import command without using the –where clause
Q11 When the source data keeps getting updated frequently, what is the approach to keep it in sync with the data in HDFS imported by sqoop?
Answer: qoop can have 2 approaches.
a − To use the –incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row.
b − To use the –incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import.
Q12 What is a sqoop metastore?
Answer: It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument.
Q13 Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?
Answer: Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.
Q14 Tell few import control commands:
Answer: –Append
–Columns
–Where
These command are most frequently used to import RDBMS Data.
Q15 Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?
Answer: Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.
Q16 How can you see the list of stored jobs in sqoop metastore?
Answer: sqoop job –list
Q17 What type of databases Sqoop can support?
Answer: MySQL, Oracle, PostgreSQL, IBM, Netezza and Teradata. Every database connects through jdbc driver.
Q18 What is the purpose of sqoop-merge?
Answer: The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.
Q19 HOw sqoop can handle large objects?
Answer: Blog and Clob columns are common large objects. If the object is less than 16MB, it stored inline with the rest of the data. If large objects, temporary stored in_lob subdirectory. Those lobs processes in a streaming fashion. Those data materialized in memory for processing. IT you set LOB to 0, those lobs objects placed in external storage.
Q20 What is the importance of eval tool?
Answer: It allows user to run sample SQL queries against Database and preview the results on the console. It can help to know what data can import? The desired data imported or not?
Q21 What is the default extension of the files produced from a sqoop import using the –compress parameter?
Answer: .gz
Q22 Can we import the data with “Where” condition?
Answer: Yes, Sqoop has a special option to export/import a particular data.
Q23 What are the limitations of importing RDBMS tables into Hcatalog directly?
Answer: There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avro file , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
Q24 what are the majorly used commands in sqoop?
Answer: In Sqoop Majorly Import and export command are used. But below commands are also useful sometimes. codegen, eval, import-all-tables, job, list-database, list-tables, merge, metastore.
Q25 What is the usefulness of the options file in sqoop.
Answer: The options file is used in sqoop to specify the command line values in a file and use it in the sqoop commands.
For example the –connect parameter’s value and –user name value scan be stored in a file and used again and again with different sqoop commands.
Q26 what are the common delimiters and escape character in sqoop?
Answer: The default delimiters are a comma(,) for fields, a newline(\n) for records
Escape characters are \b,\n,\r,\t,\”, \\’,\o etc
Q27 What are the two file formats supported by sqoop for import?
Answer: Delimited text and Sequence Files.
Q28 while loading table from MySQL into HDFS, if we need to copy tables with maximum possible speed, what can you do?
Answer: We need to use -direct argument in import command to use direct import fast path and this -direct can be used only with MySQL and PostGreSQL as of now.
Q29 How can you sync a exported table with HDFS data in which some rows are deleted?
Answer: Truncate the target table and load it again.
Q30 Differentiate between Sqoop and distCP.
Answer: DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.
Q31 How can you import only a subset of rows form a table?
Answer: By using the WHERE clause in the sqoop import statement we can import only a subset of rows.
Q32 How do you clear the data in a staging table before loading it by Sqoop?
Answer: By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done again and again till we get proper data in staging.
Q33 What is Sqoop?
Answer: Sqoop is an open source project that enables data transfer from non-hadoop source to hadoop source. It can be remembered as SQL to Hadoop -> SQOOP. It allows user to specify the source and target location inside the Hadoop.
Q34 Is it possible to do an incremental import using Sqoop?
Answer: Yes, Sqoop supports two types of incremental imports-
1)Append
2)Last Modified
To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.
Q35 How can you export only a subset of columns to a relational table using sqoop?
Answer: By using the –column parameter in which we mention the required column names as a comma separated list of values.
Q36 Which database the sqoop metastore runs on?
Answer: Running sqoop-metastore launches a shared HSQLDB database instance on the current machine.
Q37 How will you update the rows that are already exported?
Answer: The parameter –update-key can be used to update existing rows. In it a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.
Q38 You have a data in HDFS system, if you want to put some more data to into the same table, will it append the data or overwrite?
Answer: No it can’t overwrite, one way to do is copy the new file in HDFS.
Q39 Where can the metastore database be hosted?
Answer: The metastore database can be hosted anywhere within or outside of the Hadoop cluster.
Q40 Which is used to import data in Sqoop ?
Answer: In SQOOP import command is used to import RDBMS data into HDFS. Using import command we can import a particular table into HDFS. –
Q41 What is the role of JDBC driver in a Sqoop set up?
Answer: To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it needs to interact with.
Q42 How to import only the updated rows form a table into HDFS using sqoop assuming the source has last update timestamp details for each row?
Answer: By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported.
Q43 What is InputSplit in Hadoop?
Answer: When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split
Q44 Hadoop sqoop word came from ?
Answer: Sql + Hadoop = sqoop
Q45 What is the work of Export In Hadoop sqoop ?
Answer: Export the data from HDFS to RDBMS
Q46 Use of Codegen command in Hadoop sqoop ?
Answer: Generate code to interact with database records
Q47 Use of Help command in Hadoop sqoop ?
Answer: List available commands
Q48 How can you schedule a sqoop job using Oozie?
Answer: Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.
Q49 What are the two file formats supported by sqoop for import?
Answer: Delimited text and Sequence Files.
Q50 What is a sqoop metastore?
Answer: It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument.
(
------------------------------------
https://www.dezyre.com/article/sqoop-interview-questions-and-answers-for-2017/274
Top Answers to Sqoop Interview Questions
1. Compare Sqoop and Flume
Criteria
|
Sqoop
|
Flume
|
Application
|
Importing data from RDBMS
|
Moving bulk streaming data into HDFS
|
Architecture
|
Connector – connecting to respective data
|
Agent – fetching of the right data
|
Loading of data
|
Event driven
|
Not event driven
|
2. Name a few import control commands. How can Sqoop handle large objects?
Import control commands are used to import RDBMS data
Append: Append data to an existing dataset in HDFS. –append
Columns: columns to import from the table. –columns
<col,col……> • Where: where clause to use during import. —
<col,col……> • Where: where clause to use during import. —
where The common large objects are Blog and Clob.Suppose the object is less than 16 MB, it is stored inline with the rest of the data. If there are big objects, they are temporarily stored in a subdirectory with the name _lob. Those data are then materialized in memory for processing. If we set lob limit as ZERO (0) then it is stored in external memory.
3. How can we import data from particular row or column? What is the destination types allowed in Sqoop import command?
Sqoop allows to Export and Import the data from the data table based on the where clause. The syntax is
--columns
<col1,col2……> --where
--query
Example:
sqoop import –connect jdbc:mysql://db.one.com/corp --table INTELLIPAAT_EMP --where “start_date> ’2016-07-20’ ”
sqoopeval --connect jdbc:mysql://db.test.com/corp --query “SELECT * FROM intellipaat_emp LIMIT 20”
sqoop import –connect jdbc:mysql://localhost/database --username root --password aaaaa –columns “name,emp_id,jobtitle”
Sqoop supports data imported into following services:
· HDFS
· Hive
· Hbase
· Hcatalog
· Accumulo
Learn about the complete Hadoop ecosystem in this blog post.
Get Sqoop Certification in just 8 Hours
GET CERTIFIED
4. Role of JDBC driver in sqoop setup? Is the JDBC driver enough to connect the sqoop to the database?
Sqoop needs a connector to connect the different relational databases. Almost all Database vendors make a JDBC connector available specific to that Database, Sqoop needs a JDBC driver of the database for interaction.
No, Sqoop needs JDBC and a connector to connect a database.
No, Sqoop needs JDBC and a connector to connect a database.
Interested in learning Sqoop? Well, we have the comprehensive Training Course to give you a head start in your career.
5. Using Sqoop command how can we control the number of mappers?.
We can control the number of mappers by executing the parameter –num-mapers in sqoop command. The –num-mappers arguments control the number of map tasks, which is the degree of parallelism used. Start with a small number of map tasks, then choose a high number of mappers starting the performance may down on the database side.
Syntax: -m, –num-mappers
6.How will you update the rows that are already exported? Write sqoop command to show all the databases in MySQL server.
By using the parameter – update-key we can update existing rows. Comma-separated list of columns is used which uniquely identifies a row. All of these columns are used in the WHERE clause generated UPDATE query. All other table columns will be used in the SET part of the query.
The command below is used to show all the databases in MySQL server.
The command below is used to show all the databases in MySQL server.
$ sqoop list –databases –connect jdbc:mysql://database.test.com/
7. Define Sqoop metastore? What is the purpose of Sqoop-merge?
Sqoop meta store is a tool for using hosts in a shared metadata repository. Multiple users and remote users can define and execute saved jobs defined in metastore. End users configured to connect the metastore in sqoop-site.xml or with the
–meta-connect argument.
The purpose of sqoop-merge is:
This tool combines 2 datasets where entries in one dataset overwrite entries of an older dataset preserving only the new version of the records between both the data sets.
This tool combines 2 datasets where entries in one dataset overwrite entries of an older dataset preserving only the new version of the records between both the data sets.
8. Explain the saved job process in Sqoop.
Sqoop allows us to define saved jobs which make this process simple. A saved job records the configuration information required to execute a Sqoop command at a later time. sqoop-job tool describes how to create and work with saved jobs. Job descriptions are saved to a private repository stored in $HOME/.sqoop/.
We can configure Sqoop to instead use a shared metastore, which makes saved jobs offered to multiple users across a shared cluster. Starting the metastore is covered by the section on the sqoop-metastore tool.
9. How Sqoop word came ? Sqoop is which type of tool and the main use of sqoop?
Sqoop word came from SQL+HADOOP=SQOOP. And Sqoop is a data transfer tool.
The main use of Sqoop is to import and export the large amount of data from RDBMS to HDFS and vice versa.
The main use of Sqoop is to import and export the large amount of data from RDBMS to HDFS and vice versa.
10. How to enter into Mysql prompt, and explain the command parameter indicates?
The command for entering into Mysql prompt is “mysql –u root –p”
-u indicatesthe user
Root indicates username
-p indicates password.
-u indicatesthe user
Root indicates username
-p indicates password.
11. I am getting connection failure exception during connecting to Mysql through Sqoop, what is the root cause and fix for this error scenario?
This will happen when there is lack of permissions to access our Mysql database over the network. We can try the below command to confirm the connect to Mysql database from aSqoop client machine.
$ mysql –host=MySqlnode> –database=test –user= –password=
We can grant the permissions with below commands.
$ mysql –host=MySqlnode> –database=test –user= –password=
We can grant the permissions with below commands.
mysql> GRANT ALL PRIVILEGES ON *.* TO ‘%’@’localhost’;
mysql> GRANT ALL PRIVILEGES ON *.* TO ‘ ’@’localhost’;
Give your career a big boost by going through our Apache Sqoop Training Course now!
12. I am getting java.lang.IllegalArgumentException: during importing tables from oracle database.what might be the root cause and fix for this error scenario?
Sqoop commands are case- sensitive of table names and user names.
By specifying the above two values in UPPER case, it will resolve the issue.
In case, the source table is created under different user namespace,then table name should be like USERNAME.TABLENAME as shown below
sqoop import
–connect jdbc:oracle:thin:@intellipaat.testing.com/INTELLIPAAT
–username SQOOP
–password sqoop
–table COMPANY.EMPLOYEES
By specifying the above two values in UPPER case, it will resolve the issue.
In case, the source table is created under different user namespace,then table name should be like USERNAME.TABLENAME as shown below
sqoop import
–connect jdbc:oracle:thin:@intellipaat.testing.com/INTELLIPAAT
–username SQOOP
–password sqoop
–table COMPANY.EMPLOYEES
13. How can you list all the columns of a table using Apache sqoop?
There is no straight way to list all the columns of a table in Apache Sqoop like sqoop-list-columns, so first we should retrieve the columns of the particular table and transform to a file containing the column names of particular table.Syntax is:
Sqoop import –m1 –connect ‘jdbc:sqlserver://servername;database=databasename;
Username-DeZyre;password=mypassword’ –query “SELECT column_name,DATA_TYPE FROM INFORMATION_SCHEMA columns WHEREtable_name=’mytableofinterest’ AND \$CONDITIONS” –target-dir ‘mytableofinterest_column_name’.
14. How to create a table in Mysql and how to insert the values into the table ?
To create a table in mysql using the below command
mysql> create table tablename( col1 datatype, col2 datatype,…………);
Example –
mysql> create table INTELLIPAAT(emp_idint,emp_namevarchar(30),emp_salint);
Insert the values into the table
mysql> insert into table name(value1,value2,value3,………);
Example-
mysql> insert into INTELLIPAAT(1234,’aaa’,20000);
mysql> insert into INTELLIPAAT(1235,’bbb’,10000);
mysql> insert into INTELLIPAAT(1236,’ccc’,15000);
15. What are the basic commands in HadoopSqoop and its uses?
The basic commands of HadoopSqoop are
· Codegen, Create-hive-table, Eval, Export, Help, Import, Import-all-tables, List-databases, List-tables,Versions.
· Useof HadoopSqoop basic commands
· Codegen- It helps to generate code to interact with database records.
· Create-hive-table- It helps to Import a table definition into a hive
· Eval- It helps to evaluateSQL statement and display the results
· Export-It helps to export an HDFS directory into a database table
· Help- It helps to list the available commands
· Import- It helps to import a table from a database to HDFS
· Import-all-tables- It helps to import tables from a database to HDFS
· List-databases- It helps to list available databases on a server
· List-tables-It helps to list tables in a database
· Version-It helps to display the version information
16. Is sqoop same as to distcp in hadoop?
No. Because the only distcp import command is same as Sqoop import command and both the commands submit parallel map-only jobs but both command functions are different. Distcp is used to copy any type of files from Local filesystem to HDFS and Sqoop is used for transferring the data records between RDBMS and Hadoop eco- system service.
17. For each sqoop copying into HDFS how many MapReduce jobs and tasks will be submitted?
There are 4 jobs that will be submitted to each Sqoop copying into HDFS and no reduce tasks are scheduled.
18. How can Sqoop be used in Java programs?
In the Java code Sqoop jar is included in the classpath. The required parameters are created to Sqoop programmatically like for CLI (command line interface). Sqoop.runTool() method also invoked in Java code.
19. I am having around 500 tables in a database. I want to import all the tables from the database except the tables named Table498, Table 323, and Table199. How can we do this without having to import the tables one by one?
This can be proficient using the import-all-tables, import command in Sqoop and by specifying the exclude-tables option with it as follows-
sqoop import-all-tables
–connect –username –password –exclude-tables Table498, Table 323, Table 199
sqoop import-all-tables
–connect –username –password –exclude-tables Table498, Table 323, Table 199
20. Explain the significance of using –split-by clause in Apache Sqoop?
split-by is a clause, it is used to specify the columns of the table which are helping to generate splits for data imports during importing the data into the Hadoop cluster. This clause specifies the columns and helps to improve the performance via greater parallelism. And also it helps to specify the column that has an even distribution of data to create splits,that data is imported.
Take charge of your career by going through this professionally designed Apache Hadoop Developer Course.
No comments:
Post a Comment