Friday, October 30, 2020
hive Parquet to textfile
Sunday, October 18, 2020
hive example query
Wednesday, October 7, 2020
sqoop Compression snappy and avro Format
locate core-site.xml -> find codec type available
Example of loading data from MySQL to HDFS (compression: Snappy and Avro format)
$ sqoop import \ --connect jdbc:mysql://localhost:33/mybbdd \ --username=root -P \ --table=mytable \ --driver=com.mysql.jdbc.Driver \ --target-dir=/ej_snappy_avro \ --compress \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --as-avrodatafile
Example of loading data from MySQL to HDFS (compression: gzip and Avro format)
$ sqoop import \ --connect jdbc:mysql://localhost/mibbdd \ --username=root -P \ --table=mitabla \ --driver=com.mysql.jdbc.Driver \ --target-dir=/ej_gzip_avro \ --compress \ --compression-codec org.apache.hadoop.io.compress.GzipCodec \ --as-avrodatafile
Example of loading data from MySQL to HDFS (compression: BZIP2 and Sequence format)
$ sqoop import \ --connect jdbc:mysql://localhost/mibbdd \ --username=root -P \ --table=mitabla \ --driver=com.mysql.jdbc.Driver \ --target-dir=/ej_bzip2_sequence \ --compress \ --compression-codec org.apache.hadoop.io.compress.BZip2Codec \ --as-sequencefile
Example of loading data from MySQL to HDFS (restricting data with columns)
$ sqoop import \ --connect jdbc:mysql://localhost/mibbdd \ --username=root -P \ --table=mitabla \ --driver=com.mysql.jdbc.Driver \ --target-dir=/ej_2_columns \ --columns nombre,edad
Example of loading data from MySQL to HDFS (restricting data with WHERE)
$ sqoop import \ --connect jdbc:mysql://localhost/mybbdd \ --username=root -P \ --table=mytable \ --driver=com.mysql.jdbc.Driver \ --target-dir=/ej_mayor_age_40 \ --where "edad > 40"
Example of loading data from MySQL to HDFS (incremental load)
In order to make an incremental insertion we need to include new data to the table “MyTable”, for this we execute in MySQL the following sentence:
mysql> INSERT INTO mytable (nombre, edad, salario) VALUES ("Diego", 24, 21000), ("Rosa", 26, 24000), ("Javier", 28, 25000), ("Lorena", 35, 28000), ("Miriam", 42, 30000), ("Patricia", 43, 25000), ("Natalia", 45, 39000);
Note: To make the insertion necessary to do it in the db “MIBBDD”
Once the insertion is done we can make the incremental insertion from the 8 as it is the first element introduced in the new insertion.
$ sqoop import \ --connect jdbc:mysql://localhost/mybbdd \ --username=root -P \ --table=mytable \ --driver=com.mysql.jdbc.Driver \ --target-dir=/my_table_hdfs \ --incremental append \ --check-column id \ --last-value 8
Example of loading data from MySQL to HDFS and consultable from HIVE
In order to make an insertion of the table in the hive database, we must create db where it will be inserted, to avoid problems:
Hive > CREATE DATABASE mybbddhive;
Once the database is created, you are ready to run the query:
$ sqoop import \ --connect jdbc:mysql://localhost/mybbdd \ --username=root -P \ --table=mytable \ --driver=com.mysql.jdbc.Driver \ --target-dir=/ej_hive \ --compress \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --hive-import \ --hive-database mihive \ --create-hive-table \ --hive-table ej_hive_table
Creating a MySQL database table
A database is created with a table on which to perform the tests, the following commands will be used.
accessing MYSQL
$ mysql
You can make an ERROR 1045 (28000): Access denied for user ‘ root ‘ @ ‘ localhost, which is resolved:mysql -u root-PNote: Keep in mind that the MySQL and system wash can be different.
Consult database
mysql> show databases;
Create Database
mysql> create database myddbb;Use database
mysql> use myddbb;Create base table
mysql>CREATE TABLE mytable ( id MEDIUMINT NOT NULL AUTO_INCREMENT, name CHAR (30) NOT NULL, age INTEGER (30), salary INTEGER (30), PRIMARY KEY (id));
CREATE TABLE 2
mysql>CREATE TABLE mytable2 ( id MEDIUMINT NOT NULL AUTO_INCREMENT, name CHAR (30) NOT NULL, age INTEGER (30), salary INTEGER (30), PRIMARY KEY (id));
Insert Data
mysql> INSERT INTO mytable (name, age, salary) values
("Peter", 24, 21000), ("Maria", 26, 24000), ("John", 28, 25000), ("Louis", 35, 28000), ("Monica", 42, 30000), ("Rose", 43, 25000), ("Susana", 45, 39000);
Note: If you have permission problems enter MySQL and give all permissions:
grant all privileges on *.* to 'root'@'localhost' IDENTIFIED BY 'MySQL_Key' WITH GRANT OPTION;Exit MYSQL
mysql>exit;
Load MySQL data to HDFS
Example of loading data from the table “MyTable” of the Database “MIBBDD” to the folder HDFs name “Mitabla_hdfs”
$ sqoop Import \ --connect jdbc:mysql://localhost/myddbb \ --username = root -P \ --table = mytable \ --driver = com.mysql.jdbc.driver \ --target-dir =/my_hdfs_table \ --fields-terminated-by = ',' \ --lines-terminated-by '\n'Target-dir: File HDFS where it is stored.
Table: Identifies the table to be copied.
Clear-Staging-table: Indicates that past data can be deleted.
VERBOSE: Prints additional information to facilitate debugging.
Fields-terminated-by: defining the delimiter.
Loading data from HDFS to MySQL
Example of loading data from the HDFS folder named “my_hdfs_table” to the “mytable2” table in the “myddbb” database.
$ sqoop export --connect jdbc: mysql://localhost/myddbb --username = root -P --table = mytable2 --export-dir =/my_hdfs_table -m 1Note: If you have problems “set $ ACCUMULO_HOME to the root in your ACCUMULO intallation”, it can be avoided with:
$ ACCUMULO_HOME = '/var/lib/accumulo' $ Export ACCUMULO_HOME $ sudo mkdir/var/lib/accumuloTo practice with Sqoop then consult: “Examples of Sqoop”, in this section compiled many examples that can serve you useful.
Source: official Shell documentation
optimization tecqiue
Friday, October 2, 2020
saprk 2.0 jdbc
Spark-shell does not encounter so many problems This is a sbt dependency problem in IDEA.
- import java.util.Properties
- import org.apache.spark.sql
- import org.apache.spark.sql.types._
- import org.apache.spark.sql.Row
- import org.apache.spark.sql.SparkSession
- import org.apache.spark.SparkConf
- import org.apache.spark.SparkContext
The build.sbt file is as follows:
name := "Simple Project" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0" libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.1.2" libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.1.2" libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.1.2" libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.15"
The above org.apache.spark requires a package of 2.0.0 or higher. Otherwise, SparkSession cannot be imported.
Another puzzling thing is that the mysql-connector-java-8.0.15.jar imported from the outside of the dependency does not work, resulting in the package can not find the com.mysql.jdbc driver.
So: libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.15" is to solve the driver problem.
Wait for sbt:dump to complete and run the code successfully.
--------------------------------
import java.util.Properties
import org.apache.spark.sql
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
//import com.mysql.jdbc
//SparkSession
object ConnectJDBC {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("ConnectJDBC").setMaster("local[*]")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
// read the information
Val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/spark") //*****This is the database name
.option("driver", "com.mysql.jdbc.Driver").option("dbtable", "student")//***** is the table name
.option("user", "root").option("password", "123456").load()
jdbcDF.show()
/ / Below we set two data to represent two student information
val studentRDD = spark.sparkContext.parallelize(Array("1 Licheng M 26", "2 Jianghua M 27")).map(_.split(" "))
/ / The following to set the mode information
val schema = StructType(List(StructField("id", IntegerType, true), StructField("name", StringType, true), StructField("gender", StringType, true), StructField("age", IntegerType, true)))
/ / Create a Row object below, each Row object is a row in the rowRDD
val rowRDD = studentRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).trim, p(3).toInt))
/ / Establish a correspondence between the Row object and the mode, that is, the data and the pattern are associated
val studentDF = spark.createDataFrame(rowRDD, schema)
/ / Create a prop variable to save JDBC connection parameters
val prop = new Properties()
Prop.put("user", "root") // indicates that the username is root
Prop.put("password", "123456") // indicates that the password is hadoop
Prop.put("driver", "com.mysql.jdbc.Driver") // indicates that the driver is com.mysql.jdbc.Driver
// /usr/local/spark/jars/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar
/ / The following can be connected to the database, using append mode, indicating additional records to the student table in the database spark
studentDF.write.mode("append").jdbc("jdbc:mysql://localhost:3306/spark", "spark.student", prop)
}
}
---------------------------
operation result:
+---+---------+------+---+
| id| name|gender|age|
+---+---------+------+---+
| 1| Licheng| M| 26|
| 2| Jianghua| M| 27|
+---+---------+------+---+