Spark submit with SBT
This recipe assumes sbt is installed and you have already gone over
mysql with Spark recipe.
I am a big fan of Spark Shell. Biggest proof is
Spark Cookbook
which has all recipes in the form of collection of single commands on
Spark Shell. It makes it easy to understand and run and see what exact
effect each command is having.
On similar lines, I have a big fan of Maven. Maven brought two
disruptive changes to the build world which are going to stay forever
and they are
- Declarative dependency management
- Convention over configuration
That being said, to deploy Spark applications on Cluster and do cluster level optimizations,
spark-shell
is not enough and we have to use
spark-submit
.
spark-submit expects the application logic to be bundled in a jar file.
Now creating this jar file using maven is a lot of work especially for
super simple project and this is where simplicity of Sbt comes into the
picture.
In this recipe we’ll build and deploy a simple application using Sbt.
Create directory people and src/main/scala in that.
$ mkdir -p people/src/main/scala
|
Create file Person.scala in src/main/scala with following content.
import org.apache.spark._
import org.apache.spark.sql._
object Person extends App {
val sc = new SparkContext
val sqlContext = new SQLContext(sc)
val url="jdbc:mysql://localhost:3306/hadoopdb"
val prop = new java.util.Properties
prop.setProperty("user","hduser")
prop.setProperty("password","vipassana")
val people = sqlContext.read.jdbc(url,"person",prop)
people.collect.foreach(println)
}
|
Now cd to people directory and create build file build.sbt using following command
$ echo "libraryDependencies += \"org.apache.spark\" %% \"spark-sql\" % \"1.4.1\"" >> build.sbt
|
Now package the jar
Now run the application using sbt-submit
$ spark-submit --driver-class-path /home/hduser/thirdparty/mysql.jar target/scala-2.10/people*.jar
|
I hope you appreciate beauty and simplicity of Sbt now.
Now let’s do the same using cluster deploy mode, in this case we have to go to http://spark-master:8080 to see the results.
$ spark-submit --master spark://localhost:7077 --deploy-mode cluster --driver-class-path /home/hduser/thirdparty/mysql.jar --jars /home/hduser/thirdparty/mysql.jar target/scala-2.10/people*.jar
|
One thing to notice is that documentation suggests that –jars should
put library on both driver and executors. My personal experience
suggests that –driver-class-path puts library on driver and –jars puts
them only on executors
No comments:
Post a Comment