Wednesday, November 23, 2016

Spark submit with SBT

Spark submit with SBT

This recipe assumes sbt is installed and you have already gone over mysql with Spark recipe.
I am a big fan of Spark Shell. Biggest proof is Spark Cookbook which has all recipes in the form of collection of single commands on Spark Shell. It makes it easy to understand and run and see what exact effect each command is having.
On similar lines, I have a big fan of Maven. Maven brought two disruptive changes to the build world which are going to stay forever and they are
  • Declarative dependency management
  • Convention over configuration
That being said, to deploy Spark applications on Cluster and do cluster level optimizations, spark-shell is not enough and we have to use spark-submit. spark-submit expects the application logic to be bundled in a jar file. Now creating this jar file using maven is a lot of work especially for super simple project and this is where simplicity of Sbt comes into the picture.
In this recipe we’ll build and deploy a simple application using Sbt.
Create directory people and src/main/scala in that.
$ mkdir -p people/src/main/scala
Create file Person.scala in src/main/scala with following content.
import org.apache.spark._
import org.apache.spark.sql._
object Person extends App {
  val sc = new SparkContext
  val sqlContext = new SQLContext(sc)
  val url="jdbc:mysql://localhost:3306/hadoopdb"
  val prop = new java.util.Properties
  prop.setProperty("user","hduser")
  prop.setProperty("password","vipassana")
  val people = sqlContext.read.jdbc(url,"person",prop)
  people.collect.foreach(println)
}
Now cd to people directory and create build file build.sbt using following command
$ echo "libraryDependencies += \"org.apache.spark\" %% \"spark-sql\" % \"1.4.1\"" >> build.sbt
Now package the jar
$ sbt clean package
Now run the application using sbt-submit
$ spark-submit --driver-class-path /home/hduser/thirdparty/mysql.jar target/scala-2.10/people*.jar
I hope you appreciate beauty and simplicity of Sbt now.
Now let’s do the same using cluster deploy mode, in this case we have to go to http://spark-master:8080 to see the results.
$ spark-submit --master spark://localhost:7077 --deploy-mode cluster  --driver-class-path /home/hduser/thirdparty/mysql.jar --jars /home/hduser/thirdparty/mysql.jar target/scala-2.10/people*.jar
One thing to notice is that documentation suggests that –jars should put library on both driver and executors. My personal experience suggests that –driver-class-path puts library on driver and –jars puts them only on executors

No comments:

Post a Comment