Running Apache Spark from a JAR

Reading time ~3 minutes

Apache Spark is an open source cluster computing framework, which is becoming extremely popular these days. By now it has taken over the role of many previously used MapReduce and Machine Learning frameworks. So far there exists plenty of recepies on how to launch a cluster and get the examples and shell running from there. Nevertheless, assume that for an educational purpose or any other odd reason we would like to build a single JAR, with all dependencies included, which then runs some Spark related code on its own. In that case, here is a simple four-step recipe to get started from scratch.

Create a new Maven Java project

The easiest way to do this is from the command line (look here for an explanation):

mvn archetype:generate -DgroupId=com.simonj.demo -DartifactId=spark-fun -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

Edit the POM file

In my example, I first explicitly state the Java version, 1.8. Then, I remove the junit dependency and add dependencies to spark-core_2.10, testng and guava (note the version 16.0 to avoid conflicts with the current version of spark-core). Finally, I use the Maven shade plugin to include the dependencies, with additional filters and transformers to get this stuff working.

Import the project into an IDE and edit the files

In the next step, I import the project into Eclipse and edit App.java and AppTest.java. The code illustrates a simple word counting in Spark, but the important part here is using something like the following (where I launch a new Spark context with a local master):

try (JavaSparkContext context = new JavaSparkContext("local[2]", "Spark fun!")) {
    ...
    context.stop();
}

Build the project and run

In the final step, I first build the project:

mvn clean package

Then create a test file, and run the App.java from the command line (note that here I use the allinone.jar, which is the one with all dependencies included):

java -cp target/spark-fun-1.0-SNAPSHOT-allinone.jar com.simonj.demo.App test.txt

Finally, after a short time the example program spits out something like this:

{took=1, lorem=4, but=1, text=2, is=1, standard=1, been=1, sheets=1, including=1, electronic=1, of=4, not=1, software=1, type=2, survived=1, book=1, only=1, s=1, desktop=1, to=1, passages=1, containing=1, and=3, versions=1, more=1, typesetting=2, essentially=1, recently=1, ipsum=4, a=2, galley=1, aldus=1, 1960s=1, simply=1, when=1, ever=1, dummy=2, with=2, 1500s=1, in=1, publishing=1, like=1, printing=1, five=1, industry=2, letraset=1, pagemaker=1, since=1, was=1, an=1, into=1, the=6, make=1, has=2, it=3, remaining=1, unknown=1, popularised=1, leap=1, unchanged=1, centuries=1, specimen=1, also=1, printer=1, release=1, scrambled=1}

So it works – what a lovely evening and good night folks!

PS. here is the complete project created through these steps.

On designing and running online experiments

Have you wondered how to design and run online experiments? In particular, how to implement an experiment dashboard such as the one enpic...… Continue reading

Software Engineering at Cxense

Published on April 17, 2017

Having fun with Scikit-Learn

Published on December 01, 2016