Apache Spark is an open source cluster computing framework, which is becoming extremely popular these days. By now it has taken over the role of many previously used MapReduce and Machine Learning frameworks. So far there exists plenty of recepies on how to launch a cluster and get the examples and shell running from there. Nevertheless, assume that for an educational purpose or any other odd reason we would like to build a single JAR, with all dependencies included, which then runs some Spark related code on its own. In that case, here is a simple four-step recipe to get started from scratch.
Create a new Maven Java project
The easiest way to do this is from the command line (look here for an explanation):
Edit the POM file
In my example, I first explicitly state the Java version, 1.8. Then, I remove the junit dependency and add dependencies to spark-core_2.10, testng and guava (note the version 16.0 to avoid conflicts with the current version of spark-core). Finally, I use the Maven shade plugin to include the dependencies, with additional filters and transformers to get this stuff working.
Import the project into an IDE and edit the files
In the next step, I import the project into Eclipse and edit App.java and AppTest.java. The code illustrates a simple word counting in Spark, but the important part here is using something like the following (where I launch a new Spark context with a local master):
Build the project and run
In the final step, I first build the project:
Then create a test file, and run the App.java from the command line (note that here I use the allinone.jar, which is the one with all dependencies included):
Finally, after a short time the example program spits out something like this:
So it works – what a lovely evening and good night folks!
PS. here is the complete project created through these steps.