Data Exploration & Machine Learning, Hands-on


Recommended free walkthrough, check it out and boost your career:

Boost your data science career


Supercharge R with Spark: Getting Apache's SparkR Up and Running on Amazon Web Services (AWS)

Practical walkthroughs on machine learning, data exploration and finding insight.

Spark Logo


Spark doesn’t need an introduction, but just in case, it extends Hadoop’s distributed / parallel computing paradigm by using both live memory in addition to disk storage. This makes for a much faster tool. As of version 1.4, SparkR is included in Apache’s Spark build. We’ll be using version 1.5 in this walk-through.

This will be a two-part series, here we’ll install SparkR on EC2 and fire up a few clusters. In the second part, will do some distributed modeling.

Cluster Launcher

AWS Instance

In order to approach this from the same vantage point, we’ll use a small EC2 instance to launch our Spark clusters. You will need an amazon AWS account and the ability to Secure Shell Tunneling (SSH) into AWS (more on this later). Keep in mind that AWS instances cost money and the more clusters you need, the more money Amazon will charge. So, always remember to Stop or Terminate you instances when not needed!

First, sign into the AWS Console:

logging on AWS


Under header Networking, select VPC:


A virtual private cloud (VPC) will determine who and what gets to access the site. We will use the wizard and content ourselves with only on VPC. In an enterprise-level application, you will want at least 4, 2 to be private and run your database, and two to be public and hold your web-serving application. By duplicating the private and public VPCs you can benefit from fail-over and load balancing tools. For our purposes, we’ll just go with one public VPC:

Start the wizard:

VPC Wizard

Start the wizard and select VPC with a Single Public Subnet:

VPC Configuration

Most of the defaults are fine except you should add a name under VPC name:

Single Subnet

Hi there, this is Manuel Amunategui- if you're enjoying the content, find more at


VPC is done, let’s now create our EC2 instance - this is going to be our cluster-launching machine. Click on the orange cube in the upper left corner of the page. From the ensuing menu, choose the first option, EC2:

EC2 Icon


In Create Instance, select Launch Instance:

Create Instance

Select the first option Amazon Linux AMI:

Create Instance

In Step 2, continue with the preselected machine and click Next: Configure Instance Details:

Create Instance

In Step 3, keep all defaults but change the Auto-assign Public IP to Enable:


Select Review and Launch:

Review Launch

In Step 7, SSH port 22 is opened by default so there is nothing for us to do but to click Launch. It will open a pop-up box where you will need to create a new key pair:

Key Pair

Key-pair is a security file that will live on your machine and is required to SSH into the instance. I tend to create them and leave them in my downloads. What ever you decided to do, make sure you know where it is as you’ll need to pass a path to it every time you want to connect to it. Create a new one, check the acknowledgment check-box.

Finally, click Launch Instances. That’s it for our instance. Hit the View Instances on the next page as the machine is being initialized:

EC2 Instance

We need to wait till the Initializing is done. Meanwhile we can get the security credentials that will be needed to launch the clusters. Go to your account name drop down in the top right corner and click Security Credentials:

Create Instance

Choose Access Keys

Create Instance

And Create Access Key

Create Instance

This will download two strings:


Keep that download handy as we’ll need the values shortly.

Back at the instance window, check that our new instance is running, you can click on the check box to access the assigned dynamic IP address:

Create Instance

Connecting to EC2

Use the Actions button to get the exact SSH string to reach the instance:

Create Instance

And select Connect:

Connect to Instance

Read the instruction on the pop-up box. The last line states your connection string: ssh -i "demo.pem" ec2-user@ To use it on the Mac, open your terminal and navigate to your Downloads folder (or wherever you moved your pem key-pair file). As per instructions, apply chmod 400 demo.pem and copy the example string.


Paste it into the terminal and follow the instructions to log into your EC2 instance:

ssh -i "demo.pem" ec2-user@

Your terminal should confirm the EC2:

Termninal Window

Installing Spark

This instance will be used to launch our Master and Dependent clusters. Let’s get the download URL path for Spark (from the downloads page for the latest version):

Spark Downloads

We’ll use wget to download Spark directly into our instance, paste the following commands into the terminal:

# download spark
sudo wget

# unpack spark tgz file
sudo tar zxvf  spark-1.5.0-bin-hadoop2.6.tgz

Now we need to download our PEM key-pair file over to the instance as well. Exit out of the EC2 session and use the scp command (Secure Copy):

# get out of current session and back to local machine

# upload pem file in same directory on the EC2 instance (swap the dashed IP with your instance's) 
scp -i  demo.pem demo.pem

Now we go back to the EC2 box:

ssh -i "demo.pem" ec2-user@

# navigate to the ec2 folder:
cd spark-1.5.0-bin-hadoop2.6/ec2

# change file permissions
sudo chmod 400 demo.pem

We also need to export our access keys:

Set access keys:
export AWS_ACCESS_KEY_ID=...yours here... 
export AWS_SECRET_ACCESS_KEY=...yours here... 

Creating Clusters

Now, let’s get our clusters up and running. This will call the script to launch 3 clusters, 1 master and 2 dependents in the us-west zone. -k is your PEM name, -i is the path to the PEM file, -s is the number of instances needed and -t is the server type:

# launch clusters
./spark-ec2 -k demo -i demo.pem -r us-west-2 -s 2 -t m1.small launch --copy-aws-credentials my-spark-cluster

Building the clusters takes about 15 minutes. The terminal window may spit out some waiting, disconnected, or connection refused messages but as long as it doesn’t return the prompt, let it do its thing as it needs to set up each instance and install a lot of software on all of them.

Spark Downloads

Connecting to the Master Cluster

One of the last lines in the terminal window will be the master IP address - but you can also get it from the instance viewer on the AWS website.

The master cluster is where we need to run RStudio. It is automatically installed for us (nice!). To check that our clusters are up and running, connect to the Spark Master window (swap the master IP address with yours):

Spark Master

We confirm that we have two dependent clusters with status Alive. In order to log into RStudio, we need to first SSH into the instance and create a new user.

# log into master instance
ssh -i "demo.pem"

# add new user (and you don't have to use my name) and set a passwd
sudo useradd manuel
sudo passwd manuel

Working with RStudio

The credentials we just created are what we’ll use to log into RStudio Server:

# log into r studio

Spark Master

Let’s get some data through these clusters

Just a quick example from the 1.5 Spark documentation to show that this works. Paste the following code into RStudio running on your master cluster:

# from blog:
.libPaths(c(.libPaths(), '/root/spark/R/lib'))
Sys.setenv(SPARK_HOME = '/root/spark')
Sys.setenv(PATH = paste(Sys.getenv(c('PATH')), '/root/spark/bin', sep=':'))
# initialize the Spark Context
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)

# create a data frame using the createDataFrame object
df <- createDataFrame(sqlContext, faithful) 
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
# try simple generalized linear model 
model <- glm(waiting ~ eruptions, data = df, family = "gaussian")
## $coefficients
##             Estimate
## (Intercept) 33.47440
## eruptions   10.72964
# see how well it predicts
predictions <- predict(model, newData = df)
head(select(predictions, "waiting", "prediction"))

##   waiting prediction
## 1      79   72.10111
## 2      54   52.78775
## 3      74   69.23629
## 4      62   57.97017
## 5      85   82.11186
## 6      55   64.40795

And to prove that all clusters went to work, check out the Spark Master:

Spark Master

K- we’re done for now. Don’t forget to terminate all your EC2 instances you created!

# handy command to kill all Spark clusters part of my-spark-cluster 
./spark-ec2 -k demo -i demo.pem -r us-west-2  destroy my-spark-cluster

In the next walk-through, we’ll dig deeper by using distributed data frames along with regression and classification modeling examples.


Manuel Amunategui - Follow me on Twitter: @amunategui