Introduction to SparkR

Shivaram Venkataraman - UC Berkeley; Hossein Falaki - Databricks

Post-tutorial notes

The materials used in the tutorial (slides, notebook, and instructions) are available here. The SparkR tutorial slides are at hereand here.

Tutorial Description

Apache Spark is a popular cluster computing framework used for performing large scale data analysis. This tutorial will introduce cluster computing using SparkR: the R language API for Spark. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users. In this tutorial we will provide example workflows for ingesting data, performing data analysis and doing interactive queries using distributed data frames. Finally, participants will be able to try SparkR on realworld datasets using Databricks R notebooks to get hands-on experience using SparkR.

Goals

Introduce cluster computing concepts in Apache Spark and the SparkR data frame API.
Provide example workflows for ingesting data from various sources, performing analysis and visualizing the outputs.
Hands-on experience applying SparkR to real world datasets using R notebooks in the cloud.

Tutorial Outline

This tutorial will introduce R users to cluster computing using Apache Spark. It will cover a number of key components of SparkR, R API for Spark. These include:

Distributed Data Frames: The central component of SparkR is a distributed data frame implemented on top of Spark. SparkR’s distributed data frames have an API similar to dplyr and the highlevel API simplifies expressing complex data manipulation tasks on data frames. This tutorial will introduce the Spark computation model and will enable R users to become productive with the core functionality offered by SparkR data frames. We will also describe how the data frame operations scale to large datasets using Spark’s relational query optimizer.

Data Sources: Spark SQL's data sources API provides support for reading input from a variety of systems including HDFS, HBase, Cassandra and a number of formats like JSON, Parquet, etc. Integrating with the data source API enables R users to directly process data sets from any of these data sources. We will demonstrate how to use inbuilt data sources and access external data sources that are available as Spark packages.

Large Scale Machine Learning: To support large scale distributed machine learning, SparkR integrates with the MLLib, the distributed machine learning library in Spark. Spark 1.6 supports a distributed “glm” method to fit gaussian and binomial GLMs. SparkR supports a subset of the R formula operators including the “+” (inclusion), “” (exclusion), “:” (interactions) and intercept operators. The tutorial will include examples of performing data preprocessing using SQL and then learning a machine learning model using glm on SparkR.

Databricks R Notebooks: Databricks R Notebooks allow anyone familiar with R take advantage of the power of Spark through simple Spark cluster management, rich oneclick visualizations, and instant deployment to production jobs. We will be using Databricks R notebooks during the tutorial for handson exercises.

Background Knowledge

Attendees will be required to have basic R knowledge and familiarity with R data frames. Experience with packages like plyr or dplyr would be useful but is not required.

Additional Notes

To help presenters better organize the content and material, please consider answering this short survey.
Please consider signing up in advance for following free online service. It will be used for hands on exercises: https://databricks.com/ce.

Instructor Biography

Shivaram Venkataraman is a PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. He is a committer on the Apache Spark project and works on the R API for Spark. Before coming to Berkeley, he was also involved in the initial design and implementation of the distributedR project at HP Labs.

Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).

Links to previous presentations:
https://sparksummit.org/2015/events/sparkrthepastthepresentandthefuture/
https://sparksummit.org/eu2015/events/enablingexploratorydatasciencewithsparkandr/