Apache Spark is a popular cluster computing framework used for performing large scale data analysis. This tutorial will introduce cluster computing using SparkR: the R language API for Spark. SparkR provides a distributed data frame API that enables structured data processing with a syntax familiar to R users. In this tutorial we will provide example workflows for ingesting data, performing data analysis and doing interactive queries using distributed data frames. Finally, participants will be able to try SparkR on realworld datasets using Databricks R notebooks to get hands-on experience using SparkR.
This tutorial will introduce R users to cluster computing using Apache Spark. It will cover a number of key components of SparkR, R API for Spark. These include:
Distributed Data Frames: The central component of SparkR is a distributed data frame implemented on top of Spark. SparkR’s distributed data frames have an API similar to dplyr and the highlevel API simplifies expressing complex data manipulation tasks on data frames. This tutorial will introduce the Spark computation model and will enable R users to become productive with the core functionality offered by SparkR data frames. We will also describe how the data frame operations scale to large datasets using Spark’s relational query optimizer.
Data Sources: Spark SQL's data sources API provides support for reading input from a variety of systems including HDFS, HBase, Cassandra and a number of formats like JSON, Parquet, etc. Integrating with the data source API enables R users to directly process data sets from any of these data sources. We will demonstrate how to use inbuilt data sources and access external data sources that are available as Spark packages.
Large Scale Machine Learning: To support large scale distributed machine learning, SparkR integrates with the MLLib, the distributed machine learning library in Spark. Spark 1.6 supports a distributed “glm” method to fit gaussian and binomial GLMs. SparkR supports a subset of the R formula operators including the “+” (inclusion), “” (exclusion), “:” (interactions) and intercept operators. The tutorial will include examples of performing data preprocessing using SQL and then learning a machine learning model using glm on SparkR.
Databricks R Notebooks: Databricks R Notebooks allow anyone familiar with R take advantage of the power of Spark through simple Spark cluster management, rich oneclick visualizations, and instant deployment to production jobs. We will be using Databricks R notebooks during the tutorial for handson exercises.
Attendees will be required to have basic R knowledge and familiarity with R data frames. Experience with packages like plyr or dplyr would be useful but is not required.
Shivaram Venkataraman is a PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. He is a committer on the Apache Spark project and works on the R API for Spark. Before coming to Berkeley, he was also involved in the initial design and implementation of the distributedR project at HP Labs.
Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).