The R User Conference 2016

June 27 - June 30 2016
Stanford University, Stanford, California



Using R with Jupyter Notebooks for Reproducible Research

Andrie de Vries and Micheleen Harris - Microsoft

Post-tutorial notes

The materials used in the tutorial are available here.

Tutorial Description

This hands-on tutorial introduces the Jupyter notebook project (previously called IPython) in the context of programming in R for reproducible research and knowledge transfer. After this tutorial an attendee will have insights into why and how one would use Jupyter notebooks for research, training and/or instruction.

Because notebooks: 1) have “live” code, which is interactive/modifiable; 2) allow for easy collaboration on a notebook system; and 3) serve as a scratchpad for both code and notes, it can be argued that Jupyter notebooks fill an important gap between resources associated with technical projects and a way to communicate and share those resources.

Tutorial Outline

The Jupyter project had its inception over 15 years ago. We’ll provide some context behind the project, then and now, and how it has evolved into its current form: a browser-based framework for live coding, a scratchpad, a shareable document, a fun way to teach R (or 50 other languages), et cetera.

We’ll guide attendees through a hands-on lab involving some simple image processing with k-means clustering. The lab will take attendees through what training looks like in a notebook system.

Following the lab, arguments will be presented on when to use or not use notebooks with R over Rmarkdown. As an example, reproducible research revolves around being able to recreate an “experiment” and customize it for collaboration or new and novel applications.

A longer hands-on lab will build upon the first lab using k-means clustering for image analysis. The attendees will be encouraged to break out into groups, be creative and come up with ideas for clustering or other forms of image analysis on real-world use-cases and then implement their ideas in notebooks.

There will be a quick mention of both the tools provided by Microsoft for the labs as well as resources for custom local or cloud setups which serve similar purposes. Good practices when using notebooks for reproducible research and training will be presented as well. Along with the tips, there will be a demo on creating slideshows and rendering into impactful static documents.

All components will be provided so no local installs will be necessary. Additional notebooks building on what was learned in the tutorial will be provided along with introductory material for using the R kernel in a Jupyter notebook. All notebooks created by the instructors fall under the MIT License.

Goals

By the end of the tutorial attendees will have very practical knowledge on how and when to use notebooks, how to create quality notebooks and the ability to immediately begin leveraging them for any computational project, training event or publication.

Background Knowledge

This tutorial requires no background knowledge of Jupyter or Python. Knowledge of R for data processing and analysis is highly recommended.

We will provide an online, hosted Jupyter system for this tutorial. If users wish to install their own system, we’ll provide detailed download and installation instructions on the github site.

Instructor Biography

Andrie de Vries and Micheleen Harris will jointly teach this tutorial.

Andrie de Vries is a programme manager at Microsoft, responsible for the development of Microsoft R Open and connectivity between R, the Azure cloud and other Microsoft products, e.g. Excel.

During UseR!2015, Andrie had a tutorial session on RHadoop, a very popular session with more than 100 people attending.

Andrie is also a regular speaker at R and industry events, including:

  • UseR!2013: Using survival analysis for marketing attribution
  • UseR!2015: The network structure of CRAN
  • EARL London, 2014: Taking a mini CRAN into your organization (agenda)
  • EARL London, 2015: Reproducible Data Analysis with Revolution R Open (agenda)
  • Apache Big data conference, Budapest 2015: R as a language for big data analytics.
  • JSM 2015: The network structure of CRAN

Micheleen Harris is a Data Scientist at Microsoft where she shares her Python, R and Jupyter notebook experience internally as well as with clients and partners.

Recently she delivered a course utilizing Microsoft Azure to provide extensive training to a large partner, setting up a Jupyter notebook server for this purpose.

During her time as a software engineer at the Institute for Systems Biology, a non-profit academic research center, she regularly delivered presentations on her research within her group as well as company-wide.

Her recent papers include:
  • Pan-transcriptomic analysis identifies coordinated and orthologous functional modules in the diatoms Thalassiosira pseudonana and Phaeodactylum tricornutum, Marine Genomics (Nov 2015)
  • State of the human proteome in 2013 as viewed through PeptideAtlas: comparing the kidney, urine, and plasma proteomes for the biology- and disease-driven Human Proteome Project, Journal of Proteome Research (Jan 2014), a proteomics effort contributing to the Human Proteome Project.
  • She has contributed to several other academic papers and efforts over the years.
Recently she developed a “Python for the Data Scientist” course. She is also interested in polygot notebooks, for example R and Python code together in the same notebook, sharing data.

Micheleen won several awards for teaching excellence, including:
  • Faraday Fellowship for Teaching Excellence, University of Austin, 2003
  • Welch Teaching Excellence Award, University of Austin, 2002


Back to Top ↑