The materials used in the tutorial are available here.
The goal of this tutorial is to provide participants with a deep understanding of four widely used algorithms in machine learning: Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest and Deep Neural Nets. This includes a deep dive into the algorithms in the abstract sense, and a review of the implementations of these algorithms available within the R ecosystem.
Due to their popularity, each of these algorithms have several implementations available in R. Each package author takes a unique approach to implementing the algorithm, and each package provides an overlapping, but not identical, set of model parameters available to the user. The tutorial will provide an in-depth analysis of how each of these algorithms were implemented in a handful of R packages for each algorithm.
After completing this tutorial, participants will have a understanding of how each of these algorithms work, and knowledge of the available R implementations and how they differ. The participants will understand, for example, why the xgboost package has, in less than a year, become one of the most popular GBM packages in R, even though the gbm R package has been around for years and has been widely used -- what are the implementation tricks used in xgboost that are not (yet) used in the gbm package? Or, why do some practioners in certain domains prefer the one implementation over another? We will answer these questions and more!
The "Machine Learning Algorithmic Deep Dive" tutorial will be a voyage into the depths of four of the most popular machine learning algorithms: Generalized Linear Model (GLM), Gradient Boosting Machine (GBM), Random Forest and Deep Neural Nets.
The tutorial will be broken up into four main sections -- one section for each algorithm, in addition to a brief introduction and follow-up discussion. Each algorithm-specific section will provide a high-level overview of the algorithm for participants who may not have a familiarity with the algorithm. After an overview of how the algorithm works, we will focus on a subset of important model parameters and discuss in detail what each of these parameters is used for and what relationships exist between pairs or groups of model parameters. As an example of releated model parameters, when you increase the value of the mtry parameter in a Random Forest, that reduces the randomness in the forest, and therefore, you may want to re-introduce additional variance into the model by reducing the sample_rate parameter.
The discussion around the different R packages is not intended to be a "benchmark" of different algorithm implementations, but rather a survey of different techniques that the authors have chosen to use in their implementations. We will evaluate the effects that the addition of new parameters or computational tricks have on the performance of the algorithms. We will highlight the ways in which you can take advantage of the unique features of each of the implementations.
We will also spend some time talking about default values for model parameters and the thought process involved in determining default model parameters for a new algorithm implementation. When the default values of identical parameters differ between packages, we will attempt to explain why these different defaults were chosen in different cases.
Although the tutorial will focus on the various R implementations, we will also make note of any algorithm functionality that exists only in non-R packages, such as Python's scikit-learn module.
Generalized Linear Models are one of the oldest and still one of the most popular algorithms in the field of statistics / machine learning. The GLM available in base R is a "standard" single-threaded, GLM implementation with no regularization. The lack of regularization is not ideal for high-dimensional datasets and also can lead to issues that arise from collinearity among the columns. However, there are several other GLM packages that offer regularization via Lasso, Ridge and Elastic Net penalties.
Among the linear modeling packages, there are several different training techniques used; we will discuss where each of these packages overlap and where they differ. For example, many GLM implementations use the L-BFGS optimization algorithm to fit a model, where others use stochastic gradient descent or iterativelyreweighted least squares. We will explain why understanding a little bit about different solvers is very useful in practice and provide "rules of thumb" about when to use each of the optimization algorithm (e.g. sparse data, wide data, correlated features, etc).
The Gradient Boosting Machine is a very popular all-purpose algorithm. It consistently peforms better than many other machine learning algorithms across a wide variety of datasets, in part due to the flexibility induced by it's ability to optimize a user-defined loss function. GBMs allow the user to specify their own loss function, where as other algorithms have the loss function, or optimization problem, "hard-coded" into the algorithm. Some implementations make liberal use of this flexibility and provide a variety of pre-defined, useful loss functions, usually corresponding to different response distributions (e.g. Poisson, Gaussian). For this reason, certain industries or scientific domains with very specific use-cases and related loss functions, may prefer certain packages over others.
Other variability among implementations comes from various row and column sampling options that allow the user more flexibility and finer control over the randomness in the algorithm. We will discuss why the addition of something as simple as an extra column-sampling parameter to a GBM can increase model performance. We will also discuss paralleization techniques used to create distributed implemenations of GBM, an iterative algorithm that, unlike Random Forest, is non-trivial to parallelize.
Boosting methods are susceptible to overfitting (if trained for too many iterations), so early-stopping techniques are particularly useful in practice and will be discussed.
The Random Forest algorithm, along with the Gradient Boosting Machine, is one of the most popular and robust machine learning algorithms used in practice. The Random Forest algorithm is interesting because there are several variants of the original Random Forest algorithm, as well as unique implementations for each of those variants. There are also implementations that target very specific use-cases or data types. For example, the ranger R package was designed specifically for high-dimensional datasets such as Genome-Wide Association Study (GWAS) data.
Another interesting aspect to the Random Forest algorithm is that computational tricks, such as shortcuts among different sorting techniques, and massive parallelization across compute nodes, play a huge role in performance. Lastly, some counter-intuitive algorithmic techniques such as using random values rather than computing optimimal values for feature splits can lead to improvements in speed and accuracy, as in the Extremely Randomized Trees variant of Random Forest, implemented in the extraTrees package.
Deep Neural Nets
The deep learning ecosystem in R is still quite young and there are not many implementations available. Much of the focus in software development in the deep learning community has occured within the C++ and Python communities, often making use of GPU hardware, rather than traditional CPUs. However, there are a handful of impementations available in R that will be discussed in this section.
Much of the obscurity around deep neural nets comes from the fact that, compared to other machine learning algorithms, there are a large number of model parameters. Some practitioners may be overwhelmed with the number of knobs available, which can make it difficult to achieve good performance if you are not an deep learning expert -- it's hard to know where to start. As part of this section, we will review the most important model parameters available to the user and provide a fully-featured "How To" guide to successfully training deep neural nets in R.
A brief follow-up discussion about the multi-algorithm machine learning interface packages, caret and mlr, will follow the algorithm-specific talks. Although these packages don't provide their own, unique implementations of machine learning algorithms (they are wrappers for existing algorithm implementations), they offer a clean interface and some advanced features (e.g. cross-validation, bootstrapping, etc) that are worthy of mentioning as part of this tutorial.
Some familiarity with the main concepts of machine learning (regression, classification) and common machine learning algorithms would be helpful, but not required.
The repository for this tutorial is available at https://github.com/ledell/useR-machine-learning-tutorial.
Erin is a Statistician and Machine Learning Scientist at H2O.ai, where she works on open source machine learning software. She is the author of a handful of machine learning related R packages and a contributor to the rOpenHealth project. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from UC Berkeley. Her dissertation focused on scalable ensemble learning and was awareded the 2015 Erich L Lehmann Citation by the UC Berkeley Department of Statistics. As a result of working on multi-algorithm ensemble learning and software development in R, she has become familiar with many machine learning algorithms and their implementations.
A list of past presentations is available on her website. She has given dozens of talks and lead a number of workshops over the past decade on machine learning, R, Python, H2O and various topics from Statistics. This list includes talks in the "Machine Learning track" at the 2014 and 2015 useR! conferences. She is also the founder of the Bay Area Women in Machine Learning & Data Science meetup group which regularly organizes talks and hands-on coding workshops in the Bay Area.
A few examples of recent tutorials: