The R User Conference 2016

June 27 - June 30 2016
Stanford University, Stanford, California



Never Tell Me the Odds! Machine Learning with Class Imbalances

Max Kuhn - Pfizer

Post-tutorial notes

The materials used in the tutorial are available here.

Tutorial Description

This tutorial will provide an overview of using R to create effective predictive models in cases where at least one class has a low event frequency. These types of problems are often found in applications such as: click through rate prediction, disease prediction, chemical quantitative structure - activity modeling, network intrusion detection, and quantitative marketing. The session will step through the process of building, optimizing, testing, and comparing models that are focused on prediction. A case study is used to illustrate functionality.

Goals

The goal of the tutorial is to provide a thorough workflow in R that can be used with many different modeling techniques. The participants will also gain an understanding of specific types of models that can be used to combat class imbalances.

Tutorial Outline

  • review of classification problems
  • the class imbalance problem
  • methods for measuring performance
  • model independent sampling methods
  • cost-sensitive approaches

Background Knowledge

Participants should have good knowledge of basic R elements (e.g. data structures, functions, etc). Appendix B of Applied Predictive Modeling provides a summary of these topics.

This tutorial will focus on building classification models so some previous exposure to these models is advantageous. However, a short review will be provided. Familiarity will resampling procedures (e.g. cross-validation) is also helpful.

Instructor Biography

Max Kuhn, Ph.D., is a Senior Director in research in development at Pfizer in Groton, CT where he has worked as a nonclinical statistician for more than a decade. His focus is in early drug discovery in areas such as computation biology and chemistry, assay development, compound optimization, and computational sciences.

Prior to Pfizer, Max worked at BD Diagnostic Systems where he helped develop in vitro assays and diagnostic algorithms for infectious diseases. He also lead a group of statisticians, data base analysts, and programers concentrated on the analysis of clinical trial data and regulatory submissions. He received is B.S. in Mathematics and Ph.D. in Biostatistics from Virginia Commonwealth University.

Max is the author of eight R packages for techniques in machine learning and reproducible research and is an Associate Editor for the Journal of Statistical Software. He, and Kjell Johnson, wrote the book Applied Predictive Modeling, which won the Ziegel award, which recognizes the best book reviewed in Technometrics in 2015.


Back to Top ↑