Genome-Wide Association Analysis and Post-Analytic Interrogation with R

Andrea S. Foulkes - Mount Holyoke College

Post-tutorial notes

The materials used in the tutorial are available here and here.

Tutorial Description

For complex traits, such as cardiometabolic disease, we increasingly recognize that the intergeneric space between protein coding genes (PCGs) contains highly ordered regulatory elements that control expression and function of PCGs and in themselves can be actively transcribed molecules. Indeed, over 50% of genome-wide association studies (GWAS) of complex traits identify single nucleotide polymorphisms (SNPs) that fall in intergenic regions and it is only recently becoming apparent that these regions are highly organized to perform specific functions. A next step in advancing precision medicine is careful and rigorous interrogation of the role of these regulatory elements, and their interplay with known PCGs and environmental factors, in the heritability of complex disease phenotypes. This tutorial focuses on analytic techniques and R tools designed to uncover these complex, and largely uncharacterized relationships.

Goals

Upon completion of this tutorial, students will:

Understand the core analytic components of a standard GWAS and be familiar with code to complete a “start-to-finish” GWAS analysis using R; and
Understand and be able to apply approaches for more comprehensive interrogation of regional and higher-order associations to discern the relative contributions of gene and regulatory elements.

Tutorial Outline

All methods and examples will be presented and made available using R/Jupyter notebooks.

Part 1: Core analytic components of a standard GWAS.
1. Data preprocessing – SNP and sample level filtering;
2. Data generation – principal components for population substructure and SNP imputation using 1000 genomes data;
3. Association analysis – for typed and imputed SNPs using parallel processing; and
4. Post-analytic visualization and interrogation using external resources, including those available through the UCSC genome browser.
Part 2: Interrogation of regional and higher-order associations.
1. General overview of class-level testing strategies, rare-variant analysis, gene-environment interaction analysis, and gene-set enrichment and biological pathway analysis. Specific methods covered will include: GATES, VEGAS, QT, GSEA, SKAT and MAGENTA.
2. Translation of methods to alternative genomic taxonomies, including regulatory elements such as long intergenic non-coding RNAs, enhancer elements and splicing regions.
3. Correcting for multiple comparisons and controlling family wise error rate.
4. Computational efficiencies and versatility of each method with respect to alternative genomic taxonomies.

Download Jupyter Slides

You may view or download the Jupyter slides for this tutorial here (part 1) and here (part 2).

Background Knowledge

Participants are expected to be familiar with basic statistical concepts and techniques at a level of an intermediate course in statistics.

Instructor Biography

Dr. Foulkes is a Professor of Statistics at Mount Holyoke College. She received her ScD in Biostatistics from Harvard School of Public Health in 2000, and has since made substantial contributions in the field of statistical genetics, including methodological and substantive research articles (over 50 publications) as well as open source educational materials (http://www.stat-gen.org/ and http://www.statsteachr.org/). She has led tutorials at the UseR! Conferences in Dortmund, Rennes and Warwick based on her widely acclaimed graduate level textbook, Applied Statistical Genetics with R (2009), Springer, New York (http://www.springer.com/us/book/9780387895536). Content of the proposed tutorial draws on a second edition of this book, currently underway, and Dr. Foulkes’ recent tutorial, A guide to genome-wide association analysis and post-analytic interrogation (2015), Statistics in Medicine, 34:3769–3792 (http://onlinelibrary.wiley.com/doi/10.1002/sim.6605/full).