The genomic era has provided an opportunity to address a fundamental question in genetics: which genetic loci are responsible for complex diseases and physiological differences we observe among individuals? Genome-wide association studies (GWAS), which identify significant correlations between genetic markers and phenotypes, have become the first step in answering this question. This is an exciting time for GWAS as next-generation sequencing technologies will soon provide a relatively complete picture of the total genetic variation present in individual genomes that can be mined for phenotype associations.

The statistical challenge in GWAS is to find the few true cases, among millions of genetic markers, that associate with a disease or a complex phenotype. Members of our group are developing scalable computational methods for tackling this problem, making use of penalized regression techniques to simultaneously analyze marker sets, while assuming many loci contribute to the phenotype. Recent projects include: application of well-justified penalties for optimally selecting informative marker sets in the sparse parameter space of a GWAS, introduction of a variational Bayes techniques that provides a versatile system for incorporating mixture penalties, and development of hidden factor techniques to account for variation due to non-genetic factors in a GWAS (see our recent publications).



Quantile-Quantile plot of the results of a single marker analysis of simulated GWA data including over a million markers. Each blue point indicates the log10 P-value associated with a single marker. The loci with phenotype associations are indicated by black squares.The loci identified with our simultaneous marker - multiple locus analysis technique “V-Bay” are indicated in red. V-Bay is able to detect true associations that are undetectable with a single marker analysis. The insert plot shows one of the hits from V-Bay that does not lie exactly on the marker in tightest linkage disequilibrium with the associated locus but is six SNPs away. From Logsdon et al. 2010


Shown are the results of applying our multiple-locus methods to a GWAS for Type II Diabetes (T2D) using datasets from the Wellcome Trust Case Consortium (WTCCC}). The plot zooms in on the LEPR gene region, where individual marker analysis p-values are shown for a genotypic test (gray circles), a trend test (gray triangles), and logistic regression (black circles), the results of our multiple-locus method are shown for the MCP penalty (orange), and an individual marker analysis hit from the analysis of an independent GWAS dataset is also shown (pink square). Genes in this region are indicated by orange lines with thick lines indicating exons. Numbers indicated physical position in Mb, genetic distances are indicated in cM from the best 'hit', which is indicated by a '*'. (Hoffman et al. in prep).