SupportMix is a machine learning algorithm for determining ancestral origin of genomic segments when analyzing individuals from a population with a recent or ancient history of admixture. The method efficiently scales for the analysis of dense genetic data, when simultaneously considering 50-100 putative ancestral populations. The details of the methodology are described by Omberg et al. (2012). SupportMix is still under development so please check back for updates. For performance or other issues contact the author Dr. Larsson Omberg (lom at larssono.com). Please follow this link to download SupportMix (ver. Jul. 18, 2012). Update July 18, 2012: A bug was found in SupportMix related to plot labeling, please download the updated version.


Puma (Penalized Unified Multiple-locus Association) comprises a family of statistical methods designed to identify weak associations in genome-wide association studies that are not detectable by conventional analytical methods. Puma uses a regularized multiple regression in a penalized maximum likelihood framework using a generalized linear model in order to simultaneously consider tens to hundreds of thousands of genetic markers in a single statistical model. These methods are able to consider both case/control and continuous phenotypes and are optimized to efficiently handle very large datasets. Puma is currently under active development and new extensions and modifications will be posted the near future. Publications describing Puma are forthcoming. Please follow this link to download Puma.


Hidden Expression Factor analysis (HEFT) is a a combined multivariate regression and factor analysis method that identifies individual and pleiotropic effects of eQTL in the presence of unmeasured covariates / hidden factors. HEFT is a likelihood approach that learns the structure of hidden factors from multivariate gene expression data and makes use of a ridge estimator for simultaneous factor learning and detection of eQTL. HEFT requires no pre-estimation of hidden factor effects, it provides p-values, and is fast enough to complete an eQTL analysis of thousands of expression variables when analyzing hundreds of thousands of SNPs on a standard desktop in < 24hours. The paper describing HEFT is currently in review. Please follow this link to download HEFT.


Genard corrects for genetic confounding due to population structure and kinship in genome-wide association studies. Using a data-adaptive low rank linear mixed model, genard learns the dimensionality of the correction from the data. The software is compatible with plink files and can analyze 650,000 markers for ~2000 individuals in ~15 minutes or ~6000 individuals in 2 hours. The paper describing genard is currently in review. Please follow this link to download Genard.


Vbay is a statistical method designed to identify weak associations in genome-wide association studies that are not detectable by conventional statistical methods. Vbay uses a regularized multiple regression approach in a Bayesian framework in order to simultaneously consider tens to hundreds of thousands of genetic markers in a single statistical model. The method is able to consider both case/control and continuous phenotypes and is optimized to efficiently handle very large datasets. The details of the methodology are described by Logsdon et al. (2010) and publications describing further details are forthcoming. Please follow this link to download Vbay.


ELMM (Empirical Light Mutual Min) is designed to infer large regulatory networks composed of dense and hub-like structures from gene expression or related genome-wide data. The ELMM algorithm is a constraints-based learning method for recovering undirected conditional independence graphs and includes three innovations aimed at improving scaling and accuracy: 1. ranking edges by a joint evaluation of conditional independence tests at both sides of the edges, 2. use of an empirical Bayes approach for estimating independence testing parameters, and 3. an adaptive relaxation of independence constraints in dense regions of a graph to avoid multiple testing problems. The publication describing ELMM has been submitted - please follow this link to download the ELMM software.


Expanding on the original Vbay model described in Logsdon et al. (2010), VbayNetwork applies sparse regression modeling to the problem of generating networks among gene expression phenotypes and genotypes, while allowing for correction for possible confounding factors. Specifically, it is designed to reconstruct sparse statistical networks among ultra-high dimensional phenotypes and genotypes with very low Type I error rates for individual edges within the network. The details of the methodology are provided in Logsdon et al. (2012). Please follow this link to download VbayNetwork.


The previously posted versions of our software LOCate and EXPLoRE have been deprecated. Please email Dr. Mezey at jgm45 at cornell dot edu to access the old versions of these software. Our updated versions of these software will be released soon.


VCF file including genotypes for 100 Qatari exomes at 132,303 sites, including SNPs and short indels. For more information on genotyping and quality filtering protocol, see Rodriguez-Flores et al (Human Mutation, in review). Please follow this link to download VCF file code for calculating Fst in R.


The Drosophila Genetic Reference Panel DGRP is a set of inbred lines with genomes that have been sequenced at high coverage using next-generation sequencing technology. We are developing software and other resources to enable genome-wide association study (GWAS) analysis of these lines. For those at Cornell interested in available resources and links, including files with the official and alternative SNP calls for these lines in PLINK format and UCSC Browser tracks available through Dr. Adam Siepel's mirror, please visit our software Wiki.