Quantitative Genomics and Genetics

Computer Lab 11

– 13 November 2014

– Author: Jin Hyun Ju (jj328@cornell.edu)

Epistasis and the Expectation Maximization (EM) algorithm


Today we are going to run an analysis to investigate epistasis from a toy dataset. You can think of it as a simple extension to our all-to-familiar linear regression framework. Here are the things that are different from the linear regression (logistic regression) form that we have been dealing with.

  • We are considering multiple (in this case 2) genotypes at the same time in a single model. (Xa.1,Xd.1 and Xa.2 Xd.2)

  • Additional to the simple additive effect we have parameters for interactions (shown as Xa.1 * Xa.2 etc… )

  • We might want to check our hypothesis against a different null hypothesis. Previously, we were interested if a certain genotype had and additive or a dominant effect on the phenotype so we only had to check if \(\beta_a\) and \(\beta_d\) gave us a significantly better model fit compared to a model with only an intercept. In this case, we might want to compare if the addition of an interaction term made our model fit better compared to a model with the non-interaction terms included. (comparing the epistasis model to a null hypothesis that looks like this: \(Y = \beta_{\mu} + X_{a1}\beta_{a1} + X_{a2}\beta_{a2} + X_{d1}\beta_{d1} + X_{d2}\beta_{d2} + \epsilon\))

In order to compare our model to a null hypothesis that is different from the default setting of lm() - the null with only an intercept - we can use a function called anova(). You could of course calculate the likelihoods for each model by writing the equations out, but since we are passed that stage let’s take the short cut and use lm() to make our lifes easier. The code to do this will look something like this:

L0 <- lm(y ~ Xa1 + Xd1 + Xa2 + Xd2 )
L1 <- lm(y ~ Xa1 + Xd1 + Xa2 + Xd2 + Xa1*Xa2 + Xa1*Xd2 + Xa2*Xd1 + Xd1*Xd2)



I simulated a toy dataset for this analysis which can be downloaded from the class website. (QG14_Lab11_Epistasis_files.tar.gz) You will find a genotype file and a phenotype file which contain 2 genotypes and 2 phenotypes each. The genotypes have been converted to the Xa dummy coding for you so you only have to generate a Xd matrix from them.

  • Try to run lm() with the following model on both phenotypes and see if it turns out to be significant or not : lm(y ~ Xa1 + Xd1 + Xa2 + Xd2 + Xa1Xa2 + Xa1Xd1 + Xa2Xd2 + Xd1Xd2)

  • Check if the lm() results are identical if you use L0 <- lm(y ~ X.mu) and run anova(L0,L1).

  • Using the approach shown above (different L0), run the analysis on both phenotypes and see if the significance for the model changes.

  • Can you tell me which phenotype seems to have an epistasis effect?

  • What would be your best guess about how the model was generated?

Extra: EM algorithm

I am including a mini exercise about the EM algorithm for those of you who are interested. In class we learned about mixed models and the EM algorithm to estimate the parameter values for the mixed model. The process of the EM algorithm is outlined below. Basically, what happens is that it fixes the parameters (betas, and sigmas) for the first step to calculate the best estimate for a and Va and in the next step it uses the values calculated for a and Va to update the betas and the sigmas until the values don’t change much (reach convergence).

The actual R code to run an EM algorithm on the dataset that is posted on the website (QG14_Lab11_EM_files.tar.gz) is shown below. Here are some mini tasks regarding this exercise:

  • Take a look at the given data and inspect the dimensions of each.

  • For the given code, include a line that prints out the log likelihood for each iteration (Ex. iteration = 1, loglikelihood = 500.123). Also plot the log likelihood for each iteration in a line. What does it look like?

  • Compare the results of the EM algorithm to the results of a simple linear fixed effect model and see what is different.

X = as.matrix(read.table("QG14_Lab11_EM_X.txt"))
Y = as.matrix(read.table("QG14_Lab11_EM_Y.txt"))
A = as.matrix(read.table("QG14_Lab11_EM_A.txt"))

EM_algorithm = function(Y, X_j, A){

  # Calculate the inverse of A once since it is used repeatedly in the algorithm
    # This method is faster than solve(A)
    solve_A = chol2inv(chol(A))

    n = length(Y)

    I = diag(1, n)

    log_L = c()
    # set starting values
    sigma_sq_a = 70
    sigma_sq_e = 10
    beta = as.vector(rep(0, ncol(X_j)))

    C = A * sigma_sq_a + I * sigma_sq_e
    log_L[1] = -1/2 * determinant(C)$modulus - 1/2 * t(Y - X_j %*% beta) %*% chol2inv(chol(C)) %*% (Y - X_j %*% beta)
    iter = 2

        S = chol2inv(chol(I + solve_A * sigma_sq_e / sigma_sq_a))
        alpha = S %*% (Y - X_j %*% beta)

        V = S  * sigma_sq_e

        beta = chol2inv(chol(t(X_j) %*% X_j)) %*% t(X_j) %*% (Y - alpha)

        # use as.numeric() to make sure value is saved as a scalar
        sigma_sq_a = as.numeric(1/n * (t(alpha) %*% solve_A %*% alpha + sum(diag(solve_A %*% V))))

        # use as.numeric() to make sure value is saved as a scalar
        sigma_sq_e = as.numeric( 1/n * ( t(Y - X_j %*% beta - alpha) %*% (Y - X_j %*% beta - alpha) + sum(diag(V))))

        C = A * sigma_sq_a + I * sigma_sq_e
        log_L[iter] = -1/2 * determinant(C)$modulus - 1/2 * t(Y - X_j %*% beta) %*% chol2inv(chol(C)) %*% (Y - X_j %*% beta)

        if(log_L[iter] - log_L[iter-1] < 1e-5){

        iter = iter + 1

    return(list(beta = beta, 
            sigma_sq_a = sigma_sq_a, 
            sigma_sq_e = sigma_sq_e, 
            log_L = log_L[iter-1]))


# Mixed model #

n_indivs = length(Y)

# Null model
One = as.matrix(rep(1, n_indivs))
log_L_null = EM_algorithm(Y, One, A)$log_L

p_values_EM = c()

# Full model
for(j in 1:ncol(X)){

        X_j = cbind(1, X[,j])

        fit = EM_algorithm(Y, X_j, A)

        p_values_EM[j] = pchisq(-2 * (log_L_null - fit$log_L), 1, lower.tail=FALSE)