Quantitative Genomics and Genetics

Computer Lab 8

– 23 October 2014

– Author: Jin Hyun Ju (jj328@cornell.edu)


Logistic Regression


Today we are going to implement our first “algorithm” for logistic regression. Unlike linear regression, where we modeled a given phenotype value Y directly, logistic regression models the probability of Y belonging to a certain category. In most of the cases two categories are used and are coded by dummy variables 0 and 1. For example, in genome-wide association studies a control phenotype (ex. healthy, normal) would be 0 and a disease phenotype (ex. diabetes, alzheimers, etc …) would be 1. The goal in such studies is to identify genomic variations that increase the probability of belonging to the disease category.

If you take a look at the form of logistic regression you can see that the transformation confines all possible outcomes to an interval between 0 and one.

plot of chunk unnamed-chunk-1

A simple visualization in R might give us a better idea of how the curve behaves:

x <- seq(from=-10,to=10,by = 0.1)
y <- 1 / (1+exp(-x))

plot(x,y, main = "logistic regression curve", type ="b")

plot of chunk unnamed-chunk-2

Let’s project the settings in our problem onto the above equation and clarify our goal. We have a given phenotype that takes either a value of 1 or 0, and two matrices for genotypes in the form of Xa(-1,0,1) coding and Xd(-1,1) coding. Just like in linear regression, the goal is to find the values for \(\beta_{\mu}\), \(\beta_a\) and \(\beta_d\) for each genotype that best explain the relationship between the genotype and phenotype. If the relationship was error free and the genotype value directly predicts the phenotype, we would not need logistic regression (For example, if A2A2 indicates phenotype = 1 with 100% certainty). However, that is more than often not the case in real world genetics/genomics so we would have to “soft” model the relationship between genotypes and phenotypes by using probabilities, (In other words, A2A2 has a higher chance of having phenotype=1 than phenotype=0) and that is what the transformation given in the above equation is doing. So you might ask why do we need an algorithm to solve this problem if it is taking almost the exact same form as linear regression? The reason why we need an algorithm to calculate the maximum likelihood estimators for logistic regression is because there is no “closed-form” solution. In linear regression we had a “closed-form” solution which took the form of \(MLE(\hat{\beta})=(XX^T)^{-1}X^TY\), but due to the transformation we are using in logistic regression we don’t have such a simple solution in this case. So we are taking an “iterative” approach where the algorithm starts at a given point and keeps looking for a better solution in following steps until the better solution is almost identical to the solution from the previous step.

Imagine that you are on a mountain in complete darkness and the only information that you have on hand is the current altitue which you can check every 5 minutes. The goal for you is to get to the highest point that you can reach and shoot up a flare to call for help. The optimal strategy for you will likely to pick a direction to walk for 5 minutes based on the angle of the ground you are standing on, and check your altitude after 5 minutes to confirm that you actually went uphill not downhill. When you are close to (or on) the top the altitude might not change much after walking for 5 minutes and that might be your best place for shotting a flare. This is kind of what is going on in the algorithm that we are implementing.

Exercise

  1. Download the phenotype and genotype files from the class website and read them in. You should have 292 genotypes and 1 phenotype for 107 samples.

  2. Note that the genotypes are already in Xa codings, and you only have to create the Xd matrix from it.

  3. Use the template given below and try to fill in the code to make it a functional algorithm.

  4. Plot a manhattan plot for the phenotype and look for significant peaks.

  5. Your output should look something like this :

plot of chunk unnamed-chunk-3

  1. You can also visualize the individual genotype effect by using the package ggplot2 with the option position = position_jitter(w = 0.1,h=0.1)

plot of chunk unnamed-chunk-4

logistic.IRLS<- function(Xa,Xd,Y =Y, beta.initial.vec = c(0,0,0), d.stop.th = 1e-6, it.max = 100) {

  #Create the X matrix
  X.mx <- Just like linear regression
  
  N <- # get the sample size
    
  # initializing beta.t parameters for t = 0
  beta.t <- beta.initial.vec
  
  # initialize deviance at t =0
    dt < -0
    
    #initialize gamma
  # K is the part which goes into the exponent
    K <- calculate the part that goes into the exponent
  
  # calculate the gamma_inverse
    gamma_inv <- # exp() / 1+ exp()
    
    for(i in 1:it.max) {
    # store the previous deviance 
        dpt1<-dt 
        
    # create empty matrix W
        W<-matrix(0,N,N)
    
    # Fill in the diagonal of W with appropriate values
        for(j in 1:N){
            W[j,j] <- # look up the useful equations to figure out what goes in here
        }
    
    
        beta.t <- # calculate the updated value for beta.t
        
        #update gamma since it's a function of beta
        K <- # New K
        gamma_inv <- # new gamma_inv
      
        #calculate new deviance
        dt <- # Find equation in useful equations 
    # Note that one part of the deviance Y * log(Y/gamma_inv) or (1-Y)*log(1-Y/gamma_inv) for each Y will take the form of -Inf and might cause an error.
    # Specifying which Y should go into which part of the calculation is needed to avoid this.
        
        absD <- # absolute difference in deviance
        
        if(absD < d.stop.th) {
            cat("Convergence at iteration:", i, "at threshold:", d.stop.th, "\n")
            logl<-# Log likelihood goes here
            return(list(beta.t,logl)) # return a list that has beta.t and logl saved
        }   
    }
  
  # In case the algorithm did not coverge
    cat("Convergence not reached after iteration:", i, "at threshold:", d.stop.th, "\n")
    return(list(beta.t= c(NA,NA,NA),logl=NA)) # return NA values 
}


G <- dim(Xa)[2]
logl <- vector(length = G)

for(j in 1:G){

  result.list <- call our function
  logl<- # How do we extract an element from a list? might want to use [[]]

}




beta.mu.null<- # what is the beta.mu for the null hypothesis?

# get the gamma_inv for the null hypothesis
gamma_inv_null<- # calculate the gamma_inv for null

# calculate the log likelihood
LRT <- # calculate the likelihood ratio test statistic


pval <- # chi squared test with the following parameters (LRT, 2, lower.tail = F)
  
# Plot manhattan plot with cut off line  
plot(-log10(pval))
abline(-log10(0.05/300),0,col="red")



# Visualizing the effect
plot.df <- data.frame( "Phenotype" = Y, "Xa" = Xa[,i], "Xd" = Xd[,i])
jitter.plot1 <- ggplot(plot.df, aes(x = Xa, y = Phenotype))+ geom_point(position = position_jitter(w = 0.1, h = 0.1)) + ggtitle("Xa vs Phenotype")
jitter.plot2 <- ggplot(plot.df, aes(x = Xd, y = Phenotype))+ geom_point(position = position_jitter(w = 0.1, h = 0.1)) + ggtitle("Xd vs Phenotype")
print(jitter.plot1)
print(jitter.plot2)