## Quantitative Genomics and Genetics

### Computer Lab 9

– 30 October 2014

– Author: Jin Hyun Ju (jj328@cornell.edu)

## QQ plots and lm()

### lm() the easy way of getting linear regression done

Now that we have coded up linear regressions from scratch multiple times, we can move on to simpler ways. The most common way to do linear regressions in R is through the function lm(), which stands for linear model. Let’s use the mini Hapmap dataset that we used in lab 7.

phenotypes <- read.table("./HapMap_phenotypes.tsv", header = T)

Xa.all <- as.matrix(genotypes)
Xd.all <- 1- 2*abs(Xa.all)

There are multiple ways to call lm() with a specific model and we will look at two ways of doing that. The first way is to generate a dataframe and using the structure of the dataframe to specify the model. In our case, we are interested in regressing the phenotypes against the dummy variables Xa and Xd and the expression inside lm is going to be something like this:

lm(Y ~ Xa + Xd)

This is saying that Y should be the dependent variable, and Xa and Xd should be the indepenedent variables. So to call this for a single genotype and single phenotype we would call lm() like this:

regression.df <- data.frame("Y" = phenotypes[,1], "Xa" = Xa.all[,1], "Xd" = Xd.all[,1])

lm(Y ~ Xa + Xd, data = regression.df)

Call:
lm(formula = Y ~ Xa + Xd, data = regression.df)

Coefficients:
(Intercept)           Xa           Xd
6.9474      -0.0178       0.0144  

The data = regression.df part is telling lm() to look for Y, Xa and Xd values in the specified dataframe. Another way of using lm is to use the objects in the workspace as they are (which is a less favorable way since the code gets a bit messier).

lm(phenotypes[,1] ~ Xa.all[,1] + Xd.all[,1])

Call:
lm(formula = phenotypes[, 1] ~ Xa.all[, 1] + Xd.all[, 1])

Coefficients:
(Intercept)  Xa.all[, 1]  Xd.all[, 1]
6.9474      -0.0178       0.0144  

You can see that the outputs are cleaner in the former call, so let’s just stick with the first way of calling lm(). Anyways, that was easy, but it is only showing us the estimated values for each parameter (beta_mu = intercept, beta_a = Xa, beta_d = Xd). How can we get to the p-value of the whole model? Before we get into that let us do a sanity check and see if the values are identical from those which we will get by calculating them using the equations that we leanred in class.

library(MASS)

X.mx <- cbind(1, Xa.all[,1],Xd.all[,1])
Y <- as.matrix(phenotypes[,1])

MLE_beta <- ginv(t(X.mx) %*% X.mx) %*% t(X.mx) %*% Y
MLE_beta
         [,1]
[1,]  6.94741
[2,] -0.01778
[3,]  0.01442

It looks like they are almost identical (probably so because the lm function prints out the values rounded up to the 5th decimal point) ! Now that we have checked that it is indeed doing the same thing, let’s try to get all the other values like the f statistics and p-values as well. You could do it by saving the lm() call into a variable and use the values from that object.

linear.fit <- lm(Y ~ Xa + Xd, data = regression.df)

#str(linear.fit)

You can see that the result object has a lot more values saved than it actually prints out. A simple way of getting to the values that we need is to use the summary() function on the linear.fit object.

summary(linear.fit)

Call:
lm(formula = Y ~ Xa + Xd, data = regression.df)

Residuals:
Min       1Q   Median       3Q      Max
-0.16871 -0.08484  0.00025  0.07644  0.20903

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   6.9474     0.0137  506.41   <2e-16 ***
Xa           -0.0178     0.0230   -0.77     0.44
Xd            0.0144     0.0137    1.05     0.30
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0986 on 104 degrees of freedom
Multiple R-squared:  0.0518,    Adjusted R-squared:  0.0336
F-statistic: 2.84 on 2 and 104 DF,  p-value: 0.0627

This is giving us way more information than the lm output. The call is showing us which model we have used, the residuals field is showing us the summary of the residuals, the coefficients field is showing us the parameter estimates and p values for each of the parameters, and at the end we can see the f-statistic degrees of freedoms and the p-value for the whole model. Let see how we can access these values. If you save the summary output to a variable you can see that it also has a lot of information stored inside. The values that we are interested can be found in the fstatistic list and coefficients list.

lm.summary <- summary(linear.fit)

#str(lm.summary)

# for parameter estimates and their individual p-values
lm.summary$coefficients   Estimate Std. Error t value Pr(>|t|) (Intercept) 6.94741 0.01372 506.4131 3.166e-178 Xa -0.01778 0.02295 -0.7745 4.404e-01 Xd 0.01442 0.01372 1.0510 2.957e-01 # for f-statistics and degrees of freedom lm.summary$fstatistic
  value   numdf   dendf
2.844   2.000 104.000 

For some unknown reason (which is still a mystery to me) summary() will show you all the f-statistics and the model p-value, but it only saves the f-statistics in the object so you still have to calculate the p-value yourself if you want to use it. However, we have been doing this for a couple of times so we already know how to do it :) To get the p-value for the model we can simply do this:

fstat <- lm.summary\$fstatistic

pvalue <- pf(fstat[1],fstat[2],fstat[3], lower.tail = FALSE)

### QQ plots

Despite the not-so-intuitive name, QQ-plots are really useful in inspecting the quality of the results from a large scale study. QQ-plots are generally created for each phenotype, so in a genome-wide association study setting where you have P phenotypes and G genotypes, you would have P QQ-plots. So with this dataset we will have 10 QQ plots with 400 point each. For example, with a p-value vector for phenotype 1 we can simply generate a QQ plot like this :

# you would have to generate the p-value vector !
expected.pvals <- -log10(seq(from=0,to=1, length.out = length(pval.vec)))
observed.pvals <- -log10(pval.vec)

expected.sorted <- sort(expected.pvals)
observed.sorted <- sort(observed.pvals)

plot(expected.sorted, observed.sorted, main = "QQ plot for phenotype 2")
abline(a = 0,b = 1, col = "red", lwd = 1)

Now that we learned how to use lm() to get p-values for a linear model let’s implement it !

## Exercise

• Use the dataset from computer lab 7 to generate a 10 x 400 p-value matrix for each phenotype genotype pair using lm.

• Plot 10 QQ-plots for each phenotype

• Do you think the hit is real? or do you think the quality of the hit is questionable?