– 10 March 2016

– Author: Jin Hyun Ju (jj328@cornell.edu)

We began to cover genome wide association analyses (GWAS) in class and this lab will cover the basic data handling practices in GWAS.

The two major types of data you are going to be dealing with in GWAS are genotypes (sequence variants) and phenotypes (qualitative or quantitative).

Let’s begin with some genotype data.

**Exercise 1**

- Read in the data saved in the genotype data file.

```
geno_import <- read.csv("./genotype_data.csv",
header = TRUE,
stringsAsFactors = FALSE,
row.names = 1)
```

git - This file contains SNP information for N individuals and G positions. Each pair of columns will have SNPs for a given position (one for each chromosome). For example, if the first row has “A” in column 1 and “T” in column 2 this individual has a genotype of “AT” for the first position.

Can you tell me how many samples and how many genotype positions there are?

Can you identify a problem with the data, and guess why this happens?

**Exercise 2**

Now that we have loaded the data into R, we would like to prepare it for analysis.

Since we cannot really use the genotypes as they are at the moment, we would first want to convert them into numbers.

In class we learned two methods for creating genotype dummy variables, the additive version (Xa) and the dominance version (Xd).

- How would you convert the character matrix of individual SNPs into dummy variables?

**Exercise 3**

Calculate the Xd values for the genotypes.

If we need both Xa and Xd everytime, how would you generate the Xd values?

**Exercise 4**

Now that we have Xa and Xd codings for genotypes we can test their association with phenotypes.

In this example we are going to use two continuous phenotypes.

```
sim_pheno_mx <- read.csv("./phenotype_data.csv",
header = TRUE, row.names = 1)
```

Before we jump into the analysis a good idea might be filtering genotypes with a very low minor allele frequency. Filter out any genotypes with MAF lower than 0.1 (this will include genotypes with a single value).

Using the pseudo code below generate a function that will return a p-value for the case where the null hypothesis is betas for Xa and Xd are 0, and the alternative hypothesis is that one or both of the betas are not zero.

You may only use lm() to test if the results you get are correct not for the actual calculations.

Use the function ginv() from the “MASS” package instead of solve() to calculate the invese of a matrix.

```
# The pseudo code for this task
MLE.beta <- calculate MLE.beta
y.hat <- calculate the estimated values of y given the MLE>beta values
SSM <- calculate SSM
SSE <- calculate SSE
df.M <- ?
df.E <- ?
MSM <- calculate MSM
MSE <- calculate MSE
Fstatistic <- MSM / MSE
pf(Fstatistic, df.M,df.E,lower.tail =FALSE)
```