– 7 April 2016

– Author: Jin Hyun Ju (jj328@cornell.edu)

In logistic regression, the dependent variable y (in our case phenotype) is categorical instead of continuous. Specifically, we are going to use logistic regression to deal with binary phenotypes coded as 0 and 1. For example, in genome-wide association studies (GWAS) a healthy or normal control phenotype would be 0 and a disease phenotype (ex. diabetes, alzheimers, etc …) would be 1, and the goal is to identify genomic variations that increase the probability of belonging to the disease category.

You might be wondering why we need this in the first place. So let’s try to use linear regression for a binary phenotype and see what happens.

What is the predicted value of Y for Xa = -1 in this case?

If you plot out the residuals, how would it look like?

It becomes quite clear that we need a better alternative to linear regression for binary phenotypes. Simply put, logistic regression transforms the linear model that we use to “fit” the binary phenotypes. The logistic function takes the form as follows:

If we substitute t with the linear function that depends on x and beta values, the logistic function that we use becomes

A simple visualization in R might give us a better idea of how the data is transformed. Basically, logistic regression confines the original dependent values within the range of 0 and 1.

```
x <- seq(-1,1,by = 0.1)
y_linear <- x * 4
y_logistic <- 1 / ( 1 + exp(-y_linear))
par(mfrow = c(1,2))
plot(x, y_linear, main = "Linear function", type = "l")
abline(h = 0, col = "red", lty = 2)
abline(h = 1, col = "red", lty = 2)
plot(x,y_logistic, main = "logistic regression curve", type ="l", ylim = c(-0.5,1.5))
abline(h = 0, col = "red", lty = 2)
abline(h = 1, col = "red", lty = 2)
```

Let’s project the settings in our problem onto the above equation and clarify our goal. We have a given phenotype that takes either a value of 1 or 0, and two matrices for genotypes in the form of Xa(-1,0,1) coding and Xd(-1,1) coding. Just like in linear regression, the goal is to find the values for \(\beta_{\mu}\), \(\beta_a\) and \(\beta_d\) for each genotype that best explain the relationship between the genotype and phenotype.

If the relationship was error free and the genotype value directly predicts the phenotype, we would not need logistic regression (For example, if A2A2 indicates phenotype = 1 with 100% certainty). However, that is more than often not the case in real world genetics/genomics so we would have to “soft” model the relationship between genotypes and phenotypes by using probabilities, (In other words, A2A2 has a higher chance of having phenotype=1 than phenotype=0) and that is what the transformation given in the above equation is doing.

However, unlike the simple matrix calculation in linear regression for the MLE(beta) values, we don’t have that closed form estimate here.The solution to this problem is an “iterative” approach where the algorithm starts at a given point and keeps looking for a better solution in following steps until the better solution is almost identical to the solution from the previous step.

Algorithms similar to this have a general outline as follows.

An objective (or cost) function. This is a function which represents how well the model fits. For example, in linear regression a lower sum of squared errors (deviation of the predicted phenotypes from the actual phenotypes) represents a better model fit. So the goal of these algorithms would be to minimize (or maximize depending on the situation) the given objective function.

Optimization function. The core of the algorithm which finds the parameter values that minimize the objective function. Most of the algorithms will use methods based on gradients (derivatives) of the objective function to find the direction to update the parameters.

Imagine that you are on a mountain in complete darkness and the you only know the current altitue (objective function) which you can check every 5 minutes and the angle of the ground. The goal for you is to get to the highest point (find the maximum) that you can reach and shoot up a flare to call for help. The optimal strategy for you will likely be to pick a direction to walk for 5 minutes (step size) based on the angle of the ground (derivative of objective function) you are standing on, and check your altitude after 5 minutes to confirm that you actually went uphill not downhill. When you are close to (or on) the top the altitude might not change much after walking for 5 minutes and that might be your best place for shotting a flare. This is kind of what is going on in the algorithm that we are implementing.