Quantitative Genomics and Genetics 2016

Computer Lab 2

– 11 February 2016

– Author: Jin Hyun Ju (jj328@cornell.edu)

[Review] Don’t forget to set your working directory

getwd() # check current working directory 

setwd("./") # set working directory

dir()    # check what is in your directory

dir.create("./test_dir") # In case you forgot to create a directory

[Review] Reading and writing data

QG16.lab.2 <- read.table("../Lab1/QG16-lab1-data.csv", sep = ",", header = TRUE) # Reading data from a csv file

QG16.lab.2.only.a <- subset(QG16.lab.2, factor1 == "a") # subsetting a data frame
write.table(QG16.lab.2.only.a, file = "./QG16_subset_only_a.csv", sep = ",", quote= FALSE, row.names=FALSE)
# the quote options remove the "" of the entries. try it with quote = TRUE and see how it is different.
# row.names = FALSE eliminates the numbers in front of each row

1. Concept of functions

  • “Everything that exists is an object. Eveything that happens is a function call.” - John Chambers

  • We learned that a function is something that takes in an input and gives you an output.

  • The input for an R function is written between “( )” of a function, and the output(s) are the result(s) returned by the function.

  • R has many built in functions for commonly used methods in statistics.

# Examples of built in functions 
example.vector1 <- c(5,2,3,7,1,1,2,9,9)

# a function that calculates the mean
mean(example.vector1) 
[1] 4.333333
# a function that summarizes the counts
table(QG16.lab.2$factor1) 

 a  b 
50 50 
  • We can also build custom functions.
# the syntax for declaring functions, note the {} after function()

log10_add <- function(input1,input2){   # all the inputs are specified within the ( )
  cat("This is a custom function \n")   
  cat("The inputs are = ",input1,input2,"\n")  # showing you the inputs 
  output <- log10(input1) + log10(input2)       # creating an output within the function
  cat("The output is = ",output,"\n")          # print the output
  return(output)                               # return specifies the output
}

# Now we can call our custom functions like this
log10_add(100,1000)
This is a custom function 
The inputs are =  100 1000 
The output is =  5 
[1] 5
# Note that the variable output is not created in our workspace 
ls()
[1] "example.vector1"   "log10_add"         "QG16.lab.2"       
[4] "QG16.lab.2.only.a"
# in order to save the result of a function to a variable we have to assign it to a variable
test.output <- log10_add(100,1000)
This is a custom function 
The inputs are =  100 1000 
The output is =  5 
test.output
[1] 5

Question 1

  • Can you guess what is going to happen?
x <- 11

test_function <- function(y){
  output <- x + y
  return(output)
}

test_function(2) 
test_function2 <- function(x,y,z){
  output1 <- x + y + z 
  output2 <- x * y * z
  output3 <- x ^ y ^ z 
  
}

x <- test_function2(1,2,3) 

print(x)
x <- 10 

test_function3 <- function(){
  x <- 20
}

test_function3() 

print(x)

Installing and Loading packages

  • We can also use functions by installing published packages if somebody else did the hard work for us.

  • Let’s see what happens if we try to run a function called mutate().

mutate()
  • The mutate function is part of the “dplyr” package, so we need to install it first to be able to use the function.
# We can install packages that are published on CRAN by using this function

install.packages("dplyr") # in this case we are installing a package called "dplyr"

# you will need an internet connection to install a package using this function
  • So, does it work now since we have it installed?
mutate(QG16.lab.2,mean = (data1+data2+data3)/3)
  • Having the package installed does not mean you can use the function. You have to load the package into your current R session in order to use it.
# Once the installation is finished you can load the functions into the workspace 
library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
# or by 
require(dplyr)

# Now we can use the functions from dplyr
QG16.lab.2 <- mutate(QG16.lab.2,mean = (data1+data2+data3)/3)
head(QG16.lab.2)
  genename       data1        data2      data3 factor1 factor2        mean
1    gene1  1.42866369 -0.157846249  1.3136225       a   info1  0.86147998
2    gene2 -0.58165021  0.593995373 -0.2319272       b   info2 -0.07319403
3    gene3 -1.03955628  1.083860155  0.7050685       a   info3  0.24979079
4    gene4  0.58381526 -0.125866753  1.2627935       b   info4  0.57358068
5    gene5  0.04377365 -0.002239948  0.1445219       a   info5  0.06201853
6    gene6  0.26732986 -1.756292916  0.5459083       b   info6 -0.31435160
# Let's try another function from dplyr
filter(QG16.lab.2,factor1 == "b" & factor2 =="info8") # Note the & operator which is an AND operator
   genename      data1      data2      data3 factor1 factor2        mean
1     gene8  0.5110170  1.3689777  0.9548534       b   info8  0.94494937
2    gene18  1.0239683  0.1330547  0.9028215       b   info8  0.68661481
3    gene28  0.8516973 -1.0032102 -1.2986571       b   info8 -0.48338999
4    gene38  0.2830858 -0.6795029  1.8192317       b   info8  0.47427152
5    gene48 -1.1049222  0.5409523  0.3202559       b   info8 -0.08123801
6    gene58  0.7904040  0.5096778 -1.6116118       b   info8 -0.10384333
7    gene68 -1.4591909  1.5071098  0.1016278       b   info8  0.04984887
8    gene78  1.1132743  0.7142994 -0.2152775       b   info8  0.53743207
9    gene88 -0.9999763  1.2684340 -0.9384624       b   info8 -0.22333490
10   gene98  1.4704641  0.8852491  1.2723985       b   info8  1.20937057
  • At this point you might be wondering why we need to install all those packages and load them? Shouldn’t it be less complicated if we just have all these functions by default, or automatically loaded onto the workspace after installing?

  • The reason we use the library() or require() function to load specific packages is because it will slow down R if we loaded everything we have installed at the beginning as we install more packages.

  • You can think of your R session as your workbench, and the packages as specific tools in your toolbox. You wouldn’t empty out the toolbox onto your workbench for a simple task like sanding a surface. The only thing you need is some sandpaper, and any additional tool just occupies additional space and contributes only to the aesthetics of your workspace (or acts against it).

2. for loops

  • For loops are mainly used in cases where you want to apply a function iteratively over a vector, matrix or data frame (or just simply do a task multiple times)

  • Structure of the code = for (counter in vector) { do stuff }

  • Syntax of a simple for loop (Printing out the process helps a lot)

N <- 3
for ( i in 1:N ){
    cat("Processing loop number = ",i,"\n")
}
Processing loop number =  1 
Processing loop number =  2 
Processing loop number =  3 
  • You can use any legal variable name for the counter
for ( anything_goes in c(1,3,5) ){
    cat("Processing loop number = ",anything_goes,"\n")
}
Processing loop number =  1 
Processing loop number =  3 
Processing loop number =  5 
  • You can also create a loop within a loop

  • For this example, let’s say that you want to create multiple samples with different sample sizes

for ( outer in 1:3 ){
    cat("Processing Outer Loop #", outer, "\n")
  for ( inner in 1:2 ){
      cat("Processing |_ Inner Loop #", inner, "\n")
    }
}
Processing Outer Loop # 1 
Processing |_ Inner Loop # 1 
Processing |_ Inner Loop # 2 
Processing Outer Loop # 2 
Processing |_ Inner Loop # 1 
Processing |_ Inner Loop # 2 
Processing Outer Loop # 3 
Processing |_ Inner Loop # 1 
Processing |_ Inner Loop # 2 

Question2

  • What is the final value of N ?
N <- 3

for( i in 1:N){
  
  cat("Processing loop = ", i, "\n")
  N <- N + 1
  
}

print(N)

3. Vector and Matrix calculations

  • If you want to modify each element of a vector by a scalar value you can use the math operations that we have learned last week.
example.vector1
[1] 5 2 3 7 1 1 2 9 9
2 * example.vector1
[1] 10  4  6 14  2  2  4 18 18
1 + example.vector1
[1]  6  3  4  8  2  2  3 10 10
example.vector1 ^2
[1] 25  4  9 49  1  1  4 81 81
  • If you are interested in the dot product of two vectors you have to use a special operator
example.vector1 %*% example.vector1
     [,1]
[1,]  255
  • The same applies for matrices
example.matrix1 <- matrix(c(1,1,1,2,2,2), nrow = 2, ncol = 3, byrow= TRUE)

2 * example.matrix1
     [,1] [,2] [,3]
[1,]    2    2    2
[2,]    4    4    4
example.matrix1 ^ 2
     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    4    4    4
example.matrix1 - 1
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    1    1    1
  • Here is how you can do matrix calculations
# t() is transposing the matrix
example.matrix1 %*% t(example.matrix1)
     [,1] [,2]
[1,]    3    6
[2,]    6   12
# Note the dimensions 2 x 3 %*% 3 x 2  = 2 x 2 
  • Here are some useful functions that can be used in matrix calculations
# creating a diagonal matrix with the first input as values on the diagonal
diag(2,nrow = 3)
     [,1] [,2] [,3]
[1,]    2    0    0
[2,]    0    2    0
[3,]    0    0    2
diag(example.vector1)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
 [1,]    5    0    0    0    0    0    0    0    0
 [2,]    0    2    0    0    0    0    0    0    0
 [3,]    0    0    3    0    0    0    0    0    0
 [4,]    0    0    0    7    0    0    0    0    0
 [5,]    0    0    0    0    1    0    0    0    0
 [6,]    0    0    0    0    0    1    0    0    0
 [7,]    0    0    0    0    0    0    2    0    0
 [8,]    0    0    0    0    0    0    0    9    0
 [9,]    0    0    0    0    0    0    0    0    9
# calculating the inverse of a matrix
A <- matrix(c(2,-3,1,0.5),nrow = 2)
solve(A)
      [,1]  [,2]
[1,] 0.125 -0.25
[2,] 0.750  0.50
# we can check this by 
A %*% solve(A) # which results in an identity matrix 
     [,1] [,2]
[1,]    1    0
[2,]    0    1

Advanced Challenge

  • Can you generate a function that takes a vector with student names with “Weill Cornell” grades as input, and creates a list as an output that has the following entries?

  • “LetterGrade” should have the converted letter grades of each student in it.

  • “Overview” should have the number of students for each Letter grades

  • There should be an entry for each letter grade with name of students that received the corresponding letter grade.

scores <- sample(c("NP","P","LP","HP","H"), 20, replace = TRUE)
names(scores) <- paste0("student", 1:length(scores))

standard_letter_grades <- c("H" = "A", "HP" = "B", "LP" = "C", "P" = "D", "NP" = "F")
  • Your code should also work when the conversion table changes (no hard coding of letter grades are allowed!)

  • Your function should also be verbose, meaning that it should tell the user how many students are in the input in total, and the breakdown for each grade as well.

Hints

  • Getting names of vectors
names(example.vector1)
  • Checking if value is equal to something
value == 1
value == "A"
  • Example output
Total number of students =  20 
2 students received  A 
4 students received  B 
6 students received  C 
3 students received  D 
5 students received  F 
$Overview
student_letter_grade
A B C D F 
2 4 6 3 5 

$LetterGrade
 student1  student2  student3  student4  student5  student6  student7 
      "A"       "C"       "F"       "A"       "F"       "F"       "C" 
 student8  student9 student10 student11 student12 student13 student14 
      "B"       "B"       "F"       "F"       "D"       "C"       "B" 
student15 student16 student17 student18 student19 student20 
      "B"       "C"       "C"       "C"       "D"       "D" 

$A
[1] "student1" "student4"

$B
[1] "student8"  "student9"  "student14" "student15"

$C
[1] "student2"  "student7"  "student13" "student16" "student17" "student18"

$D
[1] "student12" "student19" "student20"

$F
[1] "student3"  "student5"  "student6"  "student10" "student11"