Quantitative Genomics and Genetics 2016

Computer Lab 2

– 11 February 2016

– Author: Jin Hyun Ju (jj328@cornell.edu)

[Review] Don’t forget to set your working directory

getwd() # check current working directory

setwd("./") # set working directory

dir()    # check what is in your directory

dir.create("./test_dir") # In case you forgot to create a directory

QG16.lab.2 <- read.table("../Lab1/QG16-lab1-data.csv", sep = ",", header = TRUE) # Reading data from a csv file

QG16.lab.2.only.a <- subset(QG16.lab.2, factor1 == "a") # subsetting a data frame
write.table(QG16.lab.2.only.a, file = "./QG16_subset_only_a.csv", sep = ",", quote= FALSE, row.names=FALSE)
# the quote options remove the "" of the entries. try it with quote = TRUE and see how it is different.
# row.names = FALSE eliminates the numbers in front of each row

1. Concept of functions

• “Everything that exists is an object. Eveything that happens is a function call.” - John Chambers

• We learned that a function is something that takes in an input and gives you an output.

• The input for an R function is written between “( )” of a function, and the output(s) are the result(s) returned by the function.

• R has many built in functions for commonly used methods in statistics.

# Examples of built in functions
example.vector1 <- c(5,2,3,7,1,1,2,9,9)

# a function that calculates the mean
mean(example.vector1) 
[1] 4.333333
# a function that summarizes the counts
table(QG16.lab.2$factor1)   a b 50 50  • We can also build custom functions. # the syntax for declaring functions, note the {} after function() log10_add <- function(input1,input2){ # all the inputs are specified within the ( ) cat("This is a custom function \n") cat("The inputs are = ",input1,input2,"\n") # showing you the inputs output <- log10(input1) + log10(input2) # creating an output within the function cat("The output is = ",output,"\n") # print the output return(output) # return specifies the output } # Now we can call our custom functions like this log10_add(100,1000) This is a custom function The inputs are = 100 1000 The output is = 5  [1] 5 # Note that the variable output is not created in our workspace ls() [1] "example.vector1" "log10_add" "QG16.lab.2" [4] "QG16.lab.2.only.a" # in order to save the result of a function to a variable we have to assign it to a variable test.output <- log10_add(100,1000) This is a custom function The inputs are = 100 1000 The output is = 5  test.output [1] 5 Question 1 • Can you guess what is going to happen? x <- 11 test_function <- function(y){ output <- x + y return(output) } test_function(2)  test_function2 <- function(x,y,z){ output1 <- x + y + z output2 <- x * y * z output3 <- x ^ y ^ z } x <- test_function2(1,2,3) print(x) x <- 10 test_function3 <- function(){ x <- 20 } test_function3() print(x) Installing and Loading packages • We can also use functions by installing published packages if somebody else did the hard work for us. • Let’s see what happens if we try to run a function called mutate(). mutate() • The mutate function is part of the “dplyr” package, so we need to install it first to be able to use the function. # We can install packages that are published on CRAN by using this function install.packages("dplyr") # in this case we are installing a package called "dplyr" # you will need an internet connection to install a package using this function • So, does it work now since we have it installed? mutate(QG16.lab.2,mean = (data1+data2+data3)/3) • Having the package installed does not mean you can use the function. You have to load the package into your current R session in order to use it. # Once the installation is finished you can load the functions into the workspace library(dplyr)  Attaching package: 'dplyr' The following objects are masked from 'package:stats': filter, lag The following objects are masked from 'package:base': intersect, setdiff, setequal, union # or by require(dplyr) # Now we can use the functions from dplyr QG16.lab.2 <- mutate(QG16.lab.2,mean = (data1+data2+data3)/3) head(QG16.lab.2)  genename data1 data2 data3 factor1 factor2 mean 1 gene1 1.42866369 -0.157846249 1.3136225 a info1 0.86147998 2 gene2 -0.58165021 0.593995373 -0.2319272 b info2 -0.07319403 3 gene3 -1.03955628 1.083860155 0.7050685 a info3 0.24979079 4 gene4 0.58381526 -0.125866753 1.2627935 b info4 0.57358068 5 gene5 0.04377365 -0.002239948 0.1445219 a info5 0.06201853 6 gene6 0.26732986 -1.756292916 0.5459083 b info6 -0.31435160 # Let's try another function from dplyr filter(QG16.lab.2,factor1 == "b" & factor2 =="info8") # Note the & operator which is an AND operator  genename data1 data2 data3 factor1 factor2 mean 1 gene8 0.5110170 1.3689777 0.9548534 b info8 0.94494937 2 gene18 1.0239683 0.1330547 0.9028215 b info8 0.68661481 3 gene28 0.8516973 -1.0032102 -1.2986571 b info8 -0.48338999 4 gene38 0.2830858 -0.6795029 1.8192317 b info8 0.47427152 5 gene48 -1.1049222 0.5409523 0.3202559 b info8 -0.08123801 6 gene58 0.7904040 0.5096778 -1.6116118 b info8 -0.10384333 7 gene68 -1.4591909 1.5071098 0.1016278 b info8 0.04984887 8 gene78 1.1132743 0.7142994 -0.2152775 b info8 0.53743207 9 gene88 -0.9999763 1.2684340 -0.9384624 b info8 -0.22333490 10 gene98 1.4704641 0.8852491 1.2723985 b info8 1.20937057 • At this point you might be wondering why we need to install all those packages and load them? Shouldn’t it be less complicated if we just have all these functions by default, or automatically loaded onto the workspace after installing? • The reason we use the library() or require() function to load specific packages is because it will slow down R if we loaded everything we have installed at the beginning as we install more packages. • You can think of your R session as your workbench, and the packages as specific tools in your toolbox. You wouldn’t empty out the toolbox onto your workbench for a simple task like sanding a surface. The only thing you need is some sandpaper, and any additional tool just occupies additional space and contributes only to the aesthetics of your workspace (or acts against it). 2. for loops • For loops are mainly used in cases where you want to apply a function iteratively over a vector, matrix or data frame (or just simply do a task multiple times) • Structure of the code = for (counter in vector) { do stuff } • Syntax of a simple for loop (Printing out the process helps a lot) N <- 3 for ( i in 1:N ){ cat("Processing loop number = ",i,"\n") } Processing loop number = 1 Processing loop number = 2 Processing loop number = 3  • You can use any legal variable name for the counter for ( anything_goes in c(1,3,5) ){ cat("Processing loop number = ",anything_goes,"\n") } Processing loop number = 1 Processing loop number = 3 Processing loop number = 5  • You can also create a loop within a loop • For this example, let’s say that you want to create multiple samples with different sample sizes for ( outer in 1:3 ){ cat("Processing Outer Loop #", outer, "\n") for ( inner in 1:2 ){ cat("Processing |_ Inner Loop #", inner, "\n") } } Processing Outer Loop # 1 Processing |_ Inner Loop # 1 Processing |_ Inner Loop # 2 Processing Outer Loop # 2 Processing |_ Inner Loop # 1 Processing |_ Inner Loop # 2 Processing Outer Loop # 3 Processing |_ Inner Loop # 1 Processing |_ Inner Loop # 2  Question2 • What is the final value of N ? N <- 3 for( i in 1:N){ cat("Processing loop = ", i, "\n") N <- N + 1 } print(N) 3. Vector and Matrix calculations • If you want to modify each element of a vector by a scalar value you can use the math operations that we have learned last week. example.vector1 [1] 5 2 3 7 1 1 2 9 9 2 * example.vector1 [1] 10 4 6 14 2 2 4 18 18 1 + example.vector1 [1] 6 3 4 8 2 2 3 10 10 example.vector1 ^2 [1] 25 4 9 49 1 1 4 81 81 • If you are interested in the dot product of two vectors you have to use a special operator example.vector1 %*% example.vector1  [,1] [1,] 255 • The same applies for matrices example.matrix1 <- matrix(c(1,1,1,2,2,2), nrow = 2, ncol = 3, byrow= TRUE) 2 * example.matrix1  [,1] [,2] [,3] [1,] 2 2 2 [2,] 4 4 4 example.matrix1 ^ 2  [,1] [,2] [,3] [1,] 1 1 1 [2,] 4 4 4 example.matrix1 - 1  [,1] [,2] [,3] [1,] 0 0 0 [2,] 1 1 1 • Here is how you can do matrix calculations # t() is transposing the matrix example.matrix1 %*% t(example.matrix1)  [,1] [,2] [1,] 3 6 [2,] 6 12 # Note the dimensions 2 x 3 %*% 3 x 2 = 2 x 2  • Here are some useful functions that can be used in matrix calculations # creating a diagonal matrix with the first input as values on the diagonal diag(2,nrow = 3)  [,1] [,2] [,3] [1,] 2 0 0 [2,] 0 2 0 [3,] 0 0 2 diag(example.vector1)  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 5 0 0 0 0 0 0 0 0 [2,] 0 2 0 0 0 0 0 0 0 [3,] 0 0 3 0 0 0 0 0 0 [4,] 0 0 0 7 0 0 0 0 0 [5,] 0 0 0 0 1 0 0 0 0 [6,] 0 0 0 0 0 1 0 0 0 [7,] 0 0 0 0 0 0 2 0 0 [8,] 0 0 0 0 0 0 0 9 0 [9,] 0 0 0 0 0 0 0 0 9 # calculating the inverse of a matrix A <- matrix(c(2,-3,1,0.5),nrow = 2) solve(A)  [,1] [,2] [1,] 0.125 -0.25 [2,] 0.750 0.50 # we can check this by A %*% solve(A) # which results in an identity matrix   [,1] [,2] [1,] 1 0 [2,] 0 1 Advanced Challenge • Can you generate a function that takes a vector with student names with “Weill Cornell” grades as input, and creates a list as an output that has the following entries? • “LetterGrade” should have the converted letter grades of each student in it. • “Overview” should have the number of students for each Letter grades • There should be an entry for each letter grade with name of students that received the corresponding letter grade. scores <- sample(c("NP","P","LP","HP","H"), 20, replace = TRUE) names(scores) <- paste0("student", 1:length(scores)) standard_letter_grades <- c("H" = "A", "HP" = "B", "LP" = "C", "P" = "D", "NP" = "F") • Your code should also work when the conversion table changes (no hard coding of letter grades are allowed!) • Your function should also be verbose, meaning that it should tell the user how many students are in the input in total, and the breakdown for each grade as well. Hints • Getting names of vectors names(example.vector1) • Checking if value is equal to something value == 1 value == "A" • Example output Total number of students = 20 2 students received A 4 students received B 6 students received C 3 students received D 5 students received F  $Overview
A B C D F
2 4 6 3 5

$LetterGrade student1 student2 student3 student4 student5 student6 student7 "A" "C" "F" "A" "F" "F" "C" student8 student9 student10 student11 student12 student13 student14 "B" "B" "F" "F" "D" "C" "B" student15 student16 student17 student18 student19 student20 "B" "C" "C" "C" "D" "D"$A
[1] "student1" "student4"

$B [1] "student8" "student9" "student14" "student15"$C
[1] "student2"  "student7"  "student13" "student16" "student17" "student18"

$D [1] "student12" "student19" "student20"$F
[1] "student3"  "student5"  "student6"  "student10" "student11"