Quantitative Genomics and Genetics 2016

Computer Lab 1

– 4 February 2016

– Author: Jin Hyun Ju (jj328@cornell.edu)

Best way to download computer lab material

  • Mac and Linux users

Open a terminal, change directories to your preferred location, and type the following command.

git clone https://github.com/jinhyunju/QG16_computerlab.git

After the initial “cloning” of the repository, you can use this command (while you are in the directory) to download any future updates.

git pull
  • Windows users

Download git

https://git-scm.com/downloads

Use either the git bash functionality and type the same command shown above, or use the GUI version and copy and paste the URL to clone the repository.

Installing R

The Comprehensive R Archive Network (CRAN)

http://cran.r-project.org/

Download the R distribution for your operating system (Mac, Windows, Linux) and follow the installation instruction.

Installing R studio

http://www.rstudio.com/

What is R?

  • R is a programming language: A set of commands in a syntax that can be interpreted by a computer to make use of the processors to do computations. A way to tell the computer what to do.

  • R is a “scripting” type languate: The syntax is interpreted directly to do computations and much of the complex tasks one would like to accomplish are implemented automatically.

  • R is specifically developed for doing statistical tasks and computations.

  • History: R is an open-source (free) spin-off of the language S+, originally developed at the Bell labs in the 70’s. The R system is written in C.

  • Relationship to other languates: R is distinct from object-oriented languages (such as C), other dynamic scripting languages (Perl), has similar functionality to SAS and is (effectively) a free version of S.

What is R studio?

  • User friendly interface for R. Makes your life a lot easier.

  • Syntax highlighting

  • Code completion

  • Smart indentation

  • Tools to manage plots

  • Browse files and directories

  • Visualizing object structures

  • When you run Rstudio it automatically opens an R session

Why R?

  • It’s free. (1 year SPSS licences range from $100 (student pricing) to $14,000+)

  • R is supported by a large community (constantly getting updated, help communities, new features)

  • R is user friendly and easy to learn and provides a foundation for learning more complex languages

  • R is particularly good for accomplishing statistical applications

  • R has nice graphical outputs (ggplot)

  • R is particularly useful in bioinformatics (the bioconductor community has many packages available for commonly used analyses)

Why are we interested in using R?

  • We can analyze large datasets in a reproducible way. (You will never, or hardly ever, use excel again)

  • We can implement algorithms or statistical analyses that are not available as published functions

  • We can automate a complex analysis by putting together multiple steps

  • We can simulate data

A few remarks on programming

  • It is easier than you think once you get used to it.

  • It is about winning small battles. Do not attempt to write up the entire process at once.

  • When in doubt, HIT TAB.

  • Typos are your worst enemy and you will see many in the lecture notes as well…

  • Computers are super fast stubborn idiots.

  • Your program or script will almost certainly never work on the first run. If it does work, it is probably not doing the right thing.

1. Location, Location, Location …


The working directory

  • The “working directory” is where R automatically looks for your data files and saves any outputs.

  • You can use R interactively by writing commands directly to the “prompt”, which looks like this “>”.

  • You can also use the script window to directly run commands. (we will get there soon)

  • Let’s check the directory that you are in first

getwd()
  • Now lets set your working directory to another place

  • If you don’t know what to type as your directory, try hitting the tab button inside the quote marks. setwd(“-hit tab-”)

setwd("Path_to_directory")
  • This command changed our working directory and we can check it by using getwd() again
getwd()
  • To see what is in the working directory
dir() 
[1] "QG16_computerlab1_page_ver.html"  "QG16_computerlab1_page_ver.Rmd"  
[3] "QG16_computerlab1_slide_ver.html" "QG16_computerlab1_slide_ver.Rmd" 
[5] "QG16-lab1-data.csv"               "QG16_subset_only_a.csv"          
list.files()  # does the same thing
[1] "QG16_computerlab1_page_ver.html"  "QG16_computerlab1_page_ver.Rmd"  
[3] "QG16_computerlab1_slide_ver.html" "QG16_computerlab1_slide_ver.Rmd" 
[5] "QG16-lab1-data.csv"               "QG16_subset_only_a.csv"          
  • Now that we know where we are let’s see what R can really do.

2. R as a calculator


101+127
[1] 228
2 * 4
[1] 8
6 / 3
[1] 2
  • And in a more sophisticated manner
( (6 / 3) + ( 9 / (1+2) ) - 2.3 )^2
[1] 7.29

- Built in math functions

Side note: Everything that is written after a # is considered as a comment and will not be executed.

log(10)     # log with base e
[1] 2.302585
log2(8)     # log with base 2
[1] 3
log10(1000) # log with base 10
[1] 3
exp(4)      # exponentials
[1] 54.59815
sqrt(36)    # square roots
[1] 6
abs( 10 - 15 ) # absolute values
[1] 5
  • In case you cannot remember what a function does, try typing the name of the function with a ? in front of it.
?log
?log10

3. Data structures in R


1. Vectors

  • R can store values in variables that can be declared and used as follows:
number.of.students <- 50
male.ratio <- 0.4

number.of.males <- number.of.students * male.ratio
number.of.females <- number.of.students * (1 - male.ratio)
  • Let’s check the results saved in the variables
number.of.males
[1] 20
number.of.females
[1] 30
  • It is very important to use variable names that make sense to you. DO NOT CHOOSE VARIABLE NAMES BASED ON HOW LONG IT TAKES TO TYPE THEM. For example, do not use variable names like a, asdf, nm,nf,var1, etc… You will see that you are spending much more time interpreting your own code after you haven’t looked at it for a while.

  • Variable names are case sensitive. number.of.males and Number.of.males are two different variables.

  • Word separation is usually done with periods. (you can also use _)

  • By default R is saving variables in the form of vectors, which are sequences of numbers.

  • There are many many many ways to create a from scratch vector. Here are a few examples

example.vector <- c(1.1,3,5.3,7,9.0)
example.vector
[1] 1.1 3.0 5.3 7.0 9.0
example.vector2 <- 1:10
example.vector2
 [1]  1  2  3  4  5  6  7  8  9 10
example.vector3 <- seq(from=1, to = 12, by = 3)
example.vector3
[1]  1  4  7 10
  • You can access individual values of a vector by using square brackets [ ].
example.vector[3]
[1] 5.3
# access multiple values
example.vector2[c(1,2)]
[1] 1 2
# access every value except position 2
example.vector2[-2]
[1]  1  3  4  5  6  7  8  9 10
  • You can also use functions on vectors
# get the maximum value
max(example.vector3)
[1] 10
# get the minimum value
min(example.vector3)
[1] 1
# get the mean
mean(example.vector3)
[1] 5.5
  • Vectors can also save characters
character.vector1 <- c("R", "is", "easy","free","great")
character.vector1
[1] "R"     "is"    "easy"  "free"  "great"
character.vector1[c(1,2,3)]
[1] "R"    "is"   "easy"
character.vector1[c(1,2,5)]
[1] "R"     "is"    "great"
character.vector1[-c(3,5)]
[1] "R"    "is"   "free"
  • To check the type of information that a vector has saved you can use the class() function
class(character.vector1)
[1] "character"
class(example.vector)
[1] "numeric"
class(example.vector2)
[1] "integer"
  • Now, what class would this vector be?
question1 <- "1"
  • Anything written between “” will be considered as characters.

2. Matrices

  • With matrices we can start working in 2 dimensions.

  • We can create a matrix like this:

example.matrix1 <- matrix(1:6, nrow = 2)
example.matrix2 <- matrix(1:6, ncol = 2)
example.matrix3 <- matrix(1:6, nrow = 2, ncol = 3)
example.matrix4 <- matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)

example.matrix1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
example.matrix2
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
example.matrix3
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
example.matrix4
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
  • Checking the dimensions of a matrix
dim(example.matrix4)
[1] 2 3
# The results are row x columns
  • You can access values the same way as vectors, but with an additional position

  • Entire columns and rows can be selected by leaving the other entry empty

example.matrix4[1,2]
[1] 2
example.matrix4[1:2,1]
[1] 1 4
# selecting a row 
example.matrix4[1,]
[1] 1 2 3
# selecting a column
example.matrix4[,2]
[1] 2 5
  • Sometimes you will need to transpose a matrix
t(example.matrix4)
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
  • Sometimes you will need to name the rows and columns of your matrix
rownames(example.matrix4) <- c("row1","row2")
colnames(example.matrix4) <- c("column1","column2","column3")
example.matrix4
     column1 column2 column3
row1       1       2       3
row2       4       5       6
  • Just like vectors matrices can also hold characters

  • An important feature of vectors and matrices is that they can only hold one type of data. So a numerical matrix will only be allowed to hold numerical values, and a character matrix will only hold characters. When R complains that it cannot compute on a matrix, check the type!

character.matrix <- matrix(c("a","b","c","d","e","f"), nrow = 2, ncol = 3, byrow=TRUE)
character.matrix
     [,1] [,2] [,3]
[1,] "a"  "b"  "c" 
[2,] "d"  "e"  "f" 
mode(example.matrix4)
[1] "numeric"
mode(character.matrix)
[1] "character"
  • Can you tell me what the results for the following code will be?
question2 <- matrix(c(1,2,3), nrow = 2, ncol = 3, byrow = TRUE)

question3 <- matrix(c(1,2,3,4), nrow = 2, ncol = 3, byrow = TRUE)
  • These examples show you how R recycles values if the dimensions differ. This feature comes in handy sometimes, but it mostly backfires since it happens automatically even in cases where we would rather see an error.

  • Now what will happen in this case?

question4 <- matrix(c(1,2,Three), nrow = 2, ncol = 3, byrow = TRUE)

question5 <- matrix(c(1,2,"Three"), nrow = 2, ncol = 3, byrow = TRUE)

3. Dataframes

  • To deal with multiple types of data in a single object we use data frames.
numbers <- c(1:4)
characters <- c("a","b","c","d")

# The stringsAsFactors option prevents converting strings to factors
example.data.frame <- data.frame(numbers,characters, stringsAsFactors = FALSE)
example.data.frame
  numbers characters
1       1          a
2       2          b
3       3          c
4       4          d
class(example.data.frame[,1])
[1] "integer"
class(example.data.frame[,2])
[1] "character"
  • You can also convert a matrix directly to a data.frame
converted.data.frame <- as.data.frame(example.matrix4)
converted.data.frame
     column1 column2 column3
row1       1       2       3
row2       4       5       6
class(converted.data.frame)
[1] "data.frame"
  • Data frames can be accessed just like matrices

  • If you know the column names you can also use $ to access specific columns

example.data.frame[1,1]
[1] 1
example.data.frame[2,]
  numbers characters
2       2          b
example.data.frame$numbers
[1] 1 2 3 4
  • How will this affect the column types in the data frame?
example.data.frame[1,] <- c(5,5)

example.data.frame[1,] <- c("e", "e")

4. Lists

  • You probably wont need to use lists if you are not very familiar with R, but once you get used to using lists you can make your code even more efficient!

  • Lists are basically bundles of objects, holding vectors, matrices and data frames in a single object.

example.list <- list() # declaring a list

example.list$place1 <- example.vector3
example.list$place2 <- example.matrix4
example.list$place3 <- example.data.frame
  • Now the list has a vector, a matrix and a data frame all stored in one object. To reveal the structure of a list we can use the str() function.
str(example.list)
List of 3
 $ place1: num [1:4] 1 4 7 10
 $ place2: int [1:2, 1:3] 1 4 2 5 3 6
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "row1" "row2"
  .. ..$ : chr [1:3] "column1" "column2" "column3"
 $ place3:'data.frame': 4 obs. of  2 variables:
  ..$ numbers   : int [1:4] 1 2 3 4
  ..$ characters: chr [1:4] "a" "b" "c" "d"
  • Each item can be accessed by using either a double square bracket [[]] or a dollar sign $ if you know the name of the entry.
example.list[[1]]
[1]  1  4  7 10
example.list[[2]]
     column1 column2 column3
row1       1       2       3
row2       4       5       6
example.list$place3
  numbers characters
1       1          a
2       2          b
3       3          c
4       4          d
  • If you want to access a specific value of a vector/matrix/data.frame within a list, you can do so by adding a single square bracket at the end.
example.list[[1]][2]
[1] 4
example.list[[2]][,1]
row1 row2 
   1    4 
example.list$place3[,1]
[1] 1 2 3 4

Quick Note

  • In case you cannot remember which variables you declared, use the ls() function
ls()
 [1] "character.matrix"     "characters"           "character.vector1"   
 [4] "converted.data.frame" "example.data.frame"   "example.list"        
 [7] "example.matrix1"      "example.matrix2"      "example.matrix3"     
[10] "example.matrix4"      "example.vector"       "example.vector2"     
[13] "example.vector3"      "male.ratio"           "number.of.females"   
[16] "number.of.males"      "number.of.students"   "numbers"             
  • To remove variables that are no longer used, use the rm() function
rm(example.list)
ls()
 [1] "character.matrix"     "characters"           "character.vector1"   
 [4] "converted.data.frame" "example.data.frame"   "example.matrix1"     
 [7] "example.matrix2"      "example.matrix3"      "example.matrix4"     
[10] "example.vector"       "example.vector2"      "example.vector3"     
[13] "male.ratio"           "number.of.females"    "number.of.males"     
[16] "number.of.students"   "numbers"             

4. Importing and Exporting Data


  • To actually analyze data you most certainly will have to import data from a file.

  • The most frequently used functions for importing data are read.table() or read.csv()

  • You are probably used to excel spreadsheets with the extension .xlsx. However, tab-separated text files (.tsv or .txt) or comma-separated text files (.csv) are more commonly used with large datasets. Some of these will be automatically converted by excel if you open them using excel.

  • Let’s read in a comma separated example file.

  • First, open the file using your text editor (not excel!) and look what’s inside.

QG16.lab.1 <- read.table("./QG16-lab1-data.csv", sep = ",", header = T)
QG16.lab.2 <- read.csv("./QG16-lab1-data.csv", header = TRUE)
  • The two functions here are essentially doing the samething. You can think of read.csv() as the specialized version of read.table() for csv format files. The default for read.table() is to read in tab-separated files, but it can read in csv files as well if you specify the separating character with sep = “,”.

  • The first thing to do after reading in a data file is to check the dimensions and inspect the data.

dim(QG16.lab.1)
[1] 100   6
head(QG16.lab.1)     # displays the first 10 rows of the data
  genename       data1        data2      data3 factor1 factor2
1    gene1  1.42866369 -0.157846249  1.3136225       a   info1
2    gene2 -0.58165021  0.593995373 -0.2319272       b   info2
3    gene3 -1.03955628  1.083860155  0.7050685       a   info3
4    gene4  0.58381526 -0.125866753  1.2627935       b   info4
5    gene5  0.04377365 -0.002239948  0.1445219       a   info5
6    gene6  0.26732986 -1.756292916  0.5459083       b   info6
tail(QG16.lab.1)
    genename       data1       data2       data3 factor1 factor2
95    gene95 -0.03194166  1.10455567 -0.36060870       a   info5
96    gene96 -0.55310621 -0.85785917 -0.72968106       b   info6
97    gene97  1.06344683  1.13577427  0.69769121       a   info7
98    gene98  1.47046411  0.88524913  1.27239848       b   info8
99    gene99 -0.76495643  0.56850354  1.56419354       a   info9
100  gene100  0.29042428 -0.09689405 -0.07043415       b  info10
rownames(QG16.lab.1)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11" 
 [12] "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22" 
 [23] "23"  "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33" 
 [34] "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44" 
 [45] "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55" 
 [56] "56"  "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66" 
 [67] "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74"  "75"  "76"  "77" 
 [78] "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87"  "88" 
 [89] "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99" 
[100] "100"
colnames(QG16.lab.1) # checking the column names of the data
[1] "genename" "data1"    "data2"    "data3"    "factor1"  "factor2" 
  • This data set has a column named factor1 which has two levels = “a” and “b”. Let’s say we are only interested in the entries that have an “a” for factor1. We can subset the data by using the subset() function.
QG16.lab.1.only.a <- subset(QG16.lab.1, factor1 == "a")
QG16.lab.1.only.a
   genename       data1        data2       data3 factor1 factor2
1     gene1  1.42866369 -0.157846249  1.31362251       a   info1
3     gene3 -1.03955628  1.083860155  0.70506850       a   info3
5     gene5  0.04377365 -0.002239948  0.14452188       a   info5
7     gene7  1.14015888 -1.210404779  0.53583554       a   info7
9     gene9 -1.58741791 -1.826013995 -0.01475591       a   info9
11   gene11 -0.37906431  0.080624034  0.54954169       a   info1
13   gene13 -0.21880717 -1.655372839 -1.56231250       a   info3
15   gene15 -0.09163097 -1.672460591 -0.63767185       a   info5
17   gene17  0.22308594  1.797869068  1.06193324       a   info7
19   gene19 -0.86786284 -1.158188003 -1.47626217       a   info9
21   gene21 -0.46071712 -0.644525918 -0.58340951       a   info1
23   gene23  0.25705678 -0.072345654  0.53205336       a   info3
25   gene25 -1.50902465  0.344797863 -0.14394204       a   info5
27   gene27 -0.20672985 -0.676338563  1.13029509       a   info7
29   gene29 -2.45942241  0.130780012 -0.76748855       a   info9
31   gene31 -1.07707397  1.762098305  0.51017183       a   info1
33   gene33  0.27597700  0.515601401 -0.40089964       a   info3
35   gene35 -0.10882859  1.595258970 -1.08500501       a   info5
37   gene37 -0.33385086  0.322075807 -0.06296120       a   info7
39   gene39  1.44402109 -0.836743507  0.43612306       a   info9
41   gene41  0.31641313  0.824878272  0.43425213       a   info1
43   gene43  0.87754289 -0.862256536  1.33606779       a   info3
45   gene45 -0.90987279  0.494856914  0.45633788       a   info5
47   gene47  0.71583266  0.182193120 -0.47308293       a   info7
49   gene49  0.79019425  0.589584586 -1.38647862       a   info9
51   gene51 -0.35685005 -1.284558849 -0.80726455       a   info1
53   gene53  0.26359084  0.487573945  0.60362495       a   info3
55   gene55  0.52295338  0.938662893 -0.07414981       a   info5
57   gene57  0.62241088  0.553639599 -0.18705946       a   info7
59   gene59  0.37481716  1.605322706  1.04146807       a   info9
61   gene61 -0.55942370 -1.334397563 -0.48293221       a   info1
63   gene63  0.05690851 -1.403833433 -1.78641329       a   info3
65   gene65  0.74694784 -0.105988745  0.91208138       a   info5
67   gene67  0.61620134 -0.420777426  0.49940145       a   info7
69   gene69 -1.05644453  0.192946105 -0.63908835       a   info9
71   gene71  0.27959770 -0.535279890  0.25301446       a   info1
73   gene73  0.85670738 -0.243127167 -0.71446633       a   info3
75   gene75  1.38036787  0.714319197  0.77274329       a   info5
77   gene77  1.16121170 -1.794267588  0.23173301       a   info7
79   gene79  0.27810298  0.877454710  0.48630381       a   info9
81   gene81  1.36800485 -0.450046404  1.17619932       a   info1
83   gene83 -1.19010613 -0.821803905 -0.01427030       a   info3
85   gene85 -0.86937570 -2.146207327  1.39258618       a   info5
87   gene87  0.16804174  0.067837359 -0.39151645       a   info7
89   gene89  0.51082185 -0.096180075  1.01450680       a   info9
91   gene91  0.70244217  1.217476610 -1.42666826       a   info1
93   gene93 -0.69170856  0.168757786  1.34433147       a   info3
95   gene95 -0.03194166  1.104555667 -0.36060870       a   info5
97   gene97  1.06344683  1.135774268  0.69769121       a   info7
99   gene99 -0.76495643  0.568503541  1.56419354       a   info9
  • Let’s save the subset into a csv file by using the function write.table().
write.table(QG16.lab.1.only.a, file = "./QG16_subset_only_a.csv", sep = ",", quote= FALSE, row.names=FALSE)
# the quote options remove the "" of the entries. try it with quote = TRUE and see how it is different.
# row.names = FALSE eliminates the numbers in front of each row