Quantitative Genomics and Genetics

Computer Lab 1

– 28 August 2014

– Author: Jin Hyun Ju (jj328@cornell.edu)

Installing R

The Comprehensive R Archive Network (CRAN)

http://cran.r-project.org/

Download the R distribution for your operating system (iOS, Windows, Linux) and follow the installation instruction.

Installing R studio

http://www.rstudio.com/

What is R?

  • R is a programming language: A set of comands in a syntax that can be converted to a form that can then be interpreted by a computer to make use of the processor to do computations. A way to tell the computer what to do.

  • R is a “scripting” type languate: The syntax is interpreted directly to do computations and much of the complex tasks one would like to accomplish are implemented automatically.

  • R is specifically developed for doing statistical tasks and computations.

  • History: R is an open-source (free) spin-off of the language S+, originally developed at the Bell labs in the 70’s. The R system is written in C.

  • Relationship to other languates: R is distinct from object-oriented languages (such as C), other dynamic scripting languages (Perl), has similar functionality to SAS and is (effectively) a free version of S.

What is R studio?

  • User friendly interface for R. Makes your life a lot easier.

  • Syntax highlighting

  • Code completion

  • Smart indentation

  • Tools to manage plots

  • Browse files and directories

  • Visualizing object structures

  • When you run Rstudio it automatically opens an R session

Why R?

  • It’s free. (1 year SPSS licences range from $100 (student pricing) to $14,000+)

  • R is supported by a large community (constantly getting updated, help communities, new features)

  • R is user friendly and easy to learn and provides a foundation for learning more complex languages

  • R is particularly good for accomplishing statistical applications

  • R has nice graphical outputs (ggplot)

  • R is particularly useful in bioinformatics (the bioconductor community has many packages available for specific analyses)

Why are we interested in using R?

  • We can analyze large datasets in a reproducible way. (You will never, or hardly ever, use excel again)

  • We can implement an algorithm or a statistical model that is not available as a standard analysis

  • We can automate a complex analysis by putting together multiple steps

  • We can simulate data

A few remarks on programming

  • It is easier than you think once you get used to it.

  • It is about winning the small battles. Do not attempt to write up the entire process at once.

  • Typos are your worst enemy and you will see many in the lecture notes as well…

  • Computers are super fast stubborn idiots. Always keep that in mind.

  • Your program or script will almost certainly never work on the first run. If it does work, it is probably not doing the right thing.

1. Figuring out where the computer is doing the work : Working Directories

  • The “working directory” is where R automatically looks for your data files and saves the outputs.

  • You can use R interactively by using “command line” inputs or also called the “prompt”, which looks like this “>”.

  • You can also use the script window. (we will get there soon)

  • Let’s check the directory that you are in first

getwd()
[1] "/Users/Jin/Dropbox/Quantitative_Genomics_2014/Computer_lab_1"
  • Now lets set your working directory to another place

  • If you don’t know what to type as your directory, try hitting the tab button inside the quote marks. setwd(“hit tab and see what happens”)

setwd("~/Dropbox/Quantitative_Genomics_2014/Computer_lab_1/")
  • This command changed our working directory and we can check it by using getwd() again
getwd()
[1] "/Users/Jin/Dropbox/Quantitative_Genomics_2014/Computer_lab_1"
  • To see what is in the working directory
dir() 
[1] "Computer_Lab_1.html"     "Computer_Lab_1.pdf"     
[3] "Computer_Lab_1.Rmd"      "QG13-lab1.pdf"          
[5] "QG14_Computer_Lab_1.Rmd" "QG14_subset_only_a.csv" 
[7] "QG14-lab1-data.csv"     
list.files()  # does the same thing
[1] "Computer_Lab_1.html"     "Computer_Lab_1.pdf"     
[3] "Computer_Lab_1.Rmd"      "QG13-lab1.pdf"          
[5] "QG14_Computer_Lab_1.Rmd" "QG14_subset_only_a.csv" 
[7] "QG14-lab1-data.csv"     
  • Now that we know where we are let’s see what R can really do.

2. R can calculate

101+127
[1] 228
2 * 4
[1] 8
6 / 3
[1] 2
  • And in a more sophisticated manner
( (6 / 3) + ( 9 / (1+2) ) - 2.3 )^2
[1] 7.29
  • Built in math functions

  • Everything that is written after a # is considered as a comment and will not be executed.

log(10)     # log with base e
[1] 2.303
log2(8)     # log with base 2
[1] 3
log10(1000) # log with base 10
[1] 3
exp(4)      # exponentials
[1] 54.6
sqrt(36)    # square roots
[1] 6
abs( 10 - 15 ) # absolute values
[1] 5
  • In case you cannot remember what a function does, try typing the name of the function with a ? in front of it.
?log
?log10

3. R can do much more than a calculator : Data structures in R

  1. Vectors
  • R can store values in variables that can be declared and used as follows:
number.of.students <- 50
male.ratio <- 0.4

number.of.males <- number.of.students * male.ratio
number.of.females <- number.of.students * (1 - male.ratio)
  • Let’s check the results saved in the variables
number.of.males
[1] 20
number.of.females
[1] 30
  • It is very important to use variable names that make sense to you. DO NOT CHOOSE VARIABLE NAMES BASED ON HOW LONG IT TAKES TO TYPE THEM. For example, do not use variable names like a, asdf, nm,nf,var1, etc… You will see that you are spending much more time interpreting your own code after you haven’t looked at it for a while.

  • Variable names are case sensitive. number.of.males and Number.of.males are two different variables.

  • Word separation is usually done with periods.

  • By default R is saving variables in the form of vectors, which are sequences of numbers.

  • There are many many many ways to create a from scratch vector. Here are a few examples

example.vector <- c(1.1,3,5.3,7,9.0)
example.vector
[1] 1.1 3.0 5.3 7.0 9.0
example.vector2 <- 1:10
example.vector2
 [1]  1  2  3  4  5  6  7  8  9 10
example.vector3 <- seq(from=1, to = 12, by = 3)
example.vector3
[1]  1  4  7 10
  • You can access individual values of a vector by using square brackets [ ].
example.vector[3]
[1] 5.3
# access multiple values
example.vector2[c(1,2)]
[1] 1 2
# access every value except position 2
example.vector2[-2]
[1]  1  3  4  5  6  7  8  9 10
  • You can also use functions on vectors
# get the maximum value
max(example.vector3)
[1] 10
# get the minimum value
min(example.vector3)
[1] 1
# get the mean
mean(example.vector3)
[1] 5.5
  • Vectors can also save characters
character.vector1 <- c("R", "is", "easy","free","awesome")
character.vector1
[1] "R"       "is"      "easy"    "free"    "awesome"
character.vector1[c(1,2,3)]
[1] "R"    "is"   "easy"
character.vector1[c(1,2,5)]
[1] "R"       "is"      "awesome"
character.vector1[-c(3,5)]
[1] "R"    "is"   "free"
  • To check what a vector has saved you can use the class() function
class(character.vector1)
[1] "character"
class(example.vector)
[1] "numeric"
class(example.vector2)
[1] "integer"
  • Now, what class would this vector be?
question1 <- "1"
  • Anything written between “” will be considered as characters.
  1. Matrices
  • With matrices we can start working in 2 dimensions.

  • We can create a matrix like this:

example.matrix1 <- matrix(1:6,nrow=2)
example.matrix2 <- matrix(1:6,ncol = 2)
example.matrix3 <- matrix(1:6, nrow=2,ncol=3)
example.matrix4 <- matrix(1:6, nrow=2,ncol=3, byrow=TRUE)

example.matrix1
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
example.matrix2
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
example.matrix3
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
example.matrix4
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
  • Checking the dimensions of a matrix
dim(example.matrix4)
[1] 2 3
# The results are row x columns
  • You can access values the same way as vectors, but with an additional position

  • Entire columns and rows can be selected by leaving the other entry empty

example.matrix4[1,2]
[1] 2
example.matrix4[1:2,1]
[1] 1 4
# selecting a row 
example.matrix4[1,]
[1] 1 2 3
# selecting a column
example.matrix4[,2]
[1] 2 5
  • Sometimes you will need to transpose a matrix
t(example.matrix4)
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
  • Sometimes you will need to name the rows and columns of your matrix
rownames(example.matrix4) <- c("row1","row2")
colnames(example.matrix4) <- c("column1","column2","column3")
example.matrix4
     column1 column2 column3
row1       1       2       3
row2       4       5       6
  • Just like vectors matrices can also hold characters

  • An important feature of vectors and matrices is that they can only hold one type of data. So a numerical matrix will only be allowed to hold numerical values, and a character matrix will only hold characters.

character.matrix <- matrix(c("a","b","c","d","e","f"), nrow = 2, ncol = 3, byrow=TRUE)
character.matrix
     [,1] [,2] [,3]
[1,] "a"  "b"  "c" 
[2,] "d"  "e"  "f" 
mode(example.matrix4)
[1] "numeric"
mode(character.matrix)
[1] "character"
  1. Dataframes
  • To deal with multiple types of data in a single object we use data frames.
numbers <- c(1:4)
characters <- c("a","b","c","d")

example.data.frame <- data.frame(numbers,characters)
example.data.frame
  numbers characters
1       1          a
2       2          b
3       3          c
4       4          d
class(example.data.frame[,1])
[1] "integer"
class(example.data.frame[,2])
[1] "factor"
  • You can also convert a matrix directly to a data.frame
converted.data.frame <- as.data.frame(example.matrix4)
converted.data.frame
     column1 column2 column3
row1       1       2       3
row2       4       5       6
class(converted.data.frame)
[1] "data.frame"
  • Data frames can be accessed just like matrices

  • If you know the column names you can also use $ to access specific columns

example.data.frame[1,1]
[1] 1
example.data.frame[2,]
  numbers characters
2       2          b
example.data.frame$numbers
[1] 1 2 3 4
  1. Lists
  • You probably wont need to use lists if you are not very familiar with R, but once you know how to use them it makes lots of stuff a lot easier.

  • Lists are baiscally bundles of objects, which can hold vectors, matrices and data frames in a single place.

example.list <- list() # declaring a list

example.list$place1 <- example.vector3
example.list$place2 <- example.matrix4
example.list$place3 <- example.data.frame
  • Now the list has a vector, a matrix and a data frame all stored in one object. To reveal the structure of a list we can use the str() function.
str(example.list)
List of 3
 $ place1: num [1:4] 1 4 7 10
 $ place2: int [1:2, 1:3] 1 4 2 5 3 6
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "row1" "row2"
  .. ..$ : chr [1:3] "column1" "column2" "column3"
 $ place3:'data.frame': 4 obs. of  2 variables:
  ..$ numbers   : int [1:4] 1 2 3 4
  ..$ characters: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
  • Each item can be accessed by using either a double square bracket [[]] or a dollar sign $ if you know the name of the entry.
example.list[[1]]
[1]  1  4  7 10
example.list[[2]]
     column1 column2 column3
row1       1       2       3
row2       4       5       6
example.list$place3
  numbers characters
1       1          a
2       2          b
3       3          c
4       4          d
  • If you want to access a specific value of a vector/matrix/data.frame within a list, you can do so by adding a single square bracket.
example.list[[1]][2]
[1] 4
example.list[[2]][,1]
row1 row2 
   1    4 
example.list$place3[,1]
[1] 1 2 3 4
  • In case you cannot remember which variables you declared, use the ls() function
ls()
 [1] "character.matrix"     "character.vector1"    "characters"          
 [4] "converted.data.frame" "example.data.frame"   "example.list"        
 [7] "example.matrix1"      "example.matrix2"      "example.matrix3"     
[10] "example.matrix4"      "example.vector"       "example.vector2"     
[13] "example.vector3"      "male.ratio"           "metadata"            
[16] "number.of.females"    "number.of.males"      "number.of.students"  
[19] "numbers"              "question1"           
  • To remove variables that are no longer used, use the rm() function
rm(example.list)
ls()
 [1] "character.matrix"     "character.vector1"    "characters"          
 [4] "converted.data.frame" "example.data.frame"   "example.matrix1"     
 [7] "example.matrix2"      "example.matrix3"      "example.matrix4"     
[10] "example.vector"       "example.vector2"      "example.vector3"     
[13] "male.ratio"           "metadata"             "number.of.females"   
[16] "number.of.males"      "number.of.students"   "numbers"             
[19] "question1"           

4. Importing and Exporting Data

  • To analyze actual data you most certainly will have to import your data set from an existing file.

  • The most frequently used functions for reading in data are read.table() or read.csv()

  • You are probably used to see spreadsheets in .xlsx file formats which are excel files. However, tab-separated text files (.tsv or .txt) or comma-separated text files (.csv) are more common when the size of the data becomes bigger. Some of these will be automatically converted by excel if you open them using excel.

  • Let’s read in a comma separated example file.

QG14.lab.1 <- read.table("~/Dropbox/Quantitative_Genomics_2014/Computer_lab_1/QG14-lab1-data.csv", sep = ",", header = T)
QG14.lab.2 <- read.csv("~/Dropbox/Quantitative_Genomics_2014/Computer_lab_1/QG14-lab1-data.csv", header = TRUE)
  • The two functions here are essentially doing the samething. You can think of read.csv() as the specialized version of read.table() for csv format files. Although, read.table() can read in csv files as well if you specify the separating character with sep = “,”.

  • The first thing to do after reading in a data file is to check the dimensions and inspect the data.

dim(QG14.lab.1)
[1] 100   6
head(QG14.lab.1)     # displays the first 10 rows of the data
  genename    data1    data2   data3 factor1 factor2
1    gene1  1.42866 -0.15785  1.3136       a   info1
2    gene2 -0.58165  0.59400 -0.2319       b   info2
3    gene3 -1.03956  1.08386  0.7051       a   info3
4    gene4  0.58382 -0.12587  1.2628       b   info4
5    gene5  0.04377 -0.00224  0.1445       a   info5
6    gene6  0.26733 -1.75629  0.5459       b   info6
colnames(QG14.lab.1) # checking the column names of the data
[1] "genename" "data1"    "data2"    "data3"    "factor1"  "factor2" 
  • This data set has a column named factor1 which has two levels = “a” and “b”. Let’s say we are only interested in the entries that have an “a” for factor1. We can subset the data by using the subset() function.
QG14.lab.1.only.a <- subset(QG14.lab.1, factor1 == "a")
QG14.lab.1.only.a
   genename    data1    data2    data3 factor1 factor2
1     gene1  1.42866 -0.15785  1.31362       a   info1
3     gene3 -1.03956  1.08386  0.70507       a   info3
5     gene5  0.04377 -0.00224  0.14452       a   info5
7     gene7  1.14016 -1.21040  0.53584       a   info7
9     gene9 -1.58742 -1.82601 -0.01476       a   info9
11   gene11 -0.37906  0.08062  0.54954       a   info1
13   gene13 -0.21881 -1.65537 -1.56231       a   info3
15   gene15 -0.09163 -1.67246 -0.63767       a   info5
17   gene17  0.22309  1.79787  1.06193       a   info7
19   gene19 -0.86786 -1.15819 -1.47626       a   info9
21   gene21 -0.46072 -0.64453 -0.58341       a   info1
23   gene23  0.25706 -0.07235  0.53205       a   info3
25   gene25 -1.50902  0.34480 -0.14394       a   info5
27   gene27 -0.20673 -0.67634  1.13030       a   info7
29   gene29 -2.45942  0.13078 -0.76749       a   info9
31   gene31 -1.07707  1.76210  0.51017       a   info1
33   gene33  0.27598  0.51560 -0.40090       a   info3
35   gene35 -0.10883  1.59526 -1.08501       a   info5
37   gene37 -0.33385  0.32208 -0.06296       a   info7
39   gene39  1.44402 -0.83674  0.43612       a   info9
41   gene41  0.31641  0.82488  0.43425       a   info1
43   gene43  0.87754 -0.86226  1.33607       a   info3
45   gene45 -0.90987  0.49486  0.45634       a   info5
47   gene47  0.71583  0.18219 -0.47308       a   info7
49   gene49  0.79019  0.58958 -1.38648       a   info9
51   gene51 -0.35685 -1.28456 -0.80726       a   info1
53   gene53  0.26359  0.48757  0.60362       a   info3
55   gene55  0.52295  0.93866 -0.07415       a   info5
57   gene57  0.62241  0.55364 -0.18706       a   info7
59   gene59  0.37482  1.60532  1.04147       a   info9
61   gene61 -0.55942 -1.33440 -0.48293       a   info1
63   gene63  0.05691 -1.40383 -1.78641       a   info3
65   gene65  0.74695 -0.10599  0.91208       a   info5
67   gene67  0.61620 -0.42078  0.49940       a   info7
69   gene69 -1.05644  0.19295 -0.63909       a   info9
71   gene71  0.27960 -0.53528  0.25301       a   info1
73   gene73  0.85671 -0.24313 -0.71447       a   info3
75   gene75  1.38037  0.71432  0.77274       a   info5
77   gene77  1.16121 -1.79427  0.23173       a   info7
79   gene79  0.27810  0.87745  0.48630       a   info9
81   gene81  1.36800 -0.45005  1.17620       a   info1
83   gene83 -1.19011 -0.82180 -0.01427       a   info3
85   gene85 -0.86938 -2.14621  1.39259       a   info5
87   gene87  0.16804  0.06784 -0.39152       a   info7
89   gene89  0.51082 -0.09618  1.01451       a   info9
91   gene91  0.70244  1.21748 -1.42667       a   info1
93   gene93 -0.69171  0.16876  1.34433       a   info3
95   gene95 -0.03194  1.10456 -0.36061       a   info5
97   gene97  1.06345  1.13577  0.69769       a   info7
99   gene99 -0.76496  0.56850  1.56419       a   info9
  • Let’s save the subset into a csv file by using the function write.table().
write.table(QG14.lab.1.only.a, file = "./QG14_subset_only_a.csv", sep = ",", quote= FALSE, row.names=FALSE)
# the quote options remove the "" of the entries. try it with quote = TRUE and see how it is different.
# row.names = FALSE eliminates the numbers in front of each row