Quantitative Genomics and Genetics 2016

Computer Lab 5

– 3 March 2016

– Author: Jin Hyun Ju (jj328@cornell.edu)

1. Pseudo Random Numbers

  • Generating random numbers may sound like a fairly simple task, but if you think about it and actually try to implement a method for it you will quickly notice that it is a quite difficult problem.

  • The basic mechanism is that the functions starts at a certain number (seed), and generates a sequence of numbers that look “random enough”.

  • We have already seen a random number generator in action with the rnorm() function which generates random numbers drawn from a normal distribution.

  • Today we will take a look at the more general function sample(). This function allows you to generate random numbers from a sequence with assigned probabilities.

  • The following function call will generate 3 numbers drawn from (1,3,5,7,9) without replacement.

sample(x = c(1,3,5,7,9),
       size= 3, 
       replace = FALSE)
[1] 1 9 3
  • When you set replace = TRUE, the same number can appear multiple times in the sample.
for( index in 1:5){
  sample.temp <- sample(x= c(1,3,5,7,9),
                        size= 3, 
                        replace = TRUE)
  cat("Sample #",index,"=",sample.temp, "\n")
}
Sample # 1 = 9 5 1 
Sample # 2 = 1 9 3 
Sample # 3 = 5 9 1 
Sample # 4 = 5 3 9 
Sample # 5 = 7 9 9 
  • You can also set specific probabilities for the numbers if you are interested in running a casino.
for( index in 1:5){
  sample.temp <- sample(x= c(1,3,5,7,9),
                        size= 3, 
                        replace = TRUE, 
                        prob = c(0.1,0.1,0.1,0.1,0.6))
  cat("Sample #",index,"=",sample.temp, "\n")
}
Sample # 1 = 5 9 9 
Sample # 2 = 9 5 9 
Sample # 3 = 9 9 9 
Sample # 4 = 9 3 1 
Sample # 5 = 5 9 9 
  • Note that the sample function also works for strings as well. This will come in handy when we have to simulate genotype values later in the course.
sample(x = c("A","T","G","C"),
       size= 10, 
       replace = TRUE,
       prob = c(0.3,0.3,0.2,0.2))
 [1] "G" "A" "A" "A" "G" "A" "G" "A" "A" "T"
  • It is sometimes very useful to generate the exact same results from processes that involve random numbers. You can check if your code is doing the right thing, and if not you can probably find out where it is wrong. We can set a seed, which is used as a reference point to generate random numbers, to generate identical not-so-random-anymore numbers.
set.seed(1987) 
sample(x = 1:100,size = 10, replace = FALSE)
 [1] 10 80 60 38 96 95 36 22 55 32
set.seed(1987)
sample(x = 1:100,size = 10, replace = FALSE)
 [1] 10 80 60 38 96 95 36 22 55 32
sample(x = 1:100,size = 10, replace = FALSE)
 [1] 61 74 71 18 29 45 13 62 76 35
  • The same is true for the random sampling functions from specific distributions.
set.seed(2002)
rnorm(10)
 [1] -0.10659101  1.61444509  1.38826285  0.70277776  2.25750025
 [6] -0.19649959  1.69685022 -2.35790439 -0.08141469 -0.68264759
set.seed(2002)
rnorm(10)
 [1] -0.10659101  1.61444509  1.38826285  0.70277776  2.25750025
 [6] -0.19649959  1.69685022 -2.35790439 -0.08141469 -0.68264759

2. paste() function

  • paste() is a very useful function when you have to combine constant and changing information.

  • For example, if you want to generate multiple files with the same prefix you can use the paste function like this:

for( i in 1:5){
  file.name <- paste("QG16","Lab5","file",i,sep = "_")
  # you can add as many elements as you want within the parentheses, the sep option specifies the separation character
  file.name <- paste(file.name, "txt", sep = ".")
  cat(file.name, "\n")    
}
QG16_Lab5_file_1.txt 
QG16_Lab5_file_2.txt 
QG16_Lab5_file_3.txt 
QG16_Lab5_file_4.txt 
QG16_Lab5_file_5.txt 

Exercise - The Sampling Distribution of the Mean and P values

Part 1

  • Create 100 samples from a normal distribution with a sample size of 50, mean = 0, and sd = 1.

  • For each sample calculate the mean and save it to a vector.

  • Plot the density for the sample means (there should be 100 sample means) and calculate the mean and standard deviation of this distribution.

Part 2

  • Use pnorm() to calculate p-values for each sample mean using the sampling distribution of the mean, which should have slightly different parameters compared to the original distribution.

  • Plot a histogram for the p-value distribution, and show the number of times where we reject the null hypothesis ( mean = 0 ) with a threshold of alpha = 0.05 and print it out. Keep in mind that in this case we are testing if the mean is different from 0, not larger or smaller.

Part 3

  • Generate a function based on the work you have done, so that you can change the number of samples, sample size, and the mean and standard deviation of the population that we are sampling from.

  • Test your function with a seed set at 777 and compare your results with the results shown here.

  • The output of the function should look something like this.

sampling_function(sample_size = 100, 
                  number_of_samples = 1000, 
                  population_mean = 0,
                  population_sd = 1)
## Null Hypothesis is rejected in  45  out of  1000  cases ( 4.5 %)

Part 4

  • Test your function with different input parameters.

  • Does the percentage of cases where we reject the null hypothesis change depending on your sample size?

  • Which parameters affect the sampling distribution?