Course Notes Home

Next we are going to learn another way to get R to repeat the same task for different datasets using for loops.

Suppose we want to print each word in a sentence. One way is to use six print statements:

best_practice <- c("Let", "the", "computer", "do", "the", "work")
print_words <- function(sentence) {
  print(sentence[1])
  print(sentence[2])
  print(sentence[3])
  print(sentence[4])
  print(sentence[5])
  print(sentence[6])
}

print_words(best_practice)
## [1] "Let"
## [1] "the"
## [1] "computer"
## [1] "do"
## [1] "the"
## [1] "work"

but that’s a bad approach for two reasons:

  1. It doesn’t scale: if we want to print the elements in a vector that’s hundreds long, we’d be better off just typing them in.

  2. It’s fragile: if we give it a longer vector, it only prints part of the data, and if we give it a shorter input, it returns NA values because we’re asking for elements that don’t exist!

best_practice[-6]
## [1] "Let"      "the"      "computer" "do"       "the"
print_words(best_practice[-6])
## [1] "Let"
## [1] "the"
## [1] "computer"
## [1] "do"
## [1] "the"
## [1] NA

Not Available

R has has a special variable, NA, for designating missing values that are Not Available in a data set. See ?NA and An Introduction to R for more details.

Here’s a better approach:

print_words <- function(sentence) {
  for (word in sentence) {
    print(word)
  }
}

print_words(best_practice)
## [1] "Let"
## [1] "the"
## [1] "computer"
## [1] "do"
## [1] "the"
## [1] "work"

This is shorter—certainly shorter than something that prints every character in a hundred-letter string—and more robust as well:

print_words(best_practice[-6])
## [1] "Let"
## [1] "the"
## [1] "computer"
## [1] "do"
## [1] "the"

The improved version of print_words uses a for loop to repeat an operation—in this case, printing—once for each thing in a collection. The general form of a loop is:

for (variable in collection) {
  do things with variable
}

We can name the loop variable anything we like (with a few restrictions, e.g. the name of the variable cannot start with a digit). in is part of the for syntax. Note that the body of the loop is enclosed in curly braces { }. For a single-line loop body, as here, the braces aren’t needed, but it is good practice to include them as we did.

Here’s another loop that repeatedly updates a variable:

len <- 0
vowels <- c("a", "e", "i", "o", "u")
for (v in vowels) {
  len <- len + 1
}
# Number of vowels
len
## [1] 5

It’s worth tracing the execution of this little program step by step. Since there are five elements in the vector vowels, the statement inside the loop will be executed five times. The first time around, len is zero (the value assigned to it on line 1) and v is "a". The statement adds 1 to the old value of len, producing 1, and updates len to refer to that new value. The next time around, v is "e" and len is 1, so len is updated to be 2. After three more updates, len is 5; since there is nothing left in the vector vowels for R to process, the loop finishes.

Note that a loop variable is just a variable that’s being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:

letter <- "z"
for (letter in c("a", "b", "c")) {
  print(letter)
}
## [1] "a"
## [1] "b"
## [1] "c"
# after the loop, letter is
letter
## [1] "c"

Note also that finding the length of a vector is such a common operation that R actually has a built-in function to do it called length:

length(vowels)
## [1] 5

length is much faster than any R function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven’t met yet, so we should always use it when we can.

Processing Multiple Files

If recall we have multiple data files to process. As these are all kept within the same folder we can use list.files to retreive their filenames. If we run the function without any arguments, list.files(), it returns every file in the current working directory. We can understand this result by reading the help file (?list.files). The first argument, path, is the path to the directory to be searched, and it has the default value of "." on the Unix Shell that "." is shorthand for the current working directory). The second argument, pattern, is the pattern being searched, and it has the default value of NULL. Since no pattern is specified to filter the files, all files are returned.

So to list all the csv files, we could run either of the following:

list.files(path = "data", pattern = "csv")
##  [1] "DemoData.csv"        "inflammation-01.csv" "inflammation-02.csv"
##  [4] "inflammation-03.csv" "inflammation-04.csv" "inflammation-05.csv"
##  [7] "inflammation-06.csv" "inflammation-07.csv" "inflammation-08.csv"
## [10] "inflammation-09.csv" "inflammation-10.csv" "inflammation-11.csv"
## [13] "inflammation-12.csv"
list.files(path = "data", pattern = "inflammation")
##  [1] "inflammation-01.csv" "inflammation-02.csv" "inflammation-03.csv"
##  [4] "inflammation-04.csv" "inflammation-05.csv" "inflammation-06.csv"
##  [7] "inflammation-07.csv" "inflammation-08.csv" "inflammation-09.csv"
## [10] "inflammation-10.csv" "inflammation-11.csv" "inflammation-12.csv"

Organizing Larger Projects

For larger projects, it is recommended to organize separate parts of the analysis into multiple subdirectories, e.g. one subdirectory for the raw data, one for the code, and one for the results like figures. We have done that here to some extent, putting all of our data files into the subdirectory “data”. For more advice on this topic, you can read A quick guide to organizing computational biology projects by William Stafford Noble.

As these examples show, list.files result is a vector of strings, which means we can loop over it to do something with each filename in turn.

Because we have put our data in separate subdirectory, if we want to access these files using the output of list.files we also need to include the “path” portion of the file name. We can do that by using the argument full.names = TRUE.

list.files(path = "data", pattern = "csv", full.names = TRUE)
##  [1] "data/DemoData.csv"        "data/inflammation-01.csv"
##  [3] "data/inflammation-02.csv" "data/inflammation-03.csv"
##  [5] "data/inflammation-04.csv" "data/inflammation-05.csv"
##  [7] "data/inflammation-06.csv" "data/inflammation-07.csv"
##  [9] "data/inflammation-08.csv" "data/inflammation-09.csv"
## [11] "data/inflammation-10.csv" "data/inflammation-11.csv"
## [13] "data/inflammation-12.csv"
list.files(path = "data", pattern = "inflammation", full.names = TRUE)
##  [1] "data/inflammation-01.csv" "data/inflammation-02.csv"
##  [3] "data/inflammation-03.csv" "data/inflammation-04.csv"
##  [5] "data/inflammation-05.csv" "data/inflammation-06.csv"
##  [7] "data/inflammation-07.csv" "data/inflammation-08.csv"
##  [9] "data/inflammation-09.csv" "data/inflammation-10.csv"
## [11] "data/inflammation-11.csv" "data/inflammation-12.csv"

Let’s start by using a for loop to calculate and plot the mean inflammation over time. For simplicity let’s only run over the first 3 files.

filenames <- list.files(path = "data", pattern = "inflammation.*csv", full.names = TRUE)
filenames<-filenames[1:3]
for (f in filenames) {
  dat <- read.csv(file = f, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
}

Let’s expand our for loop to also plot the min and max inflammation over time.

filenames <- list.files(path = "data", pattern = "inflammation.*csv", full.names = TRUE)
filenames<-filenames[1:3]
for (f in filenames) {
  dat <- read.csv(file = f, header = FALSE)
  avg_day_inflammation <- apply(dat, 2, mean)
  plot(avg_day_inflammation)
  min_day_inflammation <- apply(dat, 2, min)
  plot(min_day_inflammation)
  max_day_inflammation <- apply(dat, 2, max)
  plot(max_day_inflammation)
}

Saving Plots to a File

So far, we have used a for loop to plot summary statistics of the inflammation data:

While these are useful in an interactive R session, what if we want to send our results to our collaborators? Since we currently have 12 data sets, running analyze_all creates 36 plots. Saving each of these individually would be tedious and error-prone. And in the likely situation that we want to change how the data is processed or the look of the plots, we would have to once again save all 36 before sharing the updated results with our collaborators.

Here’s how we can save all three plots of the first inflammation data set in a pdf file:

pdf("inflammation-01.pdf")
dat <- read.csv(file = "data/inflammation-01.csv", header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
dev.off()

The function pdf redirects all the plots generated by R into a pdf file, which in this case we have named “inflammation-01.pdf”. After we are done generating the plots to be saved in the pdf file, we stop R from redirecting plots with the function dev.off.

Overwriting Plots

If you run pdf multiple times without running dev.off, you will save plots to the most recently opened file. However, you won’t be able to open the previous pdf files because the connections were not closed. In order to get out of this situation, you’ll need to run dev.off until all the pdf connections are closed. You can check your current status using the function dev.cur. If it says “pdf”, all your plots are being saved in the last pdf specified. If it says “null device” or “RStudioGD”, the plots will be visualized normally.

Next