Course Notes Home

We will now look at how R handle categorical data and how to summarise it.

We will start by loading a new dataset for this section and the recap some of the functions from earlier to look at the structure of this dataset.

dat <- read.csv(file = "data/DemoData.csv", header = TRUE)
class(dat)
## [1] "data.frame"
str(dat)
## 'data.frame':    30 obs. of  13 variables:
##  $ PatientID  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Collection : Factor w/ 3 levels "Bristol","Exeter",..: 2 1 3 2 1 3 3 2 1 2 ...
##  $ Age        : int  20 21 30 23 21 27 26 23 24 29 ...
##  $ Sex        : Factor w/ 2 levels "F","M": 2 2 1 2 1 1 1 1 2 2 ...
##  $ CaseControl: Factor w/ 2 levels "Case","Control": 1 2 1 2 1 2 1 2 1 2 ...
##  $ Exposure   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Test1      : int  5 5 4 9 5 1 8 3 3 8 ...
##  $ Test2      : int  28 53 94 56 47 97 71 43 51 110 ...
##  $ Test3      : int  52 10 33 14 43 13 43 37 52 28 ...
##  $ Test4      : int  24 25 29 17 26 8 23 12 11 16 ...
##  $ Test5      : int  71 5 82 66 8 51 97 31 84 66 ...
##  $ Test6      : int  23 12 23 10 8 23 18 11 15 15 ...
##  $ Test7      : int  69 34 63 66 11 60 80 34 83 57 ...
head(dat)
##   PatientID  Collection Age Sex CaseControl Exposure Test1 Test2 Test3
## 1         1      Exeter  20   M        Case        A     5    28    52
## 2         2     Bristol  21   M     Control        A     5    53    10
## 3         3 Southampton  30   F        Case        A     4    94    33
## 4         4      Exeter  23   M     Control        A     9    56    14
## 5         5     Bristol  21   F        Case        A     5    47    43
## 6         6 Southampton  27   F     Control        A     1    97    13
##   Test4 Test5 Test6 Test7
## 1    24    71    23    69
## 2    25     5    12    34
## 3    29    82    23    63
## 4    17    66    10    66
## 5    26     8     8    11
## 6     8    51    23    60
dim(dat)
## [1] 30 13

We have data.frame with data on 30 participants (one per row), including demographic information, case control status, exposure status and scores from 7 cognitive tests.

Understanding Basic Data Types in R

To make the best of the R language, you’ll need an understanding of the basic data types and data structures and how to operate on these. These are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

Everything in R is an object.

R has 6 (although we will not discuss the raw class for this workshop) atomic vector types.

By atomic, we mean the vector only holds data of a single type.

R provides many functions to examine features of vectors and other objects, for example

In the dataset we have just loaded we have not only numeric but also categorical variables such as gender, case control status or exposure status. Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, therefore you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set values, known as levels.You can use the function levels() to work out what those available options there are. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

levels(dat$Exposure)
## [1] "A" "B"

Categorical variables are typically summarized by generating a frequency table. Here we will create a few tables recapping the different ways we can select a column.

table(dat[,5]) ## take 5th column
## 
##    Case Control 
##      16      14
table(dat[,"CaseControl"]) ## take column titled CaseContol
## 
##    Case Control 
##      16      14
table(dat$CaseControl) ## take column titled CaseContol
## 
##    Case Control 
##      16      14
table(dat$Sex)
## 
##  F  M 
## 15 15
table(dat$Exposure)
## 
##  A  B 
## 15 15

The table() function can be extended to 2 or 3 dimensions as demonstrated here.

table(dat$Exposure, dat$CaseControl)
##    
##     Case Control
##   A    8       7
##   B    8       7
table(dat$Exposure, dat$CaseControl, dat$Collection)
## , ,  = Bristol
## 
##    
##     Case Control
##   A    4       2
##   B    3       3
## 
## , ,  = Exeter
## 
##    
##     Case Control
##   A    2       3
##   B    2       2
## 
## , ,  = Southampton
## 
##    
##     Case Control
##   A    2       2
##   B    3       2

Next