Categorical data

We will start by loading a new dataset for this section and the recap some of the functions from earlier to look at the structure of this dataset.

We have data.frame with data on 30 participants (one per row), including demographic information, case control status, exposure status and scores from 7 cognitive tests.

Understanding Basic Data Types in R

To make the best of the R language, you’ll need an understanding of the basic data types and data structures and how to operate on these. These are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

Everything in R is an object.

R has 6 (although we will not discuss the raw class for this workshop) atomic vector types.

character
numeric (real or decimal)
integer
logical
complex

By atomic, we mean the vector only holds data of a single type.

character: “a”, “swc”
numeric: 2, 15.5
integer: 2L (the L tells R to store this as an integer)
logical: TRUE, FALSE
complex: 1+4i (complex numbers with real and imaginary parts)

R provides many functions to examine features of vectors and other objects, for example

class() - what kind of object is it (high-level)?
typeof() - what is the object’s data type (low-level)?
length() - how long is it? What about two dimensional objects?
attributes() - does it have any metadata?

In the dataset we have just loaded we have not only numeric but also categorical variables such as gender, case control status or exposure status. Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, therefore you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set values, known as levels.You can use the function levels() to work out what those available options there are. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

levels(dat$Exposure)

## [1] "A" "B"

Categorical variables are typically summarized by generating a frequency table. Here we will create a few tables recapping the different ways we can select a column.

table(dat[,5]) ## take 5th column

## 
##    Case Control 
##      16      14

table(dat[,"CaseControl"]) ## take column titled CaseContol

## 
##    Case Control 
##      16      14

table(dat$CaseControl) ## take column titled CaseContol

## 
##    Case Control 
##      16      14

table(dat$Sex)

## 
##  F  M 
## 15 15

table(dat$Exposure)

## 
##  A  B 
## 15 15

The table() function can be extended to 2 or 3 dimensions as demonstrated here.

table(dat$Exposure, dat$CaseControl)

##    
##     Case Control
##   A    8       7
##   B    8       7

table(dat$Exposure, dat$CaseControl, dat$Collection)

## , ,  = Bristol
## 
##    
##     Case Control
##   A    4       2
##   B    3       3
## 
## , ,  = Exeter
## 
##    
##     Case Control
##   A    2       3
##   B    2       2
## 
## , ,  = Southampton
## 
##    
##     Case Control
##   A    2       2
##   B    3       2