Correlations and Regression

Correlations
Linear regression

Correlations

We may be interested in quantifying the relationship between two continuous variables by calulcating a correlation statistic. This can be done using the cor() function. We will use the method= "pearson" argument to specify we want Pearson’s correlation coefficient (i.e. parameric). If there were any missing values or NAs in either of the variables the cor() function would return NA, to avoid this include the argument use = "pairwise.complete.obs".

cor(dat$Age, dat$Test1, method = "pearson", use = "pairwise.complete.obs")

## [1] 0.3380636

Instead to calculate a rank based correlation statistic change method to "spearman"

cor(dat$Age, dat$Test1, method = "spearman", use = "pairwise.complete.obs")

## [1] 0.3167596

If we are interested in the pvalue for the correlation statistic we need to use the cor.test() function instead. P-values can be extracted in the same way as for the T-test or Mann-Whitney test.

cor.test(dat$Age, dat$Test1, method = "pearson", use = "pairwise.complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  dat$Age and dat$Test1
## t = 1.9008, df = 28, p-value = 0.06767
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02528526  0.62251433
## sample estimates:
##       cor 
## 0.3380636

cor.test(dat$Age, dat$Test1, method = "pearson", use = "pairwise.complete.obs")$p.value

## [1] 0.06767106

We may also want to visualize this relationship with a scatterplot. Again we use the plot() function but this time we are providing a vector for the x axis co-ordinates and the y-axis co-ordinates. If two numeric vectors are provided to plot() then it takes the first as the x co-ordinate and the second as the y co-ordinate. If only one numeric vector is provided it takes these as the y co-ordinate and the x co-ordinates are taken as the index i.e. 1,2,3,…

plot(dat$Age, dat$Test1, xlab = "Age", ylab = "Score from test 1", col = c("red", "blue", "forestgreen")[dat$Collection], pch = c(1,2)[dat$CaseControl])
legend("topleft", c(levels(dat$Collection)), col = c("red", "blue", "forestgreen"), pch = 1)
legend("bottomright", c(levels(dat$CaseControl)), pch = c(1,2))

The plot() command here included argument to specify the colour of the points col = c("red", "blue", "forestgreen")[dat$Collection] and the style of the points pch = c(1,2)[dat$CaseControl]. Both of these parameters can take a single value which keeps the formatting the same for all points or a vector of options which is used sequentially for each point, if vectors are provided for these arguments they do no have to be the same length as the number of data points, instead the vector is repeated to match the length of the x and y vectors. In the example above we are taking advantage of the underlying integer representation of a factor to select the appropriate colour and point type. There are 3 levels in the factor dat$Collection (Bristol, Exeter, Southampton) which are represented by 1,2,3, which can be used to index another vector. For example Bristol has the value 1 and therefore whichever entries of dat$Collection equal Bristol the first element of c("red", "blue", "forestgreen") is taken so all Bristol entries are set to red, Exeter is 2 so all occurances take the colour blue and Southampton is replaced with forestgreen.

We have also added legends to inform the user on what each point stands for. The legend() function requires a position (which can be provided as text or x,y, co-ordinates), content/labels and formating arguments if required.

If we provide a matrix or data.frame of variables to the cor() function it will calculate a correlation matrix.

cor(dat[,7:13])

##             Test1      Test2        Test3        Test4     Test5
## Test1  1.00000000  0.2971236 -0.048156416  0.307011252 0.2678782
## Test2  0.29712357  1.0000000 -0.234920868  0.123075386 0.3378438
## Test3 -0.04815642 -0.2349209  1.000000000  0.006522879 0.1286353
## Test4  0.30701125  0.1230754  0.006522879  1.000000000 0.1503753
## Test5  0.26787820  0.3378438  0.128635336  0.150375337 1.0000000
## Test6  0.27037404  0.1072052  0.201067829  0.030150901 0.1209450
## Test7 -0.07476427  0.2737324 -0.113964736 -0.261089924 0.4082685
##            Test6       Test7
## Test1 0.27037404 -0.07476427
## Test2 0.10720524  0.27373243
## Test3 0.20106783 -0.11396474
## Test4 0.03015090 -0.26108992
## Test5 0.12094499  0.40826851
## Test6 1.00000000  0.06426426
## Test7 0.06426426  1.00000000

One way to visualize this data is to use a heatmap.

corMat<-cor(dat[,7:13])
heatmap(corMat)

Linear regression

If we want a more complex analysis where we can include confounders or covariates, we want to use a regression framework. To start with we will focus on linear regression.

Let’s start by fitting a model with Test1 as the dependent variable and case control status, age and sex as the independent variables. Here we will use the formula style, which we used in the T-test and Mann-Whitney test function calls, to specify the model we want to fit. There are (at least) two ways to code the same model:

lm(dat$Test1 ~ dat$CaseControl + dat$Age + dat$Sex)

## 
## Call:
## lm(formula = dat$Test1 ~ dat$CaseControl + dat$Age + dat$Sex)
## 
## Coefficients:
##            (Intercept)  dat$CaseControlControl                 dat$Age  
##                -2.0179                  0.9188                  0.2465  
##               dat$SexM  
##                 1.2344

lm(Test1 ~ CaseControl + Age + Sex, data = dat)

## 
## Call:
## lm(formula = Test1 ~ CaseControl + Age + Sex, data = dat)
## 
## Coefficients:
##        (Intercept)  CaseControlControl                 Age  
##            -2.0179              0.9188              0.2465  
##               SexM  
##             1.2344

These two commands are equivalent, the second can be used if all your variables are in the same data.frame. Note we did not need to specify an intercept as it is added automatically. We can tell R not to include an intercept either by adding - 1 or 0 + to the formula.

lm(Test1 ~ CaseControl + Age + Sex - 1, data = dat)

## 
## Call:
## lm(formula = Test1 ~ CaseControl + Age + Sex - 1, data = dat)
## 
## Coefficients:
##    CaseControlCase  CaseControlControl                 Age  
##            -2.0179             -1.0991              0.2465  
##               SexM  
##             1.2344

lm(Test1 ~ 0 + CaseControl + Age + Sex, data = dat)

## 
## Call:
## lm(formula = Test1 ~ 0 + CaseControl + Age + Sex, data = dat)
## 
## Coefficients:
##    CaseControlCase  CaseControlControl                 Age  
##            -2.0179             -1.0991              0.2465  
##               SexM  
##             1.2344

If we execute just the lm() function it only prints a subset of the possible output:
* the formula we called
* the estimates of the coefficients.

In order to get additional information such as p-values or confidence intervals we can use the summary() function to extract the additional statistics. We could chain these commands together i.e. summary(lm)) or we could save the output of lm() in a variable.

model<-lm(Test1 ~ CaseControl + Age + Sex, data = dat)
summary(model)

## 
## Call:
## lm(formula = Test1 ~ CaseControl + Age + Sex, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5577 -1.6233 -0.3051  1.8090  5.0990 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         -2.0179     3.1643  -0.638   0.5292  
## CaseControlControl   0.9188     1.0106   0.909   0.3717  
## Age                  0.2465     0.1213   2.033   0.0524 .
## SexM                 1.2344     1.0167   1.214   0.2356  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.648 on 26 degrees of freedom
## Multiple R-squared:  0.2103, Adjusted R-squared:  0.1192 
## F-statistic: 2.308 on 3 and 26 DF,  p-value: 0.09994

We can see that the summary() function expands on the information we got from just running lm(). From top to bottom, we get the model formula, a summary of the residuals, a table of coefficients and some model fit statistics. This is still a limited amount of the information we or may be interested in getting. We are probably most interested in the coefficients table which we can extract with this command:

summary(model)$coefficients

##                      Estimate Std. Error    t value   Pr(>|t|)
## (Intercept)        -2.0178512  3.1642603 -0.6377008 0.52924504
## CaseControlControl  0.9187541  1.0106254  0.9090945 0.37165074
## Age                 0.2465469  0.1212937  2.0326434 0.05241894
## SexM                1.2344187  1.0166822  1.2141637 0.23559878

class(summary(model)$coefficients)

## [1] "matrix"

mode(summary(model)$coefficients)

## [1] "numeric"

This produces a numeric matrix with the intercept in the top row followed by 1 row for each variable and 4 columns. A matrix is another data-type in R, it is like a data.frame except it can only contain one data type, here it contains exclusively numeric values. To get the p-value of a specific variable from this matrix, we can use the commands we learnt earlier. To get the p-value for the age variable we could use:

summary(model)$coefficients[3,4]

## [1] 0.05241894

Previously, we saw that you could reference columns with the column name, this is also possible for rows. i.e. we could pull out the p-value for the age variable using this command:

summary(model)$coefficients["Age",4]

## [1] 0.05241894

It is recommended that you use the variable names to extract statistics from the coefficients table as if R cannot fit all the terms included in the model (for example because one is a duplicate of another) it automatically drops one but does not produce a warning or an error. Let’s work through an example

First, we create a new variable that is essentially a duplicate of the sex co-variate

gender<-dat$Sex
table(gender, dat$Sex)

##       
## gender  F  M
##      F 15  0
##      M  0 15

Now, let’s fit a linear model with our two identical variables for sex and extract the p-value for the Age co-variate which we expect to be in the 4th row:

model<-lm(Test1 ~ gender + Sex + Age + CaseControl, data = dat)
summary(model)$coefficients[4,4]

## [1] 0.3716507

Let’s double check we have got the correct statistic

summary(model)$coefficients

##                      Estimate Std. Error    t value   Pr(>|t|)
## (Intercept)        -2.0178512  3.1642603 -0.6377008 0.52924504
## genderM             1.2344187  1.0166822  1.2141637 0.23559878
## Age                 0.2465469  0.1212937  2.0326434 0.05241894
## CaseControlControl  0.9187541  1.0106254  0.9090945 0.37165074

It appears we have not. The Age coefficients are not in the 4th row as expected and instead we have extracted the p-value for CaseControl, As R could not estimate regression coefficients for all the terms in this model, it dropped the “troublesome” one and in the coefficients table this co-variate has been removed so all variables listed after this have been moved up a row. Therefore, using the row index we pulled out the wrong row. Had we used the variable name to identify the row we wanted we would not have had a problem.

summary(model)$coefficients["Age", 4]

## [1] 0.05241894

We can see a list of all components we can extract from the output of lm() by running names() on the lm object. All of these can be extracted wth the $.

names(model)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"

model$call ## example of how to extract any of the components listed by the previous command.

## lm(formula = Test1 ~ gender + Sex + Age + CaseControl, data = dat)

A note about including factors in linear models: R will create a series of dummy or binary variables to represent factors in a linear model with the first category listed alphabetically set as the baseline or reference. We can visualise this with the contrasts() function.

contrasts(dat$Sex)

##   M
## F 0
## M 1

In the coefficients table the group that dummy variable is identifying is identified by the category or level appended to the end of the variable name. In this example, the gender variable is converted to a dummy variable indicating males and the coeffcient therefore represents the change in males compared to females. Similarly, the CaseControl variable is converted to a dummy variable identifying control samples and the coefficient for CaseControl status represents the change in the controls compared to the cases.

There are a range of functions which can be applied to an lm() object some examples here:

fitted.values(model)

##        1        2        3        4        5        6        7        8 
## 4.147505 5.312806 5.378555 5.805900 3.159633 5.557668 4.392367 4.571481 
##        9       10       11       12       13       14       15       16 
## 5.133692 7.285181 3.900958 6.790403 7.352614 4.885461 6.298993 2.913086 
##       17       18       19       20       21       22       23       24 
## 5.805900 6.119880 7.285181 3.899274 4.078387 4.640599 6.545540 3.406180 
##       25       26       27       28       29       30 
## 6.297309 5.871649 4.638914 4.819712 3.159633 6.545540

residuals(model)

##          1          2          3          4          5          6 
##  0.8524951 -0.3128058 -1.3785549  3.1941005  1.8403670 -4.5576683 
##          7          8          9         10         11         12 
##  3.6076326 -1.5714808 -2.1336924  0.7148192  5.0990420  2.2095973 
##         13         14         15         16         17         18 
## -1.3526142 -0.8854611 -3.2989933 -1.9130862  1.1941005 -4.1198798 
##         19         20         21         22         23         24 
##  1.7148192  4.1007263  0.9216129 -1.6405986  1.4544598 -2.4061799 
##         25         26         27         28         29         30 
## -0.2973089  2.1283514 -0.6389143 -3.8197121 -1.1596330  2.4544598

You may have noticed that there are two ways to extract these values.

One function we might be particularly interested in is the predict() function wish we can use with new data observations for each covariate in our model to calculate what the outcome would be based on the model we have estimated.

We can visually inspect some of the assumptions of linear regression by applying the plot() function to the output of an lm() call. This blog posts discusses what you should be looking for in these plots [http://data.library.virginia.edu/diagnostic-plots/]

plot(model)

Sometimes we are interested in identified the best combination of variables to include in our linear model to predict our outcome.We can do this manually, by adding removing terms in our model formula. However, some of you may be used to SPSS doing this automatically. There is function in R step() which performs a similar automation.

model1<-lm(dat$Test1 ~ dat$Sex + dat$Age + dat$Collection + dat$Exposure + dat$CaseControl)
step(model1, direction="backward")

## Start:  AIC=67.71
## dat$Test1 ~ dat$Sex + dat$Age + dat$Collection + dat$Exposure + 
##     dat$CaseControl
## 
##                   Df Sum of Sq    RSS    AIC
## - dat$Collection   2    1.9524 181.71 64.036
## - dat$Exposure     1    0.5581 180.31 65.805
## - dat$CaseControl  1    5.2007 184.95 66.567
## - dat$Sex          1    7.3936 187.15 66.921
## <none>                         179.75 67.712
## - dat$Age          1   23.5395 203.29 69.403
## 
## Step:  AIC=64.04
## dat$Test1 ~ dat$Sex + dat$Age + dat$Exposure + dat$CaseControl
## 
##                   Df Sum of Sq    RSS    AIC
## - dat$Exposure     1    0.5563 182.26 62.127
## - dat$CaseControl  1    5.8931 187.60 62.993
## - dat$Sex          1    9.8552 191.56 63.620
## <none>                         181.71 64.036
## - dat$Age          1   27.8310 209.54 66.311
## 
## Step:  AIC=62.13
## dat$Test1 ~ dat$Sex + dat$Age + dat$CaseControl
## 
##                   Df Sum of Sq    RSS    AIC
## - dat$CaseControl  1    5.7935 188.06 61.066
## - dat$Sex          1   10.3342 192.60 61.782
## <none>                         182.26 62.127
## - dat$Age          1   28.9631 211.22 64.552
## 
## Step:  AIC=61.07
## dat$Test1 ~ dat$Sex + dat$Age
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 188.06 61.066
## - dat$Sex  1    16.367 204.42 61.570
## - dat$Age  1    31.945 220.00 63.773

## 
## Call:
## lm(formula = dat$Test1 ~ dat$Sex + dat$Age)
## 
## Coefficients:
## (Intercept)     dat$SexM      dat$Age  
##     -1.9950       1.4920       0.2576

Here we have set direction="backward", so the model will start with all the terms in and iteratively remove them. This procedure does not make decisions based on p-values but on a statistic called AIC (Akaike’s Information Criterion), removing the term with the lowest AIC value each time until the model AIC stops decreasing (step not shown).

In our model formula we can add interactions between variables using the following coding:

+ to combine elementary terms, as in A+B;
: for interactions, as in A:B;
* for both main effects and interactions, so A*B = A+B+A:B.

A note on non-linear models, we can fit these by creating varables to capture the non linear behaviour. For example if we wanted a quadratic term we could do the following:

x<-dat$Age
x2<-x^2
model<-lm(dat$Test1 ~ x2 + x)
summary(model)$coefficients

##                 Estimate  Std. Error     t value  Pr(>|t|)
## (Intercept)  3.646603522 21.07534321  0.17302701 0.8639208
## x2           0.006639383  0.03256255  0.20389631 0.8399636
## x           -0.108493275  1.67349862 -0.06483021 0.9487868

Correlations and Regression

Eilis Hannon

Correlations

Linear regression