If we have a factor with multiple categories we might want to use an analysis of variance or ANOVA. This can be coded in two different ways:
summary(aov(dat$Test1 ~ dat$Collection))
## Df Sum Sq Mean Sq F value Pr(>F)
## dat$Collection 2 9.24 4.622 0.563 0.576
## Residuals 27 221.56 8.206
anova(lm(dat$Test1 ~ dat$Collection))
## Analysis of Variance Table
##
## Response: dat$Test1
## Df Sum Sq Mean Sq F value Pr(>F)
## dat$Collection 2 9.244 4.6222 0.5633 0.5759
## Residuals 27 221.556 8.2058
Although these commands are performing the same statistical test and reach the same answer (compare the p-values) the output looks slightly different.
Notice how in the second option we have reused the lm()
function. ANOVAs only tell us if there is a significant difference between the groups. It does not give us an effect size to interpret what the difference we are detecting is. Often post-hoc tests would be used to work out which group(s) is/are different from the rest. However, using the regression model framework this information is already available. By using the summary()
function we can look at the p-values and coefficients of the dummy variables used for the analysis. In this example we had 3 categories we were testing across which are converted into two dummy variables. We can use contrasts()
to examine what R is doing.
contrasts(dat$Collection)
## Exeter Southampton
## Bristol 0 0
## Exeter 1 0
## Southampton 0 1
The two columns indicate R has created two dummy variables which take the values in the table for the categories in each of the rows. When we fit our model and get the coefficients table, we can see both dummy variables are listed.
model<-lm(dat$Test1 ~ dat$Collection)
anova(model)
## Analysis of Variance Table
##
## Response: dat$Test1
## Df Sum Sq Mean Sq F value Pr(>F)
## dat$Collection 2 9.244 4.6222 0.5633 0.5759
## Residuals 27 221.556 8.2058
summary(model)
##
## Call:
## lm(formula = dat$Test1 ~ dat$Collection)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1111 -2.5000 -0.0556 2.8889 4.3333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6667 0.8269 5.643 5.44e-06 ***
## dat$CollectionExeter 1.3333 1.2632 1.056 0.301
## dat$CollectionSouthampton 0.4444 1.2632 0.352 0.728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.865 on 27 degrees of freedom
## Multiple R-squared: 0.04005, Adjusted R-squared: -0.03105
## F-statistic: 0.5633 on 2 and 27 DF, p-value: 0.5759
In the output of the linear model we can see the two dummy variables R has coded which can be used to interpret a significant result.
In this framework it is also straightforward to see how additional covariates could be added. In this scenario, if we are interested in just the effect of the categorical variable we need to specify a null model which contains all the covariates we included in the full model and run an ANOVA to compare the full model with the null model. This tells us whether the additional variable(s) in the full model make a significant difference. Again we can use the output of the lm()
to identify which category (if any) is significantly different from the other categories.
model<-lm(dat$Test1 ~ dat$Collection + dat$Age) ## variable(s) of interest and covariates
null<-lm(dat$Test1 ~ dat$Age) ## just covariates
anova(model,null) ## compares full model with the null model
## Analysis of Variance Table
##
## Model 1: dat$Test1 ~ dat$Collection + dat$Age
## Model 2: dat$Test1 ~ dat$Age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 197.96
## 2 28 204.42 -2 -6.4588 0.4241 0.6588