This workshop is modeled after the CSCAR Mixed Models with R workshop conducted by Michael Clark on February 7 (https://m-clark.github.io/mixed-models-with-R)
The data and much of the content from the workshop comes from http://http://environmentalcomputing.net/mixed-models/
We want to estimate the effect of water pollution on the abundance of subtidal marine invertebrates.
We collect data from 7 estuaries, 4 pristine and 3 modified.
We collect data from up to 4 sites at each estuary.
The file Estuaries.csv contains the results from this data collection.
estuary = read.csv("Estuaries.csv",header=TRUE)
estuary$Site=factor(estuary$Site) #Specifies Site as categorical
head(estuary)
## Observation Modification Estuary Site Hydroid Total Schizoporella.errata
## 1 1 Modified JAK 1 0 44 15
## 2 2 Modified JAK 1 0 42 8
## 3 3 Modified JAK 2 0 32 9
## 4 4 Modified JAK 2 0 44 14
## 5 5 Modified JAK 3 1 42 6
## 6 6 Modified JAK 3 1 48 12
Initial investigation of the data suggests that there is a difference in invertebrate abundance between modified and pristine sites.
boxplot(Total~Modification, data=estuary)
We can more formally determine if there is a difference by using a T-test, or equivalently by fitting a linear regression model with Modification as the predictor variable.
model1 = lm(Total~Modification, data=estuary)
summary(model1)
##
## Call:
## lm(formula = Total ~ Modification, data = estuary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.50 -7.25 2.50 7.50 24.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.500 2.400 17.289 < 2e-16 ***
## ModificationPristine -15.000 3.118 -4.811 1.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.26 on 52 degrees of freedom
## Multiple R-squared: 0.308, Adjusted R-squared: 0.2947
## F-statistic: 23.14 on 1 and 52 DF, p-value: 1.332e-05
This confirms our suspicions based on the box plot, but are we leaving something out?
library(ggplot2)
ggplot(data=estuary, aes(x=Estuary,y=Total))+geom_point()
How do we account for the effect of Estuary? One way is to add Estuary to our regression model. Is this a satisfactory solution?
model2 = lm(Total~Modification+Estuary, data=estuary)
summary(model2)
##
## Call:
## lm(formula = Total ~ Modification + Estuary, data = estuary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.625 -6.292 1.500 5.250 18.875
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.500 3.280 14.483 < 2e-16 ***
## ModificationPristine -33.125 4.638 -7.142 4.99e-09 ***
## EstuaryCLY 14.500 4.638 3.126 0.00304 **
## EstuaryHAK 12.750 4.638 2.749 0.00846 **
## EstuaryJAK -6.125 4.638 -1.321 0.19306
## EstuaryJER 21.250 4.638 4.581 3.41e-05 ***
## EstuaryKEM -13.833 5.010 -2.761 0.00819 **
## EstuaryWAG NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.277 on 47 degrees of freedom
## Multiple R-squared: 0.5754, Adjusted R-squared: 0.5211
## F-statistic: 10.61 on 6 and 47 DF, p-value: 1.992e-07
We want to account for the effect of Estuary but without estimating the effect of each Estuary.
Estuary is a nuisance parameter.
This only gives us information about the estuaries in the sample.
We can also incorporate Estuary by representing Estuary as a random effect. We can model the total \(T_i\) for estuary \(i\) as:
\(T_i = a_0 + a_1*I(i\ pristine) + \sigma_i + error\), where
\(a_0\) represents the mean total for an “average” modified estuary. This is a fixed effect (constant). The model explicitly estimates fixed effects.
The expression \(I(i\ pristine)\) has value \(1\) if estuary \(i\) is pristine and \(0\) otherwise. This is sometimes called an indicator variable.
\(a_1\) represents the difference in mean totals for an average pristine estuary and an average modified estuary. This is also a fixed effect.
\(\sigma_i\) represents the difference in mean totals between estuary \(i\) and an average estuary with the same level of modification. The model assumes that the random effects \(\sigma_i\) are normally distributed with mean \(0\) and variance \(\sigma\). The model only explicitly estimates \(\sigma\). The random variable \(\sigma_i\) is sometimes called a variance parameter.
\(error\) represents additional variation that is not explained by \(\sigma_i\) The model assumes that \(error\) is normally distributed with mean \(0\) and standard deviation \(e\). This is also a variance parameter.
The model assumes that \(\sigma_i\) and \(error\) are independent.
The model above is called a linear mixed effects model (linear mixed model) because it incorporates both fixed and random effects.
The R package lme4 fits a wide variety of linear mixed effects models and the package lmerTest computes p-values for the coefficients in the model. Let us now model the effect of modification using lme4.
library(lme4)
## Loading required package: Matrix
library(lmerTest)
##
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
##
## lmer
## The following object is masked from 'package:stats':
##
## step
model3 = lmer(Total ~ 1 + Modification + (1|Estuary), data=estuary)
The expression \(1\) in the R formula corresponds to \(a_0\), the variable \(Modification\) to \(a_1\), and the expression \((1|Estuary)\) to \(\sigma_i\) in the formula above. The value in \((1|Estuary)\) indicates that the random effect is a random intercept because the intercept is allowed to vary from estuary to estuary.
summary(model3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary)
## Data: estuary
##
## REML criterion at convergence: 394.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3859 -0.7142 0.2766 0.5240 2.0456
##
## Random effects:
## Groups Name Variance Std.Dev.
## Estuary (Intercept) 55.12 7.424
## Residual 86.07 9.277
## Number of obs: 54, groups: Estuary, 7
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 40.973 4.727 5.090 8.668 0.000309 ***
## ModificationPristine -14.473 6.230 5.018 -2.323 0.067611 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## MdfctnPrstn -0.759
The estimated difference between pristine and modified estuaries as shrunk slightly, to \(-14.5\) from the original estimate of \(-15\).
The estimated difference is not statistically significant at the \(0.05\) level, unlike with the first model. The first model did not account for the influence of the estuary, however.
The model assumes that there are two sources of variance, the estuary and residual (unexplained) variance. The total variance is \(7.4+9.3=16.7\), and \(7.4/16.7=44\%\) of the total variance is explained by Estuary.
The code below gives \(95\%\) confidence intervals for the parameters in the model.
confint(model3,level=0.95) #0.95 is the default
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 2.718166 12.538348
## .sigma 7.676352 11.522837
## (Intercept) 31.918235 49.981321
## ModificationPristine -26.360731 -2.538241
The .sig01 term refers to the standard deviation of the random intercept for Estuary and the .sigma refers to the residual standard deviation. Note that the confidence intervals are asymmetric about the estimates, with the right side of the interval being longer than the left side. This is typical for variance parameters.
We can also get estimated values for the random effect for each estuary, which are sometimes called EBLUPS (estimated best linear unbiased predictors). If we add the fixed effects to the random effect for each estuary, then we get an estimate for the mean abundance for each estuary. The red dots in the graph below give these estimates. The model assumes that the unexplained variance is the same for each estuary.
ranef(model3)
## $Estuary
## (Intercept)
## BOT 5.4611641
## CLY 1.9871555
## HAK 0.5229357
## JAK 0.3363947
## JER 7.6348605
## KEM -5.7975588
## WAG -10.1449516
ggplot(data=estuary, aes(x=Estuary,y=Total))+geom_point()+ geom_point(aes(y=predict(model3),color="red",size=2))+scale_color_hue(c=200)+theme(legend.position="none")
We could in principle also allow the the effect of Modification (slope) to differ from estuary to estuary. R code to create such models is below. Note that we cannot fit this type of model to the data we have, however. Why not?
#Assumes that the random intercept and slope are correlated
model4 = lmer(Total ~ 1 + Modification + (1+Modification|Estuary), data=estuary)
#Assumes that the random intercept and slope are independent
model5 = lmer(Total ~ 1 + Modification + (1|Estuary) + (Modification|Estuary), data=estuary)
The analysis so far has ignored the effect of site.
ggplot(data=estuary, aes(x=Estuary,y=Total,color=Site,shape=Site))+geom_point()+scale_colour_manual(values=c("red","blue","green","black"))+scale_shape_manual(values=c(3,16,17,18))
Do you notice clustering by Site? Is there anything strange about how the data codes Site?
We want to consider adding an additional random effect for site.
Two random effects are nested if they are in a hierarchy, with each value for the child random effect having a unique value for the parent random effect.
Two random effects are completely crossed if there are observations having every combination of values of the two random effects. The random effects are in some sense independent.
Two random effects are (partially) crossed if they are neither nested nor completely crossed.
Are Estuary and Site nested, crossed, or partially crossed?
#Nested random effects
model6 = lmer(Total ~ 1 + Modification + (1|Estuary/Site), data=estuary) #(1|Estuary/Site) is shorthand for (1|Estuary) + (1|Estuary:Site), : means interaction (combination)
summary(model6)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary/Site)
## Data: estuary
##
## REML criterion at convergence: 386.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8686 -0.6687 0.1504 0.6505 1.9816
##
## Random effects:
## Groups Name Variance Std.Dev.
## Site:Estuary (Intercept) 49.85 7.061
## Estuary (Intercept) 47.59 6.899
## Residual 43.65 6.607
## Number of obs: 54, groups: Site:Estuary, 27; Estuary, 7
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 41.053 4.739 5.143 8.662 0.000294 ***
## ModificationPristine -14.553 6.232 5.028 -2.335 0.066496 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## MdfctnPrstn -0.760
#Completely crossed random effects
model7 = lmer(Total ~ 1 + Modification + (1|Estuary) + (1|Site), data=estuary)
summary(model7)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary) + (1 | Site)
## Data: estuary
##
## REML criterion at convergence: 393.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3312 -0.7631 0.1040 0.5766 1.8202
##
## Random effects:
## Groups Name Variance Std.Dev.
## Estuary (Intercept) 53.613 7.322
## Site (Intercept) 7.686 2.772
## Residual 80.045 8.947
## Number of obs: 54, groups: Estuary, 7; Site, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 41.257 4.849 5.869 8.508 0.000162 ***
## ModificationPristine -14.757 6.124 5.021 -2.410 0.060664 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## MdfctnPrstn -0.727
#If the coding reflects the nesting, then you can specify the model either way
estuary$Estuary.Site = estuary$Estuary:estuary$Site
model8 = lmer(Total ~ 1 + Modification + (1|Estuary) + (1|Estuary.Site), data=estuary)
summary(model8)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary) + (1 | Estuary.Site)
## Data: estuary
##
## REML criterion at convergence: 386.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8686 -0.6687 0.1504 0.6505 1.9816
##
## Random effects:
## Groups Name Variance Std.Dev.
## Estuary.Site (Intercept) 49.85 7.061
## Estuary (Intercept) 47.59 6.899
## Residual 43.65 6.607
## Number of obs: 54, groups: Estuary.Site, 27; Estuary, 7
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 41.053 4.739 5.143 8.662 0.000294 ***
## ModificationPristine -14.553 6.232 5.028 -2.335 0.066496 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## MdfctnPrstn -0.760
#Partially crossed random effects
model9 = lmer(Total ~ 1 + Modification + (1|Estuary:Site), data=estuary)
summary(model9)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary:Site)
## Data: estuary
##
## REML criterion at convergence: 390.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.4697 -0.6069 0.1077 0.4924 2.0769
##
## Random effects:
## Groups Name Variance Std.Dev.
## Estuary:Site (Intercept) 86.44 9.297
## Residual 43.65 6.607
## Number of obs: 54, groups: Estuary:Site, 27
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 41.500 3.137 25.000 13.228 8.65e-13 ***
## ModificationPristine -15.000 4.075 25.000 -3.681 0.00112 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## MdfctnPrstn -0.770
Two models are nested if the fixed + random effects of one model are a subset of the fixed + random effects of the other model
Model3 and model8 are nested models.
We can use the likelihood ratio test for nested models to compare the models and determine if the random effect for Site improves the model.
When we run the likelihood ratio test, we need to fit the models using maximum likelihood instead of restricted maximum likelihood (REML=FALSE).
lmer fits the models using restricted maximum likelihood by default instead of maximum likelihood because maximum likelihood results in biased estimates for random effect variances.
#Refit models using maximum likelihood
model3 = lmer(Total ~ 1 + Modification + (1|Estuary), REML=FALSE,data=estuary)
model8 = lmer(Total ~ 1 + Modification + (1|Estuary) + (1|Estuary.Site), REML=FALSE, data=estuary)
#Likelihood ratio test
anova(model3,model8)
## Data: estuary
## Models:
## model3: Total ~ 1 + Modification + (1 | Estuary)
## model8: Total ~ 1 + Modification + (1 | Estuary) + (1 | Estuary.Site)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## model3 4 411.92 419.87 -201.96 403.92
## model8 5 405.78 415.73 -197.89 395.78 8.1322 1 0.004348 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is clear in this case that we should include a random effect for Site in the model \((p = 0.0043)\).
The p-value from this test is conservative (fails to reject the null hypothesis when it should be rejected), however, because it is testing \(\sigma = 0\) vs \(\sigma > 0\) instead of \(\sigma = 0\) vs \(\sigma \not = 0\).
One rule of thumb would be to take the p-value from this test and divide by \(2\).
We can also use the likelihood ratio test to test fixed effects.
We do not need to divide the p-value by \(2\) here.
#Refit models using maximum likelihood
model3 = lmer(Total ~ 1 + Modification + (1|Estuary), REML=FALSE,data=estuary)
model3b = lmer(Total ~ 1 + (1|Estuary), REML=FALSE, data=estuary)
#Likelihood ratio test
anova(model3,model3b)
## Data: estuary
## Models:
## model3b: Total ~ 1 + (1 | Estuary)
## model3: Total ~ 1 + Modification + (1 | Estuary)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## model3b 3 415.02 420.99 -204.51 409.02
## model3 4 411.92 419.87 -201.96 403.92 5.1055 1 0.02385 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1