Acknowledgements

This workshop is modeled after the CSCAR Mixed Models with R workshop conducted by Michael Clark on February 7 (https://m-clark.github.io/mixed-models-with-R)

The data and much of the content from the workshop comes from http://http://environmentalcomputing.net/mixed-models/

Data

We want to estimate the effect of water pollution on the abundance of subtidal marine invertebrates.

We collect data from 7 estuaries, 4 pristine and 3 modified.

We collect data from up to 4 sites at each estuary.

The file Estuaries.csv contains the results from this data collection.

  estuary = read.csv("Estuaries.csv",header=TRUE)
  estuary$Site=factor(estuary$Site)    #Specifies Site as categorical
  head(estuary)
##   Observation Modification Estuary Site Hydroid Total Schizoporella.errata
## 1           1     Modified     JAK    1       0    44                   15
## 2           2     Modified     JAK    1       0    42                    8
## 3           3     Modified     JAK    2       0    32                    9
## 4           4     Modified     JAK    2       0    44                   14
## 5           5     Modified     JAK    3       1    42                    6
## 6           6     Modified     JAK    3       1    48                   12

Initial investigation

Initial investigation of the data suggests that there is a difference in invertebrate abundance between modified and pristine sites.

  boxplot(Total~Modification, data=estuary)

We can more formally determine if there is a difference by using a T-test, or equivalently by fitting a linear regression model with Modification as the predictor variable.

  model1 = lm(Total~Modification, data=estuary)
  summary(model1)
## 
## Call:
## lm(formula = Total ~ Modification, data = estuary)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.50  -7.25   2.50   7.50  24.50 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            41.500      2.400  17.289  < 2e-16 ***
## ModificationPristine  -15.000      3.118  -4.811 1.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.26 on 52 degrees of freedom
## Multiple R-squared:  0.308,  Adjusted R-squared:  0.2947 
## F-statistic: 23.14 on 1 and 52 DF,  p-value: 1.332e-05

This confirms our suspicions based on the box plot, but are we leaving something out?

  library(ggplot2)
  ggplot(data=estuary, aes(x=Estuary,y=Total))+geom_point()

Incorporating Estuary

How do we account for the effect of Estuary? One way is to add Estuary to our regression model. Is this a satisfactory solution?

  model2 = lm(Total~Modification+Estuary, data=estuary)
  summary(model2)
## 
## Call:
## lm(formula = Total ~ Modification + Estuary, data = estuary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.625  -6.292   1.500   5.250  18.875 
## 
## Coefficients: (1 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            47.500      3.280  14.483  < 2e-16 ***
## ModificationPristine  -33.125      4.638  -7.142 4.99e-09 ***
## EstuaryCLY             14.500      4.638   3.126  0.00304 ** 
## EstuaryHAK             12.750      4.638   2.749  0.00846 ** 
## EstuaryJAK             -6.125      4.638  -1.321  0.19306    
## EstuaryJER             21.250      4.638   4.581 3.41e-05 ***
## EstuaryKEM            -13.833      5.010  -2.761  0.00819 ** 
## EstuaryWAG                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.277 on 47 degrees of freedom
## Multiple R-squared:  0.5754, Adjusted R-squared:  0.5211 
## F-statistic: 10.61 on 6 and 47 DF,  p-value: 1.992e-07

We want to account for the effect of Estuary but without estimating the effect of each Estuary.

Estuary is a nuisance parameter.

This only gives us information about the estuaries in the sample.

Incorporating Estuary using a mixed model

We can also incorporate Estuary by representing Estuary as a random effect. We can model the total \(T_i\) for estuary \(i\) as:

\(T_i = a_0 + a_1*I(i\ pristine) + \sigma_i + error\), where

  1. \(a_0\) represents the mean total for an “average” modified estuary. This is a fixed effect (constant). The model explicitly estimates fixed effects.

  2. The expression \(I(i\ pristine)\) has value \(1\) if estuary \(i\) is pristine and \(0\) otherwise. This is sometimes called an indicator variable.

  3. \(a_1\) represents the difference in mean totals for an average pristine estuary and an average modified estuary. This is also a fixed effect.

  4. \(\sigma_i\) represents the difference in mean totals between estuary \(i\) and an average estuary with the same level of modification. The model assumes that the random effects \(\sigma_i\) are normally distributed with mean \(0\) and variance \(\sigma\). The model only explicitly estimates \(\sigma\). The random variable \(\sigma_i\) is sometimes called a variance parameter.

  5. \(error\) represents additional variation that is not explained by \(\sigma_i\) The model assumes that \(error\) is normally distributed with mean \(0\) and standard deviation \(e\). This is also a variance parameter.

  6. The model assumes that \(\sigma_i\) and \(error\) are independent.

The model above is called a linear mixed effects model (linear mixed model) because it incorporates both fixed and random effects.

Implementing and interpreting the mixed model in R

The R package lme4 fits a wide variety of linear mixed effects models and the package lmerTest computes p-values for the coefficients in the model. Let us now model the effect of modification using lme4.

  library(lme4)
## Loading required package: Matrix
  library(lmerTest)
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step
  model3 = lmer(Total ~ 1 + Modification +  (1|Estuary), data=estuary)

The expression \(1\) in the R formula corresponds to \(a_0\), the variable \(Modification\) to \(a_1\), and the expression \((1|Estuary)\) to \(\sigma_i\) in the formula above. The value in \((1|Estuary)\) indicates that the random effect is a random intercept because the intercept is allowed to vary from estuary to estuary.

  summary(model3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary)
##    Data: estuary
## 
## REML criterion at convergence: 394.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3859 -0.7142  0.2766  0.5240  2.0456 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Estuary  (Intercept) 55.12    7.424   
##  Residual             86.07    9.277   
## Number of obs: 54, groups:  Estuary, 7
## 
## Fixed effects:
##                      Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)            40.973      4.727   5.090   8.668 0.000309 ***
## ModificationPristine  -14.473      6.230   5.018  -2.323 0.067611 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## MdfctnPrstn -0.759

The estimated difference between pristine and modified estuaries as shrunk slightly, to \(-14.5\) from the original estimate of \(-15\).

The estimated difference is not statistically significant at the \(0.05\) level, unlike with the first model. The first model did not account for the influence of the estuary, however.

The model assumes that there are two sources of variance, the estuary and residual (unexplained) variance. The total variance is \(7.4+9.3=16.7\), and \(7.4/16.7=44\%\) of the total variance is explained by Estuary.

The code below gives \(95\%\) confidence intervals for the parameters in the model.

  confint(model3,level=0.95)    #0.95 is the default
## Computing profile confidence intervals ...
##                           2.5 %    97.5 %
## .sig01                 2.718166 12.538348
## .sigma                 7.676352 11.522837
## (Intercept)           31.918235 49.981321
## ModificationPristine -26.360731 -2.538241

The .sig01 term refers to the standard deviation of the random intercept for Estuary and the .sigma refers to the residual standard deviation. Note that the confidence intervals are asymmetric about the estimates, with the right side of the interval being longer than the left side. This is typical for variance parameters.

The random intercept for each estuary

We can also get estimated values for the random effect for each estuary, which are sometimes called EBLUPS (estimated best linear unbiased predictors). If we add the fixed effects to the random effect for each estuary, then we get an estimate for the mean abundance for each estuary. The red dots in the graph below give these estimates. The model assumes that the unexplained variance is the same for each estuary.

  ranef(model3)
## $Estuary
##     (Intercept)
## BOT   5.4611641
## CLY   1.9871555
## HAK   0.5229357
## JAK   0.3363947
## JER   7.6348605
## KEM  -5.7975588
## WAG -10.1449516
  ggplot(data=estuary, aes(x=Estuary,y=Total))+geom_point()+ geom_point(aes(y=predict(model3),color="red",size=2))+scale_color_hue(c=200)+theme(legend.position="none")

A random slope?

We could in principle also allow the the effect of Modification (slope) to differ from estuary to estuary. R code to create such models is below. Note that we cannot fit this type of model to the data we have, however. Why not?

  #Assumes that the random intercept and slope are correlated
  model4 = lmer(Total ~ 1 + Modification +  (1+Modification|Estuary), data=estuary)
  #Assumes that the random intercept and slope are independent
  model5 = lmer(Total ~ 1 + Modification +  (1|Estuary) + (Modification|Estuary), data=estuary)

Nested random effects

The analysis so far has ignored the effect of site.

  ggplot(data=estuary, aes(x=Estuary,y=Total,color=Site,shape=Site))+geom_point()+scale_colour_manual(values=c("red","blue","green","black"))+scale_shape_manual(values=c(3,16,17,18))

Do you notice clustering by Site? Is there anything strange about how the data codes Site?

We want to consider adding an additional random effect for site.

Two random effects are nested if they are in a hierarchy, with each value for the child random effect having a unique value for the parent random effect.

Two random effects are completely crossed if there are observations having every combination of values of the two random effects. The random effects are in some sense independent.

Two random effects are (partially) crossed if they are neither nested nor completely crossed.

Are Estuary and Site nested, crossed, or partially crossed?

R code for nested, crossed, and partially crossed random effects

  #Nested random effects
  model6 = lmer(Total ~ 1 + Modification + (1|Estuary/Site), data=estuary)  #(1|Estuary/Site) is shorthand for (1|Estuary) + (1|Estuary:Site), : means interaction (combination)
  summary(model6)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary/Site)
##    Data: estuary
## 
## REML criterion at convergence: 386.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8686 -0.6687  0.1504  0.6505  1.9816 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  Site:Estuary (Intercept) 49.85    7.061   
##  Estuary      (Intercept) 47.59    6.899   
##  Residual                 43.65    6.607   
## Number of obs: 54, groups:  Site:Estuary, 27; Estuary, 7
## 
## Fixed effects:
##                      Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)            41.053      4.739   5.143   8.662 0.000294 ***
## ModificationPristine  -14.553      6.232   5.028  -2.335 0.066496 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## MdfctnPrstn -0.760
  #Completely crossed random effects
  model7 = lmer(Total ~ 1 + Modification + (1|Estuary) + (1|Site), data=estuary)
  summary(model7)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary) + (1 | Site)
##    Data: estuary
## 
## REML criterion at convergence: 393.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3312 -0.7631  0.1040  0.5766  1.8202 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Estuary  (Intercept) 53.613   7.322   
##  Site     (Intercept)  7.686   2.772   
##  Residual             80.045   8.947   
## Number of obs: 54, groups:  Estuary, 7; Site, 4
## 
## Fixed effects:
##                      Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)            41.257      4.849   5.869   8.508 0.000162 ***
## ModificationPristine  -14.757      6.124   5.021  -2.410 0.060664 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## MdfctnPrstn -0.727
  #If the coding reflects the nesting, then you can specify the model either way
  estuary$Estuary.Site = estuary$Estuary:estuary$Site
  model8 = lmer(Total ~ 1 + Modification + (1|Estuary) + (1|Estuary.Site), data=estuary)
  summary(model8)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary) + (1 | Estuary.Site)
##    Data: estuary
## 
## REML criterion at convergence: 386.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8686 -0.6687  0.1504  0.6505  1.9816 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  Estuary.Site (Intercept) 49.85    7.061   
##  Estuary      (Intercept) 47.59    6.899   
##  Residual                 43.65    6.607   
## Number of obs: 54, groups:  Estuary.Site, 27; Estuary, 7
## 
## Fixed effects:
##                      Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)            41.053      4.739   5.143   8.662 0.000294 ***
## ModificationPristine  -14.553      6.232   5.028  -2.335 0.066496 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## MdfctnPrstn -0.760
  #Partially crossed random effects
  model9 = lmer(Total ~ 1 + Modification + (1|Estuary:Site), data=estuary)
  summary(model9)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Total ~ 1 + Modification + (1 | Estuary:Site)
##    Data: estuary
## 
## REML criterion at convergence: 390.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4697 -0.6069  0.1077  0.4924  2.0769 
## 
## Random effects:
##  Groups       Name        Variance Std.Dev.
##  Estuary:Site (Intercept) 86.44    9.297   
##  Residual                 43.65    6.607   
## Number of obs: 54, groups:  Estuary:Site, 27
## 
## Fixed effects:
##                      Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)            41.500      3.137  25.000  13.228 8.65e-13 ***
## ModificationPristine  -15.000      4.075  25.000  -3.681  0.00112 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## MdfctnPrstn -0.770

Is the Site random effect necessary?

Two models are nested if the fixed + random effects of one model are a subset of the fixed + random effects of the other model

Model3 and model8 are nested models.

We can use the likelihood ratio test for nested models to compare the models and determine if the random effect for Site improves the model.

When we run the likelihood ratio test, we need to fit the models using maximum likelihood instead of restricted maximum likelihood (REML=FALSE).

lmer fits the models using restricted maximum likelihood by default instead of maximum likelihood because maximum likelihood results in biased estimates for random effect variances.

  #Refit models using maximum likelihood
  model3 = lmer(Total ~ 1 + Modification +  (1|Estuary), REML=FALSE,data=estuary)
  model8 = lmer(Total ~ 1 + Modification + (1|Estuary) + (1|Estuary.Site), REML=FALSE, data=estuary)
  #Likelihood ratio test
  anova(model3,model8)
## Data: estuary
## Models:
## model3: Total ~ 1 + Modification + (1 | Estuary)
## model8: Total ~ 1 + Modification + (1 | Estuary) + (1 | Estuary.Site)
##        Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
## model3  4 411.92 419.87 -201.96   403.92                            
## model8  5 405.78 415.73 -197.89   395.78 8.1322      1   0.004348 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is clear in this case that we should include a random effect for Site in the model \((p = 0.0043)\).

The p-value from this test is conservative (fails to reject the null hypothesis when it should be rejected), however, because it is testing \(\sigma = 0\) vs \(\sigma > 0\) instead of \(\sigma = 0\) vs \(\sigma \not = 0\).

One rule of thumb would be to take the p-value from this test and divide by \(2\).

Is the Modification fixed effect necessary?

We can also use the likelihood ratio test to test fixed effects.

We do not need to divide the p-value by \(2\) here.

  #Refit models using maximum likelihood
  model3 = lmer(Total ~ 1 + Modification +  (1|Estuary), REML=FALSE,data=estuary)
  model3b = lmer(Total ~ 1 + (1|Estuary), REML=FALSE, data=estuary)
  #Likelihood ratio test
  anova(model3,model3b)
## Data: estuary
## Models:
## model3b: Total ~ 1 + (1 | Estuary)
## model3: Total ~ 1 + Modification + (1 | Estuary)
##         Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## model3b  3 415.02 420.99 -204.51   409.02                           
## model3   4 411.92 419.87 -201.96   403.92 5.1055      1    0.02385 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1