The Linear Model 4: Categorical predictors, model fit and comparison

Tutorial 11

Milan Valášek (University of Sussex)
05-26-2021

In tutorial 10, we extended our model of penguin flipper length by including bill length as an additional predictor and learnt how to centre, scale, and standardise variables in the linear model to change the interpretation of the individual \(b\) coefficients. The last tutorial of this module will guide you through model comparison using the F-test with R’s anova() function. Before you start this tutorial, make sure you are comfortable with the concepts discussed in lecture 10.

Warm-up quiz

As usual, here’s a little quiz to get you in the mood. Consider this linear model when answering the questions below.


Call:
lm(formula = live ~ love + laugh, data = yolo)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.822 -13.691  -0.455  12.850  61.269 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   49.113      1.794  27.380  < 2e-16 ***
love          10.914      1.261   8.656 3.41e-16 ***
laugh         -5.178      2.431  -2.130    0.034 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.72 on 291 degrees of freedom
Multiple R-squared:  0.2054,    Adjusted R-squared:    0.2 
F-statistic: 37.62 on 2 and 291 DF,  p-value: 2.949e-15

QUESTION 1

How many predictors are there in this model?

2

Correct!That’s not right…

QUESTION 2

Assuming laugh is a binary categorical variable, what is the difference in live between the two groups?
(Give answer to 2 decimal places)

-5.18 | 5.18

Correct!That’s not right…

QUESTION 3

Which of the two groups has a higher expected value of live?

laugh = 0

QUESTION 4

Assuming love is a mean-centred continuous variable, what is the expected value of live for someone with an average level of love and 0 laugh?
(Give answer to 2 decimal places)

49.11

Correct!That’s not right…

QUESTION 5

What model does the F-test reported in the summary compare this model to?

The null (intercept-only) model

QUESTION 6

Given standardised B coefficients of 0.46 and −0.11 for love and laugh respectively, which variable has the largest effect on the outcome?

Love

Setting Up

As usual, all you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.

Task 1

Create a new week_10 project and open it in RStudio. Then, create two new folders in the new week_10 folder: r_docs and data. Finally, open a new R Markdown file and save it in the r_docs folder. Get into the habit of creating new code chunk in the file for each of the tasks below.

Remember, you can add new code chunks by:

  1. Using the RStudio toolbar: Click Code > Insert Chunk
  2. Using a keyboard shortcut: the default is Ctrl + Alt + I (Windows) or ⌘ Command + ⌥ Option + I (MacOS), but you can change this under Tools > Modify Keyboard Shortcuts…
  3. Typing it out: ```{r}, press Enter, then ``` again.
  4. Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

 

Task 2

In this tutorial, just like in the previous two, we’ll continue using the penguins data set from the palmerpenguins package. You will also need the packages tidyverse, broom, and QuantPsyc (install it if you don’t have it yet; watch out for spelling and upper/lower case letters!). Put the code that loads all of these in the setup chunk of your Rmd file and run it.

Data

Task 3

Just like the last couple of times, save the penguins data set in your environment as peng_data to avoid having to type out palmerpenguins::penguins every time you want to use it.

peng_data <- palmerpenguins::penguins

Task 4

It’s always good to refresh your memory about the dataset by looking at a rough summary.

summary(peng_data)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

Adding a categorical predictor

Let’s start by rebuilding one of the models we fitted last time.

Task 5

Build a linear model predicting flipper length by body mass in kilogrammes and bill length in millimetres, with both predictors mean-centred.

peng_data <- peng_data %>%
  mutate(
    mass_kg = body_mass_g / 1000,
    mass_kg_cntr = mass_kg - mean(mass_kg, na.rm = T), # using mean
    bill_length_cntr = scale(bill_length_mm, scale = FALSE)) # using scale

m_massKg_bill_cntr <- lm(flipper_length_mm ~ mass_kg_cntr + bill_length_cntr, peng_data)

Task 6

Create a new model m_mass_bill_sex that adds the sex variable in addition to the two predictors in the previous model.

# using update()
m_mass_bill_sex <- update(m_massKg_bill_cntr, ~ . + sex)
# or build it from scratch
m_mass_bill_sex <- lm(flipper_length_mm ~ mass_kg_cntr + bill_length_cntr + sex, peng_data)

Evaluating results

Task 7

Look at the summary of this model and answer the following questions.

QUESTION 7

Which category of the sex variable is represented by 0?

Female

QUESTION 8

What is the difference between the mean flipper length of a male and female penguin?
(Give answer in mm to 2 decimal places, such as 0.42)

-4.71 | 4.71

Correct!That’s not right…

QUESTION 9

Which sex of penguin has, on average, longer flippers?

Female

QUESTION 10

Is the difference statistically significant at α = .001?

Yes

QUESTION 11

What is the expected flipper length of a male penguin of average size with a bill of average length?
(Give answer in mm to 2 decimal places, such as 200.15)

198.51

Correct!That’s not right…

QUESTION 12

What percentage of the total variance in the outcome is accounted for by the combined effect of the predictors in the data?
(Give answer as % to 1 decimal place)

81.4

Correct!That’s not right…

QUESTION 13

Does the model include any unhelpful predictors?

No

summary(m_mass_bill_sex)

Call:
lm(formula = flipper_length_mm ~ mass_kg_cntr + bill_length_cntr + 
    sex, data = peng_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.725  -3.917   0.330   4.199  14.205 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      203.22727    0.49903 407.248  < 2e-16 ***
mass_kg_cntr      14.01511    0.53535  26.179  < 2e-16 ***
bill_length_cntr   0.60591    0.07599   7.973 2.57e-14 ***
sexmale           -4.71265    0.74065  -6.363 6.64e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.067 on 329 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8143,    Adjusted R-squared:  0.8126 
F-statistic:   481 on 3 and 329 DF,  p-value: < 2.2e-16

The last row of the coefficient table says sexmale. That means that this is the change in the outcome when the sex variable “changes” to male and after the effect of the other predictors has been taken into account. In other words, it must be the female category that serves as the baseline (sex = 0) category.

The difference is the \(b\) coefficient associated with the sex variable: −4.71 mm or 4.71 mm, depending on which way around you’re doing the comparison.

Given that the \(b\) coefficient for sex is negative and that female is the reference category, male penguins have, on average, shorter flippers.

The p-value associated with the \(b\) coefficient is 6.64 × 10−10 which is far smaller than .001, and so the difference is indeed significant.

Because both body mass and bill length have been centred, and female is the baseline category, the intercept \(b_0\) represents the expected flipper length of a female penguin of average size and bill-length. To get the expected flipper length of its male counterpart, all you need to do is add the \(b\) coefficient for sex: \[b_0 + b_3 = 203.23 + -4.71 = 198.51\]

The metric for the proportion of variance accounted for in the data is multiple R2, so the answer to the penultimate question is 81.4.

Finally, there is essentially no difference between multiple and adjusted R2 (81.4% and 81.3%), which is a clear indicator that there are no redundant predictors in the model.

Task 8

What is the unique proportion of variance in the flipper length variable that’s attributable to sex? In other words, how much variance in the outcome is accounted for by sex?

You’ll need the difference between R2 from two models.

We already have a model that contains the two predictors other than sex and we also have one that includes all three. So, because sex is the only difference between the two models, the difference in their respective R2 must be due to this variable.

rsq_m1 <- m_massKg_bill_cntr %>% broom::glance() %>% dplyr::pull(r.squared)
rsq_m2 <- m_mass_bill_sex %>% broom::glance() %>% dplyr::pull(r.squared)

# difference in % rounded to 1 dp
round((rsq_m2 - rsq_m1) * 100, 1)
[1] 2.6

Granted, 2.6% of the total variance is not exactly earth-shattering but it’s something!

Model comparison

Now that we know all about the effect of sex on flipper length in penguins (this is going to win you a pub quiz one day!), let’s look at a formal comparison between a model that includes the variable and one that does not.

Task 9

Let’s formally compare m_massKg_bill_cntr and m_mass_bill_sex.

Task 9.1

Enter the objects containing the two linear models as arguments to the anova() function.

anova(m_massKg_bill_cntr, m_mass_bill_sex)

 

Well, that didn’t work, did it?

The error message we get from R for once actually tells us what the problem is in plain English:

Error in anova.lmlist(object, ...) : models were not all fitted to the same size of dataset

So the problem is that the two models we are trying to compare are based on datasets of different sizes.

Task 9.2

Pause to think why this might be the case…

The reason, as is often the case with these things, is missing data. A model can only use those rows of a dataset that have non-missing observations in each column that is used in the model specification.1 If, for instance, there are some penguins in the dataset whose bills, flippers, and body masses have been measured, but whose sex has not, their data will be used to estimate the coefficients of the m_massKg_bill_cntr model, but not the m_mass_bill_sex model. This discrepancy makes the models incomparable.

Know your models

While we haven’t explicitly focused on data exploration in the past few tutorials, we stressed the importance of knowing your data earlier in the module. We also asked you to familiarise yourself with your dataset which entails noticing things like unusual, impossible, or missing values in variables.

A familiarity with your data is really important for understanding situations such as the one that we just encountered. You should always be able to answer questions like:

QUESTION 14

What is the sample size used to fit the m_massKg_bill_cntr model?

342

Correct!That’s not right…

QUESTION 15

What is the sample size used to fit the m_mass_bill_sex model?

333

Correct!That’s not right…

QUESTION 16

How many missing values are there in the sex variable?

11

Correct!That’s not right…

QUESTION 17

How many rows have missing data in the other variables entered in the m_mass_bill_sex model but not in the sex variable?

0

Correct!That’s not right…

There are several ways of finding out how many observations were used in a particular model. A quick one is to check the number of residual df (second number of df for the F-test). This number is always equal to N − p − 1, where p is the number of predictors in the model.

To check the number of missing values in a variable, you can just look at the summary() of the data, use the is.na() and sum() functions, or some other code. Check out the R script to practical 3 for inspiration.

So now, to compare our two models, we need to refit m_massKg_bill_cntr using the same subset of peng_data that was used to fit m_mass_bill_sex.

Task 9.3

Create an object in the environment called peng_subset that will allow you to refit m_massKg_bill_cntr such that it is comparable with the other model.

Think about which variable (or NAs in which variable) caused the discrepancy in sample size and how you can get around the issue.

All you need to do is filter the data to only include those rows of peng_data which do not have NAs in the sex column.

peng_subset <- peng_data %>%
  filter(!is.na(sex))

Task 9.4

Create a model called m_massKg_bill_cntr_compare that is the same as m_massKg_bill_cntr but, this time round, fitted to peng_subset.

m_massKg_bill_cntr_compare <- lm(flipper_length_mm ~ mass_kg_cntr + bill_length_cntr, peng_subset)

Task 9.5

Have a look at coefficients of this new model and compare them to the ones from the original model

coef(m_massKg_bill_cntr_compare)
     (Intercept)     mass_kg_cntr bill_length_cntr 
     200.8593884       13.0173337        0.5440353 
coef(m_massKg_bill_cntr)
     (Intercept)     mass_kg_cntr bill_length_cntr 
     200.9152047       13.0507787        0.5492244 

As you can see, the coefficients are ever so slightly different. This is not that surprising because the data these two versions of the model use are themselves different. What is reassuring is that the discrepancies are very small.

 

Now that the first model is fitted to the same data as the second one, we can finally compare them.

Task 9.6

Compare the m_massKg_bill_cntr_compare and m_mass_bill_sex models using the anova() function.

anova(m_massKg_bill_cntr_compare, m_mass_bill_sex)
Analysis of Variance Table

Model 1: flipper_length_mm ~ mass_kg_cntr + bill_length_cntr
Model 2: flipper_length_mm ~ mass_kg_cntr + bill_length_cntr + sex
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    330 13598                                  
2    329 12108  1      1490 40.486 6.635e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OK, that should have worked now. Let’s break down the output of the anova() function.

First of all and least importantly, we have the heading Analysis of Variance Table. This is slightly confusing terminology as the term analysis of variance (or ANOVA) is oftern reserved for a different context. Let’s just agree to call this the F-test and never mind the heading (or the name of the function, for that matter).

Next, we have a re-statement of the models we’re comparing along with labels Model 1 and Model 2. These labels correspond to the rows of the table below:

  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    330 13598                                  
2    329 12108  1      1490 40.486 6.635e-10 ***

The first row of this table gives us information about Model 1 (in our case m_massKg_bill_cntr_compare), while the second row contains info about Model 2 (m_mass_bill_sex), plus the results of the F-test itself. Remember from lecture 10, that the F-test compares the amount of variance in the outcome variable explained by a model to the variance not explained by the model. This table contains the value of the F-statistic (40.49) and its associated p-value (6.64 × 10−10, which is <.001). On top of that, it also contains all the information to actually calculate F and p. Let’s go over this calculation together.

In the lecture, you learnt that the formula for F is

\[F = \frac{\frac{SS_M}{df_M}}{\frac{SS_R}{df_R}}\] where the model sum of squares (\(SS_M\), Sum of Sq in the table above) is just the total SS minus the residual SS (the latter labelled RSS in the table above). You also learnt that the total sum of squares for a model is just the residual SS of the model that we are comparing it to. So, in our case, \(SS_T\) of our model m_mass_bill_sex is the same as \(SS_R\) of m_massKg_bill_cntr_compare. Reading the table, it all checks out:

     RSS Sum of Sq
1  13598
2  12108      1490

Sum of Sq(row 2) = RSS(1) − RSS(2).

The difference between our two models is just one extra coefficient (for sex) and so the model degrees of freedom is 1. The residual degrees of freedom, as mentioned earlier is \(N - p - 1\), where \(p\) is the total number of predictors in the model: 333 − 3 − 1 = 329.

Now that we have all the numbers we need, we can just plug them into the formula:

\[F = \frac{\frac{SS_M}{df_M}}{\frac{SS_R}{df_R}} = \frac{\frac{1,490}{1}}{\frac{12,108}{329}} \approx\frac{1,490}{36.8} \approx40.4\]

Nice!

So what does the F-test tell us? As explained in the lecture, the F-test tests the null hypothesis that the new model is NOT a significant improvement on the old model. If the probability of getting the F-value we got from the test, or an even larger one, if the null is true is small enough – smaller than our chosen \(\alpha\) level – we reject the null hypothesis.

In our case, the p-value is far less than any conventional \(\alpha\) so we can say that:

Sex was significantly related to flipper length, b = −4.71, 95% CI [−6.17, −3.26], t(329) = −6.36, p < .001, and accounted for additional 2.6% of the variance in the outcome. The inclusion of this predictor in the model represented a significant improvement in model fit over the comparison model, F(1, 329) = 40.49, p < .001.

Task 10

For your last task, perform an F-test comparing m_massKg_bill_cntr_compare to the null model.

First, we need to fit the null model to the data. If we’re using m_massKg_bill_cntr_compare built using peng_subset, we need to fit the model to the same dataset!

m_null <- lm(flipper_length_mm ~ 1, peng_subset)

Next, we can use anova() to compare the models:

# You can also compare all 3 models
anova(m_null, m_massKg_bill_cntr_compare)
Analysis of Variance Table

Model 1: flipper_length_mm ~ 1
Model 2: flipper_length_mm ~ mass_kg_cntr + bill_length_cntr
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    332 65219                                  
2    330 13598  2     51620 626.35 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value you’re after is the one in the row corresponding to Model 2.

 

Well done finishing your last Analysing Data tutorial!

 

EXTRA: Mismatch

In case you’re wondering why the the F-value in the 2nd row does not match the F-statistic at the bottom of summary() output:

summary(m_massKg_bill_cntr_compare)

Call:
lm(formula = flipper_length_mm ~ mass_kg_cntr + bill_length_cntr, 
    data = peng_subset)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.9869  -4.5868   0.4285   4.8784  14.0977 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      200.85939    0.35180 570.940  < 2e-16 ***
mass_kg_cntr      13.01733    0.54163  24.034  < 2e-16 ***
bill_length_cntr   0.54404    0.07975   6.822 4.31e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.419 on 330 degrees of freedom
Multiple R-squared:  0.7915,    Adjusted R-squared:  0.7902 
F-statistic: 626.3 on 2 and 330 DF,  p-value: < 2.2e-16

…the answer has to do with the fact that there are several ways of cutting up variance. The F-test in summary() uses something called Type I Sums of Squares, while the F-test from anova() uses Type III Sums of Squares. We briefly talked about the difference between these two in the 7th episode of #statsChats.


  1. OK, this is strictly speaking not true. There are various “data imputation” techniques that allow us to substitute missing data for a likely value but this is an advanced topic that you do not need to worry about at this stage.↩︎