Analysing Data: The Linear Model 4: Categorical predictors, model fit and comparison

In tutorial 10, we extended our model of penguin flipper length by including bill length as an additional predictor and learnt how to centre, scale, and standardise variables in the linear model to change the interpretation of the individual

coefficients. The last tutorial of this module will guide you through model comparison using the F-test with R’s anova() function. Before you start this tutorial, make sure you are comfortable with the concepts discussed in lecture 10.

Warm-up quiz

As usual, here’s a little quiz to get you in the mood. Consider this linear model when answering the questions below.

Setting Up

As usual, all you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.

Task 1punk!

Create a new week_10 project and open it in RStudio. Then, create two new folders in the new week_10 folder: r_docs and data. Finally, open a new R Markdown file and save it in the r_docs folder. Get into the habit of creating new code chunk in the file for each of the tasks below.

Remember, you can add new code chunks by:

Using the RStudio toolbar: Click Code > Insert Chunk
Using a keyboard shortcut: the default is Ctrl + Alt + I (Windows) or ⌘ Command + ⌥ Option + I (MacOS), but you can change this under Tools > Modify Keyboard Shortcuts…
Typing it out: ```{r}, press Enter, then ``` again.
Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

Task 2punk!

In this tutorial, just like in the previous two, we’ll continue using the penguins data set from the palmerpenguins package. You will also need the packages tidyverse, broom, and QuantPsyc (install it if you don’t have it yet; watch out for spelling and upper/lower case letters!). Put the code that loads all of these in the setup chunk of your Rmd file and run it.

Data

Task 3punk!

Just like the last couple of times, save the penguins data set in your environment as peng_data to avoid having to type out palmerpenguins::penguins every time you want to use it.

Task 4punk!

It’s always good to refresh your memory about the dataset by looking at a rough summary.

summary(peng_data)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

Adding a categorical predictor

Task 5Prog-rocK

Build a linear model predicting flipper length by body mass in kilogrammes and bill length in millimetres, with both predictors mean-centred.

Task 6punk!

Create a new model m_mass_bill_sex that adds the sex variable in addition to the two predictors in the previous model.

Evaluating results

Task 7jazz...

summary(m_mass_bill_sex)


Call:
lm(formula = flipper_length_mm ~ mass_kg_cntr + bill_length_cntr + 
    sex, data = peng_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.725  -3.917   0.330   4.199  14.205 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      203.22727    0.49903 407.248  < 2e-16 ***
mass_kg_cntr      14.01511    0.53535  26.179  < 2e-16 ***
bill_length_cntr   0.60591    0.07599   7.973 2.57e-14 ***
sexmale           -4.71265    0.74065  -6.363 6.64e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.067 on 329 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8143,    Adjusted R-squared:  0.8126 
F-statistic:   481 on 3 and 329 DF,  p-value: < 2.2e-16

The last row of the coefficient table says sexmale. That means that this is the change in the outcome when the sex variable “changes” to male and after the effect of the other predictors has been taken into account. In other words, it must be the female category that serves as the baseline (sex = 0) category.

The difference is the coefficient associated with the sex variable: −4.71 mm or 4.71 mm, depending on which way around you’re doing the comparison.

Given that the coefficient for sex is negative and that female is the reference category, male penguins have, on average, shorter flippers.

The p-value associated with the coefficient is 6.64 × 10⁻¹⁰ which is far smaller than .001, and so the difference is indeed significant.

Because both body mass and bill length have been centred, and female is the baseline category, the intercept represents the expected flipper length of a female penguin of average size and bill-length. To get the expected flipper length of its male counterpart, all you need to do is add the coefficient for sex:

The metric for the proportion of variance accounted for in the data is multiple R², so the answer to the penultimate question is 81.4.

Finally, there is essentially no difference between multiple and adjusted R² (81.4% and 81.3%), which is a clear indicator that there are no redundant predictors in the model.

Task 8Prog-rocK

What is the unique proportion of variance in the flipper length variable that’s attributable to sex? In other words, how much variance in the outcome is accounted for by sex?

We already have a model that contains the two predictors other than sex and we also have one that includes all three. So, because sex is the only difference between the two models, the difference in their respective R² must be due to this variable.

rsq_m1 <- m_massKg_bill_cntr %>% broom::glance() %>% dplyr::pull(r.squared)
rsq_m2 <- m_mass_bill_sex %>% broom::glance() %>% dplyr::pull(r.squared)

# difference in % rounded to 1 dp
round((rsq_m2 - rsq_m1) * 100, 1)

[1] 2.6

Granted, 2.6% of the total variance is not exactly earth-shattering but it’s something!

Model comparison

Now that we know all about the effect of sex on flipper length in penguins (this is going to win you a pub quiz one day!), let’s look at a formal comparison between a model that includes the variable and one that does not.

Task 9

Task 9.1punk!

Enter the objects containing the two linear models as arguments to the anova() function.

The error message we get from R for once actually tells us what the problem is in plain English:

So the problem is that the two models we are trying to compare are based on datasets of different sizes.

Task 9.2Prog-rocK

The reason, as is often the case with these things, is missing data. A model can only use those rows of a dataset that have non-missing observations in each column that is used in the model specification.

OK, this is strictly speaking not true. There are various “data imputation” techniques that allow us to substitute missing data for a likely value but this is an advanced topic that you do not need to worry about at this stage.

¹ If, for instance, there are some penguins in the dataset whose bills, flippers, and body masses have been measured, but whose sex has not, their data will be used to estimate the coefficients of the m_massKg_bill_cntr model, but not the m_mass_bill_sex model. This discrepancy makes the models incomparable.

Know your models

While we haven’t explicitly focused on data exploration in the past few tutorials, we stressed the importance of knowing your data earlier in the module. We also asked you to familiarise yourself with your dataset which entails noticing things like unusual, impossible, or missing values in variables.

A familiarity with your data is really important for understanding situations such as the one that we just encountered. You should always be able to answer questions like:

QUESTION 14

What is the sample size used to fit the m_massKg_bill_cntr model?

342

Correct!That’s not right…

QUESTION 15

What is the sample size used to fit the m_mass_bill_sex model?

333

Correct!That’s not right…

QUESTION 16

How many missing values are there in the sex variable?

Correct!That’s not right…

QUESTION 17

How many rows have missing data in the other variables entered in the m_mass_bill_sex model but not in the sex variable?

Correct!That’s not right…

Hint

There are several ways of finding out how many observations were used in a particular model. A quick one is to check the number of residual df (second number of df for the F-test). This number is always equal to N − p − 1, where p is the number of predictors in the model.

To check the number of missing values in a variable, you can just look at the summary() of the data, use the is.na() and sum() functions, or some other code. Check out the R script to practical 3 for inspiration.

So now, to compare our two models, we need to refit m_massKg_bill_cntr using the same subset of peng_data that was used to fit m_mass_bill_sex.

Task 9.3Prog-rocK

Create an object in the environment called peng_subset that will allow you to refit m_massKg_bill_cntr such that it is comparable with the other model.

All you need to do is filter the data to only include those rows of peng_data which do not have NAs in the sex column.

peng_subset <- peng_data %>%
  filter(!is.na(sex))

Task 9.4punk!

Create a model called m_massKg_bill_cntr_compare that is the same as m_massKg_bill_cntr but, this time round, fitted to peng_subset.

Task 9.5punk!

Have a look at coefficients of this new model and compare them to the ones from the original model

coef(m_massKg_bill_cntr_compare)

     (Intercept)     mass_kg_cntr bill_length_cntr 
     200.8593884       13.0173337        0.5440353

coef(m_massKg_bill_cntr)

     (Intercept)     mass_kg_cntr bill_length_cntr 
     200.9152047       13.0507787        0.5492244

As you can see, the coefficients are ever so slightly different. This is not that surprising because the data these two versions of the model use are themselves different. What is reassuring is that the discrepancies are very small.

Now that the first model is fitted to the same data as the second one, we can finally compare them.

Task 9.6punk!

Compare the m_massKg_bill_cntr_compare and m_mass_bill_sex models using the anova() function.

OK, that should have worked now. Let’s break down the output of the anova() function.

First of all and least importantly, we have the heading Analysis of Variance Table. This is slightly confusing terminology as the term analysis of variance (or ANOVA) is oftern reserved for a different context. Let’s just agree to call this the F-test and never mind the heading (or the name of the function, for that matter).

Next, we have a re-statement of the models we’re comparing along with labels Model 1 and Model 2. These labels correspond to the rows of the table below:

The first row of this table gives us information about Model 1 (in our case m_massKg_bill_cntr_compare), while the second row contains info about Model 2 (m_mass_bill_sex), plus the results of the F-test itself. Remember from lecture 10, that the F-test compares the amount of variance in the outcome variable explained by a model to the variance not explained by the model. This table contains the value of the F-statistic (40.49) and its associated p-value (6.64 × 10⁻¹⁰, which is <.001). On top of that, it also contains all the information to actually calculate F and p. Let’s go over this calculation together.

The difference between our two models is just one extra coefficient (for sex) and so the model degrees of freedom is 1. The residual degrees of freedom, as mentioned earlier is

, where

is the total number of predictors in the model: 333 − 3 − 1 = 329.

Now that we have all the numbers we need, we can just plug them into the formula:

So what does the F-test tell us? As explained in the lecture, the F-test tests the null hypothesis that the new model is NOT a significant improvement on the old model. If the probability of getting the F-value we got from the test, or an even larger one, if the null is true is small enough – smaller than our chosen

level – we reject the null hypothesis.

Task 10Prog-rocK

For your last task, perform an F-test comparing m_massKg_bill_cntr_compare to the null model.

First, we need to fit the null model to the data. If we’re using m_massKg_bill_cntr_compare built using peng_subset, we need to fit the model to the same dataset!

m_null <- lm(flipper_length_mm ~ 1, peng_subset)

Next, we can use anova() to compare the models:

# You can also compare all 3 models
anova(m_null, m_massKg_bill_cntr_compare)

Analysis of Variance Table

Model 1: flipper_length_mm ~ 1
Model 2: flipper_length_mm ~ mass_kg_cntr + bill_length_cntr
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    332 65219                                  
2    330 13598  2     51620 626.35 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value you’re after is the one in the row corresponding to Model 2.

EXTRA: Mismatch

In case you’re wondering why the the F-value in the 2^nd row does not match the F-statistic at the bottom of summary() output:

summary(m_massKg_bill_cntr_compare)


Call:
lm(formula = flipper_length_mm ~ mass_kg_cntr + bill_length_cntr, 
    data = peng_subset)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.9869  -4.5868   0.4285   4.8784  14.0977 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      200.85939    0.35180 570.940  < 2e-16 ***
mass_kg_cntr      13.01733    0.54163  24.034  < 2e-16 ***
bill_length_cntr   0.54404    0.07975   6.822 4.31e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.419 on 330 degrees of freedom
Multiple R-squared:  0.7915,    Adjusted R-squared:  0.7902 
F-statistic: 626.3 on 2 and 330 DF,  p-value: < 2.2e-16

…the answer has to do with the fact that there are several ways of cutting up variance. The F-test in summary() uses something called Type I Sums of Squares, while the F-test from anova() uses Type III Sums of Squares. We briefly talked about the difference between these two in the 7^th episode of #statsChats.

The Linear Model 4: Categorical predictors, model fit and comparison

Author

Affiliation

Published

DOI

Warm-up quiz

QUESTION 1

QUESTION 2

QUESTION 3

QUESTION 4

QUESTION 5

QUESTION 6

Setting Up

Task 1punk!

Task 2punk!

Data

Task 3punk!

Task 4punk!

Adding a categorical predictor

Task 5Prog-rocK

Task 6punk!

Evaluating results

Task 7jazz...

QUESTION 7

QUESTION 8

QUESTION 9

QUESTION 10

QUESTION 11

QUESTION 12

QUESTION 13

Task 8Prog-rocK

Model comparison

Task 9

Task 9.1punk!

Task 9.2Prog-rocK

Know your models

QUESTION 14

QUESTION 15

QUESTION 16

QUESTION 17

Task 9.3Prog-rocK

Task 9.4punk!

Task 9.5punk!

Task 9.6punk!

Task 10Prog-rocK

EXTRA: Mismatch

Footnotes