End of the Line: Comparing models

Lecture 10

Dr Milan Valášek

23 April 2021

1 / 26

Stats wars

LM1: A New Equation
LM2: R² strikes back
LM3: Return of the y_i
LM4: F awakens

2 / 26

Today

Categorical predictors again
Multiple R² and adjusted R²
Comparing linear models with the F-test

3 / 26

What you already know about the linear model

$y_{i} = b_{0} + b_{1} \times x_{1_{i}} + 0 \times x_{2_{i}} + 0 \times x_{3_{i}} + \dots + 0 \times x_{n_{i}} + e_{i}$

Every predictor adds a dimension (axis)
Every $b$ coefficient (except for $b_{0}$ ) is a slope of the plane of best fit with respect to the dimension of the predictor
Predictors can be continuous or categorical
Outcome must be continuous

4 / 26

Last time: birth weight model# centre predictors
bweight <- bweight %>%
  mutate(gest_cntrd = Gestation - mean(Gestation, na.rm=TRUE),
         age_cntrd = mage - mean(mage, na.rm=TRUE))
m_age_gest <- lm(Birthweight ~ age_cntrd + gest_cntrd, bweight)
summary(m_age_gest)

## 
## Call:
## lm(formula = Birthweight ~ age_cntrd + gest_cntrd, data = bweight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77485 -0.35861 -0.00236  0.26948  0.96943 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.3128571  0.0674405  49.123  < 2e-16 ***
## age_cntrd   -0.0007953  0.0120469  -0.066    0.948    
## gest_cntrd   0.1618369  0.0258242   6.267 2.21e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4371 on 39 degrees of freedom
## Multiple R-squared:  0.5017,    Adjusted R-squared:  0.4762 
## F-statistic: 19.64 on 2 and 39 DF,  p-value: 1.26e-06
5 / 26

Categorical predictor

Let's see if whether or not the mother is a smoker has an influence on the baby's birth weight
What do you think?

6 / 26

Categorical predictor

Let's see if whether or not the mother is a smoker has an influence on the baby's birth weight
What do you think?
Smoking is a binary categorical predictor but we can add it to the model just like any other variable!

bweight$smoker

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1

6 / 26

Effect of mother's smoking

m_smoke <- lm(Birthweight ~ smoker, bweight)
summary(m_smoke)

## 
## Call:
## lm(formula = Birthweight ~ smoker, data = bweight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21409 -0.39159  0.02591  0.41935  1.43591 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5095     0.1298  27.040   <2e-16 ***
## smoker       -0.3754     0.1793  -2.093   0.0427 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5804 on 40 degrees of freedom
## Multiple R-squared:  0.09874,    Adjusted R-squared:  0.07621 
## F-statistic: 4.382 on 1 and 40 DF,  p-value: 0.0427

7 / 26

Effect of mother's smoking

m_smoke <- lm(Birthweight ~ smoker, bweight)
summary(m_smoke)

## 
## Call:
## lm(formula = Birthweight ~ smoker, data = bweight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21409 -0.39159  0.02591  0.41935  1.43591 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5095     0.1298  27.040   <2e-16 ***
## smoker       -0.3754     0.1793  -2.093   0.0427 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5804 on 40 degrees of freedom
## Multiple R-squared:  0.09874,    Adjusted R-squared:  0.07621 
## F-statistic: 4.382 on 1 and 40 DF,  p-value: 0.0427

${birthweight}_{i} = b_{0} - 0.38 \times {smoker}_{i} + e_{i}$

Babies whose mothers smoke are, on average, 0.38 lbs lighter at birth than those whose mother do not smoke.

7 / 26

Continuous AND categorical predictors

Of course, you can include categorical predictors even in a larger model

m_age_gest_smoke <- update(m_age_gest, ~ . + smoker)
# that's the same as
m_age_gest_smoke <- lm(Birthweight ~ age_cntrd + gest_cntrd + smoker, bweight)
summary(m_age_gest_smoke)

## 
## Call:
## lm(formula = Birthweight ~ age_cntrd + gest_cntrd + smoker, data = bweight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60724 -0.31521  0.00952  0.24838  1.08946 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.475375   0.093843  37.034  < 2e-16 ***
## age_cntrd    0.005115   0.011668   0.438   0.6636    
## gest_cntrd   0.156079   0.024552   6.357 1.84e-07 ***
## smoker      -0.310261   0.131381  -2.362   0.0234 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4135 on 38 degrees of freedom
## Multiple R-squared:  0.5655,    Adjusted R-squared:  0.5312 
## F-statistic: 16.49 on 3 and 38 DF,  p-value: 5.083e-07

8 / 26

Multiple R²

We talked about what R² means in terms of explained and unexplained (residual) variance

9 / 26

Multiple R²

It's the same in multiple-predictor models:

# multplie R-squared from model
m_age_gest_smoke %>% broom::glance() %>% pull(r.squared)

## [1] 0.5655139

# same as manually-calculated using formula
1 - var(resid(m_age_gest_smoke))/var(bweight$Birthweight)

## [1] 0.5655139

It can be also interpreted as the square of the correlation between predicted and observed values of outcome

predicted <- fitted(m_age_gest_smoke)
correlation <- cor(predicted, bweight$Birthweight)
correlation^2

## [1] 0.5655139

10 / 26

The thing with multiple R²

We can also add more predictors to our models
Sometimes, they are good, sometimes, they are bad...

bweight <- bweight %>%
  mutate(x4 = rnorm(nrow(bweight)),
         x5 = rnorm(nrow(bweight)),
         x6 = rnorm(nrow(bweight)))
# add 3 random predictor variables
m_big <- update(m_age_gest_smoke, ~ . + x4 + x5 + x6)
m_big %>% broom::tidy()

## # A tibble: 7 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  3.48       0.0936    37.2   1.02e-29
## 2 age_cntrd    0.00284    0.0116     0.244 8.09e- 1
## 3 gest_cntrd   0.145      0.0260     5.57  2.83e- 6
## 4 smoker      -0.280      0.135     -2.08  4.53e- 2
## 5 x4           0.100      0.0745     1.35  1.86e- 1
## 6 x5          -0.0368     0.0754    -0.488 6.28e- 1
## 7 x6          -0.0784     0.0560    -1.40  1.71e- 1

11 / 26

The thing with multiple R²

Multiple R² gets higher as we add more and more predictors, even if there is no relationship between them and the outcome

# original 3-predictor model
m_age_gest_smoke %>% broom::glance() %>% pull(r.squared)

## [1] 0.5655139

# model that includes rubbish predictors
m_big %>% broom::glance() %>% pull(r.squared)

## [1] 0.6135306

The new model explains additional ~5% of the variance in birth weight even though the predictors are just randomly generated
We want a metric of explained variance that doesn't get inflated by adding noise to the model

12 / 26

Adjusted R²

Adjusted R² ( $R_{adj}^{2}$ , or sometimes ${\bar{R}}^{2}$ ) introduces a penalty for each predictor added to the model

$R_{adj}^{2} = 1 - \frac{\frac{S S_{R}}{n - p - 1}}{\frac{S S_{T}}{n - 1}}$

How does the penalty work?
- The larger the number of predictors ( $p$ ), the larger the numerator $\frac{S S_{R}}{n - p - 1}$ gets
- The larger the numerator, the larger the entire fraction (just like 3/4 is more than 2/4)
- The larger the fraction, the smaller the adjusted R²
- Penalty per predictor is larger in small samples

13 / 26

Adjusted R²

Compared to multiple R², adjusted R² gets far less inflated by including pointless predictors in the model

# original 3-predictor model
m_age_gest_smoke %>% broom::glance() %>%
  dplyr::select(r.squared, adj.r.squared)

## # A tibble: 1 x 2
##   r.squared adj.r.squared
##       <dbl>         <dbl>
## 1     0.566         0.531

# model that includes rubbish predictors
m_big %>% broom::glance() %>%
  dplyr::select(r.squared, adj.r.squared)

## # A tibble: 1 x 2
##   r.squared adj.r.squared
##       <dbl>         <dbl>
## 1     0.614         0.547

14 / 26

Adjusted R²

Adjusted R² is an attempt at estimating proportion of variance explained by the model in the population
If adjusted R² is always better, why even look at multiple R²?

15 / 26

Adjusted R²

Adjusted R² is an attempt at estimating proportion of variance explained by the model in the population
If adjusted R² is always better, why even look at multiple R²?
Large difference between the values of multiple R² and adjusted R² indicates that there are some unnecessary predictors in the model and it should be simplified

15 / 26

Adjusted R²

Adjusted R² is an attempt at estimating proportion of variance explained by the model in the population
If adjusted R² is always better, why even look at multiple R²?
Large difference between the values of multiple R² and adjusted R² indicates that there are some unnecessary predictors in the model and it should be simplified
The larger the difference between the two, the less likely it is that the model will generalise beyond the current sample

15 / 26

Model fit

We can take a model as a whole and assess whether it is a good fit to the data or not a good one
To do this we calculate the ratio of variance explained by the model (AKA model variance) and the variance not explained by the model (AKA residual variance)
This is the F-ratio

$\begin{aligned} F & = \frac{explained variance}{unexplained variance} \\ = \frac{model variance}{residual variance} \\ = \frac{S S_{M} / d f_{M}}{S S_{R} / d f_{R}} \\ = \frac{(S S_{T} - S S_{R}) / d f_{M}}{S S_{R} / d f_{R}} \end{aligned}$

16 / 26

F-distribution

F-ratio is a ratio of two sums of squares (model and residual, respectively), scaled by their respective degrees of freedom
We know that the sampling distribution of this kind of a metric is the F-distribution (hence the name)

17 / 26

F-test

Because we know the sampling distribution of our F-ratio, we can calculate the p-value associated with the given value of F under the null hypothesis that our model is no better than the null model

## 
## Call:
## lm(formula = Birthweight ~ age_cntrd + gest_cntrd + smoker, data = bweight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60724 -0.31521  0.00952  0.24838  1.08946 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.475375   0.093843  37.034  < 2e-16 ***
## age_cntrd    0.005115   0.011668   0.438   0.6636    
## gest_cntrd   0.156079   0.024552   6.357 1.84e-07 ***
## smoker      -0.310261   0.131381  -2.362   0.0234 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4135 on 38 degrees of freedom
## Multiple R-squared:  0.5655,    Adjusted R-squared:  0.5312 
## F-statistic: 16.49 on 3 and 38 DF,  p-value: 5.083e-07

18 / 26

Comparing models

"Good" fit is a relative term - a model is a good fit in comparison to some other model

19 / 26

Comparing models

"Good" fit is a relative term - a model is a good fit in comparison to some other model
wHAt noW??

19 / 26

Comparing models

"Good" fit is a relative term - a model is a good fit in comparison to some other model
wHAt noW??
Yes, we're always comparing two models, even in the example above!
The total variance of a model* is just the residual variance of the model we're comparing it to!
F-test from summary() compares the model to the null (intercept-only) model

* The denominator of the ratio $F = \frac{model variance}{total variance}$

19 / 26

Comparing models

My model brings all the x to the y
And F like "it's better than yours!"
Damn right, it's better than yours!
I can teach you but must come to the lectures, do the practicals and tutorials, and revise

~ G W Snedecor

The null hypothesis of the F-test is that the model is no better than the comparison model
A significant F-test tells us that our model is a statistically significantly better fit to the data than the comparison model
In other words, our model is an improvement on the previous model
It explains significantly* more variance in the outcome variable than the other model

*statistically significantly!!!

20 / 26

Comparing models

The anova() function in R allows us to compare the models explicitly

m_null <- lm(Birthweight ~ 1, bweight)
anova(m_null, m_age_gest_smoke)

## Analysis of Variance Table
## 
## Model 1: Birthweight ~ 1
## Model 2: Birthweight ~ age_cntrd + gest_cntrd + smoker
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     41 14.9523                                  
## 2     38  6.4965  3    8.4557 16.486 5.083e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These are the same numbers as we got from summary() (or broom::glance())

ABCDEFGHIJ0123456789

statistic <dbl>	df <dbl>	df.residual <int>	p.value <dbl>
16.48655	3	38	5.083237e-07

21 / 26

Comparing models

You can compare as many models as you like
They all have to be modelling the same outcome variable
They must all be nested
- You must be able to sort them from the simplest to the most complex
- A simpler model mustn't contain any predictors that a more complex model doesn't also contain
- They must both be fitted to the same data (their respective Ns must be the same)

birthwieght ~ gestation

birthwieght ~ gestation + mage

birthwieght ~ gestation

birthwieght ~ smoker + mage

22 / 26

Take-home message

Categorical predictors can be entered in the linear model just like continuous predictors
- For binary variables, the $b$ represents the difference between the two categories
- For more than 2 categories, it's a little more complicated (next term)
Multiple R² can be interpreted as the proportion of variance in outcome explained by the model or the correlation between what the model predicts and what we observe
Multiple R² gets inflated by adding more and more predictors
Adjusted R² penalises for each added predictor to counteract this
- Difference between the two is indicative of useless predictors in the model and a low generalisability of the model

23 / 26

Take-home message

We can compare nested models using the F-test
F-ratio is the ratio of variance explained by the model vs variance not explained by the model
We know that the sampling distribution of the F-ratio (AKA F-value, F-statistic) is the F-distribution with $d f_{M}$ and $d f_{R}$ degrees of freedom
The null hypothesis of the F-test is that the model is no better than the comparison model
F-test is always a comparison of two models, albeit sometimes implicitly
- Total variance for a given model is just the residual variance of the comparison model!

24 / 26

Next time

Next week is our final Analysing Data lecture 😢
Revision session to recap all we talked about in all of this term's lectures
Don't forget to do the last tutorial that will let you practice comparing models
For final practical, Jennifer and I are planning something pretty special so do come!

25 / 26

Next time

Next week is our final Analysing Data lecture 😢
Revision session to recap all we talked about in all of this term's lectures
Don't forget to do the last tutorial that will let you practice comparing models
For final practical, Jennifer and I are planning something pretty special so do come!
- It's still a practical, and no, it won't be delivered through the medium of interpretative dance

25 / 26

Next time

Next week is our final Analysing Data lecture 😢
Revision session to recap all we talked about in all of this term's lectures
Don't forget to do the last tutorial that will let you practice comparing models
For final practical, Jennifer and I are planning something pretty special so do come!
- It's still a practical, and no, it won't be delivered through the medium of interpretative dance
- We won't be singing either!

25 / 26

Have a lovely weekend :)26 / 26

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

End of the Line: Comparing models

Lecture 10

Dr Milan Valášek

23 April 2021

Stats wars

Today

What you already know about the linear model

Last time: birth weight model

Categorical predictor

Categorical predictor

Effect of mother's smoking

Effect of mother's smoking

Continuous AND categorical predictors

Multiple R2

Multiple R2

The thing with multiple R2

The thing with multiple R2

Adjusted R2

Adjusted R2

Adjusted R2

Adjusted R2

Adjusted R2

Model fit

F-distribution

F-test

Comparing models

Comparing models

Comparing models

Comparing models

Comparing models

Comparing models

Take-home message

Take-home message

Next time

Next time

Next time

Have a lovely weekend :)

Stats wars

Help

Multiple R²

Multiple R²

The thing with multiple R²

The thing with multiple R²

Adjusted R²

Adjusted R²

Adjusted R²

Adjusted R²

Adjusted R²