yi=b0+b1×x1i+0×x2i+0×x3i+⋯+0×xni+ei
Every predictor adds a dimension (axis)
Every b coefficient (except for b0) is a slope of the plane of best fit with respect to the dimension of the predictor
Predictors can be continuous or categorical
Outcome must be continuous
# centre predictorsbweight <- bweight %>% mutate(gest_cntrd = Gestation - mean(Gestation, na.rm=TRUE), age_cntrd = mage - mean(mage, na.rm=TRUE))m_age_gest <- lm(Birthweight ~ age_cntrd + gest_cntrd, bweight)summary(m_age_gest)
## ## Call:## lm(formula = Birthweight ~ age_cntrd + gest_cntrd, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -0.77485 -0.35861 -0.00236 0.26948 0.96943 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.3128571 0.0674405 49.123 < 2e-16 ***## age_cntrd -0.0007953 0.0120469 -0.066 0.948 ## gest_cntrd 0.1618369 0.0258242 6.267 2.21e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4371 on 39 degrees of freedom## Multiple R-squared: 0.5017, Adjusted R-squared: 0.4762 ## F-statistic: 19.64 on 2 and 39 DF, p-value: 1.26e-06
Let's see if whether or not the mother is a smoker has an influence on the baby's birth weight
What do you think?
Smoking is a binary categorical predictor but we can add it to the model just like any other variable!
bweight$smoker
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1## [39] 1 1 1 1
m_smoke <- lm(Birthweight ~ smoker, bweight)summary(m_smoke)
## ## Call:## lm(formula = Birthweight ~ smoker, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -1.21409 -0.39159 0.02591 0.41935 1.43591 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.5095 0.1298 27.040 <2e-16 ***## smoker -0.3754 0.1793 -2.093 0.0427 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.5804 on 40 degrees of freedom## Multiple R-squared: 0.09874, Adjusted R-squared: 0.07621 ## F-statistic: 4.382 on 1 and 40 DF, p-value: 0.0427
m_smoke <- lm(Birthweight ~ smoker, bweight)summary(m_smoke)
## ## Call:## lm(formula = Birthweight ~ smoker, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -1.21409 -0.39159 0.02591 0.41935 1.43591 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.5095 0.1298 27.040 <2e-16 ***## smoker -0.3754 0.1793 -2.093 0.0427 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.5804 on 40 degrees of freedom## Multiple R-squared: 0.09874, Adjusted R-squared: 0.07621 ## F-statistic: 4.382 on 1 and 40 DF, p-value: 0.0427
birthweighti=b0−0.38×smokeri+ei
m_age_gest_smoke <- update(m_age_gest, ~ . + smoker)# that's the same asm_age_gest_smoke <- lm(Birthweight ~ age_cntrd + gest_cntrd + smoker, bweight)summary(m_age_gest_smoke)
## ## Call:## lm(formula = Birthweight ~ age_cntrd + gest_cntrd + smoker, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -0.60724 -0.31521 0.00952 0.24838 1.08946 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.475375 0.093843 37.034 < 2e-16 ***## age_cntrd 0.005115 0.011668 0.438 0.6636 ## gest_cntrd 0.156079 0.024552 6.357 1.84e-07 ***## smoker -0.310261 0.131381 -2.362 0.0234 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4135 on 38 degrees of freedom## Multiple R-squared: 0.5655, Adjusted R-squared: 0.5312 ## F-statistic: 16.49 on 3 and 38 DF, p-value: 5.083e-07
# multplie R-squared from modelm_age_gest_smoke %>% broom::glance() %>% pull(r.squared)
## [1] 0.5655139
# same as manually-calculated using formula1 - var(resid(m_age_gest_smoke))/var(bweight$Birthweight)
## [1] 0.5655139
predicted <- fitted(m_age_gest_smoke)correlation <- cor(predicted, bweight$Birthweight)correlation^2
## [1] 0.5655139
bweight <- bweight %>% mutate(x4 = rnorm(nrow(bweight)), x5 = rnorm(nrow(bweight)), x6 = rnorm(nrow(bweight)))# add 3 random predictor variablesm_big <- update(m_age_gest_smoke, ~ . + x4 + x5 + x6)m_big %>% broom::tidy()
## # A tibble: 7 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 3.48 0.0936 37.2 1.02e-29## 2 age_cntrd 0.00284 0.0116 0.244 8.09e- 1## 3 gest_cntrd 0.145 0.0260 5.57 2.83e- 6## 4 smoker -0.280 0.135 -2.08 4.53e- 2## 5 x4 0.100 0.0745 1.35 1.86e- 1## 6 x5 -0.0368 0.0754 -0.488 6.28e- 1## 7 x6 -0.0784 0.0560 -1.40 1.71e- 1
# original 3-predictor modelm_age_gest_smoke %>% broom::glance() %>% pull(r.squared)
## [1] 0.5655139
# model that includes rubbish predictorsm_big %>% broom::glance() %>% pull(r.squared)
## [1] 0.6135306
The new model explains additional ~5% of the variance in birth weight even though the predictors are just randomly generated
We want a metric of explained variance that doesn't get inflated by adding noise to the model
R2adj=1−SSRn−p−1SSTn−1
How does the penalty work?
The larger the number of predictors ( p ), the larger the numerator SSRn−p−1 gets
The larger the numerator, the larger the entire fraction (just like 3/4 is more than 2/4)
The larger the fraction, the smaller the adjusted R2
Penalty per predictor is larger in small samples
# original 3-predictor modelm_age_gest_smoke %>% broom::glance() %>% dplyr::select(r.squared, adj.r.squared)
## # A tibble: 1 x 2## r.squared adj.r.squared## <dbl> <dbl>## 1 0.566 0.531
# model that includes rubbish predictorsm_big %>% broom::glance() %>% dplyr::select(r.squared, adj.r.squared)
## # A tibble: 1 x 2## r.squared adj.r.squared## <dbl> <dbl>## 1 0.614 0.547
Adjusted R2 is an attempt at estimating proportion of variance explained by the model in the population
If adjusted R2 is always better, why even look at multiple R2?
Large difference between the values of multiple R2 and adjusted R2 indicates that there are some unnecessary predictors in the model and it should be simplified
Adjusted R2 is an attempt at estimating proportion of variance explained by the model in the population
If adjusted R2 is always better, why even look at multiple R2?
Large difference between the values of multiple R2 and adjusted R2 indicates that there are some unnecessary predictors in the model and it should be simplified
The larger the difference between the two, the less likely it is that the model will generalise beyond the current sample
We can take a model as a whole and assess whether it is a good fit to the data or not a good one
To do this we calculate the ratio of variance explained by the model (AKA model variance) and the variance not explained by the model (AKA residual variance)
This is the F-ratio
F=explained varianceunexplained variance=model varianceresidual variance=SSM/dfMSSR/dfR=(SST−SSR)/dfMSSR/dfR
## ## Call:## lm(formula = Birthweight ~ age_cntrd + gest_cntrd + smoker, data = bweight)## ## Residuals:## Min 1Q Median 3Q Max ## -0.60724 -0.31521 0.00952 0.24838 1.08946 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.475375 0.093843 37.034 < 2e-16 ***## age_cntrd 0.005115 0.011668 0.438 0.6636 ## gest_cntrd 0.156079 0.024552 6.357 1.84e-07 ***## smoker -0.310261 0.131381 -2.362 0.0234 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 0.4135 on 38 degrees of freedom## Multiple R-squared: 0.5655, Adjusted R-squared: 0.5312 ## F-statistic: 16.49 on 3 and 38 DF, p-value: 5.083e-07
"Good" fit is a relative term - a model is a good fit in comparison to some other model
wHAt noW??
Yes, we're always comparing two models, even in the example above!
The total variance of a model* is just the residual variance of the model we're comparing it to!
F-test from summary()
compares the model to the null (intercept-only) model
* The denominator of the ratio F=model variancetotal variance
My model brings all the x to the y
And F like "it's better than yours!"
Damn right, it's better than yours!
I can teach you but must come to the lectures, do the practicals and tutorials, and revise
The null hypothesis of the F-test is that the model is no better than the comparison model
A significant F-test tells us that our model is a statistically significantly better fit to the data than the comparison model
In other words, our model is an improvement on the previous model
It explains significantly* more variance in the outcome variable than the other model
*statistically significantly!!!
anova()
function in R
allows us to compare the models explicitlym_null <- lm(Birthweight ~ 1, bweight)anova(m_null, m_age_gest_smoke)
## Analysis of Variance Table## ## Model 1: Birthweight ~ 1## Model 2: Birthweight ~ age_cntrd + gest_cntrd + smoker## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 41 14.9523 ## 2 38 6.4965 3 8.4557 16.486 5.083e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary()
(or broom::glance()
)ABCDEFGHIJ0123456789 |
statistic <dbl> | df <dbl> | df.residual <int> | p.value <dbl> | |
---|---|---|---|---|
16.48655 | 3 | 38 | 5.083237e-07 |
You can compare as many models as you like
They all have to be modelling the same outcome variable
They must all be nested
You must be able to sort them from the simplest to the most complex
A simpler model mustn't contain any predictors that a more complex model doesn't also contain
They must both be fitted to the same data (their respective Ns must be the same)
birthwieght ~ gestation
vs
birthwieght ~ gestation + mage
birthwieght ~ gestation
vs
birthwieght ~ smoker + mage
Categorical predictors can be entered in the linear model just like continuous predictors
For binary variables, the b represents the difference between the two categories
For more than 2 categories, it's a little more complicated (next term)
Multiple R2 can be interpreted as the proportion of variance in outcome explained by the model or the correlation between what the model predicts and what we observe
Multiple R2 gets inflated by adding more and more predictors
Adjusted R2 penalises for each added predictor to counteract this
We can compare nested models using the F-test
F-ratio is the ratio of variance explained by the model vs variance not explained by the model
We know that the sampling distribution of the F-ratio (AKA F-value, F-statistic) is the F-distribution with dfM and dfR degrees of freedom
The null hypothesis of the F-test is that the model is no better than the comparison model
F-test is always a comparison of two models, albeit sometimes implicitly
Next week is our final Analysing Data lecture 😢
Revision session to recap all we talked about in all of this term's lectures
Don't forget to do the last tutorial that will let you practice comparing models
For final practical, Jennifer and I are planning something pretty special so do come!
Next week is our final Analysing Data lecture 😢
Revision session to recap all we talked about in all of this term's lectures
Don't forget to do the last tutorial that will let you practice comparing models
For final practical, Jennifer and I are planning something pretty special so do come!
Next week is our final Analysing Data lecture 😢
Revision session to recap all we talked about in all of this term's lectures
Don't forget to do the last tutorial that will let you practice comparing models
For final practical, Jennifer and I are planning something pretty special so do come!
It's still a practical, and no, it won't be delivered through the medium of interpretative dance
We won't be singing either!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |