class: center, middle, inverse, title-slide # End of the Line: Comparing models ## Lecture 10 ### Dr Milan Valášek ### 23 April 2021 --- exclude: ![:live] class: center starwars .fade[ ] .crawl[ # Stats wars .cntr-li[ - **LM1:** A New Line - **LM2:** <i>R</i><sup>2</sup> Strikes Back - **LM3:** Return of the <i>y<sub>i</sub></i> - **LM4:** .secondary[<i>F</i> Awakens] ] ] --- exclude: ![:notLive] ## Stats wars - **LM1:** A New Equation - **LM2:** <i>R</i><sup>2</sup> strikes back - **LM3:** Return of the <i>y<sub>i</sub></i> - **LM4:** .secondary[<i>F</i> awakens] --- <script type="text/javascript"> /* message to display at the bottom of slides when ?live=true */ slideMessage = "<span>Ask questions at </span><a href = 'pollev.com/milanvalasek890'>pollev.com/milanvalasek890</a>" plotsToPanels() live() </script> ## Today - Categorical predictors again - Multiple <i>R</i><sup>2</sup> and adjusted <i>R</i><sup>2</sup> - Comparing linear models with the *F*-test --- ## What you already know about the linear model `$$y_i = b_0 + b_1 \times x_{1_i} + 0 \times x_{2_i} + 0 \times x_{3_i} + \dots + 0 \times x_{n_i} + e_i$$` - Every predictor *adds a dimension* (axis) - Every `\(b\)` coefficient (except for `\(b_0\)`) is a *slope* of the plane of best fit with respect to the dimension of the predictor - Predictors can be continuous or categorical - **Outcome must be continuous** --- ## Last time: birth weight model .smol[ ```r # centre predictors bweight <- bweight %>% mutate(gest_cntrd = Gestation - mean(Gestation, na.rm=TRUE), age_cntrd = mage - mean(mage, na.rm=TRUE)) m_age_gest <- lm(Birthweight ~ age_cntrd + gest_cntrd, bweight) summary(m_age_gest) ``` ``` ## ## Call: ## lm(formula = Birthweight ~ age_cntrd + gest_cntrd, data = bweight) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.77485 -0.35861 -0.00236 0.26948 0.96943 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.3128571 0.0674405 49.123 < 2e-16 *** ## age_cntrd -0.0007953 0.0120469 -0.066 0.948 ## gest_cntrd 0.1618369 0.0258242 6.267 2.21e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4371 on 39 degrees of freedom ## Multiple R-squared: 0.5017, Adjusted R-squared: 0.4762 ## F-statistic: 19.64 on 2 and 39 DF, p-value: 1.26e-06 ``` ] --- ## Categorical predictor - Let's see if whether or not the mother is a smoker has an influence on the baby's birth weight - What do you think? -- - Smoking is a binary *categorical* predictor but we can add it to the model just like any other variable! .smol[ ```r bweight$smoker ``` ``` ## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [39] 1 1 1 1 ``` ] --- ## Effect of mother's smoking .smol[ ```r m_smoke <- lm(Birthweight ~ smoker, bweight) summary(m_smoke) ``` ``` ## ## Call: ## lm(formula = Birthweight ~ smoker, data = bweight) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.21409 -0.39159 0.02591 0.41935 1.43591 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.5095 0.1298 27.040 <2e-16 *** ## smoker -0.3754 0.1793 -2.093 0.0427 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5804 on 40 degrees of freedom ## Multiple R-squared: 0.09874, Adjusted R-squared: 0.07621 ## F-statistic: 4.382 on 1 and 40 DF, p-value: 0.0427 ``` ] -- `$$\text{birthweight}_i = b_0 -0.38 \times \text{smoker}_i + e_i$$` - Babies whose mothers smoke are, on average, 0.38 lbs lighter at birth than those whose mother do not smoke. --- ## Continuous AND categorical predictors - Of course, you can include categorical predictors even in a larger model .smol[ ```r m_age_gest_smoke <- update(m_age_gest, ~ . + smoker) # that's the same as m_age_gest_smoke <- lm(Birthweight ~ age_cntrd + gest_cntrd + smoker, bweight) summary(m_age_gest_smoke) ``` ``` ## ## Call: ## lm(formula = Birthweight ~ age_cntrd + gest_cntrd + smoker, data = bweight) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.60724 -0.31521 0.00952 0.24838 1.08946 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.475375 0.093843 37.034 < 2e-16 *** ## age_cntrd 0.005115 0.011668 0.438 0.6636 ## gest_cntrd 0.156079 0.024552 6.357 1.84e-07 *** ## smoker -0.310261 0.131381 -2.362 0.0234 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4135 on 38 degrees of freedom ## Multiple R-squared: 0.5655, Adjusted R-squared: 0.5312 ## F-statistic: 16.49 on 3 and 38 DF, p-value: 5.083e-07 ``` ] --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Multiple <i>R</i><sup>2</sup> - We talked about what <i>R</i><sup>2</sup> means in terms of explained and unexplained (residual) variance <iframe id="r-sq-viz" class="viz app" src="https://and.netlify.app/viz/r_sq" data-external="1" width="100%" height="600"></iframe> --- ## Multiple <i>R</i><sup>2</sup> - It's the same in multiple-predictor models: ```r # multplie R-squared from model m_age_gest_smoke %>% broom::glance() %>% pull(r.squared) ``` ``` ## [1] 0.5655139 ``` ```r # same as manually-calculated using formula 1 - var(resid(m_age_gest_smoke))/var(bweight$Birthweight) ``` ``` ## [1] 0.5655139 ``` - It can be also interpreted as the **square of the correlation** between *predicted* and *observed* values of outcome ```r predicted <- fitted(m_age_gest_smoke) correlation <- cor(predicted, bweight$Birthweight) correlation^2 ``` ``` ## [1] 0.5655139 ``` --- ## The thing with multiple <i>R</i><sup>2</sup> - We can also add more predictors to our models - Sometimes, they are good, sometimes, they are bad... ```r bweight <- bweight %>% mutate(x4 = rnorm(nrow(bweight)), x5 = rnorm(nrow(bweight)), x6 = rnorm(nrow(bweight))) # add 3 random predictor variables m_big <- update(m_age_gest_smoke, ~ . + x4 + x5 + x6) m_big %>% broom::tidy() ``` ``` ## # A tibble: 7 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 3.48 0.0936 37.2 1.02e-29 ## 2 age_cntrd 0.00284 0.0116 0.244 8.09e- 1 ## 3 gest_cntrd 0.145 0.0260 5.57 2.83e- 6 ## 4 smoker -0.280 0.135 -2.08 4.53e- 2 ## 5 x4 0.100 0.0745 1.35 1.86e- 1 ## 6 x5 -0.0368 0.0754 -0.488 6.28e- 1 ## 7 x6 -0.0784 0.0560 -1.40 1.71e- 1 ``` --- ## The thing with multiple <i>R</i><sup>2</sup> - Multiple <i>R</i><sup>2</sup> gets higher as we add more and more predictors, even if there is no relationship between them and the outcome ```r # original 3-predictor model m_age_gest_smoke %>% broom::glance() %>% pull(r.squared) ``` ``` ## [1] 0.5655139 ``` ```r # model that includes rubbish predictors m_big %>% broom::glance() %>% pull(r.squared) ``` ``` ## [1] 0.6135306 ``` - The new model explains additional ~5% of the variance in birth weight even though the predictors are just randomly generated - We want a metric of explained variance that doesn't get inflated by adding noise to the model --- ## Adjusted <i>R</i><sup>2</sup> - Adjusted <i>R</i><sup>2</sup> ( `\(R^2_{\text{adj}}\)`, or sometimes `\(\bar{R}^2\)`) introduces a *penalty* for each predictor added to the model `$$R^2_{\text{adj}} = 1 - \frac{\frac{SS_R}{n-p-1}}{\frac{SS_T}{n-1}}$$` - How does the penalty work? - The larger the number of predictors ( `\(p\)` ), the larger the numerator `\(\frac{SS_R}{n-p-1}\)` gets - The larger the numerator, the larger the entire fraction (just like 3/4 is more than 2/4) - The larger the fraction, the *smaller* the adjusted <i>R</i><sup>2</sup> - Penalty per predictor is *larger in small samples* --- ## Adjusted <i>R</i><sup>2</sup> - Compared to multiple <i>R</i><sup>2</sup>, adjusted <i>R</i><sup>2</sup> gets **far less inflated** by including pointless predictors in the model ```r # original 3-predictor model m_age_gest_smoke %>% broom::glance() %>% dplyr::select(r.squared, adj.r.squared) ``` ``` ## # A tibble: 1 x 2 ## r.squared adj.r.squared ## <dbl> <dbl> ## 1 0.566 0.531 ``` ```r # model that includes rubbish predictors m_big %>% broom::glance() %>% dplyr::select(r.squared, adj.r.squared) ``` ``` ## # A tibble: 1 x 2 ## r.squared adj.r.squared ## <dbl> <dbl> ## 1 0.614 0.547 ``` --- ## Adjusted <i>R</i><sup>2</sup> - Adjusted <i>R</i><sup>2</sup> is an attempt at estimating proportion of variance explained by the model *in the population* - If adjusted <i>R</i><sup>2</sup> is always better, why even look at multiple <i>R</i><sup>2</sup>? -- - Large difference between the values of multiple <i>R</i><sup>2</sup> and adjusted <i>R</i><sup>2</sup> indicates that there are some unnecessary predictors in the model and it should be simplified -- - The larger the difference between the two, the less likely it is that the model will generalise beyond the current sample --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Model fit - We can take a model as a whole and assess whether it is a good fit to the data or not a good one - To do this we calculate the ratio of variance *explained* by the model (AKA **model variance**) and the variance *not explained* by the model (AKA **residual variance**) - This is the *F*-ratio `$$\begin{align}F&=\frac{\text{explained variance}}{\text{unexplained variance}}\\\\&=\frac{\text{model variance}}{\text{residual variance}}\\\\&=\frac{SS_M/df_M}{SS_R/df_R}\\\\&=\frac{(SS_T - SS_R)/df_M}{SS_R/df_R}\end{align}$$` --- ## *F*-distribution - *F*-ratio is a ratio of two sums of squares (model and residual, respectively), scaled by their respective degrees of freedom - We know that the sampling distribution of this kind of a metric is the *F*-distribution (hence the name) <iframe id="fdist-viz" class="viz app" src="https://and.netlify.app/viz/fdist" data-external="1" style="height:641px;transform:scale(.7) translateY(-150px);"></iframe> --- ## *F*-test - Because we know the sampling distribution of our *F*-ratio, we can calculate the *p*-value associated with the given value of *F* under the null hypothesis that *our model is no better than the null model* .smol[ ``` ## ## Call: ## lm(formula = Birthweight ~ age_cntrd + gest_cntrd + smoker, data = bweight) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.60724 -0.31521 0.00952 0.24838 1.08946 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.475375 0.093843 37.034 < 2e-16 *** ## age_cntrd 0.005115 0.011668 0.438 0.6636 ## gest_cntrd 0.156079 0.024552 6.357 1.84e-07 *** ## smoker -0.310261 0.131381 -2.362 0.0234 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4135 on 38 degrees of freedom ## Multiple R-squared: 0.5655, Adjusted R-squared: 0.5312 ## F-statistic: 16.49 on 3 and 38 DF, p-value: 5.083e-07 ``` ] --- ## Comparing models - "Good" fit is a relative term - a model is a good fit *in comparison to some other model* -- - **wHAt noW??** -- - Yes, we're *always* comparing two models, even in the example above! - The total variance of a model_\*_ *is just the **residual variance** of the model we're comparing it to*! - *F*-test from `summary()` compares the model to the **null** (intercept-only) **model** _\*_ The denominator of the ratio `\(F = \frac{\text{model variance}}{\text{total variance}}\)` --- ## Comparing models .right.small[ *My model brings all the x to the y<br>And F like "it's better than yours!"<br>Damn right, it's better than yours!<br>I can teach you but must come to the lectures, do the practicals and tutorials, and revise* [~ G W Snedecor](https://www.youtube.com/watch?v=6AwXKJoKJz4) ] - The null hypothesis of the *F*-test is that *the model is no better than the comparison model* - A significant *F*-test tells us that our model is a *statistically significantly* better fit to the data than the comparison model - In other words, our model is an improvement on the previous model - It explains significantly*\** more variance in the outcome variable than the other model _\*_**statistically** significantly!!! --- ## Comparing models - The `anova()` function in `R` allows us to compare the models *explicitly* .smol[ ```r m_null <- lm(Birthweight ~ 1, bweight) anova(m_null, m_age_gest_smoke) ``` ``` ## Analysis of Variance Table ## ## Model 1: Birthweight ~ 1 ## Model 2: Birthweight ~ age_cntrd + gest_cntrd + smoker ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 41 14.9523 ## 2 38 6.4965 3 8.4557 16.486 5.083e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] - These are the same numbers as we got from `summary()` (or `broom::glance()`) .smol[ ] <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":["statistic"],"name":[1],"type":["dbl"],"align":["right"]},{"label":["df"],"name":[2],"type":["dbl"],"align":["right"]},{"label":["df.residual"],"name":[3],"type":["int"],"align":["right"]},{"label":["p.value"],"name":[4],"type":["dbl"],"align":["right"]}],"data":[{"1":"16.48655","2":"3","3":"38","4":"5.083237e-07","_row":"value"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> --- ## Comparing models - You can compare as many models as you like - They all have to be modelling *the same outcome* variable - They must all be *nested* - You must be able to sort them from the simplest to the most complex - A simpler model *mustn't* contain any predictors that a more complex model doesn't also contain - They must both be fitted to *the same data* (their respective *N*s must be the same) .pull-left[ <img src="https://upload.wikimedia.org/wikipedia/commons/7/73/Flat_tick_icon.svg" width="50px" height="50px"> `birthwieght ~ gestation` .center[vs] `birthwieght ~ gestation + mage` ] .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/8/8f/Flat_cross_icon.svg" width="50px" height="50px"> `birthwieght ~ gestation` .center[vs] `birthwieght ~ smoker + mage` ] --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Take-home message - Categorical predictors can be entered in the linear model just like continuous predictors - For binary variables, the `\(b\)` represents the difference between the two categories - For more than 2 categories, it's a little more complicated (next term) - Multiple <i>R</i><sup>2</sup> can be interpreted as the proportion of variance in outcome explained by the model or the *correlation between what the model predicts and what we observe* - Multiple <i>R</i><sup>2</sup> gets *inflated by adding more and more predictors* - Adjusted <i>R</i><sup>2</sup> *penalises* for each added predictor to counteract this - Difference between the two is *indicative of useless predictors* in the model and a *low generalisability* of the model --- ## Take-home message - We can compare **nested** models using the *F*-test - *F*-ratio is the ratio of variance *explained* by the model vs variance *not explained* by the model - We know that the sampling distribution of the *F*-ratio (AKA *F*-value, *F*-statistic) is the *F*-distribution with `\(df_M\)` and `\(df_R\)` degrees of freedom - The null hypothesis of the *F*-test is that *the model is no better than the comparison model* - *F*-test is **always** a comparison of two models, albeit sometimes *implicitly* - Total variance for a given model is *just the residual variance of the comparison model!* --- ## Next time - Next week is our **final** Analysing Data lecture 😢 - Revision session to recap all we talked about in all of this term's lectures - Don't forget to do the last tutorial that will let you practice comparing models - For final practical, Jennifer and I are planning something *pretty special* so do come! -- - It's still a practical, and no, it won't be delivered through the medium of interpretative dance -- - We won't be singing either! --- class: last-slide weekend background-image: url("assets/end.jpg") background-size: cover # Have a lovely weekend :)