class: center, middle, inverse, title-slide # The Linear Model 2: Evaluating the Model ## Lecture 08 ### Dr Jennifer Mankin ### 19 March 2021 --- <script type="text/javascript"> <!-- message to display at the bottom of slides when ?live=true --> slideMessage = "<span>Ask questions at </span><a href = 'pollev.com/milanvalasek890'>pollev.com/milanvalasek890</a>" plotsToPanels() live() </script> ## Looking Ahead (and Behind) - Week 4: Correlation - Week 5: Chi-Square ( `\(\chi^2\)` ) - Week 6: *t*-test -- - Last week: The Linear Model - Equation of a Line - This week: The Linear Model - Evaluating the Model --- ## Announcements - Lab Report Help - [Lab Report Drop-ins](https://canvas.sussex.ac.uk/courses/12727/pages/r-help-desk-drop-ins) - Results: review this week's practical on writing up lab report results - Discussion: review Academic Advising session - Next week's practicals: The Linear Model (both lectures) - Awards: [SavioR nominations](https://canvas.sussex.ac.uk/courses/12727/quizzes/19897) and [Education Awards](http://www.sussex.ac.uk/staff/the-education-awards/) - Feedback via the [Suggestions Box](https://canvas.sussex.ac.uk/courses/12727/quizzes/20246) --- ## Objectives After this lecture you will understand: - The equation for a linear model with one predictor - *b*<sub>0</sub> (the intercept) - *b*<sub>1</sub> (the slope) - The logic of NHST for *b*-values - Interpreting *p* and CIs - How to assess model fit with `\(R^2\)` --- ## General Model Equation <br><br> <center><font size = "18"> `\(outcome = model + error\)` </font></center><p> <br><br> * We can use models to **predict** the outcome for a particular case * This is always subject to some degree of **error** --- ## The Linear Model - The linear model predicts the outcome `\(y\)` based on a predictor `\(x\)` - General form: `\(y_{i} = b_{0} + b_{1}x_{1i} + e_{i}\)` - *b*<sub>0</sub>: the **intercept**, or value of `\(y\)` when `\(x\)` is 0 - *b*<sub>1</sub>: the **slope**, or change in `\(y\)` for every unit change in `\(x\)` - The slope *b*<sub>1</sub> represents the relationship between the predictor and the outcome --- ## Today's Example - "Does Hugging Provide Stress-Buffering Social Support? A Study of Susceptibility to Upper Respiratory Infection and Illness" ([Cohen et al., 2015](https://journals.sagepub.com/doi/full/10.1177/0956797614559284)) - Participants completed questionnaires and phone interviews over 14 days - Including whether they had been hugged each day -- - Then exposed to a cold virus! 🤒 - Measures of infection: amount of mucus, nasal clearing time -- - Does receipt of hugs have a relationship with infection? - What kind of relationship might we predict? 🤔 ![GIF of Olaf from Frozen saying, "And I like warm hugs."](assets/warm_hugs.gif) --- ## Operationalisation - Predictor: Percentage of days in which participants were hugged - Higher percentage = more hugs - Outcome: Nasal clearing time - A measure of congestion - Longer time = more congestion (= worse cold) -- - Model: `\(Congestion_{i} = b_{0} + b_{1}\times{Hugs_{1i}} + e_{i}\)` - Use the `lm()` function to estimate *b*<sub>0</sub> and *b*<sub>1</sub> --- ## Having a Look .codePanel[ ```r cold_hugs %>% mutate(pct_hugs = pct_hugs*100) %>% ggplot(aes(x = pct_hugs, y = post_nasal_clear_log)) + geom_point(position = "jitter", alpha = .4) + scale_x_continuous(name = "Percentage of Days with Hugs") + scale_y_continuous("Congestion (Log)") + cowplot::theme_cowplot() ``` ![](index_files/figure-html/unnamed-chunk-2-1.png)<!-- -->] --- ## Having a Look .codePanel[ ```r cold_hugs %>% mutate(pct_hugs = pct_hugs*100) %>% ggplot(aes(x = pct_hugs, y = post_nasal_clear_log)) + geom_point(position = "jitter", alpha = .4) + scale_x_continuous(name = "Percentage of Days with Hugs") + scale_y_continuous("Congestion (Log)") + geom_smooth(method = "lm") + cowplot::theme_cowplot() ``` ![](index_files/figure-html/unnamed-chunk-3-1.png)<!-- -->] -- - Very slight *negative* relationship - How can we interpret this relationship? --- ## Creating the Model ``` ## ## Call: ## lm(formula = post_nasal_clear_log ~ pct_hugs, data = cold_hugs) ## ## Coefficients: ## (Intercept) pct_hugs ## 0.5952 -0.1077 ``` - For every *unit increase* in hugs, congestion changes by -0.11 - Here, "unit increase" = 1% - So, congestion goes down by 0.11 for every 1% increase in hugs Model: `\(Congestion_i = 0.60 - 0.11\times{Hugs_i}\)` --- ## The Story So Far - Investigating whether hugs protect against colds - Linear model shows that more hugs are associated with less congestion (infection) - `\(Congestion_i = 0.60 - 0.11\times{Hugs_i}\)` - Is this model any good? What do we mean by "good"? -- - Captures a relationship that may in fact exist: significance and CIs of *b*<sub>1</sub> -- - Explains the variance in the outcome: `\(R^2\)` for the model --- ## NHST for LM - *b*<sub>1</sub> quantifies the relationship between the predictor and the outcome - The *effect* of interest and the key part of the linear model! - Our estimate of the true relationship in the population (the model parameter) - So...is this value **significant**? --- ## NHST for LM - Our recipe for significance testing is: - Data - A test statistic - The distribution of that test statistic under the null hypothesis - The probability *p* of finding a test statistic as large as the one we have (or larger) if the null hypothesis is true - First, we need to sort out the null hypothesis of *b*<sub>1</sub> --- ## Null Hypothesis of *b*<sub>1</sub> - *b*<sub>1</sub> captures the relationship between variables - How much the outcome *y* changes for each unit change in *x* - Null hypothesis: the outcome *y* **does not change** when *x* changes - What would this look like in terms of the linear model? 🤔 --- ## Null Hypothesis of *b*<sub>1</sub> - `\(Congestion_{i} = b_{0} + 0 \times Hugs_{1i} + e_{i}\)` - This is the **null** or **intercept-only model** .codePanel[ ```r cold_hugs %>% mutate(pct_hugs = pct_hugs*100) %>% ggplot(aes(x = pct_hugs, y = post_nasal_clear_log)) + geom_point(position = "jitter", alpha = .4) + scale_x_continuous(name = "Percentage of Days with Hugs") + scale_y_continuous("Congestion (Log)") + geom_smooth(method = "lm") + geom_hline(yintercept = mean(cold_hugs$post_nasal_clear_log, na.rm = T), linetype = "dashed", colour = "purple3")+ cowplot::theme_cowplot() ``` ![](index_files/figure-html/unnamed-chunk-5-1.png)<!-- -->] --- ## Significance of *b*<sub>1</sub> - *b*<sub>1</sub> = 0 represents the null hypothesis - So, the alternative hypothesis is *b*<sub>1</sub> `\(\ne\)` 0 - For our model, does *b*<sub>1</sub> = 0? -- - No! Here *b*<sub>1</sub> = -0.11 - Is our estimate of *b*<sub>1</sub> different **enough** from 0 to believe that it may actually not be 0 in the population? - Compare the estimate of *b*<sub>1</sub> to the variation in estimates of *b*<sub>1</sub> --- ## Significance of *b*<sub>1</sub> - Signal-to-noise ratio - Signal: the estimate of *b*<sub>1</sub> - Noise: the standard error of *b*<sub>1</sub> - Scale *b*<sub>1</sub> by its standard error: `\(\frac{b_{1}}{SE_{b_1}}\)` - What do you get when you divide a normally distributed value by its standard error...???? 🤔 --- ## That's the *t*! ![GIF meme of Kermit the Frog drinking tea](assets/kermit.gif) --- ## Significance of *b*<sub>1</sub> - `\(\frac{b_{1}}{SE_{b_1}} = t\)` - Compare our value of *t* to the *t*-distribution to get *p*, just as we've seen before - If *p* is smaller than our chosen alpha level, our predictor is considered to be significant -- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> <em>b</em> </th> <th style="text-align:right;"> <em>SE</em><sub>b</sub> </th> <th style="text-align:right;"> <em>t</em> </th> <th style="text-align:left;"> <em>p</em> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:right;"> 0.60 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 16.04 </td> <td style="text-align:left;"> < .001 </td> </tr> <tr> <td style="text-align:left;"> Percentage of Days with Hugs </td> <td style="text-align:right;"> -0.11 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> -2.05 </td> <td style="text-align:left;"> .041 </td> </tr> </tbody> </table> --- ## Confidence Intervals for *b*<sub>1</sub> - Give us the range of likely sample estimates of `\(\beta_1\)` from other samples - Only if our interval is one of the 95% of intervals that does in fact contain the population value! - Review [Lecture 2](https://and.netlify.app/lectures/#lecture-2) for more on CIs -- - Key info: does the confidence interval cross or include 0? - If yes, it's likely that we could have gathered a sample where *b*<sub>1</sub> was 0 --- ## Confidence Intervals for *b*<sub>1</sub> - What can we conclude from these confidence intervals? 🤔 <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Term </th> <th style="text-align:right;"> <em>b</em> </th> <th style="text-align:right;"> <em>SE</em><sub>b</sub> </th> <th style="text-align:right;"> <em>t</em> </th> <th style="text-align:left;"> <em>p</em> </th> <th style="text-align:right;"> CI<sub>upper</sub> </th> <th style="text-align:right;"> CI<sub>lower</sub> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Intercept </td> <td style="text-align:right;"> 0.60 </td> <td style="text-align:right;"> 0.04 </td> <td style="text-align:right;"> 16.04 </td> <td style="text-align:left;"> < .001 </td> <td style="text-align:right;"> 0.522 </td> <td style="text-align:right;"> 0.668 </td> </tr> <tr> <td style="text-align:left;"> Percentage of Days with Hugs </td> <td style="text-align:right;"> -0.11 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> -2.05 </td> <td style="text-align:left;"> .041 </td> <td style="text-align:right;"> -0.211 </td> <td style="text-align:right;"> -0.005 </td> </tr> </tbody> </table> --- ## Interim Summary - The key element of the linear model is *b*<sub>1</sub> - Quantifies the relationship between the predictor and the outcome - Null hypothesis: *b*<sub>1</sub> = 0 - Alternative hypothesis: *b*<sub>1</sub> `\(\ne\)` 0 - Is *b*<sub>1</sub> different from 0? - Significance via *t* - Confidence intervals --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## A Good Model - Captures a relationship that does in fact exist - Isn't just noise (random variation) - Quantified with significance/CIs -- - Is useful for understanding the outcome variable - Explains variance in the outcome - Quantified with `\(R^2\)` --- ## Explaining Variance - We want to explain variance, particularly in the outcome - Goodness of Fit: How well does the model fit the data? - Better fit = model is better able to explain the outcome - So, how do we quantify model fit? --- ## Goodness of Fit with `\(R^2\)` <iframe id="r-sq-viz" class="viz app" src="https://and.netlify.app/viz/r_sq" data-external="1" width="100%" height="600"></iframe> --- ## Goodness of Fit with `\(R^2\)` - `\(R^2 = \frac{Variance\ explained\ by\ the\ model}{Total\ variance}\)` - Interpret as a percentage of variance explained - Applies to our sample only - Larger value means better fit - Adjusted `\(R^2\)`: estimate of `\(R^2\)` **in the population** -- - How does this value look? 🤔 <br><br> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> R<sup>2</sup> </th> <th style="text-align:right;"> Adjusted R<sup>2</sup> </th> <th style="text-align:right;"> <em>F</em> </th> <th style="text-align:left;"> <em>p</em> </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:right;"> 4.22 </td> <td style="text-align:left;"> .041 </td> </tr> </tbody> </table> --- ## Putting It All Together ```r hugs_lm %>% summary() ``` ``` ## ## Call: ## lm(formula = post_nasal_clear_log ~ pct_hugs, data = cold_hugs) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.59522 -0.27880 0.01766 0.25219 0.79262 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.59522 0.03710 16.043 <0.0000000000000002 *** ## pct_hugs -0.10773 0.05243 -2.055 0.0405 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3572 on 403 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.01037, Adjusted R-squared: 0.007912 ## F-statistic: 4.222 on 1 and 403 DF, p-value: 0.04055 ``` --- ## Summary - The linear model (LM) expresses the relationship between at least one predictor, `\(x\)`, and an outcome, `\(\hat{y}\)` - Linear model equation: `\(y_{i} = b_{0} + b_{1}x_{1i} + e_{i}\)` - Most important result is the parameter *b*<sub>1</sub>, which expresses the change in `\(y\)` for each unit change in `\(x\)` - Evaluating the model - Is it unlikely that *b*<sub>1</sub> isn't 0? Significance tests and CIs - How well does the model fit the data? `\(R^2\)` and adjusted `\(R^2\)` --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- class: last-slide <br><br><br><br><br> # Have a lovely weekend!