class: center, middle, inverse, title-slide # The Linear Model 1: Equation of a Line ## Lecture 07 ### Dr Jennifer Mankin ### 12 March 2021 --- <script type="text/javascript"> <!-- message to display at the bottom of slides when ?live=true --> slideMessage = "<span>Ask questions at </span><a href = 'pollev.com/milanvalasek890'>pollev.com/milanvalasek890</a>" plotsToPanels() live() </script> ## Looking Ahead (and Behind) - Week 4: Correlation - Week 5: Chi-Square ( `\(\chi^2\)` ) - Last week: *t*-test -- - This week: The Linear Model - Equation of a Line - Next week: The Linear Model - Significance Testing --- ## Announcements - [Lab Report Drop-ins](https://canvas.sussex.ac.uk/courses/12727/pages/r-help-desk-drop-ins) starting next week - Next week's practicals: Writing up lab report results - Week 9 practicals will start on the Linear Model - You still have a tutorial and quiz this week! - Next week: Academic Advising session on Discussion writing - Designed for this assessment - please attend - Awards: [SavioR nominations](https://canvas.sussex.ac.uk/courses/12727/quizzes/19897) and [Education Awards](http://www.sussex.ac.uk/staff/the-education-awards/) - Feedback via the [Suggestions Box](https://canvas.sussex.ac.uk/courses/12727/quizzes/20246) --- ## Objectives After this lecture you will understand: - What a statistical model is and why they are useful - The equation for a linear model with one predictor - *b*<sub>0</sub> (the intercept) - *b*<sub>1</sub> (the slope) - Both categorical and continuous predictors - Using the equation to predict an outcome - How to read scatterplots and lines of best fit --- ## The Linear Model - Extremely common and fundamental testing paradigm - Predict the outcome *y* from one or more predictors (*x*s) - Our first (explicit) contact with **statistical modeling** -- - A **statistical model** is a mathematical expression that captures the relationship between variables - All of our test statistics (*r*, *t*, etc.) are actually models! --- ## Maps as Models - A map is a simplified depiction of the world - Captures the important elements (roads, cities, oceans, mountains) - *Doesn't* capture individual detail (where your gran lives) -- - Depicts **relationships** between locations and geographical features - Helps you **predict** what you will encounter in the world - E.g. if you keep walking south eventually you'll fall in the sea! --- ## Statistical Models - A model is a simplified depiction of some relationship + We want to **predict** what will happen in the world + But the world is complex and full of noise (randomness) - We can build a model to try to capture the important elements + Change/adjust the model to see what might happen with different **parameters** --- ## Statistical Models - **Why** might it be useful to create a model like this? - Can you think of any recent examples of such models? --- ## Statistical Models: COVID-19 <iframe src="https://covid19.healthdata.org/global?view=total-deaths&tab=trend" width="800px" height="400px" style="margin-top:20px;"></iframe> --- ## Predictors and Outcomes * Now we start assigning our variables roles to play * The **outcome** is the variable we want to explain - Also called the dependent variable, or DV -- * The **predictors** are variables that may have a relationship with the outcome - Also called the independent variable(s), or IV(s) -- + We measure or manipulate the predictors, then quantify the systematic change in the outcome + NB: **YOU** (the researcher) assign these roles! --- ## General Model Equation <br><br> <center><font size = "18"> `\(outcome = model + error\)` </font></center><p> <br><br> * We can use models to **predict** the outcome for a particular case * This is always subject to some degree of **error** --- ## Making Predictions - Last week, we looked at the Language and Word Forms subscale of the SCSQ - If I wanted to **predict** someone's Language score... - What would be the most sensible *estimate*? -- .codePanel[ ```r syn_data %>% ggplot(aes(x = Language)) + geom_histogram(breaks = syn_data %>% pull(Language) %>% unique()) + scale_x_continuous(name = "Language and Word Forms Score", limits = c(1, 5)) + scale_y_continuous(name = "Count") + scale_fill_discrete(name = "Synaesthesia") + geom_vline(aes(xintercept = mean(Language)), colour = "purple3", linetype = "dashed") + annotate("text", x = mean(syn_data$Language) + .1, y = 122, label = paste0("Mean: ", mean(syn_data$Language) %>% round(2)), hjust=0, colour = "purple4") ``` ![](index_files/figure-html/mean-hist-1.png)<!-- -->] --- ## Making Predictions - Without any other information, the best estimate is the **mean of the outcome** - But we *do* have more information! -- - Last week: Grapheme-colour synaesthetes score higher than non-synaesthetes on the Language subscale on average - We could make a **better** prediction if we knew whether that person was a synaesthete - Use the mean score in the synaesthete vs non-synaesthete groups -- - Let's write an equation that we can use to predict someone's score based on whether they are a synaesthete or not! --- ## Making Predictions - First: the non-synaesthete (baseline) group - If someone is a non-synaesthete, predicted Language score = 3.55 - `\(\hat{Language_{non-syn}} = 3.55\)` .codePanel[ ```r syn_data %>% dplyr::group_by(GraphCol) %>% dplyr::summarise( n = dplyr::n(), mean_lang = mean(Language), se_lang = sd(Language)/sqrt(n) ) %>% ggplot2::ggplot(aes(x = GraphCol, y = mean_lang)) + geom_errorbar(aes(ymin = mean_lang - 2*se_lang, ymax = mean_lang + 2*se_lang), width = .1) + geom_point(colour = "black", fill = "orange", pch = 23) + scale_y_continuous(name = "Language Score", limits = c(3, 5)) + labs(x = "Grapheme-Colour Synaesthete") + geom_label(stat = 'summary', fun.y=mean, aes(label = paste0("Mean: ", round(..y.., 2))), nudge_x = 0.1, hjust = 0) + cowplot::theme_cowplot() ``` ![](index_files/figure-html/means-plot-1.png)<!-- -->] --- ## Making Predictions - Next: the synaesthete group - If someone is a synaesthete, predicted Language score = 4.29 - `\(\hat{Language_{syn}} = 3.55 + 0.74\)` .codePanel[ ```r bracket <- syn_data %>% group_by(GraphCol) %>% summarise(y = mean(Language)) %>% mutate(x = rep(2.7, 2), y = round(y, 2)) syn_data %>% ggplot(aes(x = GraphCol, y = Language)) + #geom_line(stat="summary", fun = mean, aes(group = NA)) + geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) + geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") + geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4, nudge_x = c(-.3, .3)) + geom_line(data = bracket, mapping = aes(x, y)) + geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) + annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2, label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)), hjust=0) + coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,5)) + scale_y_continuous(name = "Language Score") + labs(x = "Grapheme-Colour Synaesthete") + cowplot::theme_cowplot() ``` ![](index_files/figure-html/diff-plot-1.png)<!-- -->] --- ## Making Predictions - We want our equation to give a **different** prediction depending on whether someone is a synaesthete or not - `\(\hat{Language} = 3.55\)` -- - `\(\hat{Language_{i}} = 3.55 + 0.74\times{GraphCol_{i}}\)` -- - When someone is a non-synaestheste (GraphCol = 0)... - `\(\hat{Language_{i}} = 3.55 + 0.74\times{0}\)` - `\(\hat{Language_{i}} = 3.55\)` -- - When someone is a synaesthete (GraphCol = 1)... - `\(\hat{Language_{i}} = 3.55 + 0.74\times{1}\)` - `\(\hat{Language_{i}} = 3.55 + 0.74\)` - `\(\hat{Language_{i}} = 4.29\)` --- ## Drawing Lines - This equation represents a **linear model** (a line!) between the means .codePanel[ ```r bracket <- syn_data %>% group_by(GraphCol) %>% summarise(y = mean(Language)) %>% mutate(x = rep(2.7, 2), y = round(y, 2)) syn_data %>% ggplot(aes(x = GraphCol, y = Language)) + geom_line(stat="summary", fun = mean, aes(group = NA)) + geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) + geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") + geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4, nudge_x = c(-.3, .3)) + geom_line(data = bracket, mapping = aes(x, y)) + geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) + annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2, label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)), hjust=0) + coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,5)) + scale_y_continuous(name = "Language Score", limits = c(1, 5)) + labs(x = "Grapheme-Colour Synaesthete") + cowplot::theme_cowplot() ``` ![](index_files/figure-html/unnamed-chunk-3-1.png)<!-- -->] --- ## Drawing Lines * The line starts from the mean of the non-synaesthete group + This is the **intercept**, which we will call *b*<sub>0</sub> + The predicted value of the outcome `\(\hat{y}\)` when the predictor `\(x\)` is 0 -- * Changing to the synaesthete group, predicted Language score changes by 0.74 + This is the **slope** of the line, which we will call *b*<sub>1</sub> + The change in the outcome for every **unit change** in the predictor -- - This prediction will always have some amount of error, *e* -- - In general, then, the linear model has the form: .center[ .large[ `\(y_{i} = b_{0} + b_{1}x_{1i} + e_{i}\)` ] ] --- ## Drawing Lines ``` ## ## Call: ## lm(formula = Language ~ GraphCol, data = syn_data) ## ## Coefficients: ## (Intercept) GraphColYes ## 3.5549 0.7313 ``` .codePanel[ ```r bracket <- syn_data %>% group_by(GraphCol) %>% summarise(y = mean(Language)) %>% mutate(x = rep(2.7, 2), y = round(y, 2)) syn_data %>% ggplot(aes(x = GraphCol, y = Language)) + geom_line(stat="summary", fun = mean, aes(group = NA), colour = "red") + geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) + geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") + geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4, nudge_x = c(-.3, .3)) + geom_line(data = bracket, mapping = aes(x, y)) + geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) + annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2, label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)), hjust=0) + coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,5)) + scale_y_continuous(name = "Language Score", limits = c(1, 5)) + labs(x = "Grapheme-Colour Synaesthete") + cowplot::theme_cowplot() ``` ![](index_files/figure-html/unnamed-chunk-5-1.png)<!-- -->] --- ## Welcome to `lm()`! * Today's new function is a very important one + We will use it for the rest of the term and... + It will be very important next year as well! * Creates a **l**inear **m**odel -> `lm()` + A statistical model that looks like a line --- ## Basic Anatomy of `lm()` - `lm(outcome ~ predictor, data = data)` - Should look familiar: almost identical to `t.test()`! -- ### `t.test` ```r t.test(Language ~ GraphCol, data = syn_data, alternative = "two.sided", var.equal = T) ``` ### `lm` ```r lm(Language ~ GraphCol, data = syn_data) ``` --- ## Making Connections - Remember from last week: the "signal" of interest was the difference in means - That same value (0.74) is also the value of *b*<sub>1</sub>! (ignoring rounding error...) - The change in Language between synaesthetes and non-synaesthetes - Quantifies the relationship between the predictor and the outcome - This is the **key element** of the linear model! --- ## Have a Go! <iframe class="app" src="https://and.netlify.app/viz/reg_line" data-external="1" width="100%" height="600"></iframe> --- ## Interim Summary - The linear model predicts the outcome `\(y\)` based on a predictor `\(x\)` - General form: `\(y_{i} = b_{0} + b_{1}x_{1i} + e_{i}\)` - *b*<sub>0</sub>: the **intercept**, or value of `\(y\)` when `\(x\)` is 0 - *b*<sub>1</sub>: the **slope**, or change in `\(y\)` for every unit change in `\(x\)` - The slope *b*<sub>1</sub> represents the relationship between the predictor and the outcome - Up next: continuous predictors --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- ## Modelling Gender * Week 4: **correlation** between femininity and masculinity + Remember: *r* expresses degree and direction of the relationship * Today: a linear model using the same variables + Use this model to make **predictions** --- ## Visualising the Line * Ratings of femininity vs masculinity .codePanel[ ```r gensex %>% mutate(Gender = fct_explicit_na(Gender)) %>% ggplot(aes(x = Gender_fem_1, y = Gender_masc_1)) + geom_point(position = "jitter", size = 2, alpha = .4) + scale_x_continuous(name = "Femininity", breaks = c(0:9)) + scale_y_continuous(name = "Masculinity", breaks = c(0:9)) + cowplot::theme_cowplot() ``` ![](index_files/figure-html/fem_masc_plot-1.png)<!-- -->] --- ## Visualising the Line * Add a *line of best fit* for the relationship between femininity and masculinity - Fits the data with the least squared error (but that's for next year!) - What do you think the line will look like? --- ## Visualising the Line .codePanel[ ```r gensex %>% mutate(Gender = fct_explicit_na(Gender)) %>% ggplot(aes(x = Gender_fem_1, y = Gender_masc_1)) + geom_point(position = "jitter", size = 2, alpha = .4) + geom_smooth(method = "lm", formula = y ~ x) + scale_x_continuous(name = "Femininity", breaks = c(0:9)) + scale_y_continuous(name = "Masculinity", breaks = c(0:9)) + cowplot::theme_cowplot() ``` ![](index_files/figure-html/fem_masc_plot-line-1.png)<!-- -->] --- ## Modelling Gender .center[ `\(\hat{y_{i}} = b_{0} + b_{1}x_{1i} + \epsilon\)` ] <br><br> * Outcome: Masculinity * Predictor: Femininity * *b*<sub>0</sub>: value of masculinity when femininity is 0 (the intercept) * *b*<sub>1</sub>: change in masculinity associated with a unit change in femininity (the slope) -- <br><br> .center[ `\(Masculinity_{i} = b_{0} + b_{1}Femininity_{i} + e_{i}\)` ] --- ## Modelling Gender .center[ `\(Masculinity_{i} = b_{0} + b_{1}Femininity_{i} + e_{i}\)` ] -- ``` ## ## Call: ## lm(formula = Gender_masc_1 ~ Gender_fem_1, data = gensex) ## ## Coefficients: ## (Intercept) Gender_fem_1 ## 8.8246 -0.7976 ``` -- .center[ `\(\hat{Masculinity_{i}} = 8.82 - 0.80\times{Femininity_{i}}\)` ] --- ## Predicting Gender * We can now use this model to **predict** someone's rating of masculinity, if we know their rating of femininity - e.g., someone who is not very feminine (rating: 3) - What would the model predict for this person's masculinity rating? * `\(\hat{Masculinity_{i}} = 8.82 - 0.80\times{Femininity_{i}}\)` --- ## Predicting Gender * `\(\hat{Masculinity_{i}} = 8.82 - 0.80\times{Femininity_{i}}\)` + `\(\hat{Masculinity_{i}} = 8.82 - 0.80\times{3}\)` + `\(\hat{Masculinity_{i}} = 6.42\)` * So, someone with a femininity of 3 will have a predicted masculinity rating of 6.42 + This is subject to some (unknowable!) degree of error --- ## Comparing to the Model - Someone with a femininity of 3 will have a predicted masculinity rating of 6.42 .codePanel[ ```r gensex %>% mutate(Gender = fct_explicit_na(Gender)) %>% ggplot(aes(x = Gender_fem_1, y = Gender_masc_1)) + geom_point(position = "jitter", size = 2, alpha = .4) + geom_smooth(method = "lm", formula = y ~ x) + geom_vline(xintercept = 3, linetype = "dashed") + geom_hline(yintercept = 6.42, linetype = "dashed") + scale_x_continuous(name = "Femininity", breaks = c(0:9)) + scale_y_continuous(name = "Masculinity", breaks = c(0:9)) + cowplot::theme_cowplot() ``` ![](index_files/figure-html/fem_masc_plot-line-horiz-1.png)<!-- -->] --- ## Summary - The linear model (LM) expresses the relationship between at least one predictor, `\(x\)`, and an outcome, `\(\hat{y}\)` - Linear model equation: `\(y_{i} = b_{0} + b_{1}x_{1i} + e_{i}\)` - Predictors can be categorical or continuous! - Most important result is the parameter *b*<sub>1</sub>, which expresses the change in `\(y\)` for each unit change in `\(x\)` - Used to **predict** the outcome for a given value of the predictor - Next week: significance and model fit --- exclude: ![:live] .pollEv[ <iframe src="https://embed.polleverywhere.com/discourses/IXL29dEI4tgTYG0zH1Sef?controls=none&short_poll=true" width="800px" height="600px"></iframe> ] --- class: last-slide <br><br><br><br><br> # Have a lovely weekend!