The Linear Model 1: Equation of a Line

Lecture 07

Dr Jennifer Mankin

12 March 2021

1 / 35

Looking Ahead (and Behind)

Week 4: Correlation
Week 5: Chi-Square ( $χ^{2}$ )
Last week: t-test

2 / 35

Looking Ahead (and Behind)

Week 4: Correlation
Week 5: Chi-Square ( $χ^{2}$ )
Last week: t-test
This week: The Linear Model - Equation of a Line
Next week: The Linear Model - Significance Testing

2 / 35

Announcements

Lab Report Drop-ins starting next week
Next week's practicals: Writing up lab report results
- Week 9 practicals will start on the Linear Model
- You still have a tutorial and quiz this week!
Next week: Academic Advising session on Discussion writing
- Designed for this assessment - please attend
Awards: SavioR nominations and Education Awards
Feedback via the Suggestions Box

3 / 35

Objectives

After this lecture you will understand:

What a statistical model is and why they are useful
The equation for a linear model with one predictor
- b₀ (the intercept)
- b₁ (the slope)
- Both categorical and continuous predictors
Using the equation to predict an outcome
How to read scatterplots and lines of best fit

4 / 35

The Linear Model

Extremely common and fundamental testing paradigm
- Predict the outcome y from one or more predictors (xs)
- Our first (explicit) contact with statistical modeling

5 / 35

The Linear Model

Extremely common and fundamental testing paradigm
- Predict the outcome y from one or more predictors (xs)
- Our first (explicit) contact with statistical modeling
A statistical model is a mathematical expression that captures the relationship between variables
- All of our test statistics (r, t, etc.) are actually models!

5 / 35

Maps as Models

A map is a simplified depiction of the world
- Captures the important elements (roads, cities, oceans, mountains)
- Doesn't capture individual detail (where your gran lives)

6 / 35

Maps as Models

A map is a simplified depiction of the world
- Captures the important elements (roads, cities, oceans, mountains)
- Doesn't capture individual detail (where your gran lives)
Depicts relationships between locations and geographical features
- Helps you predict what you will encounter in the world
- E.g. if you keep walking south eventually you'll fall in the sea!

6 / 35

Statistical Models

A model is a simplified depiction of some relationship
- We want to predict what will happen in the world
- But the world is complex and full of noise (randomness)
We can build a model to try to capture the important elements
- Change/adjust the model to see what might happen with different parameters

7 / 35

Statistical Models

Why might it be useful to create a model like this?
Can you think of any recent examples of such models?

8 / 35

Statistical Models: COVID-19

9 / 35

Predictors and Outcomes

Now we start assigning our variables roles to play
The outcome is the variable we want to explain
- Also called the dependent variable, or DV

10 / 35

Predictors and Outcomes

Now we start assigning our variables roles to play
The outcome is the variable we want to explain
- Also called the dependent variable, or DV
The predictors are variables that may have a relationship with the outcome
- Also called the independent variable(s), or IV(s)

10 / 35

Predictors and Outcomes

Now we start assigning our variables roles to play
The outcome is the variable we want to explain
- Also called the dependent variable, or DV
The predictors are variables that may have a relationship with the outcome
- Also called the independent variable(s), or IV(s)

We measure or manipulate the predictors, then quantify the systematic change in the outcome
- NB: YOU (the researcher) assign these roles!

10 / 35

General Model Equation

o u t c o m e = m o d e l + e r r o r

We can use models to predict the outcome for a particular case
This is always subject to some degree of error

11 / 35

Making Predictions

Last week, we looked at the Language and Word Forms subscale of the SCSQ
If I wanted to predict someone's Language score...
- What would be the most sensible estimate?

12 / 35

Making Predictions

Last week, we looked at the Language and Word Forms subscale of the SCSQ
If I wanted to predict someone's Language score...
- What would be the most sensible estimate?

syn_data %>% 
  ggplot(aes(x = Language)) +
  geom_histogram(breaks = syn_data %>% pull(Language) %>% unique()) +
  scale_x_continuous(name = "Language and Word Forms Score",
                     limits = c(1, 5)) +
  scale_y_continuous(name = "Count") +
  scale_fill_discrete(name = "Synaesthesia") +
  geom_vline(aes(xintercept = mean(Language)),
             colour = "purple3",
             linetype = "dashed") + 
  annotate("text", x = mean(syn_data$Language) + .1, y = 122,
           label = paste0("Mean: ", mean(syn_data$Language) %>% round(2)),
           hjust=0, colour = "purple4")

12 / 35

Making Predictions

Without any other information, the best estimate is the mean of the outcome
- But we do have more information!

13 / 35

Making Predictions

Without any other information, the best estimate is the mean of the outcome
- But we do have more information!
Last week: Grapheme-colour synaesthetes score higher than non-synaesthetes on the Language subscale on average
- We could make a better prediction if we knew whether that person was a synaesthete
- Use the mean score in the synaesthete vs non-synaesthete groups

13 / 35

Making Predictions

Without any other information, the best estimate is the mean of the outcome
- But we do have more information!
Last week: Grapheme-colour synaesthetes score higher than non-synaesthetes on the Language subscale on average
- We could make a better prediction if we knew whether that person was a synaesthete
- Use the mean score in the synaesthete vs non-synaesthete groups
Let's write an equation that we can use to predict someone's score based on whether they are a synaesthete or not!

13 / 35

Making Predictions

First: the non-synaesthete (baseline) group
- If someone is a non-synaesthete, predicted Language score = 3.55
- $\hat{L a n g u a g e_{n o n - s y n}} = 3.55$

syn_data %>% 
  dplyr::group_by(GraphCol) %>% 
  dplyr::summarise(
    n = dplyr::n(),
    mean_lang = mean(Language),
    se_lang = sd(Language)/sqrt(n)
  ) %>% 
  ggplot2::ggplot(aes(x = GraphCol, y = mean_lang)) +
  geom_errorbar(aes(ymin = mean_lang - 2*se_lang, ymax = mean_lang + 2*se_lang), width = .1) +
  geom_point(colour = "black", fill = "orange", pch = 23) +
  scale_y_continuous(name = "Language Score",
                     limits = c(3, 5)) +
  labs(x = "Grapheme-Colour Synaesthete") +
  geom_label(stat = 'summary', fun.y=mean, aes(label = paste0("Mean: ", round(..y.., 2))), nudge_x = 0.1, hjust = 0) +
  cowplot::theme_cowplot()

14 / 35

Making Predictions

Next: the synaesthete group
- If someone is a synaesthete, predicted Language score = 4.29
- $\hat{L a n g u a g e_{s y n}} = 3.55 + 0.74$

bracket <- syn_data %>%
  group_by(GraphCol) %>%
  summarise(y = mean(Language)) %>%
  mutate(x = rep(2.7, 2),
         y = round(y, 2))
syn_data %>% 
  ggplot(aes(x = GraphCol, y = Language)) +
  #geom_line(stat="summary", fun = mean, aes(group = NA)) +
  geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) +
  geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") +
  geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4,
             nudge_x = c(-.3, .3)) +
  geom_line(data = bracket, mapping = aes(x, y)) +
  geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) +
  annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2,
           label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)),
           hjust=0) + 
  coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,5)) +
  scale_y_continuous(name = "Language Score") +
  labs(x = "Grapheme-Colour Synaesthete") +
  cowplot::theme_cowplot()

15 / 35

Making Predictions

We want our equation to give a different prediction depending on whether someone is a synaesthete or not
- $\hat{L a n g u a g e} = 3.55$

16 / 35

Making Predictions

We want our equation to give a different prediction depending on whether someone is a synaesthete or not
- $\hat{L a n g u a g e} = 3.55$
- $\hat{L a n g u a g e_{i}} = 3.55 + 0.74 \times G r a p h C o l_{i}$

16 / 35

Making Predictions

We want our equation to give a different prediction depending on whether someone is a synaesthete or not
- $\hat{L a n g u a g e} = 3.55$
- $\hat{L a n g u a g e_{i}} = 3.55 + 0.74 \times G r a p h C o l_{i}$
When someone is a non-synaestheste (GraphCol = 0)...
- $\hat{L a n g u a g e_{i}} = 3.55 + 0.74 \times 0$
- $\hat{L a n g u a g e_{i}} = 3.55$

16 / 35

Making Predictions

We want our equation to give a different prediction depending on whether someone is a synaesthete or not
- $\hat{L a n g u a g e} = 3.55$
- $\hat{L a n g u a g e_{i}} = 3.55 + 0.74 \times G r a p h C o l_{i}$
When someone is a non-synaestheste (GraphCol = 0)...
- $\hat{L a n g u a g e_{i}} = 3.55 + 0.74 \times 0$
- $\hat{L a n g u a g e_{i}} = 3.55$
When someone is a synaesthete (GraphCol = 1)...
- $\hat{L a n g u a g e_{i}} = 3.55 + 0.74 \times 1$
- $\hat{L a n g u a g e_{i}} = 3.55 + 0.74$
- $\hat{L a n g u a g e_{i}} = 4.29$

16 / 35

Drawing Lines

This equation represents a linear model (a line!) between the means

bracket <- syn_data %>%
  group_by(GraphCol) %>%
  summarise(y = mean(Language)) %>%
  mutate(x = rep(2.7, 2),
         y = round(y, 2))
syn_data %>% 
  ggplot(aes(x = GraphCol, y = Language)) +
  geom_line(stat="summary", fun = mean, aes(group = NA)) +
  geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) +
  geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") +
  geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4,
             nudge_x = c(-.3, .3)) +
  geom_line(data = bracket, mapping = aes(x, y)) +
  geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) +
  annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2,
           label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)),
           hjust=0) + 
  coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,5)) +
  scale_y_continuous(name = "Language Score",
                     limits = c(1, 5)) +
  labs(x = "Grapheme-Colour Synaesthete") +
  cowplot::theme_cowplot()

17 / 35

Drawing Lines

The line starts from the mean of the non-synaesthete group
- This is the intercept, which we will call b₀
- The predicted value of the outcome $\hat{y}$ when the predictor $x$ is 0

18 / 35

Drawing Lines

The line starts from the mean of the non-synaesthete group
- This is the intercept, which we will call b₀
- The predicted value of the outcome $\hat{y}$ when the predictor $x$ is 0
Changing to the synaesthete group, predicted Language score changes by 0.74
- This is the slope of the line, which we will call b₁
- The change in the outcome for every unit change in the predictor

18 / 35

Drawing Lines

The line starts from the mean of the non-synaesthete group
- This is the intercept, which we will call b₀
- The predicted value of the outcome $\hat{y}$ when the predictor $x$ is 0
Changing to the synaesthete group, predicted Language score changes by 0.74
- This is the slope of the line, which we will call b₁
- The change in the outcome for every unit change in the predictor

This prediction will always have some amount of error, e

18 / 35

Drawing Lines

The line starts from the mean of the non-synaesthete group
- This is the intercept, which we will call b₀
- The predicted value of the outcome $\hat{y}$ when the predictor $x$ is 0
Changing to the synaesthete group, predicted Language score changes by 0.74
- This is the slope of the line, which we will call b₁
- The change in the outcome for every unit change in the predictor

This prediction will always have some amount of error, e
In general, then, the linear model has the form:

$y_{i} = b_{0} + b_{1} x_{1 i} + e_{i}$

18 / 35

Drawing Lines

## 
## Call:
## lm(formula = Language ~ GraphCol, data = syn_data)
## 
## Coefficients:
## (Intercept)  GraphColYes  
##      3.5549       0.7313

bracket <- syn_data %>%
  group_by(GraphCol) %>%
  summarise(y = mean(Language)) %>%
  mutate(x = rep(2.7, 2),
         y = round(y, 2))
syn_data %>% 
  ggplot(aes(x = GraphCol, y = Language)) +
  geom_line(stat="summary", fun = mean, aes(group = NA), colour = "red") +
  geom_errorbar(stat = "summary", fun.data = mean_cl_boot, width = .25) +
  geom_point(stat = "summary", fun = mean, shape = 23, fill = "orange") +
  geom_text(aes(label=round(..y..,2)), stat = "summary", fun = mean, size = 4,
             nudge_x = c(-.3, .3)) +
  geom_line(data = bracket, mapping = aes(x, y)) +
  geom_segment(data = bracket, mapping = aes(x, y, xend = x - .05, yend = y)) +
  annotate("text", x = bracket$x + .1, y = min(bracket$y) + diff(bracket$y)/2,
           label = paste0("Difference:\n", bracket$y[2], " - ", bracket$y[1], "\n= ", diff(bracket$y)),
           hjust=0) + 
  coord_cartesian(xlim = c(0.5, 3.5), ylim = c(3,5)) +
  scale_y_continuous(name = "Language Score",
                     limits = c(1, 5)) +
  labs(x = "Grapheme-Colour Synaesthete") +
  cowplot::theme_cowplot()

19 / 35

Welcome to `lm()`!

Today's new function is a very important one
- We will use it for the rest of the term and...
- It will be very important next year as well!
Creates a linear model -> lm()
- A statistical model that looks like a line

20 / 35

Basic Anatomy of `lm()`

lm(outcome ~ predictor, data = data)
Should look familiar: almost identical to t.test()!

21 / 35

Basic Anatomy of `lm()`

lm(outcome ~ predictor, data = data)
Should look familiar: almost identical to t.test()!

`t.test`

t.test(Language ~ GraphCol, data = syn_data, 
       alternative = "two.sided", var.equal = T)

`lm`

lm(Language ~ GraphCol, data = syn_data)

21 / 35

Making Connections

Remember from last week: the "signal" of interest was the difference in means
That same value (0.74) is also the value of b₁! (ignoring rounding error...)
- The change in Language between synaesthetes and non-synaesthetes
- Quantifies the relationship between the predictor and the outcome
- This is the key element of the linear model!

22 / 35

Have a Go!

23 / 35

Interim Summary

The linear model predicts the outcome $y$ based on a predictor $x$
- General form: $y_{i} = b_{0} + b_{1} x_{1 i} + e_{i}$
- b₀: the intercept, or value of $y$ when $x$ is 0
- b₁: the slope, or change in $y$ for every unit change in $x$
The slope b₁ represents the relationship between the predictor and the outcome
Up next: continuous predictors

24 / 35

Modelling Gender

Week 4: correlation between femininity and masculinity
- Remember: r expresses degree and direction of the relationship
Today: a linear model using the same variables
- Use this model to make predictions

25 / 35

Visualising the Line

Ratings of femininity vs masculinity

gensex %>%
  mutate(Gender = fct_explicit_na(Gender)) %>% 
  ggplot(aes(x = Gender_fem_1, y = Gender_masc_1)) +
  geom_point(position = "jitter", size = 2, alpha = .4) +
  scale_x_continuous(name = "Femininity", breaks = c(0:9)) +
  scale_y_continuous(name = "Masculinity", breaks = c(0:9)) +
  cowplot::theme_cowplot()

26 / 35

Visualising the Line

Add a line of best fit for the relationship between femininity and masculinity
- Fits the data with the least squared error (but that's for next year!)

What do you think the line will look like?

27 / 35

Visualising the Line

gensex %>%
  mutate(Gender = fct_explicit_na(Gender)) %>% 
  ggplot(aes(x = Gender_fem_1, y = Gender_masc_1)) +
  geom_point(position = "jitter", size = 2, alpha = .4) +
  geom_smooth(method = "lm", formula = y ~ x) + 
  scale_x_continuous(name = "Femininity", breaks = c(0:9)) +
  scale_y_continuous(name = "Masculinity", breaks = c(0:9)) +
  cowplot::theme_cowplot()

28 / 35

Modelling Gender

$\hat{y_{i}} = b_{0} + b_{1} x_{1 i} + ϵ$

Outcome: Masculinity
Predictor: Femininity
b₀: value of masculinity when femininity is 0 (the intercept)
b₁: change in masculinity associated with a unit change in femininity (the slope)

29 / 35

Modelling Gender

$\hat{y_{i}} = b_{0} + b_{1} x_{1 i} + ϵ$

Outcome: Masculinity
Predictor: Femininity
b₀: value of masculinity when femininity is 0 (the intercept)
b₁: change in masculinity associated with a unit change in femininity (the slope)

$M a s c u l i n i t y_{i} = b_{0} + b_{1} F e m i n i n i t y_{i} + e_{i}$

29 / 35

Modelling Gender

$M a s c u l i n i t y_{i} = b_{0} + b_{1} F e m i n i n i t y_{i} + e_{i}$

30 / 35

Modelling Gender

$M a s c u l i n i t y_{i} = b_{0} + b_{1} F e m i n i n i t y_{i} + e_{i}$

## 
## Call:
## lm(formula = Gender_masc_1 ~ Gender_fem_1, data = gensex)
## 
## Coefficients:
##  (Intercept)  Gender_fem_1  
##       8.8246       -0.7976

30 / 35

Modelling Gender

$M a s c u l i n i t y_{i} = b_{0} + b_{1} F e m i n i n i t y_{i} + e_{i}$

## 
## Call:
## lm(formula = Gender_masc_1 ~ Gender_fem_1, data = gensex)
## 
## Coefficients:
##  (Intercept)  Gender_fem_1  
##       8.8246       -0.7976

$\hat{M a s c u l i n i t y_{i}} = 8.82 - 0.80 \times F e m i n i n i t y_{i}$

30 / 35

Predicting Gender

We can now use this model to predict someone's rating of masculinity, if we know their rating of femininity
- e.g., someone who is not very feminine (rating: 3)
- What would the model predict for this person's masculinity rating?
$\hat{M a s c u l i n i t y_{i}} = 8.82 - 0.80 \times F e m i n i n i t y_{i}$

31 / 35

Predicting Gender

$\hat{M a s c u l i n i t y_{i}} = 8.82 - 0.80 \times F e m i n i n i t y_{i}$
- $\hat{M a s c u l i n i t y_{i}} = 8.82 - 0.80 \times 3$
- $\hat{M a s c u l i n i t y_{i}} = 6.42$
So, someone with a femininity of 3 will have a predicted masculinity rating of 6.42
- This is subject to some (unknowable!) degree of error

32 / 35

Comparing to the Model

Someone with a femininity of 3 will have a predicted masculinity rating of 6.42

gensex %>%
  mutate(Gender = fct_explicit_na(Gender)) %>% 
  ggplot(aes(x = Gender_fem_1, y = Gender_masc_1)) +
  geom_point(position = "jitter", size = 2, alpha = .4) +
  geom_smooth(method = "lm", formula = y ~ x) + 
  geom_vline(xintercept = 3, linetype = "dashed") +
  geom_hline(yintercept = 6.42, linetype = "dashed") +
  scale_x_continuous(name = "Femininity", breaks = c(0:9)) +
  scale_y_continuous(name = "Masculinity", breaks = c(0:9)) +
  cowplot::theme_cowplot()

33 / 35

Summary

The linear model (LM) expresses the relationship between at least one predictor, $x$ , and an outcome, $\hat{y}$
- Linear model equation: $y_{i} = b_{0} + b_{1} x_{1 i} + e_{i}$
- Predictors can be categorical or continuous!
- Most important result is the parameter b₁, which expresses the change in $y$ for each unit change in $x$
Used to predict the outcome for a given value of the predictor
Next week: significance and model fit

34 / 35

Have a lovely weekend!

35 / 35

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

The Linear Model 1: Equation of a Line

Lecture 07

Dr Jennifer Mankin

12 March 2021

Looking Ahead (and Behind)

Looking Ahead (and Behind)

Announcements

Objectives

The Linear Model

The Linear Model

Maps as Models

Maps as Models

Statistical Models

Statistical Models

Statistical Models: COVID-19

Predictors and Outcomes

Predictors and Outcomes

Predictors and Outcomes

General Model Equation

Making Predictions

Making Predictions

Making Predictions

Making Predictions

Making Predictions

Making Predictions

Making Predictions

Making Predictions

Making Predictions

Making Predictions

Making Predictions

Drawing Lines

Drawing Lines

Drawing Lines

Drawing Lines

Drawing Lines

Drawing Lines

Welcome to lm()!

Basic Anatomy of lm()

Basic Anatomy of lm()

t.test

lm

Making Connections

Have a Go!

Interim Summary

Modelling Gender

Visualising the Line

Visualising the Line

Visualising the Line

Modelling Gender

Modelling Gender

Modelling Gender

Modelling Gender

Modelling Gender

Predicting Gender

Predicting Gender

Comparing to the Model

Summary

Have a lovely weekend!

Looking Ahead (and Behind)

Help

Welcome to `lm()`!

Basic Anatomy of `lm()`

Basic Anatomy of `lm()`

`t.test`

`lm`