Tutorial 10
In the previous tutorial, we practised fitting a linear model with a single predictor to see if we can say anything about the length of a penguin’s flipper if we know something about its body mass. This time, we will build more sophisticated models with more than one predictor. To do that, we’ll make use of the understanding of multiple-predictor linear models that you gained in lecture 9.
Before we start, let’s warm up by answering a few quiz questions!
As you know, the general equation for the linear model with n predictors is
\[y_i = b_0 + b_1\times x_{1_i} + b_2\times x_{2_i} + b_3\times x_{3_i} + \dots + b_{n-1}\times x_{n-1_i} + b_n\times x_{n_i} + e_i\]
How many slopes will there be in a model with 4 predictors?
4
Correct!That’s not right…
How many intercepts will there be in a model with 3 predictors?
1
Correct!That’s not right…
In general terms, how many b coefficients will there be in a model with n predictors? (do not include white spaces)
n + 1
Correct!That’s not right…
What is the geometrical representation of the linear model with 2 predictors?
2D plane in a 3D space
Correct!That’s not right…
What does each slope represent?
Relationship between a given predictor and the outcome after accounting for all other predictors
Correct!That’s not right…
As usual, all you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.
Create a new week_10
project and open it in RStudio. Then, create two new folders in the new week_10
folder: r_docs
and data
. Finally, open a new R Markdown file and save it in the r_docs
folder. Get into the habit of creating new code chunk in the file for each of the tasks below.
Remember, you can add new code chunks by:
```
{r}
, press Enter, then ```
again.
In this tutorial, just like in the previous one, we’ll continue using the penguins data set from the palmerpenguins
package. You will also need the packages tidyverse
, broom
, and QuantPsyc
(install it if you don’t have it yet; watch out for spelling and upper/lower case letters!). Put the code that loads all of these in the setup
chunk of your Rmd file and run it.
Just like last time, save the penguins
data set in your environment as peng_data
to avoid having to type out palmerpenguins::penguins
every time you want to use it.
peng_data <- palmerpenguins::penguins
Remind yourself of what’s in the data by looking at a rough summary.
summary(peng_data)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
Last time we looked at how we can use information about a penguin’s body mass to make a more accurate prediction about the length of its flippers than simply guessing the mean flipper length for any given specimen.
Let’s take a look at this model again.
Fit the model to the data using the lm()
, saving it’s output as m_mass
Use broom::tidy()
to get a neat tibble of model summary statistics for this model. Save it as m_mass_summary
m_mass_summary <- m_mass %>%
broom::tidy()
Inspect the results and answer the following questions.
What kind of relationship between body mass and flipper length does the model describe
Positive
Correct!That’s not right…
By how many millimetres is flipper length predicted to increase for every 10 gramme increase in a penguin’s body mass? (give answer to 2 decimal places)
0.15
Correct!That’s not right…
Is this a statistically significant relationship (assuming an α = .05)?
Yes
Correct!That’s not right…
Use broom::glance()
to get the model fit statistics for m_mass
and answer the quiz question.
What proportion of total variance in flipper length is accounted for by body mass? (give answer as % to 1 decimal place, such as 77.2)
75.9
Correct!That’s not right…
First, let’s have a look at the output of broom::glance()
.
m_mass %>% broom::glance()
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.759 0.758 6.91 1071. 4.37e-107 1 -1146. 2297.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
The value we are looking for is in the first column, r.squared
. All we need to do is to convert it to % and round to one decimal place:
Now that you’ve refreshed your memory of the single-predictor model we built last time, let’s extend it by adding another predictor.
Imagine we’re also interested in whether or not bill length is related to flipper length. We can extend out m_mass
model to explore this research question.
Before we do that, let’s look at what a simple model of flipper length as predicted by bill length tells us.
Fit this model, saving it as m_bill
, inspect the results and answer the following questions.
What kind of relationship between bill length and flipper length does the model describe
Positive
Correct!That’s not right…
By how many millimetres is flipper length predicted to increase for every 1 mm increase in a penguin’s bill length? (give answer to 2 decimal places)
1.69
Correct!That’s not right…
Is this a statistically significant relationship (assuming an α = .05)?
Yes
Correct!That’s not right…
What proportion of total variance in flipper length is accounted for by bill length? (give answer as % to 1 decimal place, such as 77.2)
43.1
Correct!That’s not right…
First, let’s fit the model and save the summary using broom::tidy()
Let’s have a look:
m_bill_smry
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 127. 4.67 27.2 3.65e-87
2 bill_length_mm 1.69 0.105 16.0 1.74e-43
The coefficient for bill_length_mm
, in other words, the slope for the variable (\(b_1\)), is positive. Given that both our predictor and outcome represent length in millimetres, and so a larger number represents more of the quantity that’s being measured, this is a positive relationship.
Both variables in the model are measured in mm and so the coefficient is interpreted as an increase in flipper length of 1.69 mm per every 1 mm increase in bill length.
The p-value is tiny (less than 2 × 10−16), and so \(b_1\) is statistically significantly different from zero, given the \(\alpha\) = .05 significance level.
Finally, the proportion of variance explained is:
OK, so we know that bill length is a significant predictor or flipper length. However, it could well be the case, that both of these variables are just reflections of body mass and, if we take body mass into account, the relationship between bill and flipper length disappears. In other words, it’s possible that larger penguins have both longer bills and longer flippers but once you only consider birds of a certain body mass, the ones with longer bills do not have longer flippers. Putting both of the predictors in the model allows us to see whether or not this is actually the case.
A careful inspection of the R2 values of our two models reveals that the relationship is not going to be completely straightforward. The first model accounts for 75.9% of the variance in the outcome, while the second model accounts for 43.1%. That means, that the model that combines the two predictors cannot simply be the same as the two models added together. The reason for this is that the total amount of variance in flipper length is, well, 100% and not 75.9% + 43.1% = 119%.
In other words, some of the variance accounted for by the two predictors will be shared between them.
Let’s do it then!
Let’s fit and interpret a two-predictor model of flipper length one step at a time.
First of all, let’s fit the model using lm()
and save it as m_mass_bill
.
Check out lecture 9 if you can’t remember how to add multiple predictor into the formula.
m_mass_bill <- lm(flipper_length_mm ~ body_mass_g + bill_length_mm, peng_data)
Get the summary of the model and have a look at it. Are the model coefficients the same as the ones from m_mass
and m_bill
? If not, how are they different?
m_mass_bill_smry <- m_mass_bill %>% broom::tidy()
m_mass_bill_smry
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 122. 2.86 42.7 1.70e-138
2 body_mass_g 0.0131 0.000545 23.9 7.56e- 75
3 bill_length_mm 0.549 0.0801 6.86 3.31e- 11
As you can see, both “effects”1 have reduced in size. They are still positive and statistically significant but the effect of body mass has shrunk from 0.015 mm per each gramme to 0.013 mm/g, while the effect of bill length has gone down from 1.69 mm increase in flipper length per each additional mm of bill to just 0.55 mm per mm. This has to do with the above-mentioned fact that some of the explained variance in the outcome is shared by the two predictors.
Remember that the interpretation of the slopes in a multi-predictor model is different from that in a single-predictor model.
In our particular model, the \(b\) coefficient for body mass represents the increase in flipper length given a certain fixed bill length. Conversely, the \(b\) coefficient for body bill length represents the increase in flipper length given a certain fixed body mass.
Given the coefficients of our model – 121.956, 0.013, and 0.549 – give the predicted value of flipper length for the following values of body mass and bill length. (2 decimal places)
Do you dare use R
to do the numbers for you? 😛
Mass: 4,200 g
Bill length: 45 mm
201.48 | 201.26
Correct!That’s not right…
Mass: 4,200 g
Bill length: 30 mm
193.25 | 193.03
Correct!That’s not right…
Mass: 3,800 g
Bill length: 45 mm
196.26 | 196.06
Correct!That’s not right…
Mass: 3,800 g
Bill length: 30 mm
188.03 | 187.83
Correct!That’s not right…
All you need to do to answer these questions is to plug the right numbers into the formula and do the maths. For instance, the answer to the first question is:
\[\begin{align}\hat{y}&=b_0 + b_1\times x_1 + b_2\times x_2\\&=b_0 + b_1\times\text{body_mass} + b_2\times\text{bill_length}\\&\approx121.956 + 0.013\times\text{body_mass} + 0.549\times\text{bill_length}\\&\approx121.956 + 0.013\times4,200 + 0.549\times45\\&\approx201.26\end{align}\]
A good thing to notice is the relationships between the answers to each of the questions and the \(b\) coefficients.
The values of body mass in questions 14 and 15 are the same and the values of bill length differ by 15 mm. That means that the difference between the results is going to be 15 × 0.549 = 8.24. That’s going to be the same as the difference between answers to questions 16 and 17 because the difference in the respective values of bill length in these two questions are also 15 mm, while the values of body mass are, again, the same.
Questions 14 and 16 on the one hand and 15 and 17 on the other have the same values of bill length but, in both cases, they differ in body mass by 400 g. That means that the differences between both pairs of results is going to be 400 × 0.013 = 5.2.
Finally, the differences between questions 14 and 17 and 15 and 16, respectively, is going to be the sum of the previous two differences: 400 × 0.013 + 15 × 0.549 = 13.44.
R
Of course, we can tell the computer to calculate the predicted values for us!
There are multiple ways of doing this and there are even functions written exactly for this purpose, such as predict()
. But, for now, let’s use what we already know.
If you look at the output of our model using broom::tidy()
(which we saved as m_mass_bill_smry
), you’ll find the coefficients there, in the estimate
column:
m_mass_bill_smry
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 122. 2.86 42.7 1.70e-138
2 body_mass_g 0.0131 0.000545 23.9 7.56e- 75
3 bill_length_mm 0.549 0.0801 6.86 3.31e- 11
We can use this column to calculate the predicted value for any set of predictor values. All we need is a column of values to multiply the estimate
column by and then add the products together:
b | x | b * x | |
---|---|---|---|
121.956 | 1 | 121.956 | |
0.013 | 4200 | 54.813 | |
0.549 | 45 | 24.715 | |
sum | 201.484 |
To do this in R
you only need to come up with an algorithm. An OK one could look something like this:
x
, with the values of predictors. Since we’re multiplying two columns, the predictor column also needs a 1 to multiply the intercept by.estimate
and x
columns. Let’s call it bx
.bx
column from the tibble.All that’s left is to translate these steps into R
using basic data wrangling functions:
m_mass_bill_smry %>%
dplyr::mutate(x = c(1, 4200, 45), # step 1
bx = estimate * x) %>% # step 2
dplyr::pull(bx) %>% # step 3
sum() %>% # step 4
round(2) # step 5
[1] 201.48
And there’s your answer… It differs from the result of the calculation in the Solution box (201.26), because here we are not rounding our coefficients to 3 decimal places. Not doing so makes the answer more precise.
With a little thinking, you could even write code that calculates the answer to all four questions in one go!
Why don’t you give it a try as an extra task? Using the predict()
function will make it very easy.
Look at the model fit statistic, R2 in particular.
What proportion of variance in the outcome does the model account for? (Give answer as % to 1 decimal place)
78.8
Correct!That’s not right…
m_mass_bill %>% broom::glance()
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.788 0.787 6.49 631. 4.88e-115 2 -1123. 2255.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
Notice that R2 of this model is far smaller than the R2 of the two one-predictor models added together. In fact, it’s not that much larger than that of the body mass only model: While that model accounted for 75.9% of the total variance in flipper length, this one accounts for 78.8%. Once again, this is because there is a substantial overlap (correlation) between body mass and bill length.
When we were reporting our body mass model in the last tutorial, we changed the scale of the predictor from grammes to kilogrammes. This is by no means a problematic thing to do. After all, given the mean and SD of the body_mass_g
variable (4201.75 and 801.95, respectively), it is not really reasonable to expect every 1 g increase in body mass to have a substantial effect on flipper length and so working with body mass in kg is fine.
Rescale the body mass variable, refit the model and inspect the results to see how the coefficients change.
Creating a mass_kg
variable in the peng_data
shouldn’t be difficult as it’s just a conversion of grammes to kilogrammes…
# first, create the re-scaled variable
peng_data <- peng_data %>%
dplyr::mutate(mass_kg = body_mass_g / 1000)
# re-fti the model
m_massKg_bill <- lm(flipper_length_mm ~ mass_kg + bill_length_mm, peng_data)
# see resutls
m_massKg_bill_smry <- m_massKg_bill %>% broom::tidy()
m_massKg_bill_smry
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 122. 2.86 42.7 1.70e-138
2 mass_kg 13.1 0.545 23.9 7.56e- 75
3 bill_length_mm 0.549 0.0801 6.86 3.31e- 11
You should be able to see that both the intercept and the slope for bill length stayed the same as in the original model but the slope for mass (kg) scaled up by 1,000, compared to the one in the original model. Other than that, it’s the same number.
So as you can see, scaling the predictors scales the coefficients accordingly. If we wanted, we could scale flipper or bill length to some other units (such as centimetres) but, in this case, millimetres work well.
What doesn’t work as well, however, is the intercept. Not that there’s something wrong with it but, in the current model, it’s not very meaningful. After all, the flipper length of a hypothetical massless bill-less penguin is not that interesting. If may be more interesting to know what the expected flipper length is for a penguin of average size and average bill length.
Re-fit the model with the predictors centred around their respective means.
To mean-centre a variable, you can either subtract its mean from it, or use the scale()
function with the right combination of its arguments. Check out the documentation for the function (?scale
) to find out what arguments the function accepts.
peng_data <- peng_data %>%
mutate(
mass_kg_cntr = mass_kg - mean(mass_kg, na.rm = T), # using mean
bill_length_cntr = scale(bill_length_mm, scale = FALSE)) # using scale
m_massKg_bill_cntr <- lm(flipper_length_mm ~ mass_kg_cntr + bill_length_cntr, peng_data)
# see resutls
m_massKg_bill_cntr_smry <- m_massKg_bill_cntr %>% broom::tidy()
m_massKg_bill_cntr_smry
# A tibble: 3 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 201. 0.351 573. 0.
2 mass_kg_cntr 13.1 0.545 23.9 7.56e-75
3 bill_length_cntr 0.549 0.0801 6.86 3.31e-11
So… What is the expected flipper length (in mm) for a penguin of average size and average bill length? (2 decimal places)
200.92
Correct!That’s not right…
The centred model we just made is pretty informative. However, it’s not immediately obvious which of our two predictors has a stronger effect on the outcome. That’s because the two predictors are measured on incomparable scales (kg vs mm). This is where standardised coefficients (Bs) come in handy: because they are both on the scale of standard deviations, they allow us to compare effects of very different variables.
Use QuantPscy::lm.beta()
to get B coefficients for the predictors in our model.
m_massKg_bill_cntr %>% QuantPsyc::lm.beta()
mass_kg_cntr bill_length_cntr
0.7442998 0.2132412
This of the predictors has the strongest effect on the outcome?
Body mass
Correct!That’s not right…
Since one SD increase in body mass is associated with a 0.74 SD increase in flipper length, while one SD increase in bill length is associated with only a 0.21 SD increase in flipper length, body mass has the larger effect on the outcome.
Report the findings of the model. At the very least, you should report R2 and the individual model coefficients along with their test of significance.
Check out tutorial 9 for inspiration.
The model accounted for a significant proportion of variance in flipper length, \(R^2 = .79\), 90% CI \([0.75\), \(0.82]\), \(F(2, 339) = 631.39\), \(p < .001\). Body mass was positively related to flipper length, with flipper length increasing by 13.05 millimetres with each one kilogramme increase in body mass, 95% CI \([11.98\), \(14.12]\), \(t(339) = 23.94\), \(p < .001\). Bill length was also significantly related to the outcome, \(b = 0.55\), 95% CI \([0.39\), \(0.71]\), \(t(339) = 6.86\), \(p < .001\).
That’s all for today. Good job!
Here, effect is a technical term and does not imply a causal link between the predictor and the outcome!↩︎