Practical 10
Today we’re continuing our work on the Linear Model, which will be our main focus for the rest of the term and throughout second year. This week we’ll start from the single-predictor model we made last week and expand it by adding more predictors.
In order to complete this practical successfully, you need to have completed the Week 9 practical first. If you haven’t done that yet, stop here and go do that practical before you carry on here.
Set up as usual:
tidyverse
We’re continuing this week with the pokemon
dataset that we started working with in the last practical. More importantly, we’ll also be building from the same model and results that we looked at last time. If you didn’t complete that practical, or if it’s just been a hot minute since you had to think about the linear model, let’s look back at what we did in that practical before we move on.
Read in the pokemon
dataset at the following link, which is a copy of this Pokemon dataset, and store it in a new object called pokemon
.
Link: https://and.netlify.app/docs/pokemon.csv
pokemon <- readr::read_csv("https://and.netlify.app/docs/pokemon.csv")
As future Pokemon Masters, we want to work out which Pokemon is the strongest. So, we’ll keep using attack
as our outcome variable, which quantifies a Pokemon’s offensive capabilities in battle. Last week we created a model using one predictor; I used hp
, a measure of the Pokemon’s health, but you might have chosen something different.
Recreate your model with one predictor from the last practical and save it in a new object called poke_lm
. Then, have a look at a summary
of the results.
Call:
lm(formula = attack ~ hp, data = pokemon)
Residuals:
Min 1Q Median 3Q Max
-162.812 -18.469 -3.344 16.656 108.091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.59387 2.88447 15.11 <2e-16 ***
hp 0.49687 0.03903 12.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.34 on 799 degrees of freedom
Multiple R-squared: 0.1686, Adjusted R-squared: 0.1676
F-statistic: 162 on 1 and 799 DF, p-value: < 2.2e-16
Use the summary you’ve produced to answer the following questions for your model:
If you’re stuck here, make sure you review the Week 9 practical in depth before you continue.
For my model with hp
as the only predictor:
hp
) is b1. From the output we can see that b0 is 43.59 and b1 is 0.5.hp
and attack
is positive - Pokemon with more health also tend to have a stronger attack. More specifically, for each additional point of health a Pokemon has (a unit increase in hp
), its attack increases by 0.5.Pr(>|t|)
. We want the value in this column on the row labeled with the name of the predictor. For my model, this value is very small, and definitely smaller than .05, so the positive relationship between hp
and attack
is significant. The ***
code after the p-value in the output tells me that this result is significant at p < .001, which is also how I would report this result.We now have a good idea of the relationship between our one predictor and attack
. Next, let’s expand our model to include more predictors.
Choose another predictor from one of the following:
defense
: How good the Pokemon’s defense isheight_m
: The Pokemon’s height in metersweight_kg
: The Pokemon’s weight in kilogramsspeed
: How fast the Pokemon ishp
: How many hit points (a measure of health) the Pokemon hasAgain, you should choose a different predictor than the one in the guided part of the practical. For the purposes of the solutions in this worksheet, I’ll use defense
.
In your RMarkdown file, write down your prediction about the relationship between your new chosen predictor and the outcome, attack
.
As we’ve done before, any prediction is fine that makes sense to you. The key is that you make a prediction and think about what that relationship might look like in the data.
Next up, we should have a look at our data. We’re first going to do this for our new predictor only, then make a more complex figure with both predictors.
Create a scatterplot of your new predictor and attack
as the outcome.
pokemon %>%
ggplot(aes(x = defense, y = attack)) +
geom_point()
If you like, label the axes better and clean up the plot with a theme.
pokemon %>%
ggplot(aes(x = defense, y = attack)) +
geom_point() +
scale_x_continuous(name = "Defense Stat",
# seq() creates a vector of breaks at increments of 25
breaks = seq(0, max(pokemon$defense), by = 25)) +
scale_y_continuous(name = "Attack Stat",
breaks = seq(0, max(pokemon$attack), by = 25)) +
cowplot::theme_cowplot()
Stop and have a look at your plot. How would you draw the line of best fit? Is this the direction of relationship you expected?
Add a line of best fit to your plot. Is this what you expected?
The answer to this question depends on what you predicted originally, and the line of best fit you suggested before you added it to your plot!
pokemon %>%
ggplot(aes(x = defense, y = attack)) +
geom_point() +
scale_x_continuous(name = "Defense Stat",
# seq() creates a vector of breaks at increments of 25
breaks = seq(0, max(pokemon$defense), by = 25)) +
scale_y_continuous(name = "Attack Stat",
breaks = seq(0, max(pokemon$attack), by = 25)) +
geom_smooth(method = "lm") +
cowplot::theme_cowplot()
It’s always a good idea to visualise your data before modeling, for example with the scatterplots we’ve made above. However, for more than one predictor the plots get hard to read, and for more than two it isn’t even possible to make a plot! Typically we are more interested in the b estimates rather than overly elaborate plots. In my case, both hp
and defense
have positive relationships with attack
. This is enough for our purposes, but if you’d like more plotting, see the optional box below.
If you’re really keen to make a plot with both predictors, you can get a basic one really easily with R.
I learned how to do this by Googling “3d plots in R” and following this tutorial. For just about anything you want to do in R, just search for “[thing you want to do] in R” and there are tons of resources available!
First, install and load the scatterplot3d
package.
install.packages("scatterplot3d")
library(scatterplot3d)
Next, create a basic plot with your two predictors on the x- and y-axes, with the outcome on the z-axis. If you get a bit confused, use the tutorial linked above for help!
scatterplot3d(x = pokemon$hp, y = pokemon$defense, z = pokemon$attack)
…Yuck. That’s not too helpful, is it?
You can make the plot better if you like by trying out the options in the tutorial linked above. However, this might give you an idea why plots with three variables are a bit tricky to use/interpret! For now, we’re going to move on to the numbers, which are much easier to get on with.
Now that we have some idea of what to expect, we can create our linear model. Since we already have a poke_lm
object containing our model with one predictor, we need a new object to contain our second model.
To do this, we need to add the second predictor into the formula - literally, by using +
!
Use the lm()
function to run the analysis with both predictors and save the model in a new object called poke_lm2
.
# remember to use whichever predictor you've chosen!
poke_lm2 <- lm(attack ~ hp + defense, data = pokemon)
Call this poke_lm2
object in the Console to view it, then write out the linear model from the output.
First, to see what’s in our model, we can just call it by typing its name in the console and pressing Enter.
poke_lm2
Call:
lm(formula = attack ~ hp + defense, data = pokemon)
Coefficients:
(Intercept) hp defense
21.5852 0.3818 0.4102
This simple version of the output now gives us three numbers:
(Intercept)
: the intercept, here 21.59hp
: the slope for this predictor, here 0.38defense
: the slope for the second predictor, here 0.41Keep in mind that the slope will always be named with the predictor that it’s associated with!
So, we can write our linear model:
Attacki = 21.59 + 0.5 \(\times\) HPi + 0.41\(\times\) Defensei
Choose two values, one for HP and one for defense. You can pick them at random if you want. Using your equation, what attack
value would you predict a Pokemon with those stats to have?
If you’re using the hp
and defense
predictors, use these values: 69 HP and 73 defense.
Start with our equation: Attacki = 43.59 + 0.5 \(\times\) HPi + 0.41\(\times\) Defensei
All we need to do is replace the names of the variables, HP and Defense, with the specific values for those predictors that we’re interested in:
attack = 21.59 + 0.38 \(\times\) 69 + 0.41 \(\times\) 73
attack = 21.59 + 26.22 + 29.93
attack = 77.74
So, a Pokemon with 69 HP and 73 defense should have an attack rating of about 78.
Which predictor(s) are significant? How do you know? Report the results in your Markdown.
Get a summary
of your model, or maybe tidy
it up a bit, to get p-values.
First, we need output that gives us p-values. Either summary()
or broom::tidy()
will work, but I’m using the latter so I can get confidence intervals as well.
poke_lm2 %>% broom::tidy(conf.int = T)
# A tibble: 3 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 21.6 3.12 6.91 9.78e-12 15.5 27.7
2 hp 0.382 0.0366 10.4 5.45e-24 0.310 0.454
3 defense 0.410 0.0316 13.0 4.23e-35 0.348 0.472
First, we can essentially ignore the intercept. It’s useful for constructing the equation as we did above, but with multiple predictors it becomes hard to interpret usefully. As we’ve said before, we’re primarily interested in the bs for the predictors, which capture the relationship between each of them and the outcome.
We can report the statistical results for the hp
predictor as: \(b = 0.38\), 95% CI \([0.31\), \(0.45]\), \(t(798) = 10.44\), \(p < .001\). As p is smaller than .05, this is a significant result.
We can report the statistical results for the defense
predictor as: \(b = 0.41\), 95% CI \([0.35\), \(0.47]\), \(t(798) = 12.98\), \(p < .001\). As p is again smaller than .05, this is also a significant result.
How can you interpret the values of the bs in this model? Write down your thoughts in your RMarkdown.
For the hp
predictor, the value of b1 is 0.38. This is the predicted change in attack
for each unit change in hp
, when defense
is 0. In other words, for a Pokemon with 0 defense, for each additional hit point a Pokemon has, it’s predicted to have a 0.38 increase in attack stat - a positive relationship.
Conversely, for the defense
predictor, the value of b2 is 0.41. This is the predicted change in attack
for each unit change in defense
, when hp
is 0. In other words, for a Pokemon with 0 HP, for each additional defense point a Pokemon has, it’s predicted to have a 0.41 increase in attack stat - also a positive relationship.
Notice that interpretation of the bs as they are isn’t a very sensible way to present this information. There are no Pokemon with 0 defense, and a Pokemon with 0 HP would be dead! This isn’t wrong per se, but we can create a model in which the value of 0 is more sensible to interpret.
Rerun the model with mean-centred predictors.
First, create two new centred versions of hp
and defense
.
We need to create new predictor variables, centred around their respective means. Note that there are two ways to do this, and either one is fine.
We can confirm this worked by comparing the two variables. First, I’ve calculated the mean of the uncentred defense
variable, and then the mean of the centred one. (This doesn’t look like 0 at first glance, but notice the scientific notation!). So, the two variables have different means, and the centred one has a mean of 0.
Then, I ask for the correlation coefficient r between the two. The result is 1: a perfect correlation. So, all we’ve done is shift the defense
variable along the x-axis so its mean aligns with 0. This effectively changes the meaning of 0 in the interpretation of our linear model.
Now, rerun your linear model using the two centred predictors. How is it similar to the previous model? How is it different?
First, construct the model. Then, let’s pull up the summaries of the previous model and the centred model to compare.
poke_lm2_cntr <- pokemon %>%
lm(attack ~ hp_cntr + defense_cntr, data = .)
poke_lm2_cntr %>% summary()
Call:
lm(formula = attack ~ hp_cntr + defense_cntr, data = .)
Residuals:
Min 1Q Median 3Q Max
-114.079 -16.849 -2.578 15.697 88.219
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.85768 0.94260 82.60 <2e-16 ***
hp_cntr 0.38177 0.03658 10.44 <2e-16 ***
defense_cntr 0.41017 0.03160 12.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26.68 on 798 degrees of freedom
Multiple R-squared: 0.3136, Adjusted R-squared: 0.3119
F-statistic: 182.3 on 2 and 798 DF, p-value: < 2.2e-16
poke_lm2 %>% summary()
Call:
lm(formula = attack ~ hp + defense, data = pokemon)
Residuals:
Min 1Q Median 3Q Max
-114.079 -16.849 -2.578 15.697 88.219
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.58517 3.12285 6.912 9.78e-12 ***
hp 0.38177 0.03658 10.436 < 2e-16 ***
defense 0.41017 0.03160 12.982 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26.68 on 798 degrees of freedom
Multiple R-squared: 0.3136, Adjusted R-squared: 0.3119
F-statistic: 182.3 on 2 and 798 DF, p-value: < 2.2e-16
The first thing to notice is the Estimates (b-values). The bs for hp
and defense
are identical in both models - the relationship between each of them and the outcome is the same. However, the intercept has changed.
Remember that the intercept represents the predicted value of the outcome when all the predictors are 0. This is still the case, but for the centred predictors, 0 represents the mean value. So, in the centred model, the intercept represents the predicted value of the outcome when all predictors are at their mean. Based on this, we can say that a Pokemon with the mean value of HP and the mean defense stat is predicted to have an attack stat of about 78.
Sound familiar? It should! Calculate the means of hp
and defense
, then have a quick look back at Task 7.2.
You might notice that the values I gave you for that calculation were the mean values for those two variables. The answer you got there should be the same as the intercept here, because it’s the same calculation.
Before we wrap up, let’s do one more thing - check our \(R^2\) values.
Obtain the \(R^2\) value for the two-predictor model (doesn’t matter which one!). What can you conclude from this?
All you need is a quick glance
.
poke_lm2 %>% broom::glance()
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.314 0.312 26.7 182. 6.33e-66 2 -3765. 7539.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
Multiplying by 100 to get a percentage, we can say that 31.36% of the variance in the outcome is accounted for by this model.
If we want to be a bit fancy, we could report \(R^2\) formally: \(R^2 = .31\), 90% CI \([0.27\), \(0.36]\)
Compare the \(R^2\) for this model with two predictors to the \(R^2\) for our original model. How much more variance have we been able to explain by adding in a second predictor?
First, let’s get the \(R^2\) for our first model again:
poke_lm %>% broom::glance()
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.169 0.168 29.3 162. 6.27e-34 1 -3842. 7690.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
So, our first model explained 16.86% of the variance in the outcome, and as we saw above, the second model explained 31.36%. So, by adding in a second predictor, we’ve been able to explain 14.5% more variance.
Is this enough of an improvement to believe the second model is better than the first? Come to Lecture 10 to find out!
Well done today - that was a lot to get through. If you’re following along with this, you’re doing great! If you’re still feeling unsure, we’ll have more time to practice next week before the exam. It’s really important that you have a working knowledge of the linear model, though, so don’t give up!
And finally, if it’s stuck in your head like it is in mine, here’s one of my favourite covers to sing along with.