Analysing Data: LM 3: Multiple Predictors

Welcome BACK to LM!

Today we’re continuing our work on the Linear Model, which will be our main focus for the rest of the term and throughout second year. This week we’ll start from the single-predictor model we made last week and expand it by adding more predictors.

In order to complete this practical successfully, you need to have completed the Week 9 practical first. If you haven’t done that yet, stop here and go do that practical before you carry on here.

Setup

Task 1

Set up as usual:

Create a project for this week
Open a new RMarkdown document to work in and save it in your project folder
Load tidyverse

Our Courage Will Pull Us Through

We’re continuing this week with the pokemon dataset that we started working with in the last practical. More importantly, we’ll also be building from the same model and results that we looked at last time. If you didn’t complete that practical, or if it’s just been a hot minute since you had to think about the linear model, let’s look back at what we did in that practical before we move on.

Task 2

Read in the pokemon dataset at the following link, which is a copy of this Pokemon dataset, and store it in a new object called pokemon.

Link: https://and.netlify.app/docs/pokemon.csv

pokemon <- readr::read_csv("https://and.netlify.app/docs/pokemon.csv")

As future Pokemon Masters, we want to work out which Pokemon is the strongest. So, we’ll keep using attack as our outcome variable, which quantifies a Pokemon’s offensive capabilities in battle. Last week we created a model using one predictor; I used hp, a measure of the Pokemon’s health, but you might have chosen something different.

Task 3

Recreate your model with one predictor from the last practical and save it in a new object called poke_lm. Then, have a look at a summary of the results.

poke_lm <- lm(attack ~ hp, data = pokemon)

poke_lm %>% summary()


Call:
lm(formula = attack ~ hp, data = pokemon)

Residuals:
     Min       1Q   Median       3Q      Max 
-162.812  -18.469   -3.344   16.656  108.091 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 43.59387    2.88447   15.11   <2e-16 ***
hp           0.49687    0.03903   12.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.34 on 799 degrees of freedom
Multiple R-squared:  0.1686,    Adjusted R-squared:  0.1676 
F-statistic:   162 on 1 and 799 DF,  p-value: < 2.2e-16

Task 3.1

Use the summary you’ve produced to answer the following questions for your model:

What are the estimates for b₀ and b₁?
What is the direction of the relationship between the predictor and the outcome? What does this mean?
With an \(\alpha\) level of .05, does your predictor significantly predict the outcome?
For the sample of data that you have, what percentage of the variance in the outcome is explained by this model?

If you’re stuck here, make sure you review the Week 9 practical in depth before you continue.

For my model with hp as the only predictor:

b estimates: The b-estimates are both given in the “Estimate” column. The one labeled “Intercept” is b₀, and the one labeled with the name of the predictor (here, hp) is b₁. From the output we can see that b₀ is 43.59 and b₁ is 0.5.
Direction of relationship: Look again at the value of b₁: 0.5. This is a positive value, so that means the relationship between hp and attack is positive - Pokemon with more health also tend to have a stronger attack. More specifically, for each additional point of health a Pokemon has (a unit increase in hp), its attack increases by 0.5.
Significance: The p-values are listed in the column labeled, somewhat unhelpfully, Pr(>|t|). We want the value in this column on the row labeled with the name of the predictor. For my model, this value is very small, and definitely smaller than .05, so the positive relationship between hp and attack is significant. The *** code after the p-value in the output tells me that this result is significant at p < .001, which is also how I would report this result.
Variance Explained: This value is given at the bottom of the output, labeled “Multiple R-squared”. To get a percentage, multiply by 100 to get 16.86%.

Making Predictions

We now have a good idea of the relationship between our one predictor and attack. Next, let’s expand our model to include more predictors.

Task 4

Choose another predictor from one of the following:

defense: How good the Pokemon’s defense is
height_m: The Pokemon’s height in meters
weight_kg: The Pokemon’s weight in kilograms
speed: How fast the Pokemon is
hp: How many hit points (a measure of health) the Pokemon has

Again, you should choose a different predictor than the one in the guided part of the practical. For the purposes of the solutions in this worksheet, I’ll use defense.

Task 5

In your RMarkdown file, write down your prediction about the relationship between your new chosen predictor and the outcome, attack.

As we’ve done before, any prediction is fine that makes sense to you. The key is that you make a prediction and think about what that relationship might look like in the data.

Visualising the Relationship

Next up, we should have a look at our data. We’re first going to do this for our new predictor only, then make a more complex figure with both predictors.

Task 6

Create a scatterplot of your new predictor and attack as the outcome.

pokemon %>% 
  ggplot(aes(x = defense, y = attack)) +
  geom_point()

Task 6.1

If you like, label the axes better and clean up the plot with a theme.

pokemon %>% 
  ggplot(aes(x = defense, y = attack)) +
  geom_point() +
  scale_x_continuous(name = "Defense Stat", 
                     # seq() creates a vector of breaks at increments of 25
                     breaks = seq(0, max(pokemon$defense), by = 25)) +
  scale_y_continuous(name = "Attack Stat",
                     breaks = seq(0, max(pokemon$attack), by = 25)) +
  cowplot::theme_cowplot()

Task 6.2

Stop and have a look at your plot. How would you draw the line of best fit? Is this the direction of relationship you expected?

Task 6.3

Add a line of best fit to your plot. Is this what you expected?

The answer to this question depends on what you predicted originally, and the line of best fit you suggested before you added it to your plot!

pokemon %>% 
  ggplot(aes(x = defense, y = attack)) +
  geom_point() +
  scale_x_continuous(name = "Defense Stat", 
                     # seq() creates a vector of breaks at increments of 25
                     breaks = seq(0, max(pokemon$defense), by = 25)) +
  scale_y_continuous(name = "Attack Stat",
                     breaks = seq(0, max(pokemon$attack), by = 25)) +
  geom_smooth(method = "lm") +
  cowplot::theme_cowplot()

It’s always a good idea to visualise your data before modeling, for example with the scatterplots we’ve made above. However, for more than one predictor the plots get hard to read, and for more than two it isn’t even possible to make a plot! Typically we are more interested in the b estimates rather than overly elaborate plots. In my case, both hp and defense have positive relationships with attack. This is enough for our purposes, but if you’d like more plotting, see the optional box below.

If you’re really keen to make a plot with both predictors, you can get a basic one really easily with R.

I learned how to do this by Googling “3d plots in R” and following this tutorial. For just about anything you want to do in R, just search for “[thing you want to do] in R” and there are tons of resources available!

First, install and load the scatterplot3d package.

install.packages("scatterplot3d")
library(scatterplot3d)

Next, create a basic plot with your two predictors on the x- and y-axes, with the outcome on the z-axis. If you get a bit confused, use the tutorial linked above for help!

scatterplot3d(x = pokemon$hp, y = pokemon$defense, z = pokemon$attack)

…Yuck. That’s not too helpful, is it?

You can make the plot better if you like by trying out the options in the tutorial linked above. However, this might give you an idea why plots with three variables are a bit tricky to use/interpret! For now, we’re going to move on to the numbers, which are much easier to get on with.

The Multiple Predictor Model

Now that we have some idea of what to expect, we can create our linear model. Since we already have a poke_lm object containing our model with one predictor, we need a new object to contain our second model.

To do this, we need to add the second predictor into the formula - literally, by using +!

Task 7

Use the lm() function to run the analysis with both predictors and save the model in a new object called poke_lm2.

# remember to use whichever predictor you've chosen!
poke_lm2 <- lm(attack ~ hp + defense, data = pokemon)

Task 7.1

Call this poke_lm2 object in the Console to view it, then write out the linear model from the output.

First, to see what’s in our model, we can just call it by typing its name in the console and pressing Enter.

poke_lm2


Call:
lm(formula = attack ~ hp + defense, data = pokemon)

Coefficients:
(Intercept)           hp      defense  
    21.5852       0.3818       0.4102

This simple version of the output now gives us three numbers:

(Intercept): the intercept, here 21.59
hp: the slope for this predictor, here 0.38
defense: the slope for the second predictor, here 0.41

Keep in mind that the slope will always be named with the predictor that it’s associated with!

So, we can write our linear model:

Attack_i = 21.59 + 0.5 \(\times\) HP_i + 0.41\(\times\) Defense_i

Task 7.2

Choose two values, one for HP and one for defense. You can pick them at random if you want. Using your equation, what attack value would you predict a Pokemon with those stats to have?

If you’re using the hp and defense predictors, use these values: 69 HP and 73 defense.

Start with our equation: Attack_i = 43.59 + 0.5 \(\times\) HP_i + 0.41\(\times\) Defense_i

All we need to do is replace the names of the variables, HP and Defense, with the specific values for those predictors that we’re interested in:

attack = 21.59 + 0.38 \(\times\) 69 + 0.41 \(\times\) 73

attack = 21.59 + 26.22 + 29.93

attack = 77.74

So, a Pokemon with 69 HP and 73 defense should have an attack rating of about 78.

Task 7.3

Which predictor(s) are significant? How do you know? Report the results in your Markdown.

Get a summary of your model, or maybe tidy it up a bit, to get p-values.

First, we need output that gives us p-values. Either summary() or broom::tidy() will work, but I’m using the latter so I can get confidence intervals as well.

poke_lm2 %>% broom::tidy(conf.int = T)

# A tibble: 3 x 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   21.6      3.12        6.91 9.78e-12   15.5      27.7  
2 hp             0.382    0.0366     10.4  5.45e-24    0.310     0.454
3 defense        0.410    0.0316     13.0  4.23e-35    0.348     0.472

First, we can essentially ignore the intercept. It’s useful for constructing the equation as we did above, but with multiple predictors it becomes hard to interpret usefully. As we’ve said before, we’re primarily interested in the bs for the predictors, which capture the relationship between each of them and the outcome.

We can report the statistical results for the hp predictor as: \(b = 0.38\), 95% CI \([0.31\), \(0.45]\), \(t(798) = 10.44\), \(p < .001\). As p is smaller than .05, this is a significant result.

We can report the statistical results for the defense predictor as: \(b = 0.41\), 95% CI \([0.35\), \(0.47]\), \(t(798) = 12.98\), \(p < .001\). As p is again smaller than .05, this is also a significant result.

Task 7.4

How can you interpret the values of the bs in this model? Write down your thoughts in your RMarkdown.

For the hp predictor, the value of b₁ is 0.38. This is the predicted change in attack for each unit change in hp, when defense is 0. In other words, for a Pokemon with 0 defense, for each additional hit point a Pokemon has, it’s predicted to have a 0.38 increase in attack stat - a positive relationship.

Conversely, for the defense predictor, the value of b₂ is 0.41. This is the predicted change in attack for each unit change in defense, when hp is 0. In other words, for a Pokemon with 0 HP, for each additional defense point a Pokemon has, it’s predicted to have a 0.41 increase in attack stat - also a positive relationship.

Notice that interpretation of the bs as they are isn’t a very sensible way to present this information. There are no Pokemon with 0 defense, and a Pokemon with 0 HP would be dead! This isn’t wrong per se, but we can create a model in which the value of 0 is more sensible to interpret.

Task 8

Rerun the model with mean-centred predictors.

Task 8.1

First, create two new centred versions of hp and defense.

We need to create new predictor variables, centred around their respective means. Note that there are two ways to do this, and either one is fine.

pokemon <- pokemon %>% 
  dplyr::mutate(hp_cntr = scale(hp, scale = FALSE),
                defense_cntr = defense - mean(defense))

We can confirm this worked by comparing the two variables. First, I’ve calculated the mean of the uncentred defense variable, and then the mean of the centred one. (This doesn’t look like 0 at first glance, but notice the scientific notation!). So, the two variables have different means, and the centred one has a mean of 0.

Then, I ask for the correlation coefficient r between the two. The result is 1: a perfect correlation. So, all we’ve done is shift the defense variable along the x-axis so its mean aligns with 0. This effectively changes the meaning of 0 in the interpretation of our linear model.

mean(pokemon$defense)

[1] 73.00874

mean(pokemon$defense_cntr)

[1] -4.296713e-15

cor(pokemon$defense, pokemon$defense_cntr)

[1] 1

Task 8.2

Now, rerun your linear model using the two centred predictors. How is it similar to the previous model? How is it different?

First, construct the model. Then, let’s pull up the summaries of the previous model and the centred model to compare.

poke_lm2_cntr <- pokemon %>% 
  lm(attack ~ hp_cntr + defense_cntr, data = .)

poke_lm2_cntr %>% summary()


Call:
lm(formula = attack ~ hp_cntr + defense_cntr, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-114.079  -16.849   -2.578   15.697   88.219 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  77.85768    0.94260   82.60   <2e-16 ***
hp_cntr       0.38177    0.03658   10.44   <2e-16 ***
defense_cntr  0.41017    0.03160   12.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.68 on 798 degrees of freedom
Multiple R-squared:  0.3136,    Adjusted R-squared:  0.3119 
F-statistic: 182.3 on 2 and 798 DF,  p-value: < 2.2e-16

poke_lm2 %>% summary()


Call:
lm(formula = attack ~ hp + defense, data = pokemon)

Residuals:
     Min       1Q   Median       3Q      Max 
-114.079  -16.849   -2.578   15.697   88.219 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.58517    3.12285   6.912 9.78e-12 ***
hp           0.38177    0.03658  10.436  < 2e-16 ***
defense      0.41017    0.03160  12.982  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.68 on 798 degrees of freedom
Multiple R-squared:  0.3136,    Adjusted R-squared:  0.3119 
F-statistic: 182.3 on 2 and 798 DF,  p-value: < 2.2e-16

The first thing to notice is the Estimates (b-values). The bs for hp and defense are identical in both models - the relationship between each of them and the outcome is the same. However, the intercept has changed.

Remember that the intercept represents the predicted value of the outcome when all the predictors are 0. This is still the case, but for the centred predictors, 0 represents the mean value. So, in the centred model, the intercept represents the predicted value of the outcome when all predictors are at their mean. Based on this, we can say that a Pokemon with the mean value of HP and the mean defense stat is predicted to have an attack stat of about 78.

Sound familiar? It should! Calculate the means of hp and defense, then have a quick look back at Task 7.2.

mean(pokemon$hp)

[1] 68.9588

mean(pokemon$defense)

[1] 73.00874

You might notice that the values I gave you for that calculation were the mean values for those two variables. The answer you got there should be the same as the intercept here, because it’s the same calculation.

Before we wrap up, let’s do one more thing - check our \(R^2\) values.

Task 9

Obtain the \(R^2\) value for the two-predictor model (doesn’t matter which one!). What can you conclude from this?

All you need is a quick glance.

poke_lm2 %>% broom::glance()

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.314         0.312  26.7      182. 6.33e-66     2 -3765. 7539.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

Multiplying by 100 to get a percentage, we can say that 31.36% of the variance in the outcome is accounted for by this model.

If we want to be a bit fancy, we could report \(R^2\) formally: \(R^2 = .31\), 90% CI \([0.27\), \(0.36]\)

Task 9.1

Compare the \(R^2\) for this model with two predictors to the \(R^2\) for our original model. How much more variance have we been able to explain by adding in a second predictor?

First, let’s get the \(R^2\) for our first model again:

poke_lm %>% broom::glance()

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.169         0.168  29.3      162. 6.27e-34     1 -3842. 7690.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

So, our first model explained 16.86% of the variance in the outcome, and as we saw above, the second model explained 31.36%. So, by adding in a second predictor, we’ve been able to explain 14.5% more variance.

Is this enough of an improvement to believe the second model is better than the first? Come to Lecture 10 to find out!

You Teach Me and I’ll Teach You

Well done today - that was a lot to get through. If you’re following along with this, you’re doing great! If you’re still feeling unsure, we’ll have more time to practice next week before the exam. It’s really important that you have a working knowledge of the linear model, though, so don’t give up!

And finally, if it’s stuck in your head like it is in mine, here’s one of my favourite covers to sing along with.