Practical 09
This week, and for the rest of this term (and next year!), we’ll be primarily working with the Linear Model. This week we’re looking at what we’ve learned in the lectures and practicals for the last two weeks to get started creating linear models ourselves.
Set up as usual:
tidyverse
The theme for today’s task is inspired by a weird Internet thing that happened in the spring of 2014, called Twitch Plays Pokemon. A programmer in Australia hooked up the classic Pokemon Red game to the chat room on video game streaming site Twitch. By typing commands in the chat, viewers could control what the character in the game did - but on the scale of thousands or tens of thousands of participants at once. The game turned into a massive social experiment and even spawned a minor cult before the cumulative 1.16 million viewers beat the game in about two and a half weeks1.
If you aren’t really familiar with Pokemon, all you need to know for today is that it’s a series of video games (later TV show, graphic novels, etc. etc.) that take place in an alternate universe with magical animals called Pokemon. Pokemon battling (sort of like magical dogfighting?) is a core part of this universe, with children setting out at a young age to travel the world, capture many different types of Pokemon, and compete in massive championships, with the winner being crowned a Pokemon Master. It’s a bit ethically murky because at least some Pokemon seem to be sentient, and they can all talk, but only to say their own species name. You know, as I’m writing this description, it just keeps sounding weirder…
Figure 1 A Pokemon called Pikachu. You may have heard of him. Source
For this week’s stats practical, we’re doing a Twitch-Plays-Pokemon-style walkthrough of a dataset of Pokemon characteristics to practice the linear model. You don’t need to know anything about Pokemon to do this practical besides what’s in the box above, but if you weren’t in the live practical session, you might want to watch the recording before you get started.
Read in the pokemon
dataset at the following link, which is a copy of this Pokemon dataset.
Link: https://and.netlify.app/docs/pokemon.csv
pokemon <- readr::read_csv("https://and.netlify.app/docs/pokemon.csv")
First we need to choose a research question to investigate.
As future Pokemon Masters, we want to work out which Pokemon is the strongest. So, we’ll keep using attack
as our outcome variable, which quantifies a Pokemon’s offensive capabilities.
For the predictor, choose from one of the following:
defense
: How good the Pokemon’s defense isheight_m
: The Pokemon’s height in metersweight_kg
: The Pokemon’s weight in kilogramsspeed
: How fast the Pokemon ishp
: How many hit points (a measure of health) the Pokemon hasFor the predictor, you should choose a different predictor than the one in the guided part of the practical. For the purposes of the solutions in this worksheet, I’ll use hp
.
In your RMarkdown file, write down your prediction about the relationship between your chosen predictor and the outcome, attack
.
As we’ve done before, any prediction is fine that makes sense to you. The key is that you make a prediction and think about what that relationship might look like in the data.
Next up, we should have a look at our data.
Create a scatterplot of your chosen predictor and attack
as the outcome.
pokemon %>%
ggplot(aes(x = hp, y = attack)) +
geom_point()
If you like, label the axes better and clean up the plot with a theme.
pokemon %>%
ggplot(aes(x = hp, y = attack)) +
geom_point() +
scale_x_continuous(name = "Max Hit Points",
# seq() creates a vector of breaks at increments of 25
breaks = seq(0, max(pokemon$hp), by = 25)) +
scale_y_continuous(name = "Attack Stat",
breaks = seq(0, max(pokemon$attack), by = 25)) +
cowplot::theme_cowplot()
Stop and have a look at your plot. How would you draw the line of best fit? Is this the direction of relationship you expected?
Add a line of best fit to your plot. Is this what you expected?
pokemon %>%
ggplot(aes(x = hp, y = attack)) +
geom_point() +
scale_x_continuous(name = "Max Hit Points",
# seq() creates a vector of breaks at increments of 25
breaks = seq(0, max(pokemon$hp), by = 25)) +
scale_y_continuous(name = "Attack Stat",
breaks = seq(0, max(pokemon$attack), by = 25)) +
geom_smooth(method = "lm") +
cowplot::theme_cowplot()
Optionally, add another line to your plot that represents the null model.
Have a look at geom_hline
, or the code for the plots in the lecture!
pokemon %>%
ggplot(aes(x = hp, y = attack)) +
geom_point() +
scale_x_continuous(name = "Max Hit Points",
breaks = seq(0, max(pokemon$hp), by = 25)) +
scale_y_continuous(name = "Attack Stat",
breaks = seq(0, max(pokemon$attack), by = 25)) +
geom_hline(yintercept = mean(pokemon$attack), linetype = "dashed") +
geom_smooth(method = "lm") +
cowplot::theme_cowplot()
Now that we have some idea of what to expect, we can create our linear model!
Use the lm()
function to run your analysis and save the model in a new object called poke_lm
.
# remember to use whichever predictor you've chosen!
poke_lm <- lm(attack ~ hp, data = pokemon)
Call this poke_lm
object in the Console to view it, then write out the linear model from the output.
First, to see what’s in our model, we can just call it by typing its name in the console and pressing Enter.
poke_lm
Call:
lm(formula = attack ~ hp, data = pokemon)
Coefficients:
(Intercept) hp
43.5939 0.4969
As we’ve seen before, this simple version of the output gives us two numbers:
(Intercept)
: the intercept, here 43.59hp
: the slope, here 0.5Keep in mind that the slope will always be named with the predictor that it’s associated with!
So, we can write our linear model: Attacki = 43.59 + 0.5 \(\times\) HPi
How can you interpret the value of b1 for this model? Write down your thoughts in your RMarkdown.
For the hp
predictor, the value of b1 is 0.5. This is the predicted change in attack
for each unit change in hp
. In other words, for each additional hit point a Pokemon has, it’s predicted to have 0.5 more hit points - a positive relationship.
Using your equation, what attack
value would you predict a Pokemon with 86 HP to have?
Start with our equation: Attacki = 43.59 + 0.5 \(\times\) HPi
All we need to do is replace the value of HP with the specific value we’re interested in:
attack = 43.59 + 0.5 \(\times\) 86
attack = 43.59 + 43
attack = 86.59
Weirdly enough, a Pokemon with 86 HP should have an attack rating of about 86!
Now we have the model parameters, but we don’t want to just describe the line - we want to be able to say something about the population, not just our sample. For this, we need some more info!
Use broom::tidy()
to get p-values for your bs. Is your predictor significant?
poke_lm %>% broom::tidy()
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 43.6 2.88 15.1 1.41e-45
2 hp 0.497 0.0390 12.7 6.27e-34
Yikes, look at those p-values! That’s a lot of 0s. I think we can safely say that both these values are smaller than .05, although the one we really care about here is for the predictor, hp
.
Add the conf.int = T
argument to broom::tidy()
to get confidence intervals. How can you interpret these? Do they “agree” with the p-value?
poke_lm %>% broom::tidy(conf.int = T)
# A tibble: 2 x 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 43.6 2.88 15.1 1.41e-45 37.9 49.3
2 hp 0.497 0.0390 12.7 6.27e-34 0.420 0.573
Again, we’re mainly interested in b1, which is hp
. First we want to know if these confidence intervals cross or include 0. Looking at these two values, neither of them are exactly 0, and both values are positive. So, they don’t cross or include 0.
Remember that the confidence intervals give us a range of likely values of b1 from other samples. If the confidence intervals don’t cross or include 0, as we can see here, it’s unlikely that we could have gathered a sample where b1 was 0. This “agrees” with our p-value, which indicated that the probability of obtaining the results we have, assuming the null hypothesis is true, is very very small. Altogether the evidence suggests that it’s quite unlikely that the relationship between HP and attack is actually 0 (no relationship at all).
Use broom::glance()
to get \(R^2\) and adjusted \(R^2\) for this model. How much of the variance of the outcome is explained by the model for this sample? What would we expect this to be in the population?
poke_lm %>% broom::glance()
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.169 0.168 29.3 162. 6.27e-34 1 -3842. 7690.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
For now we can ignore everything in this output except for the values labeled r.squared
and adj.r.squared
. To answer the questions, we can simply convert them to percentages by multiplying by 100.
Variance in the outcome explained by the model in the sample: 16.86% Estimated variance in the outcome explained by the model in the population: 16.76%
Finally, we can get all this same information - except for CIs - from the summary()
function. I (Jennifer) like summary()
because you can get a good overview of a lot of information quickly, but it’s very inconvenient for reporting, so it’s good to know how to use the broom
functions as well.
Get a summary of your model. What do the stars mean?
poke_lm %>% summary()
Call:
lm(formula = attack ~ hp, data = pokemon)
Residuals:
Min 1Q Median 3Q Max
-162.812 -18.469 -3.344 16.656 108.091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.59387 2.88447 15.11 <2e-16 ***
hp 0.49687 0.03903 12.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.34 on 799 degrees of freedom
Multiple R-squared: 0.1686, Adjusted R-squared: 0.1676
F-statistic: 162 on 1 and 799 DF, p-value: < 2.2e-16
We can see all of the same numbers here that we saw before, even labeled the same way. summary()
also gives us the stars at the end of each line, which give us an idea of what alpha level each element of the model is significant at. As you can see, our predictor has three stars, which corresponds to p < .001.
That’s as far as we’ll go today - we’ll practice reporting models after the break, when we have multiple predictors to explore.
The Linear Model is all of our destinies for the next year or so, so it’s important to get comfortable working with it. Feel free to explore this dataset further if you’d like more practice over the holiday!
I (Jennifer) was doing my first year of PhD in Edinburgh, actually sat at the desk next to our own Milan Valasek (the universe is weird like that!). Because I spent a lot of my time on the Internet instead of, y’know, doing my work, I watched a lot of this unfold in real time. It was even stranger than this makes it sound.↩︎