Analysing Data: LM 1 & 2: Single Predictor

Welcome to LM!

This week, and for the rest of this term (and next year!), we’ll be primarily working with the Linear Model. This week we’re looking at what we’ve learned in the lectures and practicals for the last two weeks to get started creating linear models ourselves.

Setup

Task 1

Set up as usual:

Create a project for this week
Open a new RMarkdown document to work in and save it in your project folder
Load tidyverse

I Wanna Be the Very Best

The theme for today’s task is inspired by a weird Internet thing that happened in the spring of 2014, called Twitch Plays Pokemon. A programmer in Australia hooked up the classic Pokemon Red game to the chat room on video game streaming site Twitch. By typing commands in the chat, viewers could control what the character in the game did - but on the scale of thousands or tens of thousands of participants at once. The game turned into a massive social experiment and even spawned a minor cult before the cumulative 1.16 million viewers beat the game in about two and a half weeks¹.

To Catch Them Is My Real Test, To Train Them Is My Cause

If you aren’t really familiar with Pokemon, all you need to know for today is that it’s a series of video games (later TV show, graphic novels, etc. etc.) that take place in an alternate universe with magical animals called Pokemon. Pokemon battling (sort of like magical dogfighting?) is a core part of this universe, with children setting out at a young age to travel the world, capture many different types of Pokemon, and compete in massive championships, with the winner being crowned a Pokemon Master. It’s a bit ethically murky because at least some Pokemon seem to be sentient, and they can all talk, but only to say their own species name. You know, as I’m writing this description, it just keeps sounding weirder…

A picture of Pikachu, a mouselike yellow fantasy creature. Figure 1 A Pokemon called Pikachu. You may have heard of him. Source

For this week’s stats practical, we’re doing a Twitch-Plays-Pokemon-style walkthrough of a dataset of Pokemon characteristics to practice the linear model. You don’t need to know anything about Pokemon to do this practical besides what’s in the box above, but if you weren’t in the live practical session, you might want to watch the recording before you get started.

Task 2

Read in the pokemon dataset at the following link, which is a copy of this Pokemon dataset.

Link: https://and.netlify.app/docs/pokemon.csv

pokemon <- readr::read_csv("https://and.netlify.app/docs/pokemon.csv")

Making Predictions

First we need to choose a research question to investigate.

As future Pokemon Masters, we want to work out which Pokemon is the strongest. So, we’ll keep using attack as our outcome variable, which quantifies a Pokemon’s offensive capabilities.

Task 3

For the predictor, choose from one of the following:

defense: How good the Pokemon’s defense is
height_m: The Pokemon’s height in meters
weight_kg: The Pokemon’s weight in kilograms
speed: How fast the Pokemon is
hp: How many hit points (a measure of health) the Pokemon has

For the predictor, you should choose a different predictor than the one in the guided part of the practical. For the purposes of the solutions in this worksheet, I’ll use hp.

Task 4

In your RMarkdown file, write down your prediction about the relationship between your chosen predictor and the outcome, attack.

As we’ve done before, any prediction is fine that makes sense to you. The key is that you make a prediction and think about what that relationship might look like in the data.

Visualising the Relationship

Next up, we should have a look at our data.

Task 5

Create a scatterplot of your chosen predictor and attack as the outcome.

pokemon %>% 
  ggplot(aes(x = hp, y = attack)) +
  geom_point()

Task 5.1

If you like, label the axes better and clean up the plot with a theme.

pokemon %>% 
  ggplot(aes(x = hp, y = attack)) +
  geom_point() +
  scale_x_continuous(name = "Max Hit Points", 
                     # seq() creates a vector of breaks at increments of 25
                     breaks = seq(0, max(pokemon$hp), by = 25)) +
  scale_y_continuous(name = "Attack Stat",
                     breaks = seq(0, max(pokemon$attack), by = 25)) +
  cowplot::theme_cowplot()

Task 5.2

Stop and have a look at your plot. How would you draw the line of best fit? Is this the direction of relationship you expected?

Task 5.3

Add a line of best fit to your plot. Is this what you expected?

pokemon %>% 
  ggplot(aes(x = hp, y = attack)) +
  geom_point() +
  scale_x_continuous(name = "Max Hit Points", 
                     # seq() creates a vector of breaks at increments of 25
                     breaks = seq(0, max(pokemon$hp), by = 25)) +
  scale_y_continuous(name = "Attack Stat",
                     breaks = seq(0, max(pokemon$attack), by = 25)) +
  geom_smooth(method = "lm") +
  cowplot::theme_cowplot()

Task 5.4

Optionally, add another line to your plot that represents the null model.

Have a look at geom_hline, or the code for the plots in the lecture!

pokemon %>% 
  ggplot(aes(x = hp, y = attack)) +
  geom_point() +
  scale_x_continuous(name = "Max Hit Points", 
                     breaks = seq(0, max(pokemon$hp), by = 25)) +
  scale_y_continuous(name = "Attack Stat",
                     breaks = seq(0, max(pokemon$attack), by = 25)) +
  geom_hline(yintercept = mean(pokemon$attack), linetype = "dashed") +
  geom_smooth(method = "lm") +
  cowplot::theme_cowplot()

Creating the Model

Now that we have some idea of what to expect, we can create our linear model!

Task 6

Use the lm() function to run your analysis and save the model in a new object called poke_lm.

# remember to use whichever predictor you've chosen!
poke_lm <- lm(attack ~ hp, data = pokemon)

Task 7

Call this poke_lm object in the Console to view it, then write out the linear model from the output.

First, to see what’s in our model, we can just call it by typing its name in the console and pressing Enter.

poke_lm


Call:
lm(formula = attack ~ hp, data = pokemon)

Coefficients:
(Intercept)           hp  
    43.5939       0.4969

As we’ve seen before, this simple version of the output gives us two numbers:

(Intercept): the intercept, here 43.59
hp: the slope, here 0.5

Keep in mind that the slope will always be named with the predictor that it’s associated with!

So, we can write our linear model: Attack_i = 43.59 + 0.5 \(\times\) HP_i

Task 8

How can you interpret the value of b₁ for this model? Write down your thoughts in your RMarkdown.

For the hp predictor, the value of b₁ is 0.5. This is the predicted change in attack for each unit change in hp. In other words, for each additional hit point a Pokemon has, it’s predicted to have 0.5 more hit points - a positive relationship.

Task 9

Using your equation, what attack value would you predict a Pokemon with 86 HP to have?

Start with our equation: Attack_i = 43.59 + 0.5 \(\times\) HP_i

All we need to do is replace the value of HP with the specific value we’re interested in:

attack = 43.59 + 0.5 \(\times\) 86

attack = 43.59 + 43

attack = 86.59

Weirdly enough, a Pokemon with 86 HP should have an attack rating of about 86!

Evaluating the Model

Now we have the model parameters, but we don’t want to just describe the line - we want to be able to say something about the population, not just our sample. For this, we need some more info!

Significance Testing

Task 10

Use broom::tidy() to get p-values for your bs. Is your predictor significant?

poke_lm %>% broom::tidy()

# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   43.6      2.88        15.1 1.41e-45
2 hp             0.497    0.0390      12.7 6.27e-34

Yikes, look at those p-values! That’s a lot of 0s. I think we can safely say that both these values are smaller than .05, although the one we really care about here is for the predictor, hp.

Task 11

Add the conf.int = T argument to broom::tidy() to get confidence intervals. How can you interpret these? Do they “agree” with the p-value?

poke_lm %>% broom::tidy(conf.int = T)

# A tibble: 2 x 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   43.6      2.88        15.1 1.41e-45   37.9      49.3  
2 hp             0.497    0.0390      12.7 6.27e-34    0.420     0.573

Again, we’re mainly interested in b₁, which is hp. First we want to know if these confidence intervals cross or include 0. Looking at these two values, neither of them are exactly 0, and both values are positive. So, they don’t cross or include 0.

Remember that the confidence intervals give us a range of likely values of b₁ from other samples. If the confidence intervals don’t cross or include 0, as we can see here, it’s unlikely that we could have gathered a sample where b₁ was 0. This “agrees” with our p-value, which indicated that the probability of obtaining the results we have, assuming the null hypothesis is true, is very very small. Altogether the evidence suggests that it’s quite unlikely that the relationship between HP and attack is actually 0 (no relationship at all).

Goodness of Fit

Task 12

Use broom::glance() to get \(R^2\) and adjusted \(R^2\) for this model. How much of the variance of the outcome is explained by the model for this sample? What would we expect this to be in the population?

poke_lm %>% broom::glance()

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.169         0.168  29.3      162. 6.27e-34     1 -3842. 7690.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

For now we can ignore everything in this output except for the values labeled r.squared and adj.r.squared. To answer the questions, we can simply convert them to percentages by multiplying by 100.

Variance in the outcome explained by the model in the sample: 16.86% Estimated variance in the outcome explained by the model in the population: 16.76%

Reading the Summary

Finally, we can get all this same information - except for CIs - from the summary() function. I (Jennifer) like summary() because you can get a good overview of a lot of information quickly, but it’s very inconvenient for reporting, so it’s good to know how to use the broom functions as well.

Task 13

Get a summary of your model. What do the stars mean?

poke_lm %>% summary()


Call:
lm(formula = attack ~ hp, data = pokemon)

Residuals:
     Min       1Q   Median       3Q      Max 
-162.812  -18.469   -3.344   16.656  108.091 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 43.59387    2.88447   15.11   <2e-16 ***
hp           0.49687    0.03903   12.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 29.34 on 799 degrees of freedom
Multiple R-squared:  0.1686,    Adjusted R-squared:  0.1676 
F-statistic:   162 on 1 and 799 DF,  p-value: < 2.2e-16

We can see all of the same numbers here that we saw before, even labeled the same way. summary() also gives us the stars at the end of each line, which give us an idea of what alpha level each element of the model is significant at. As you can see, our predictor has three stars, which corresponds to p < .001.

I Know It’s My Destiny

That’s as far as we’ll go today - we’ll practice reporting models after the break, when we have multiple predictors to explore.

The Linear Model is all of our destinies for the next year or so, so it’s important to get comfortable working with it. Feel free to explore this dataset further if you’d like more practice over the holiday!

I (Jennifer) was doing my first year of PhD in Edinburgh, actually sat at the desk next to our own Milan Valasek (the universe is weird like that!). Because I spent a lot of my time on the Internet instead of, y’know, doing my work, I watched a lot of this unfold in real time. It was even stranger than this makes it sound.↩︎