Tutorial 9
Last time you learnt about the equation for a straight line and how it can be used to model a relationship between a predictor and an outcome variable by fitting a linear model to data. In this tutorial, we’ll talk about how we find line of best fit that characterises the linear model and also what we mean by “best fit”. In other words, we’ll see how we (well, R
really) can find the values of the intercept and slope of the line that fits the data best. We will also talk about how the linear model is tested in the framework of Null Hypothesis Significance Testing (NHST). Finally, once we’ve found and tested our linear model, we will see how we can evaluate how good this model is.
Before we start, why not get in the mood by answering a few quiz questions!
Remember, the equation for the linear model is
\[y_i = b_0 + b_1\times x_i + e_i\]
What does b1 represent in the formula?
The increase in outcome per unit increase in predictor
Correct!That’s not right…
What value of the outcome does the model yi = 13 − 2.5 × xi predict when the value of the predictor is 4? (give answer to 2 decimal places)
3.00
Correct!That’s not right…
What outcome value does the same model predict when the the predictor is −1? (give answer to 2 decimal places)
15.50
Correct!That’s not right…
What kind of relationship is this?
Negative
Correct!That’s not right…
Remember that \(x\) represents the predictor in the linear model. To solve for the outcome, simply replace \(x\) in the equation with the given value of the predictor and finish up the maths!
Make sure you give your answer to two decimal places, even if there are 0s.
The \(i\) subscript in \(y_i\)1, \(x_i\), and \(e_i\) is simply the index of the given data point. You can imagine it as the row number in your data set: \(i = 1\) is the first row, \(i = 2\) is the second row, \(i = n\) is the \(n^{\text{th}}\) row of your data.
The model then expresses a different value of the outcome variable \(y\) for each value of the predictor variable \(x\). Because each point has a different amount of “error” (difference between what the model predicts for each row and what we observe), there’s also a separate \(e_i\) for each data point.
However, the model intercept and slope (the \(b_0\) and \(b_1\) coefficients) are the same for every point, and so they do not have an \(i\) subscript.
All you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.
Create a new week_09
project and open it in RStudio. Then, create two new folders in the new week_09
folder: r_docs
and data
. Finally, open a new R Markdown file and save it in the r_docs
folder. Get into the habit of creating new code chunk in the file for each of the tasks below.
Remember, you can add new code chunks by:
```
{r}
, press Enter, then ```
again.
In this tutorial, we’ll return to the lovely penguins data set from the palmerpenguins
package. You will also need the tidyverse
and broom
packages (install it if you don’t have it yet). Put the code that loads all of these in the setup
chunk of your Rmd file and run it.
Remember how we said in the lecture that there are many ways to draw a line through data. Take a moment to play with the intercept and slope of the line in the plot below. Use the yellow circles to drag the line to a position where you think it adequately fits the scattered data points.
If you’re thinking that it’s not really obvious how precisely to fit the line, you are correct! It’s really difficult to just eyeball the position and orientation. What we need is some sort of criterion for deciding which line fits the data best.
You learnt about the concepts of deviation and sum of squares last term (you can refresh your memory by revisiting the lecture handout) but to reiterate, deviation is the difference from each point in the data from some place, for example the mean. We use deviations to quantify the amount of variability in a variable. However, because the individual distances of each point from the mean of a variable always add up to zero, we calculate the squared deviations and add those up. The result of this operation is the sum of squares and it is the basis for the calculation of the variance and the standard deviation (check out the formulae in the handout linked above if you don’t know what they are).
Just like we can calculate the distance of points from a point, we can calculate their distance from a line. After all, the mean can be equally well represented as a point on the number line or as a line on a 2-axis plot! Let’s see what that looks like.
In the applet below, change the slope and intercept of the line so that it is the same as the dashed line that represents the mean of the y variable.
What is the intercept of this line?
Mean of y
Correct!That’s not right…
What is the slope of this line? (2 decimal places)
0.00
Correct!That’s not right…
The thin dashed lines connecting each point with the line are the deviations and the teal squares are, well, their squares. The purple square under the plot is the sum of squares: its area is equal to the area of all the teal squares combined.
Move the line about and change its slope and watch what happens with the teal and purple squares. Have a go!
Let’s talk about jargon for a second. While the term deviation is reserved for differences betwen data points and their mean, the vertical distances of points from a line of the linear model are called residuals.
There’s no principled difference between the two, since the mean can be represented as a line in the context of the linear model but it’s important that you are familiar with the term residual.
So, when we talk about the Sum of Squared Residuals (SSR or SSR), we mean the total area of these teal squares which is shown in the applet above as the purple square under the plot.
SSR tells us the amount to which the line (i.e., the linear model) does not fit the data. If the points were all aligned, the line connecting them would have SSR = 0.
There are several ways of finding the line of best fit, depending on what your definition of “best” is but the most straightforward way is to get the line that results in the smallest possible SSR. This method is called Ordinary Least Squares (OLS) regression because it finds the line with the smallest (least) square.
Play around with with the visualisation and try to find the OLS line of best fit. The line will change colour when you get it right!
What is the value of SSR of the line of best fit in the visualisation? (2 decimal places)
58.80
Correct!That’s not right…
That’s pretty much all there is to it to finding the intercept and slope of the line (there’s a little more maths but you don’t have to worry about it). Of course, we don’t have to find the line of best fit ourselves. We can ask R
to do it for us.
The penguins
object from the package palmerpenguins
contains the dataset we worked with last term. Save it in your environment as peng_data
so that you don’t have to type out palmerpenguins::penguins
everytime you want to do something with it.
peng_data <- palmerpenguins::penguins
Make sure you know what variables there are in the dataset and look at the rough summary of it.
summary(peng_data)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
Let’s build a model of penguin flipper length predicted by the body mass of these majestic creatures.
First of all, let’s visualise the relationship between these two variables on a scatter plot. Create a plot that looks something like this:
peng_data %>%
ggplot2::ggplot(aes(body_mass_g, flipper_length_mm)) +
geom_point(alpha=.4) +
labs(x="Body mass (g)", y="Flipper length (mm)") +
cowplot::theme_cowplot()
The plot gives us a decent impression of what the relationship between the two variables is like.
What can be said of the relationship between body mass and flipper length in the penguins in our sample?
As body mass increases, so does flipper length
Correct!That’s not right…
Looking at the plot, what can you safely claim about the slope of the line of best fit through the data?
It’s smaller than 1 but larger than 0
Correct!That’s not right…
The relationship is positive and so the slope of the line must also be positive. However, given the scale of the variables (body mass in 1,000s and flipper length in 100s), a unit change in body mass results in a change in flipper length of less than one.
lm()
Before we fit a model predicting flipper length by body mass to our data, let’s see what the null model looks like. The null – or intercept-only – model is the most basic model of a variable we can build. As the name suggests, it only includes the intercept, meaning that the slope is fixed to the value of zero. Conceptually, this model represents our best guess about the value of any observation of the outcome variable, if we have no information about its relationship with any other variables.
Let’s build us one such model!
Fit the intercept-only model of flipper_length_mm
and save its output in an object called null_model
. The R
formula notation of a null model is outcome ~ 1
.
null_model <- lm(flipper_length_mm ~ 1, data = peng_data)
Let’s have a look what’s inside this object.
null_model
Call:
lm(formula = flipper_length_mm ~ 1, data = peng_data)
Coefficients:
(Intercept)
200.9
What R
returns is just a printout of the most basic information about the model contained in the object. The object however has much more stuff in it.
Tell R
to print out the structure of the null_model
object.
Use the str()
function to do this.
str(null_model)
List of 12
$ coefficients : Named num 201
..- attr(*, "names")= chr "(Intercept)"
$ residuals : Named num [1:342] -19.92 -14.92 -5.92 -7.92 -10.92 ...
..- attr(*, "names")= chr [1:342] "1" "2" "3" "5" ...
$ effects : Named num [1:342] -3715.57 -13.89 -4.89 -6.89 -9.89 ...
..- attr(*, "names")= chr [1:342] "(Intercept)" "" "" "" ...
$ rank : int 1
$ fitted.values: Named num [1:342] 201 201 201 201 201 ...
..- attr(*, "names")= chr [1:342] "1" "2" "3" "5" ...
$ assign : int 0
$ qr :List of 5
..$ qr : num [1:342, 1] -18.4932 0.0541 0.0541 0.0541 0.0541 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:342] "1" "2" "3" "5" ...
.. .. ..$ : chr "(Intercept)"
.. ..- attr(*, "assign")= int 0
..$ qraux: num 1.05
..$ pivot: int 1
..$ tol : num 1e-07
..$ rank : int 1
..- attr(*, "class")= chr "qr"
$ df.residual : int 341
$ na.action : 'omit' Named int [1:2] 4 272
..- attr(*, "names")= chr [1:2] "4" "272"
$ call : language lm(formula = flipper_length_mm ~ 1, data = peng_data)
$ terms :Classes 'terms', 'formula' language flipper_length_mm ~ 1
.. ..- attr(*, "variables")= language list(flipper_length_mm)
.. ..- attr(*, "factors")= int(0)
.. ..- attr(*, "term.labels")= chr(0)
.. ..- attr(*, "order")= int(0)
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(flipper_length_mm)
.. ..- attr(*, "dataClasses")= Named chr "numeric"
.. .. ..- attr(*, "names")= chr "flipper_length_mm"
$ model :'data.frame': 342 obs. of 1 variable:
..$ flipper_length_mm: int [1:342] 181 186 195 193 190 181 195 193 190 186 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language flipper_length_mm ~ 1
.. .. ..- attr(*, "variables")= language list(flipper_length_mm)
.. .. ..- attr(*, "factors")= int(0)
.. .. ..- attr(*, "term.labels")= chr(0)
.. .. ..- attr(*, "order")= int(0)
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(flipper_length_mm)
.. .. ..- attr(*, "dataClasses")= Named chr "numeric"
.. .. .. ..- attr(*, "names")= chr "flipper_length_mm"
..- attr(*, "na.action")= 'omit' Named int [1:2] 4 272
.. ..- attr(*, "names")= chr [1:2] "4" "272"
- attr(*, "class")= chr "lm"
As you can see, this is quite a long list of things. What’s important to understand is that there’s a lot of useful information returned by lm()
and that we can query our objects for all kinds of things using a few handy functions.
One of these functions is summary()
which we have already encountered.
Ask R
to give you a summary of the model.
summary(null_model)
Call:
lm(formula = flipper_length_mm ~ 1, data = peng_data)
Residuals:
Min 1Q Median 3Q Max
-28.915 -10.915 -3.915 12.085 30.085
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 200.9152 0.7604 264.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.06 on 341 degrees of freedom
(2 observations deleted due to missingness)
Let’s break down the summary that R
gives us:
Call:
lm(formula = flipper_length_mm ~ 1, data = peng_data)
This is just R
’s way of reminding us what model we are looking at and what function call was used to fit it.
Residuals:
Min 1Q Median 3Q Max
-28.915 -10.915 -3.915 12.085 30.085
These are some basic descriptive statistics of the model residuals. Remember that residuals are just the differences between what the model predicts – the line of best fit – and our actual observed data. The residuals are then a bunch of numbers and there are as many of them as we have data points. This little summary tells us the minimum and maximum value of these numbers, their median, and their 1st and 3rd quartiles.
This information is useful for doing so-called “model diagnostics” which is something we will learn next term so, for the time being, don’t worry about it too much.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 200.9152 0.7604 264.2 <2e-16 ***
This is the main meat of the summary: the model coefficients and their associated statistics. The model we fitted is an intercept-only model so it stands to reason that it only has one coefficient – the intercept. If we also included a predictor, this little table would have two rows, one for the intercept and one for the slope.
Let’s have a look what the columns represent:
Estimate
is the value of \(b_0\), the intercept itself. Since the point of all statistics is to estimate population values based on samples, this \(b_0\) coefficient is also used as the estimate (\(\hat{\beta_0}\)) of the population intercept \(\beta_0\)Std. Error
is the estimated standard error of this estimate. Just like we can estimate the standard error of the mean, SEM, we can estimate the standard error of any other estimate. Yeah, there is a lot of estimation going on in this game.t value
is, well, the t-statistic associated with the estimate and is calculated simply as the ratio of the first two numbers:\[\begin{aligned}t&=\frac{b_0}{\widehat{SE}_{b_0}}\\&=\frac{200.9152}{0.7604}\\&\approx264.2\end{aligned}\]Pr(>|t|)
is the p-value associated with this t-statistic and, just like any p-value it tells us the probability of getting a t at least as extreme as the one we got if the null hypothesis is trueThe null hypothesis with respect to every row of this summary table is that the population value of the parameter (in this case \(b_0\)) is zero! If the p-value associated with the test statistic is smaller than our chosen \(\alpha\) level, we reject this null hypothesis and claim that the population value of the given parameter is statistically significantly different from 0.
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This is just a legend for the asterisks next to the p-values:
*
– result is significant at \(\alpha\) = .05**
– result is significant at \(\alpha\) = .01***
– result is significant at \(\alpha\) = .001.
– result is significant at \(\alpha\) = .1 (not really used much)Residual standard error: 14.06 on 341 degrees of freedom
(2 observations deleted due to missingness)
This is additional information about out model and, at this stage, you don’t need to worry about it. However, notice that R
is telling us that two rows were deleted from the dataset because they contained missing values in either of the two columns used to fit the model!
Without checking the mean of flipper_length_mm
, what’s the relationship of the intercept of the null model to the mean of the outcome?
They are the same
Correct!That’s not right…
The null model represents the mean of the outcome and it’s only value is its intercept, and so the intercept of the null model is the mean of the outcome variable.
OK, now that we know how to read the output of the linear model, let’s fit the model we’ve been meaning to – one that predicts flipper length by body mass of a penguin.
Fit this model using lm()
and save its output into body_mass_model
.
body_mass_model <- lm(flipper_length_mm ~ body_mass_g, data = peng_data)
Ask R
for the summary of this model.
summary(body_mass_model)
Call:
lm(formula = flipper_length_mm ~ body_mass_g, data = peng_data)
Residuals:
Min 1Q Median 3Q Max
-23.7626 -4.9138 0.9891 5.1166 16.6392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.367e+02 1.997e+00 68.47 <2e-16 ***
body_mass_g 1.528e-02 4.668e-04 32.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.913 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
As you can see, the printout now has one more row in the Coefficients
table and some additional information about the model fit:
Multiple R-squared: 0.759, Adjusted R-squared: 0.7583
F-statistic: 1071 on 1 and 340 DF, p-value: < 2.2e-16
As explained in the lecture, R2 (here labelled Multiple R-squared
) can be interpreted as the proportion of variance in the outcome accounted for by the predictor.
Let’s dig into this a little.
First, look at this plot:
Look at how much the points are dispersed above and below the purple line that represents the null model (\(\bar{y}\)). That will give you an impression of the total variance in the outcome variable.
Now look at how much the points are scattered with respect to the line of our model (above and below it, not side-to-side). You can see that they vary much less than with respect to the intercept-only model with no predictor.
That means that the predictor accounts for some of the variance in the outcome. So, if we know the body mass of a penguin, we can more accurately guess its flipper length than if we don’t know its body mass!
That’s exactly what it means for a variable to predict another variable – it makes our guess about the likely value of the outcome more “educated” and precise.
As for the multiple R2 and the F-statistic, we will talk about what they are for next week.
Answer the following questions.
How many coefficients are there in this model? (whole number)
2
Correct!That’s not right…
What’s the value of the slope? (3 decimal places and watch out for scientific notation!)
0.015
Correct!That’s not right…
Is the relationship between body mass and flipper length statistically significant at α = .05?
Yes
Correct!That’s not right…
What proportion of the total variance in flipper length is accounted for by a penguin’s body mass? (in % to 1 decimal place, e.g., 33.3)
75.9
Correct!That’s not right…
Finally, visualise the model by adding the line of best fit to our scatterplot. Perhaps something like this.
We did this in the previous tutorial. The function is geom_smooth()
peng_data %>%
ggplot2::ggplot(aes(body_mass_g, flipper_length_mm)) +
geom_point(alpha=.4) +
#for colours, change theme_col and second_col to whatever colours you like!
geom_smooth(method = "lm", colour = theme_col, fill = second_col) +
labs(x="Body mass (g)", y="Flipper length (mm)") +
cowplot::theme_cowplot()
R
outputThe one drawback of the summary()
of an lm
object is that it doesn’t come in a nice tibble. Luckily, the package broom
was written precisely to tidy2 the output of models into this familiar and convenient format. The package contains a few handy functions we’ll be using quite a bit.
First up, it’s the broom::tidy()
function.3
Take the body_mass_model
object and pipe it into the broom::tidy()
function to see what it does.
body_mass_model %>%
broom::tidy()
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 137. 2.00 68.5 5.71e-201
2 body_mass_g 0.0153 0.000467 32.7 4.37e-107
As you can see, everything that was previously in the Coefficients
section of the summary()
printout is now in a neat tibble. That means, that we can easily query individual values using our dplyr
chops.
Write code that only print out the p-value associated with the b1 coefficient.
You want to filter()
this tibble so that it only has a single row and then pull()
a single variable out of it.
That’s quite useful, don’t you think?
The second function you should be aware of is broom::glance()
.
Take the body_mass_model
object and pipe it into the broom::glance()
function to see what it does.
body_mass_model %>%
broom::glance()
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.759 0.758 6.91 1071. 4.37e-107 1 -1146. 2297.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
The output of this function, provided you gave it a lm
object, will only ever have one row with all the various model fit statistics. We will talk about some of these in the coming months but for now, the most important is the r.squared
column containing, you guessed it, R2.
Now that we have the results from our linear model, we can report them. Let’s first see what a pretty standard way of reporting this kind of a model is and then break it down piece by piece.
The model accounted for a significant proportion of variance in flipper length, \(R^2 = .76\), 90% CI \([0.72\), \(0.79]\), \(F(1, 340) = 1,070.74\), \(p < .001\). Body mass was positively related to flipper length, with flipper length increasing by 15.28 millimetres with each one kilogramme increase in body mass, 95% CI \([14.36\), \(16.19]\), \(t(340) = 32.72\), \(p < .001\).
So what do we have here…
First, we report that, overall, the model was significant, i.e., that it explained a significant proportion in the outcome’s variance. To evidence this claim, we should report:
Once we’ve reported the overall model fit, we should report relationship of interest and, once again, provide evidence by quoting the relevant coefficient and its statistical test. In our case, we’re only interested in the slope (\(b_1\)) because it is the slope of the line that tells us about the relationship between our predictor and outcome. To fully report the relationship, we need:
It is nice to tell the reader what this relationship looks like by telling them what change in the outcome is associated with a unit change in the predictor. Notice that we reported this in terms of a 1 kg change in body mass, even though the data is in grammes. This means that we’re reporting a \(b\) that’s 1,000 times as large as we saw in the model results. The reason for doing this is that the number was so small that rounding it to 2 decimal places resulted in a value and a confidence interval that was not very informative (basically 0.02 [0.01, 0.02]).
If this ever happens to you and you decide to change the scale of any of the variables in the model you must rerun all models and redo all plots to reflect this change.
It is absolutely crucial that all results, tables, and figures are consistent and in agreement with each other!
To cement the understanding of proportion of variance explained, let’s calculate R2 ourselves!
First, let’s introduce another handy broom
function.
Pipe the model object into the broom::augment() function
and assign the output to an object called body_mass_model_data
.
body_mass_model_data <- body_mass_model %>%
broom::augment()
Let’s have a look what the function outputted…
body_mass_model_data
# A tibble: 342 x 9
.rownames flipper_length_mm body_mass_g .fitted .resid .hat
<chr> <int> <int> <dbl> <dbl> <dbl>
1 1 181 3750 194. -13.0 0.00385
2 2 186 3800 195. -8.78 0.00366
3 3 195 3250 186. 8.62 0.00705
4 5 193 3450 189. 3.57 0.00550
5 6 190 3650 192. -2.49 0.00431
6 7 181 3625 192. -11.1 0.00444
7 8 195 4675 208. -13.1 0.00395
8 9 193 3475 190. 3.19 0.00533
9 10 190 4250 202. -11.7 0.00293
10 11 186 3300 187. -1.14 0.00663
# ... with 332 more rows, and 3 more variables: .sigma <dbl>,
# .cooksd <dbl>, .std.resid <dbl>
For this task, we only need to understand a few of these variables:
flipper_length_mm
is the original observed flipper length data; this was the outcome variable in our modelbody_mass_g
contains original body mass data; this was our predictor.fitted
are the values of the outcome as predicted by the model..resid
are the residuals: the difference between observed values of outcome and .fitted
Calculate the variance of the residuals from this tibble and save the value in var_resid
.
var_resid <- var(body_mass_model_data$.resid)
What is the variance of the residuals? (to 2 decimal places)
47.65
Correct!That’s not right…
Calculate the variance of the outcome variable and assign it to var_total
.
There may be NA
s in the variable, *nudge, nudge, wink, wink*…
What is the total variance of the outcome? (to 2 decimal places)
197.73
Correct!That’s not right…
Apply the formula for R2 to get the value.
\(R^2 = 1 - \frac{var_{\text{resid}}}{var_{\text{total}}}\)
1 - var_resid / var_total
[1] 0.7589925
As a sanity check, make sure you get the same value as the one in the model summary (or in the broom::glance()
tibble).
That’s all for today. Good job!