Practical 05
Today’s practical will be a collaborative effort. Before you jump in, decide on roles with your team.
Decide who within your team will do the following roles. You should have only one scribe, but you can have more than one of the other roles.
Keep in mind that if someone in your team is usually the scribe, you should switch roles so that everyone gets practice working in RStudio.
At this point, the Scribe should share their screen, and your team should work through the following tasks together.
Just like every week, we want to set up our workspace: project, folder, and document to work in.
CIf you haven’t done it yet, create a week_05 R
project inside of your module folder and within it, create the standard folder structure, just like last week.
Create your own new Markdown file and save it in the week_05/r_docs
folder you’ve just created. You can give it a title, put your name as the author, and delete any default text or code chunks that you won’t need. Keep the setup
code chunk, though!
Use this R Markdown file to complete the following tasks, adding code chunks and headings as you go.
In the setup
code chunk of the .Rmd file, write the code to load the tidyverse
package.
Download the Millennium Cohort dataset from the link below, and save it in a new object called data
.
Link: https://and.netlify.app/docs/mc_data.csv
data <- readr::read_csv("https://and.netlify.app/docs/mc_data.csv")
You should see a new object, data
, appear in your environment.
Using R, find out how many participants there are in this dataset.
The easiest way is to simply look at the dataset in your environment, which tells you the number of “obvs” (observations, or rows). Or, we can have R count for us.
nrow(data)
[1] 12170
Wow, this is a huge dataset! We’ve said that a bigger sample is better, so this must be excellent.
Our data today comes from the Millennium Cohort, a large group of young people who have been invited to participate in a longitudinal study since birth. The data we are using is from a sweep that was conducted when the cohort was about 10 - 11 years old, and includes questions about both internet/social media usage and wellbeing/happiness.
Have a look at the Codebook below and choose two variables to correlate. Specifically, you should choose one variable about social media or Internet use, and one variable about happiness or wellbeing.
In the solutions, we’ll use often_messages
and recently_laugh
. If you want to, you can choose these as well; in that case, your code and output will match the solutions exactly.
However, I’d recommend you choose variables you find interesting. If you do choose different variables, though, keep in mind that your answers - including interpretation - will be different from the solutions.
Variable Name | Description | Scale or Values |
---|---|---|
sex | Sex | 1 = Male, 2 = Female |
age | Age as of last birthday | Years (numeric) |
Internet/Social Media Use | ||
often_use_internet | How often do you use the Internet, not at school? | 1 = Most days, 2 = At least once a week, 3 = At least once a month, 4 = Less often than once a month, 5 = Never |
often_messages | How often do you exchange messages with friends on the Internet? | |
often_social_media | How often do you visit a social networking website on the Internet? | |
Wellbeing/Happiness | ||
happiness_looks | How happy are you with the way you look? | Likert scale, 1 = Completely happy, 7 = Not at all happy |
happiness_life | How happy are you with your life as a whole? | |
recently_happy | In the last four weeks, how often did you feel happy? | 1 = Never, 2 = Almost never, 3 = Sometimes, 4 = Often, 5 = Almost always |
recently_worried | In the last four weeks, how often did you worry about what would happen? | |
recently_sad | In the last four weeks, how often did you feel sad? | |
recently_afraid | In the last four weeks, how often did you feel afraid or scared? | |
recently_laugh | In the last four weeks, how often did you laugh? | |
recently_angry | In the last four weeks, how often did you get angry? |
In your Markdown document, write down your prediction about how these two variables will be correlated. You should mention:
Ideally, your prediction should be based on an understanding of what previous papers have found - for instance, if you had a look at some of the recommended reading from the handout. If you didn’t, just use your own logic and reasoning to make a prediction.
When you’re predicting the direction of the correlation, make sure you read the “Scale or Values” column carefully to interpret the meaning of the numbers you get.
This is more of a comment than a solution, because there isn’t a “correct” prediction to make!
You might predict that as participants send online messages less frequently - that is, as their score on online_messages
increases - the amount they laugh will go up. (If this seems backwards to you, make sure you read the “Scale or Values” column carefully! It tells you what response these numbers correspond to, and the scale on the Internet/Social Media questions is the reverse of the Happiness/Wellbeing questions.) You could suggest that this is because they’re spending less time online and more time in the “real world” living their best life.
On the other hand, you might predict the reverse: that when participants send online messages less frequently, the frequency of laughter will go down. This could be because they have fewer online social contacts, or see fewer doge memes, so they laugh less often.
Don’t proceed until you have written down your predictions! It’s important to do this before you know the result of the analysis.
Run and report the correlation analysis with the following steps.
Use the cor.test()
function to output a correlation analysis on the two variables you chose.
If you need help, the Week 5 tutorial explains how to do this. Or, look up the help documentation by running ?cor.test
in the console.
data %>%
cor.test(~ often_messages + recently_laugh, data = .,
alternative = "two.sided", method = "pearson" )
Pearson's product-moment correlation
data: often_messages and recently_laugh
t = -12.291, df = 12168, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.12824922 -0.09315109
sample estimates:
cor
-0.1107347
Report the results of your analysis. Your reporting should mention which variables you correlated, what their relationship was like (i.e. degree and direction), and give the following information about the analysis:
If you’re not sure exactly how to report this analysis, look back at the Week 4 tutorial and give it your best shot. You can also have a look at a paper reporting a correlation analysis!
You should write this report in your own words, but here’s an example:
"Frequency of online messaging and how recently the participants reported laughing were significantly negatively correlated, r(12168) = -.11, p < .001, 95% CI = [-0.13, -0.09].
In addition to your statistical reporting, interpret the results. This means to explain what the correlation tells you in plain language.
This one’s tricky because of the way the scales are coded, i.e., what a higher score means for each variable. Make sure you check the Codebook carefully!
You should write this report in your own words, but here’s an example:
“In other words, as participants reported less frequent online messaging, they also tended to report laughing less often.”
Note: This is a bit tricky because of how the response scales to these questions are measured. Notice in the Codebook that the often_messages
variable is, in a sense, backwards - a higher number means less frequent messaging. So, as that variable increases - and online messaging becomes less frequent - the recently_laugh
variable goes down. We know this because of the negative sign on the correlation (r = -.11). Unlike the often_messages
variable, the recently_laugh
variable is coded so that a higher score means more frequent laughter! So, more frequent online messaging is associated with less frequent laughter. Said the other way round: as participants reported laughing more often, their score on often_messages
tended to decrease, which means they sent more messages online.
If that makes your head hurt, you’re not the only one! However, that’s how the Millennium Cohort study team chose to ask these questions. It’s better to stick with the original response scale, even if it’s a bit confusing.
Compare the results you’ve reported to your original prediction. Was this what you expected? Was the correlation in the direction that you predicted? How about strength? Write down your thoughts in your Markdown document.
Once you’ve finished your write-up, go to the Padlet below for your practical session and paste your completed writeup there. Remember, you need to mention:
Practical 01 (Thursday 9am) | Practical 02 (Thursday 11am)
Practical 03 (Thursday 4pm) | Practical 04 (Thursday 6pm)
Practical 05 (Friday 9am)Have a look through other responses on the Padlet, especially groups that used different variables that you did. Overall, did we find evidence that social media/internet use makes children more unhappy? Write down your thoughts in your Markdown.
So, job done, right? We found a significant correlation and proved definitively that sending online messages makes you laugh more (or whatever variables you used). Tell that to your great-aunt Marge the next time she tells you that you stare at your phone too much!
By now your statistics spidey senses are hopefully tingling. There are a few things here that aren’t quite right. Let’s look into this correlation a little further.
Can we conclude from this analysis that sending online messages makes you laugh more (or whatever variables you used)? Write your thoughts in your Markdown.
It doesn’t matter which variables you used: we can’t conclude that there’s a causal relationship from this analysis. There might actually be one, but correlation does not provide evidence of that! It only tells us how much one variable tends to change in relation to another variable.
Can we conclude that we have proven a relationship between sending online messages and laughing more often (or whatever variables you used)? Write your thoughts in your Markdown.
Remember, statistical analyses don’t prove - they provide evidence. We have some evidence on our hands that there is some relationship between these variables, of a size that we are unlikely to encounter if in fact the null hypothesis is true and the actual value of the correlation between them is 0. (This may not be the case for your analysis, of course, if you chose different variables and had a non-significant result.)
This is a common topic in StatsChats, if you’re keen to hear more.
Even with these caveats in place, it’s still tempting to conclude that because our value of p is so small, there must be something cool or interesting happening here. But…is there? Let’s have a closer look at what this value of r actually means.
Remember that the value of r is an interpretable number: it tells us about the strength of the relationship between our two variables. This makes r a very useful example of an effect size: a number whose magnitude corresponds to the size of the effect we are interested in. The bigger the (absolute) value of r, the stronger the relationship between the variables.
What do you understand the absolute value of your correlation coefficient r to mean? Write down your thoughts in your Markdown.
Remember that the absolute value of r must lie between 0 (no relationship at all) and 1 (a perfect relationship). For the two example variables, often_message
and recently_laugh
, our r value is quite small: only .11.
If you like, you can use the Interpreting Correlations interactive visualisation to get an idea of what this means by setting the value of the correlation to -.11.
At this point it would be very useful to actually graph our data. As we did in the tutorial, this might help us understand what the correlation is actually telling us.
In a real analysis, it is essential to do lots of thorough data exploration and graphing before you go plunging into your analysis! We’re just coming round to it now for dramatic effect 😄
Create a scatterplot of the two variables you chose for your analysis. Does this change your interpretation at all? Write down your thoughts in your Markdown.
Here’s some code to produce a simple plot:
data %>%
ggplot(aes(often_messages, recently_laugh)) +
geom_point(position = "jitter", alpha = .4) +
scale_x_discrete(name = "Frequency of Online Messages")+
scale_y_discrete(name = "Frequency of Laughter in Last 4 Weeks")+
theme_minimal()
Any plot that looks something like this is fine, but I’ve added a few things to make it prettier and easier to read.
Yikes - this is pretty hard to interpret. We can immediately see that very few participants responded “Never” or “Almost never” to the frequency of laughter question (bottom two rows have very few dots). We can also see that relatively few participants answered “At least once a month” or “Less often than once a month” to the online messages question (third and fourth columns)1. The parts of the graph with the dark black boxes are the ones with the most responses. This seems to be laughter “almost always” and online messages “most days” (upper left corner); or “Almost always” or “often” for laughter and “Never” for online messages (top two boxes in the far right column).
In other words, children mostly report laughing fairly often; and they tend to report either sending online messages quite frequently, or never.
Overall, it’s clear that the relationship between these variables is not as simple as the value of r would lead us to believe. This is why it’s always important to think about the actual value of the test statistic you have calculated, and to inspect your data thoroughly (for example, by making graphs) instead of only relying on the significance value.
Well done!
This is the end of the required portion of the practical, so you’re welcome to jump down to the end if time is up. If you have whizzed through the previous tasks and still have time left, or if you’re just curious, carry on to the next section to dig deeper into what’s going on.
This final section is optional, but recommended. If you chose different variables above, keep in mind that this section will focus on the often_messages
and recently_laugh
variables.
So, how do we reconcile the fact that this correlation is significant, with the fact that the actual value of r is quite small and the scatterplot of the data shows no obvious trend?
The key comes from divorcing the real-world meaning of the word “significant” (i.e. “meaningful”, “important”) from the statistical sense.
Let’s revise what we know about distributions, standard error, and sample size.
Recall that we checked the number of participants at the beginning. How many were there again?
nrow(data)
[1] 12170
In your own words, explain what the relationship is between sample size and standard error. Write your thoughts in your Markdown document.
See Lecture 1 for help!
Using this online calculator, set the value of r to our value (-.11). Then, try changing the value in “Sample Size”, leaving the value of r the same. What happens to the value of t as sample size increases? Write down your thoughts in your Markdown.
There isn’t a “right” answer here, but what you should notice is that for the exact same value of r, the value of t becomes greater as sample size increases.
If you click on “Formula”, you can see the equation to convert r to t. There’s N, which is sample size, in the denominator - so the value of t depends on N.
Putting it all together - as sample size increases, what happens to the value of t - and therefore of p? Why should this be the case? Write down your thoughts in your Markdown.
How does the t-distribution work, and what are its critical values?
As we saw in the previous task, the value of t gets bigger - that is, more extreme - as the sample size increases. Remember as well that the shape of the distribution of t depends on the degrees of freedom, becoming more and more like a normal distribution. Finally, remember that we can calculate the probability of obtaining a value as extreme as the one we have, or more extreme, under the null hypothesis using the distribution of the test statistic (here, t).
So, with very large sample sizes like the one we have, a few things result:
So, test statistics capturing very small or weak effects are still unlikely to occur if the null hypothesis is true when sample size is very large. This is an example of statistical power, or the ability to detect some effect or relationship. Larger samples have more power to detect small effects, which is one of the reasons that, on the whole, larger samples are better. However, it’s also a good reason to NOT simply rely on significance when you are evaluating how interesting, meaningful, or strong an effect is! Regardless of its significance, we also need to interpret the actual value of r, and use graphs and other descriptives to understand what, if anything, this relationship means.
In short: statistical significance is NOT the same as practical significance.
If you got stuck here or are finding this quite difficult, don’t worry - these are complex ideas. As you think about them and use them more, they will become more familiar and more intuitive.
Well done today! We got some practice running a correlation analysis, reporting it in APA style, and examining what those results actually mean. If you did the optional extra tasks, you also looked at why very small effects are still (statistically) significant for very large samples.
You’re welcome - and encouraged! - to keep working with the Millennium Cohort data to practice what we’ve learned today.
Remember, if you get stuck or you have questions, post them on Piazza, or bring them to StatsChats or to drop-ins.
Good job!
This is a type of response bias called acquiescence bias, in which respondents tend to use only the extreme ends of the scale, and is quite typical of child participants (here’s a paper if you’re interested).↩︎