Tutorial 5
This tutorial covers how to prepare for, complete, and report correlation in R. Before you start this tutorial, you should make sure you review the relevant lecture, as this tutorial assumes you already know what correlation is.
All you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.
Note on code in the tutorial
You can reveal the solution to each task by clicking on the “Solution” button. As always, the point of this tutorial is not to torment you with tasks you cannot do. However, you are strongly encouraged not to reach straight for the answer. Try to complete the task yourself: write the code in your script, run it, see if it works. If it doesn’t, try again. Only reveal the solution if you’re stuck or if you want to check your own solution.
Create a new week_05
project and open it in RStudio. Then, create two new folders in the new week_5
folder: r_docs
and data
. Finally, open a new Markdown file and save it in the r_docs
folder. Since we will be practicing reporting (writing up results), we will need Markdown later on. For the tasks, get into the habit of creating new code chunks as you go.
Remember, you can add new code chunks by:
```
{r}
, press Enter, then ```
again.
Add and run the command to load the tidyverse
package in the setup
code chunk.
If you don’t have it yet, install the GGally
package IN THE CONSOLE. Do not add this command to a code chunk! Then, load the GGally
package in your setup
code chunk.
As usual, we will need some data to work with. Since there’s so much interesting information in it, let’s keep looking at the gensex
dataset we’ve been using so far. Use the link below to read in the data in a new code chunk.
Link: https://and.netlify.app/docs/gensex_clean.csv
As usual, before beginning any analysis, we will want to take a look at our data to make sure it’s clean and correct before we proceed. Since we’re using our familiar gensex
dataset that we’ve cleaned already, we should be pretty happy with this, but let’s have a look again at these variables. The main reason is that you can generate rs all day easily, but if you don’t know what the variables you are correlating actually mean, that doesn’t do you much good!
Variable | Type | Description |
---|---|---|
Duration | Continuous | Time spent completing the questionnaire, in seconds |
Age | Continuous | Participant age in years |
Gender | Categorical | Participant gender |
Gender_comfortable_1 | Continuous |
How comfortable participants are with their response to Gender ; Likert 1 - 9, high score = more comfortable
|
Gender_masc_1 | Continuous | How masculine participants feel they are; Likert 1 - 9, high score = more masculine |
Gender_fem_1 | Continuous | How feminine participants feel they are; Likert 1 - 9, high score = more feminine |
Gender_stability_1 | Continuous | How stable participants perceive their gender identity to be; Likert 1 - 9, high score = more stable |
Sexual_strength_1 | Continuous | How strongly participants experience sexual attraction; Likert 1 - 9, high score = more strongly |
Sexual_fewq_1 | Continuous | How frequently participants experience sexual attraction; Likert 1 - 9, high score = more frequently |
Sexual_gender_1 | Continuous | Sexual preference for genders; Likert 1 - 9, 1 = women, 5 = no preference/no gender, 9 = men |
Romantic_strength_1 | Continuous | How strongly participants experience romantic/emotional attraction; Likert 1 - 9, high score = more strongly |
Romantic_freq_1 | Continuous | How frequently participants experience romantic/emotional attraction; Likert 1 - 9, high score = more frequently |
Romantic_gender_1 | Continuous | Romantic/emotional preference for genders; Likert 1 - 9, 1 = women, 5 = no preference/no gender, 9 = men |
For our practice today, let’s focus on the ratings of either romantic or sexual attraction, since we already looked at gender in the lecture.
Overwrite the gensex
object so that it only contains variables to do with sexual or romantic attraction. Use the codebook above to find each variable, and make sure you understand how each variable was measured.
You need to select
the right variables, but there’s a more efficient way than typing all of them out! See ?dplyr::select
for ideas.
Remember NOT to use quotes within select
!
You can simply type into select()
the names of all the variables you want to keep. This is fine but a bit inefficient and repetitive:
gensex <- gensex %>%
dplyr::select(.data = ., #optional!
Sexual_fewq_1, Sexual_gender_1, Sexual_strength_1,
Romantic_freq_1, Romantic_gender_1, Romantic_strength_1
)
The help documentation suggests a more efficient way using selection helpers, which are special functions that match by pattern. Instead of typing out variable names, we can see that all the variables we want have names that start with either “Sexual” or “Romantic”. So, we can use starts_with()
to get them based on those prefixes:
cor()
To start, we can get a very basic correlation matrix very quickly with the function cor()
from the stats
package, which is automatically loaded. It isn’t very pretty, but it will get us a rough and ready correlation matrix quickly.
Create a correlation matrix of all six variables.
gensex %>%
cor()
Sexual_strength_1 Sexual_fewq_1 Sexual_gender_1
Sexual_strength_1 1.0000000 0.62565969 NA
Sexual_fewq_1 0.6256597 1.00000000 NA
Sexual_gender_1 NA NA 1
Romantic_strength_1 NA NA NA
Romantic_freq_1 0.3430913 0.48449978 NA
Romantic_gender_1 0.0979998 -0.01233245 NA
Romantic_strength_1 Romantic_freq_1
Sexual_strength_1 NA 0.34309127
Sexual_fewq_1 NA 0.48449978
Sexual_gender_1 NA NA
Romantic_strength_1 1 NA
Romantic_freq_1 NA 1.00000000
Romantic_gender_1 NA 0.07396634
Romantic_gender_1
Sexual_strength_1 0.09799980
Sexual_fewq_1 -0.01233245
Sexual_gender_1 NA
Romantic_strength_1 NA
Romantic_freq_1 0.07396634
Romantic_gender_1 1.00000000
Notice that there are a lot of NA values in this table. Why do you think this is? What can we do about it?
Look at the gensex
data to figure out why the correlation matrix contains NAs. Then, produce a correlation matrix that doesn’t have any NAs in it.
A summary
of the dataset might give you a good overview of your variables so you can filter
out the problem.
As we’ve seen before, the result of many calculations is NA if there’s even one NA value in the variable. Sure enough, both the Sexual_gender_1
and Romantic_strength_1
variables contain at least one NA.
gensex %>% summary()
Sexual_strength_1 Sexual_fewq_1 Sexual_gender_1
Min. :1.000 Min. :1.000 Min. :1.00
1st Qu.:7.000 1st Qu.:5.000 1st Qu.:5.00
Median :8.000 Median :6.000 Median :8.00
Mean :7.526 Mean :6.154 Mean :6.64
3rd Qu.:9.000 3rd Qu.:8.000 3rd Qu.:9.00
Max. :9.000 Max. :9.000 Max. :9.00
NA's :3
Romantic_strength_1 Romantic_freq_1 Romantic_gender_1
Min. :2.000 Min. :1.000 Min. :1.000
1st Qu.:7.000 1st Qu.:4.000 1st Qu.:5.000
Median :8.000 Median :6.000 Median :8.000
Mean :7.776 Mean :5.922 Mean :6.631
3rd Qu.:9.000 3rd Qu.:7.000 3rd Qu.:9.000
Max. :9.000 Max. :9.000 Max. :9.000
NA's :2
We can use filter()
to keep only the rows that do NOT contain NAs in both variables:
Then, we can run the same cor()
command as before.
gensex %>%
cor()
Sexual_strength_1 Sexual_fewq_1 Sexual_gender_1
Sexual_strength_1 1.00000000 0.601693729 0.081864835
Sexual_fewq_1 0.60169373 1.000000000 0.001622784
Sexual_gender_1 0.08186484 0.001622784 1.000000000
Romantic_strength_1 0.50969282 0.252900937 0.022522256
Romantic_freq_1 0.31539071 0.470851441 0.109874779
Romantic_gender_1 0.08464036 -0.025282400 0.870330126
Romantic_strength_1 Romantic_freq_1
Sexual_strength_1 0.50969282 0.31539071
Sexual_fewq_1 0.25290094 0.47085144
Sexual_gender_1 0.02252226 0.10987478
Romantic_strength_1 1.00000000 0.51362193
Romantic_freq_1 0.51362193 1.00000000
Romantic_gender_1 0.01380141 0.06792014
Romantic_gender_1
Sexual_strength_1 0.08464036
Sexual_fewq_1 -0.02528240
Sexual_gender_1 0.87033013
Romantic_strength_1 0.01380141
Romantic_freq_1 0.06792014
Romantic_gender_1 1.00000000
We can see that there’s a huge amount of variation in how strongly these variables are correlated. We can also see that diagonal line of 1s, which represent perfect correlations.
In your Markdown document, write your own explanation of why there are absolutely perfect correlations in your matrix.
You can see in the output that the 1s appear on the diagonal where the same variable name is in both the column and the row. As we discussed in the lecture, these 1s represent where each variable is correlated with itself. These 1s are not particularly helpful or informative!
As we covered in the lecture, a correlation matrix like this is the same above the diagonal line of 1s and below it. So, more than of half this big wall of numbers is just repeated information! Let’s use a different function to get a bit more bang for our buck, as it were.
GGally
The GGally
package contains a very useful function called ggscatmat()
, which we can use to help us better understand the relationships between our variables. It outputs a correlation matrix like the one we saw above, but mixed in with some really useful plots.
Put all of your variables into the GGally::ggscatmat()
function and have a look at the output.
Since we want to put in all of the variables we currently have in gensex
, we can just pipe it in.
gensex %>%
ggscatmat()
ggscatmat()
outputs a ggplot
object, so just like any plot we make with ggplot
, we can customise it. Here I’ve just added a line that lets me adjust the aspect ratio to make it a bit more readable. Feel free to fiddle with the number to make it work best for you.
gensex %>%
ggscatmat() +
theme(aspect.ratio=.5)
This plot has a lot going on, but don’t worry! That’s just because there’s a lot of useful information here. Let’s focus on one bit at a time.
First, we can see that we have our six variable names listed along the top of the plot, and then the same six names along the side. Where those columns and rows intersect, some cells of the plot contain numbers, some contain dots, and some contain lines. Let’s look first at the lines, which are all on the diagonal.
We saw before that cor()
gave us 1s on the diagonal, representing each variable correlated with itself. Here, instead of 1s (which aren’t very helpful or interesting), we actually have a histogram of each variable. From the shapes that mostly slope dramatically up to the right, we can see that low ratings are uncommon, and high ratings very common, for most of our variables. The exceptions are the two variables asking about frequency (of sexual and romantic attraction), for which the most common response is 6 - 7 on our 9-point scale.
Next, let’s look at the numbers. You might recognise these values from cor()
above. These are the correlation coefficients for each pair of variables. To figure out which variables are involved in the correlation, go up to the column name and along to the row name.
Finally, we have the dots. These are scatterplots of each pair of variables. Again, to find which variables are in each plot, look at the column name and the row name. So, each unique pair of variables has a scatterplot and a correlation coefficient; and each individual variable has a histogram along the diagonal. Like I said - snazzy!
Answer the following questions using the output from ggscatmat()
.
How many pairs of variables have an inverse relationship?
1
Correct!That’s not right…
What is the value of the correlation r between the two variables with the strongest relationship? Give your answer to two decimal places.
0.87
Correct!That’s not right…
What is the value of the correlation r between the two variables with the weakest relationship? Give your answer to two decimal places.
0
Correct!That’s not right…
To the nearest whole point, what is the most common rating for the variable Sexual_gender_1
?
8
Well done!Not quite - where is the highest point of the distribution?
How can you interpret the scatterplot between Sexual_fewq_1
and Sexual_gender_1
?
No pattern, just a random block of dots
Correct!That’s not right…
We now can easily investigate the correlation between any two variables in our dataset. However, in the lecture we also talked about reporting significance, degrees of freedom, and CIs, which we can’t get easily from this output. So, if we want to report the results of particular correlations, we should have a look at a different function, cor.test()
.
cor.test()
Unlike cor()
, cor.test()
only tests the correlation between pairs of variables. If we want to look at lots of variables at once, we’d be better off with cor()
(or, although we haven’t looked at it in this tutorial, Hmisc::rcorr()
). However, if we want more information about a particular correlation, such as easily reportable degrees of freedom, p-values, and confidence intervals, cor.test()
is the better choice.
Open the help documentation for the cor.test()
function.
?cor.test()
Under the heading Usage, we can see that the function has two methods: one is called “Default S3 method” and the other is mentions “class ‘formula’”. We could use either method, but we’ll use the formula version now - because it’s going to be extremely useful in the future!
Under the heading Arguments, the help documentation says about the formula
argument: “a formula of the form ~ u + v
, where each of u and v are numeric variables giving the data values for one sample.” So, we need to write the first argument as ~ variable_name_1 + variable_name_2
. We then also have a data
argument to specify the data that those variables come from.
Finally, we have two other important arguments with no default:
alternative
must be set to one of the three options "two.sided"
, "less"
, or "greater"
. This determines the type of alternative hypothesis. Because we are going to be using two-tailed tests, we will stick with "two-sided"
.method
must be set to one of the three options "pearson"
, "kendall"
, or "spearman"
. Thus far, we have only been using Pearson correlations, so we’ll stick with that for now....
are other options you can change. We’ll stick with the defaults for now.Use cor.test()
to calculate the correlation between Sexual_strength_1
and Romantic_strength_1
as the output shows below.
To begin, we need our formula. We know the two variables that we want to use: Sexual_strength_1
and Romantic_strength_1
. We begin by writing them in the formula we saw above: ~ Sexual_strength_1 + Romantic_strength_1
. (Note that the order doesn’t matter.)
Next, we need to tell R where these variables come from using the data
argument. We could either put in the name of the dataset, or pipe it in.
Then, we need to specify the values for alternative
and method
, which we already discussed above. Altogether, we get this:
Pearson's product-moment correlation
data: Sexual_strength_1 and Romantic_strength_1
t = 10.244, df = 299, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4208932 0.5888164
sample estimates:
cor
0.5096928
When we run this function, the output is some reasonably easy to read information about the results of our test. We can see here that we have:
Stop and look at the values that the output has given you. Based on what you have learned from the lecture, write down what you think this value of r means, and how you could interpret it for these variables.
First, we can see that there is a significant correlation between how strongly participants experience sexual attraction, and how strongly they experience romantic attraction. This is a reasonably strong correlation, but it’s not as strong as we might expect. It’s also a positive correlation. This tells us that on the whole, people who experience stronger sexual attraction also tend to experience stronger romantic attraction.
Create a scatterplot of these two variables to help you understand the correlation. Does this change your interpretation at all?
Any plot that looks like the one below is fine, but I’ve added a few things to make it prettier and easier to read!
gensex %>%
ggplot(aes(Sexual_strength_1, Romantic_strength_1)) +
geom_point(position = "jitter", alpha = .4) +
scale_x_continuous(breaks = 0:9,
name = "Strength of Sexual Attraction")+
scale_y_continuous(breaks = 0:9,
name = "Strength of Romantic Attraction")+
theme_minimal()
We can see that for both variables, most people tended to answer on the high end of the scale - which gives us the cluster of points in the upper right hand corner. This is somewhat concerning, because not many people answered at the middle to lower ends of either scale, so our value for r might not really represent what’s going on. In other words, we don’t have much information about people who don’t experience sexual or romantic attraction very strongly.
We’ll worry more about this next year, when we study assumptions and bias in data; for now, just notice that this looks a bit odd.
Although we might conclude from this result that the strength of sexual attraction and the strength of romantic attraction tend to change in the same way, that does not mean they are causally related. Our results do not mean that if you experience weaker sexual attraction, that will cause you to also experience weaker romantic attraction, or vice versa. It should be clear from the previous statement that there’s no way to establish which of these two might be the cause and which the effect in any case. What our results do suggest is that if you experience weaker sexual attraction, you tend to experience weaker romantic attraction (and vice versa), to a degree that is unlikely to occur if there is in fact no real relationship at all between the strength of sexual and romantic attraction.
Save the output from cor.test()
as a new object called sex_rom_cor
, so we can use it again later.
sex_rom_cor <- gensex %>%
cor.test(~ Sexual_strength_1 + Romantic_strength_1, data = .,
alternative = "two.sided", method = "pearson")
Our reporting in the tasks above could use some numbers, which we can get out of the output of cor.test()
. Let’s look at how we can formally report the results of a correlation analysis.
To finish our analysis, we have to tell people about it - so we should write up the results of our correlation test as in a report.
Results like this correlation test, that have a small number of values to report, are often reported in brackets in the text of your report. This takes the general form:
(name_of_estimate(degrees_of_freedom) = value_of_estimate, p = exact_p, 95% CI [lower_bound, upper_bound])
To get these values, let’s look again at the results that we already saved in the object sex_rom_cor
.
Call the name of the object sex_rom_cor
to see the displayed output.
sex_rom_cor
Pearson's product-moment correlation
data: Sexual_strength_1 and Romantic_strength_1
t = 10.244, df = 299, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4208932 0.5888164
sample estimates:
cor
0.5096928
This is the same output that we saw above when we ran the analysis. Although this is useful for us, the analysts, to understand the result, it’s not at all formatted well for reports.
You should never have any raw (unformatted) code or output in a report! Always report results in the text, in a figure, or in a formatted table.
Use the general format above and the values to type out the result of the correlation analysis.
We start with the general form:
name_of_estimate(degrees_of_freedom) = value_of_estimate, p = exact_p, 95% CI [lower_bound, upper_bound]
Now we need each of these values from the output:
The only weird thing is the value of p. R is essentially telling us that the value of p is very very very very very small. In APA style (which is the writing/reporting style we use as standard), we should report exact p values unless they are smaller than .001. At that point, we can just write “p < .001”.
Here, the p-value is some number smaller than 2.2e-16 (which means 2.2 with 16 zeros in front of it, or .00000000000000022), which is clearly much smaller than .001!
So, our reporting should look like:
r(299) = .51, p < .001, 95% CI [.42, .59]
To complete our report, we should describe in words what this result means. We essentially include the statistical result as a citation to give evidence for our claim.
Have a go writing out a report of this statistical analysis.
You should definitely write out your own report in your own words, but here’s an example:
“A correlation analysis was performed on participants’ ratings of how strongly they experience sexual and romantic attraction. The results indicated that strength of sexual and romantic attraction were significantly positively correlated (r(299) = .51, p < .001, 95% CI [.42, .59]).”
That’s looking pretty slick! This is exactly the sort of thing you would expect to see in a journal article.
If you’re happy with this and/or ready to be done, you skip down to the Recap to finish up - this is the end of the required bit of the tutorial. Well done!
If you’d like to learn how to get R to do the reporting for you, read on!
There are two ways to do our reporting: typing out the results by hand, as we just did above, or using inline coding. For this module, you are only required to be able to type out the results by hand, as we did above.
However, inline coding is an extremely powerful feature of Markdown documents, but it requires a bit more coding skill. So, if you’re ready to give it a go, let’s get started!
Use class()
and str()
to look at the class and structure of the sex_rom_cor
object.
sex_rom_cor %>% class()
[1] "htest"
sex_rom_cor %>% str()
List of 9
$ statistic : Named num 10.2
..- attr(*, "names")= chr "t"
$ parameter : Named int 299
..- attr(*, "names")= chr "df"
$ p.value : num 2.64e-21
$ estimate : Named num 0.51
..- attr(*, "names")= chr "cor"
$ null.value : Named num 0
..- attr(*, "names")= chr "correlation"
$ alternative: chr "two.sided"
$ method : chr "Pearson's product-moment correlation"
$ data.name : chr "Sexual_strength_1 and Romantic_strength_1"
$ conf.int : num [1:2] 0.421 0.589
..- attr(*, "conf.level")= num 0.95
- attr(*, "class")= chr "htest"
The first command tells us that this object, which contains the results from the function cor.test()
, has the class htest
. What does this mean? Objects of this class tend to have a specific structure, which we can see using str()
. Now we can see that there is a lot more stored in here than it seems. If we want to get out specific values, we can simply subset them using $
, as the output from str()
suggests (notice the $
s in the output). The output from cor.test()
that we have here contains the following useful elements (among others):
statistic
: the calculated value of tparameter
: the degrees of freedom for tp.value
: the p-valueestimate
: the value of rconf.int
: the values of the lower and upper bounds of the confidence interval for rUse subsetting to extract the value of r from the sex_rom_cor
object.
sex_rom_cor$estimate
cor
0.5096928
That’s very cool, but we can do a bit better…
Use subsetting to extract the value of r from the sex_rom_cor
object, then round it to two decimal places.
sex_rom_cor$estimate %>%
round(2)
cor
0.51
That’s looking very promising! Now, if I want to report the correlation coefficient for this analysis as part of my writeup, I could use inline code to do this. Inline code is a small snippet of R code inside the text part of a Markdown document, no different from the code we use inside code chunks. The difference is, we can write lines of code that print out particular numbers - as we just did above - and then insert that code into the text where we want that number to appear. When we knit the document, R will run the inline code and replace it with whatever value the code produces. To do this, I just need to write `r some_r_code`
in my Markdown.
If that’s clear as mud, it might help to see it in action. In the text of my Markdown document, I could write:
“Ratings of the strength of sexual and romantic attraction were correlated at r = `r sex_rom_cor$estimate %>% round(2)`
”
When I knit the Markdown document, this will come out as:
“Ratings of the strength of sexual and romantic attraction were correlated at r = 0.51”
The code `r sex_rom_cor$estimate %>% round(2)`
is evaluated by R to produce a number, and then that number appears in place of the code in the knitted document. Cool, eh?
Once we have the basic idea, we can get almost any value of interest from our sex_rom_cor
object using subsetting. The only other thing you need to know is that if subsetting gives you multiple numbers, you can get out just one of them using square brackets with a number. For example, this code: object_name$variable_name[1]
will give you the first observation in that variable.
Use subsetting with square brackets to get out the lower and higher bounds of the confidence intervals from sex_rom_cor
.
sex_rom_cor$conf.int[1]
[1] 0.4208932
sex_rom_cor$conf.int[2]
[1] 0.5888164
Complete the report of this analysis below, replacing the square brackets with inline code that produces the correct number. Make sure you round all values to two decimal places, if necessary, except for p, which is rounded to three.
Strength of sexual attraction and strength of romantic attraction were significantly correlated, r([degrees of freedom]) = [value of r], 95% CI = [[lower bound], [upper bound]], p = [p-value].
To preview what number your inline code will produce when it’s knitted, you can click anywhere inside the backticks and press Ctrl (or ⌘ Command) + Enter. You’ll see a small pop-up that will show the value that your inline code produces.
Strength of sexual attraction and strength of romantic attraction were significantly positively correlated, r(`r
sex_rom_cor$parameter
`) = `r
sex_rom_cor$estimate %>% round(2)
`
, p = `r
sex_rom_cor$p.value %>% round(3)
`
,95% CI = [`r
sex_rom_cor$conf.int[1] %>% round(2)
`
, `r
sex_rom_cor$conf.int[2] %>% round(2)
`
].
Which should appear in Markdown as:
Strength of sexual attraction and strength of romantic attraction were significantly positively correlated, r(299) = 0.51, p = 0, 95% CI = [0.42, 0.59].
Note: The p-value actually isn’t right here. Since p is so small, it only has 0s to three decimal places, but the value isn’t actually 0. The problem is how round()
works - since we’re rounding to 0.000, the output just becomes 0, which is a bit rubbish! In this case, we’d be better off just typing “p < .001”.
There are other, better ways to report p programmatically (ie using a function) - can you find one that does what you want? If not - can you write one for yourself? Let us know how you get on!
Knit your document to see your completed inline code!
Whew, well done! You’re welcome - and encouraged! - to keep working with the gensex
data to practice what we’ve learned today.
In sum, we’ve covered:
$
to report results with inline codeRemember, if you get stuck or you have questions, post them on Piazza, or bring them to StatsChats or to drop-ins.
Good job!
That’s all for today. See you soon!