Analysing Data: Correlation

This tutorial covers how to prepare for, complete, and report correlation in R. Before you start this tutorial, you should make sure you review the relevant lecture, as this tutorial assumes you already know what correlation is.

Setting Up

All you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.

Task 1

Create a new week_05 project and open it in RStudio. Then, create two new folders in the new week_5 folder: r_docs and data. Finally, open a new Markdown file and save it in the r_docs folder. Since we will be practicing reporting (writing up results), we will need Markdown later on. For the tasks, get into the habit of creating new code chunks as you go.

Remember, you can add new code chunks by:

Using the RStudio toolbar: Click Code > Insert Chunk
Using a keyboard shortcut: the default is Ctrl + Alt + I, but you can change this under Tools > Modify Keyboard Shortcuts…
Typing it out: ```{r}, press Enter, then ``` again.
Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

Task 2punk!

Add and run the command to load the tidyverse package in the setup code chunk.

Task 2.1punk!

If you don’t have it yet, install the GGally package IN THE CONSOLE. Do not add this command to a code chunk! Then, load the GGally package in your setup code chunk.

Task 3punk!

As usual, we will need some data to work with. Since there’s so much interesting information in it, let’s keep looking at the gensex dataset we’ve been using so far. Use the link below to read in the data in a new code chunk.

Data Preparation

As usual, before beginning any analysis, we will want to take a look at our data to make sure it’s clean and correct before we proceed. Since we’re using our familiar gensex dataset that we’ve cleaned already, we should be pretty happy with this, but let’s have a look again at these variables. The main reason is that you can generate rs all day easily, but if you don’t know what the variables you are correlating actually mean, that doesn’t do you much good!

Codebook

Variable	Type	Description
Duration	Continuous	Time spent completing the questionnaire, in seconds
Age	Continuous	Participant age in years
Gender	Categorical	Participant gender
Gender_comfortable_1	Continuous	How comfortable participants are with their response to `Gender`; Likert 1 - 9, high score = more comfortable
Gender_masc_1	Continuous	How masculine participants feel they are; Likert 1 - 9, high score = more masculine
Gender_fem_1	Continuous	How feminine participants feel they are; Likert 1 - 9, high score = more feminine
Gender_stability_1	Continuous	How stable participants perceive their gender identity to be; Likert 1 - 9, high score = more stable
Sexual_strength_1	Continuous	How strongly participants experience sexual attraction; Likert 1 - 9, high score = more strongly
Sexual_fewq_1	Continuous	How frequently participants experience sexual attraction; Likert 1 - 9, high score = more frequently
Sexual_gender_1	Continuous	Sexual preference for genders; Likert 1 - 9, 1 = women, 5 = no preference/no gender, 9 = men
Romantic_strength_1	Continuous	How strongly participants experience romantic/emotional attraction; Likert 1 - 9, high score = more strongly
Romantic_freq_1	Continuous	How frequently participants experience romantic/emotional attraction; Likert 1 - 9, high score = more frequently
Romantic_gender_1	Continuous	Romantic/emotional preference for genders; Likert 1 - 9, 1 = women, 5 = no preference/no gender, 9 = men

For our practice today, let’s focus on the ratings of either romantic or sexual attraction, since we already looked at gender in the lecture.

Task 4punk!

Overwrite the gensex object so that it only contains variables to do with sexual or romantic attraction. Use the codebook above to find each variable, and make sure you understand how each variable was measured.

You can simply type into select() the names of all the variables you want to keep. This is fine but a bit inefficient and repetitive:

gensex <- gensex %>% 
  dplyr::select(.data = ., #optional!
    Sexual_fewq_1, Sexual_gender_1, Sexual_strength_1,
    Romantic_freq_1, Romantic_gender_1, Romantic_strength_1
  )

The help documentation suggests a more efficient way using selection helpers, which are special functions that match by pattern. Instead of typing out variable names, we can see that all the variables we want have names that start with either “Sexual” or “Romantic”. So, we can use starts_with() to get them based on those prefixes:

gensex <- gensex %>% 
  dplyr::select(starts_with(c("Sexual", "Romantic")))

Correlation Matrices

Basic Matrix with cor()

To start, we can get a very basic correlation matrix very quickly with the function cor() from the stats package, which is automatically loaded. It isn’t very pretty, but it will get us a rough and ready correlation matrix quickly.

Task 5punk!

gensex %>% 
  cor()

                    Sexual_strength_1 Sexual_fewq_1 Sexual_gender_1
Sexual_strength_1           1.0000000    0.62565969              NA
Sexual_fewq_1               0.6256597    1.00000000              NA
Sexual_gender_1                    NA            NA               1
Romantic_strength_1                NA            NA              NA
Romantic_freq_1             0.3430913    0.48449978              NA
Romantic_gender_1           0.0979998   -0.01233245              NA
                    Romantic_strength_1 Romantic_freq_1
Sexual_strength_1                    NA      0.34309127
Sexual_fewq_1                        NA      0.48449978
Sexual_gender_1                      NA              NA
Romantic_strength_1                   1              NA
Romantic_freq_1                      NA      1.00000000
Romantic_gender_1                    NA      0.07396634
                    Romantic_gender_1
Sexual_strength_1          0.09799980
Sexual_fewq_1             -0.01233245
Sexual_gender_1                    NA
Romantic_strength_1                NA
Romantic_freq_1            0.07396634
Romantic_gender_1          1.00000000

Notice that there are a lot of NA values in this table. Why do you think this is? What can we do about it?

Task 6Prog-rocK

Look at the gensex data to figure out why the correlation matrix contains NAs. Then, produce a correlation matrix that doesn’t have any NAs in it.

As we’ve seen before, the result of many calculations is NA if there’s even one NA value in the variable. Sure enough, both the Sexual_gender_1 and Romantic_strength_1 variables contain at least one NA.

gensex %>% summary()

 Sexual_strength_1 Sexual_fewq_1   Sexual_gender_1
 Min.   :1.000     Min.   :1.000   Min.   :1.00   
 1st Qu.:7.000     1st Qu.:5.000   1st Qu.:5.00   
 Median :8.000     Median :6.000   Median :8.00   
 Mean   :7.526     Mean   :6.154   Mean   :6.64   
 3rd Qu.:9.000     3rd Qu.:8.000   3rd Qu.:9.00   
 Max.   :9.000     Max.   :9.000   Max.   :9.00   
                                   NA's   :3      
 Romantic_strength_1 Romantic_freq_1 Romantic_gender_1
 Min.   :2.000       Min.   :1.000   Min.   :1.000    
 1st Qu.:7.000       1st Qu.:4.000   1st Qu.:5.000    
 Median :8.000       Median :6.000   Median :8.000    
 Mean   :7.776       Mean   :5.922   Mean   :6.631    
 3rd Qu.:9.000       3rd Qu.:7.000   3rd Qu.:9.000    
 Max.   :9.000       Max.   :9.000   Max.   :9.000    
 NA's   :2

We can use filter() to keep only the rows that do NOT contain NAs in both variables:

gensex <- gensex %>% 
  filter(!is.na(Sexual_gender_1) & !is.na(Romantic_strength_1))

Then, we can run the same cor() command as before.

gensex %>% 
  cor()

                    Sexual_strength_1 Sexual_fewq_1 Sexual_gender_1
Sexual_strength_1          1.00000000   0.601693729     0.081864835
Sexual_fewq_1              0.60169373   1.000000000     0.001622784
Sexual_gender_1            0.08186484   0.001622784     1.000000000
Romantic_strength_1        0.50969282   0.252900937     0.022522256
Romantic_freq_1            0.31539071   0.470851441     0.109874779
Romantic_gender_1          0.08464036  -0.025282400     0.870330126
                    Romantic_strength_1 Romantic_freq_1
Sexual_strength_1            0.50969282      0.31539071
Sexual_fewq_1                0.25290094      0.47085144
Sexual_gender_1              0.02252226      0.10987478
Romantic_strength_1          1.00000000      0.51362193
Romantic_freq_1              0.51362193      1.00000000
Romantic_gender_1            0.01380141      0.06792014
                    Romantic_gender_1
Sexual_strength_1          0.08464036
Sexual_fewq_1             -0.02528240
Sexual_gender_1            0.87033013
Romantic_strength_1        0.01380141
Romantic_freq_1            0.06792014
Romantic_gender_1          1.00000000

We can see that there’s a huge amount of variation in how strongly these variables are correlated. We can also see that diagonal line of 1s, which represent perfect correlations.

Task 7Prog-rocK

In your Markdown document, write your own explanation of why there are absolutely perfect correlations in your matrix.

You can see in the output that the 1s appear on the diagonal where the same variable name is in both the column and the row. As we discussed in the lecture, these 1s represent where each variable is correlated with itself. These 1s are not particularly helpful or informative!

As we covered in the lecture, a correlation matrix like this is the same above the diagonal line of 1s and below it. So, more than of half this big wall of numbers is just repeated information! Let’s use a different function to get a bit more bang for our buck, as it were.

Snazzy Matrix with GGally

The GGally package contains a very useful function called ggscatmat(), which we can use to help us better understand the relationships between our variables. It outputs a correlation matrix like the one we saw above, but mixed in with some really useful plots.

Task 8punk!

Put all of your variables into the GGally::ggscatmat() function and have a look at the output.

Since we want to put in all of the variables we currently have in gensex, we can just pipe it in.

gensex %>% 
  ggscatmat()

At least on my computer, the resulting plot is a little crunched up and square. Lucky for us, ggscatmat() outputs a ggplot object, so just like any plot we make with ggplot, we can customise it. Here I’ve just added a line that lets me adjust the aspect ratio to make it a bit more readable. Feel free to fiddle with the number to make it work best for you.

gensex %>% 
  ggscatmat() +
  theme(aspect.ratio=.5)

This plot has a lot going on, but don’t worry! That’s just because there’s a lot of useful information here. Let’s focus on one bit at a time.

First, we can see that we have our six variable names listed along the top of the plot, and then the same six names along the side. Where those columns and rows intersect, some cells of the plot contain numbers, some contain dots, and some contain lines. Let’s look first at the lines, which are all on the diagonal.

We saw before that cor() gave us 1s on the diagonal, representing each variable correlated with itself. Here, instead of 1s (which aren’t very helpful or interesting), we actually have a histogram of each variable. From the shapes that mostly slope dramatically up to the right, we can see that low ratings are uncommon, and high ratings very common, for most of our variables. The exceptions are the two variables asking about frequency (of sexual and romantic attraction), for which the most common response is 6 - 7 on our 9-point scale.

Next, let’s look at the numbers. You might recognise these values from cor() above. These are the correlation coefficients for each pair of variables. To figure out which variables are involved in the correlation, go up to the column name and along to the row name.

Finally, we have the dots. These are scatterplots of each pair of variables. Again, to find which variables are in each plot, look at the column name and the row name. So, each unique pair of variables has a scatterplot and a correlation coefficient; and each individual variable has a histogram along the diagonal. Like I said - snazzy!

Mini-Quiz

Task 9

How many pairs of variables have an inverse relationship?

Correct!That’s not right…

Task 10

What is the value of the correlation r between the two variables with the strongest relationship? Give your answer to two decimal places.

0.87

Correct!That’s not right…

Task 11

What is the value of the correlation r between the two variables with the weakest relationship? Give your answer to two decimal places.

Correct!That’s not right…

We now can easily investigate the correlation between any two variables in our dataset. However, in the lecture we also talked about reporting significance, degrees of freedom, and CIs, which we can’t get easily from this output. So, if we want to report the results of particular correlations, we should have a look at a different function, cor.test().

Pairwise Tests with cor.test()

Unlike cor(), cor.test() only tests the correlation between pairs of variables. If we want to look at lots of variables at once, we’d be better off with cor() (or, although we haven’t looked at it in this tutorial, Hmisc::rcorr()). However, if we want more information about a particular correlation, such as easily reportable degrees of freedom, p-values, and confidence intervals, cor.test() is the better choice.

Task 14punk!

Under the heading Usage, we can see that the function has two methods: one is called “Default S3 method” and the other is mentions “class ‘formula’”. We could use either method, but we’ll use the formula version now - because it’s going to be extremely useful in the future!

Under the heading Arguments, the help documentation says about the formula argument: “a formula of the form ~ u + v, where each of u and v are numeric variables giving the data values for one sample.” So, we need to write the first argument as ~ variable_name_1 + variable_name_2. We then also have a data argument to specify the data that those variables come from.

Task 15Prog-rocK

Use cor.test() to calculate the correlation between Sexual_strength_1 and Romantic_strength_1 as the output shows below.

To begin, we need our formula. We know the two variables that we want to use: Sexual_strength_1 and Romantic_strength_1. We begin by writing them in the formula we saw above: ~ Sexual_strength_1 + Romantic_strength_1. (Note that the order doesn’t matter.)

Next, we need to tell R where these variables come from using the data argument. We could either put in the name of the dataset, or pipe it in.

Then, we need to specify the values for alternative and method, which we already discussed above. Altogether, we get this:

gensex %>% 
  cor.test(~ Sexual_strength_1 + Romantic_strength_1, data = .,
           alternative = "two.sided", method = "pearson")

## Alternative without a pipe
cor.test(~ Sexual_strength_1 + Romantic_strength_1, data = gensex,
           alternative = "two.sided", method = "pearson")

When we run this function, the output is some reasonably easy to read information about the results of our test. We can see here that we have:

Task 16Prog-rocK

Stop and look at the values that the output has given you. Based on what you have learned from the lecture, write down what you think this value of r means, and how you could interpret it for these variables.

First, we can see that there is a significant correlation between how strongly participants experience sexual attraction, and how strongly they experience romantic attraction. This is a reasonably strong correlation, but it’s not as strong as we might expect. It’s also a positive correlation. This tells us that on the whole, people who experience stronger sexual attraction also tend to experience stronger romantic attraction.

Task 16.1jazz...

Create a scatterplot of these two variables to help you understand the correlation. Does this change your interpretation at all?

Any plot that looks like the one below is fine, but I’ve added a few things to make it prettier and easier to read!

gensex %>% 
  ggplot(aes(Sexual_strength_1, Romantic_strength_1)) +
  geom_point(position = "jitter", alpha = .4) +
  scale_x_continuous(breaks = 0:9,
                     name = "Strength of Sexual Attraction")+
  scale_y_continuous(breaks = 0:9,
                     name = "Strength of Romantic Attraction")+
  theme_minimal()

We can see that for both variables, most people tended to answer on the high end of the scale - which gives us the cluster of points in the upper right hand corner. This is somewhat concerning, because not many people answered at the middle to lower ends of either scale, so our value for r might not really represent what’s going on. In other words, we don’t have much information about people who don’t experience sexual or romantic attraction very strongly.

We’ll worry more about this next year, when we study assumptions and bias in data; for now, just notice that this looks a bit odd.

Correlation does not imply causation

Although we might conclude from this result that the strength of sexual attraction and the strength of romantic attraction tend to change in the same way, that does not mean they are causally related. Our results do not mean that if you experience weaker sexual attraction, that will cause you to also experience weaker romantic attraction, or vice versa. It should be clear from the previous statement that there’s no way to establish which of these two might be the cause and which the effect in any case. What our results do suggest is that if you experience weaker sexual attraction, you tend to experience weaker romantic attraction (and vice versa), to a degree that is unlikely to occur if there is in fact no real relationship at all between the strength of sexual and romantic attraction.

Task 17punk!

Save the output from cor.test() as a new object called sex_rom_cor, so we can use it again later.

Our reporting in the tasks above could use some numbers, which we can get out of the output of cor.test(). Let’s look at how we can formally report the results of a correlation analysis.

Reporting Correlations

To finish our analysis, we have to tell people about it - so we should write up the results of our correlation test as in a report.

Results like this correlation test, that have a small number of values to report, are often reported in brackets in the text of your report. This takes the general form:

To get these values, let’s look again at the results that we already saved in the object sex_rom_cor.

Task 18punk!

This is the same output that we saw above when we ran the analysis. Although this is useful for us, the analysts, to understand the result, it’s not at all formatted well for reports.

Task 19Prog-rocK

Use the general format above and the values to type out the result of the correlation analysis.

We start with the general form:

name_of_estimate(degrees_of_freedom) = value_of_estimate, p = exact_p, 95% CI [lower_bound, upper_bound]

Now we need each of these values from the output:

name_of_estimate: r
degrees_of_freedom: 299
value_of_estimate: .51
p = ????
lower_bound: .42
upper_bound: .59

The only weird thing is the value of p. R is essentially telling us that the value of p is very very very very very small. In APA style (which is the writing/reporting style we use as standard), we should report exact p values unless they are smaller than .001. At that point, we can just write “p < .001”.

Here, the p-value is some number smaller than 2.2e-16 (which means 2.2 with 16 zeros in front of it, or .00000000000000022), which is clearly much smaller than .001!

So, our reporting should look like:

r(299) = .51, p < .001, 95% CI [.42, .59]

To complete our report, we should describe in words what this result means. We essentially include the statistical result as a citation to give evidence for our claim.

Task 20Prog-rocK

You should definitely write out your own report in your own words, but here’s an example:

“A correlation analysis was performed on participants’ ratings of how strongly they experience sexual and romantic attraction. The results indicated that strength of sexual and romantic attraction were significantly positively correlated (r(299) = .51, p < .001, 95% CI [.42, .59]).”

That’s looking pretty slick! This is exactly the sort of thing you would expect to see in a journal article.

If you’re happy with this and/or ready to be done, you skip down to the Recap to finish up - this is the end of the required bit of the tutorial. Well done!

OPTIONAL: Inline Coding

There are two ways to do our reporting: typing out the results by hand, as we just did above, or using inline coding. For this module, you are only required to be able to type out the results by hand, as we did above.

However, inline coding is an extremely powerful feature of Markdown documents, but it requires a bit more coding skill. So, if you’re ready to give it a go, let’s get started!

Task 21punk!

Use class() and str() to look at the class and structure of the sex_rom_cor object.

sex_rom_cor %>% class()

[1] "htest"

sex_rom_cor %>% str()

List of 9
 $ statistic  : Named num 10.2
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named int 299
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 2.64e-21
 $ estimate   : Named num 0.51
  ..- attr(*, "names")= chr "cor"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "correlation"
 $ alternative: chr "two.sided"
 $ method     : chr "Pearson's product-moment correlation"
 $ data.name  : chr "Sexual_strength_1 and Romantic_strength_1"
 $ conf.int   : num [1:2] 0.421 0.589
  ..- attr(*, "conf.level")= num 0.95
 - attr(*, "class")= chr "htest"

The first command tells us that this object, which contains the results from the function cor.test(), has the class htest. What does this mean? Objects of this class tend to have a specific structure, which we can see using str(). Now we can see that there is a lot more stored in here than it seems. If we want to get out specific values, we can simply subset them using $, as the output from str() suggests (notice the $s in the output). The output from cor.test() that we have here contains the following useful elements (among others):

Task 22punk!

Task 22.1punk!

Use subsetting to extract the value of r from the sex_rom_cor object, then round it to two decimal places.

That’s looking very promising! Now, if I want to report the correlation coefficient for this analysis as part of my writeup, I could use inline code to do this. Inline code is a small snippet of R code inside the text part of a Markdown document, no different from the code we use inside code chunks. The difference is, we can write lines of code that print out particular numbers - as we just did above - and then insert that code into the text where we want that number to appear. When we knit the document, R will run the inline code and replace it with whatever value the code produces. To do this, I just need to write `r some_r_code` in my Markdown.

If that’s clear as mud, it might help to see it in action. In the text of my Markdown document, I could write:

“Ratings of the strength of sexual and romantic attraction were correlated at r = `r sex_rom_cor$estimate %>% round(2)`”

“Ratings of the strength of sexual and romantic attraction were correlated at r = 0.51”

The code `r sex_rom_cor$estimate %>% round(2)` is evaluated by R to produce a number, and then that number appears in place of the code in the knitted document. Cool, eh?

Once we have the basic idea, we can get almost any value of interest from our sex_rom_cor object using subsetting. The only other thing you need to know is that if subsetting gives you multiple numbers, you can get out just one of them using square brackets with a number. For example, this code: object_name$variable_name[1] will give you the first observation in that variable.

Task 23Prog-rocK

Use subsetting with square brackets to get out the lower and higher bounds of the confidence intervals from sex_rom_cor.

sex_rom_cor$conf.int[1]

[1] 0.4208932

sex_rom_cor$conf.int[2]

[1] 0.5888164

Task 24jazz...

Complete the report of this analysis below, replacing the square brackets with inline code that produces the correct number. Make sure you round all values to two decimal places, if necessary, except for p, which is rounded to three.

Strength of sexual attraction and strength of romantic attraction were significantly correlated, r([degrees of freedom]) = [value of r], 95% CI = [[lower bound], [upper bound]], p = [p-value].

Strength of sexual attraction and strength of romantic attraction were significantly positively correlated, r(`r sex_rom_cor$parameter`) = `r sex_rom_cor$estimate %>% round(2)`, p = `r sex_rom_cor$p.value %>% round(3)`,95% CI = [`r sex_rom_cor$conf.int[1] %>% round(2)`, `r sex_rom_cor$conf.int[2] %>% round(2)`].

Which should appear in Markdown as:

Strength of sexual attraction and strength of romantic attraction were significantly positively correlated, r(299) = 0.51, p = 0, 95% CI = [0.42, 0.59].

Note: The p-value actually isn’t right here. Since p is so small, it only has 0s to three decimal places, but the value isn’t actually 0. The problem is how round() works - since we’re rounding to 0.000, the output just becomes 0, which is a bit rubbish! In this case, we’d be better off just typing “p < .001”.

There are other, better ways to report p programmatically (ie using a function) - can you find one that does what you want? If not - can you write one for yourself? Let us know how you get on!

Task 25punk!

Recap

Whew, well done! You’re welcome - and encouraged! - to keep working with the gensex data to practice what we’ve learned today.

Remember, if you get stuck or you have questions, post them on Piazza, or bring them to StatsChats or to drop-ins.

Correlation

Author

Affiliation

Published

DOI

Setting Up

Task 1

Task 2punk!

Task 2.1punk!

Task 3punk!

Data Preparation

Codebook

Task 4punk!

Correlation Matrices

Basic Matrix with cor()

Task 5punk!

Task 6Prog-rocK

Task 7Prog-rocK

Snazzy Matrix with GGally

Task 8punk!

Mini-Quiz

Task 9

Task 10

Task 11

Task 12

Task 13

Pairwise Tests with cor.test()

Task 14punk!

Task 15Prog-rocK

Task 16Prog-rocK

Task 16.1jazz...

Correlation does not imply causation

Task 17punk!

Reporting Correlations

Task 18punk!

Task 19Prog-rocK

Task 20Prog-rocK

OPTIONAL: Inline Coding

Task 21punk!

Task 22punk!

Task 22.1punk!

Task 23Prog-rocK

Task 24jazz...

Task 25punk!

Recap

Footnotes

Basic Matrix with `cor()`

Snazzy Matrix with `GGally`

Pairwise Tests with `cor.test()`