Correlation

Tutorial 5

Published

Feb. 22, 2021

DOI

This tutorial covers how to prepare for, complete, and report correlation in R. Before you start this tutorial, you should make sure you review the relevant lecture, as this tutorial assumes you already know what correlation is.

Setting Up

All you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.

Note on code in the tutorial

You can reveal the solution to each task by clicking on the “Solution” button. As always, the point of this tutorial is not to torment you with tasks you cannot do. However, you are strongly encouraged not to reach straight for the answer. Try to complete the task yourself: write the code in your script, run it, see if it works. If it doesn’t, try again. Only reveal the solution if you’re stuck or if you want to check your own solution.

Task 1

Create a new week_05 project and open it in RStudio. Then, create two new folders in the new week_5 folder: r_docs and data. Finally, open a new Markdown file and save it in the r_docs folder. Since we will be practicing reporting (writing up results), we will need Markdown later on. For the tasks, get into the habit of creating new code chunks as you go.

Remember, you can add new code chunks by:

  1. Using the RStudio toolbar: Click Code > Insert Chunk
  2. Using a keyboard shortcut: the default is Ctrl + Alt + I, but you can change this under Tools > Modify Keyboard Shortcuts…
  3. Typing it out: ```{r}, press Enter, then ``` again.
  4. Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

 

Task 2punk!

Add and run the command to load the tidyverse package in the setup code chunk.

Task 2.1punk!

If you don’t have it yet, install the GGally package IN THE CONSOLE. Do not add this command to a code chunk! Then, load the GGally package in your setup code chunk.

 

Task 3punk!

As usual, we will need some data to work with. Since there’s so much interesting information in it, let’s keep looking at the gensex dataset we’ve been using so far. Use the link below to read in the data in a new code chunk.

Link: https://and.netlify.app/docs/gensex_clean.csv

 

Data Preparation

As usual, before beginning any analysis, we will want to take a look at our data to make sure it’s clean and correct before we proceed. Since we’re using our familiar gensex dataset that we’ve cleaned already, we should be pretty happy with this, but let’s have a look again at these variables. The main reason is that you can generate rs all day easily, but if you don’t know what the variables you are correlating actually mean, that doesn’t do you much good!

Codebook

Variable Type Description
Duration Continuous Time spent completing the questionnaire, in seconds
Age Continuous Participant age in years
Gender Categorical Participant gender
Gender_comfortable_1 Continuous How comfortable participants are with their response to Gender; Likert 1 - 9, high score = more comfortable
Gender_masc_1 Continuous How masculine participants feel they are; Likert 1 - 9, high score = more masculine
Gender_fem_1 Continuous How feminine participants feel they are; Likert 1 - 9, high score = more feminine
Gender_stability_1 Continuous How stable participants perceive their gender identity to be; Likert 1 - 9, high score = more stable
Sexual_strength_1 Continuous How strongly participants experience sexual attraction; Likert 1 - 9, high score = more strongly
Sexual_fewq_1 Continuous How frequently participants experience sexual attraction; Likert 1 - 9, high score = more frequently
Sexual_gender_1 Continuous Sexual preference for genders; Likert 1 - 9, 1 = women, 5 = no preference/no gender, 9 = men
Romantic_strength_1 Continuous How strongly participants experience romantic/emotional attraction; Likert 1 - 9, high score = more strongly
Romantic_freq_1 Continuous How frequently participants experience romantic/emotional attraction; Likert 1 - 9, high score = more frequently
Romantic_gender_1 Continuous Romantic/emotional preference for genders; Likert 1 - 9, 1 = women, 5 = no preference/no gender, 9 = men

For our practice today, let’s focus on the ratings of either romantic or sexual attraction, since we already looked at gender in the lecture.

Task 4punk!

Overwrite the gensex object so that it only contains variables to do with sexual or romantic attraction. Use the codebook above to find each variable, and make sure you understand how each variable was measured.

Hint

You need to select the right variables, but there’s a more efficient way than typing all of them out! See ?dplyr::select for ideas.

Remember NOT to use quotes within select!

 

Correlation Matrices

Basic Matrix with cor()

To start, we can get a very basic correlation matrix very quickly with the function cor() from the stats package, which is automatically loaded. It isn’t very pretty, but it will get us a rough and ready correlation matrix quickly.

Task 5punk!

Create a correlation matrix of all six variables.

Notice that there are a lot of NA values in this table. Why do you think this is? What can we do about it?

Task 6Prog-rocK

Look at the gensex data to figure out why the correlation matrix contains NAs. Then, produce a correlation matrix that doesn’t have any NAs in it.

Hint

A summary of the dataset might give you a good overview of your variables so you can filter out the problem.

We can see that there’s a huge amount of variation in how strongly these variables are correlated. We can also see that diagonal line of 1s, which represent perfect correlations.

Task 7Prog-rocK

In your Markdown document, write your own explanation of why there are absolutely perfect correlations in your matrix.

As we covered in the lecture, a correlation matrix like this is the same above the diagonal line of 1s and below it. So, more than of half this big wall of numbers is just repeated information! Let’s use a different function to get a bit more bang for our buck, as it were.

Snazzy Matrix with GGally

The GGally package contains a very useful function called ggscatmat(), which we can use to help us better understand the relationships between our variables. It outputs a correlation matrix like the one we saw above, but mixed in with some really useful plots.

Task 8punk!

Put all of your variables into the GGally::ggscatmat() function and have a look at the output.

This plot has a lot going on, but don’t worry! That’s just because there’s a lot of useful information here. Let’s focus on one bit at a time.

First, we can see that we have our six variable names listed along the top of the plot, and then the same six names along the side. Where those columns and rows intersect, some cells of the plot contain numbers, some contain dots, and some contain lines. Let’s look first at the lines, which are all on the diagonal.

We saw before that cor() gave us 1s on the diagonal, representing each variable correlated with itself. Here, instead of 1s (which aren’t very helpful or interesting), we actually have a histogram of each variable. From the shapes that mostly slope dramatically up to the right, we can see that low ratings are uncommon, and high ratings very common, for most of our variables. The exceptions are the two variables asking about frequency (of sexual and romantic attraction), for which the most common response is 6 - 7 on our 9-point scale.

Next, let’s look at the numbers. You might recognise these values from cor() above. These are the correlation coefficients for each pair of variables. To figure out which variables are involved in the correlation, go up to the column name and along to the row name.

Finally, we have the dots. These are scatterplots of each pair of variables. Again, to find which variables are in each plot, look at the column name and the row name. So, each unique pair of variables has a scatterplot and a correlation coefficient; and each individual variable has a histogram along the diagonal. Like I said - snazzy!

Mini-Quiz

Answer the following questions using the output from ggscatmat().

Task 9

How many pairs of variables have an inverse relationship?

1

Correct!That’s not right…

Task 10

What is the value of the correlation r between the two variables with the strongest relationship? Give your answer to two decimal places.

0.87

Correct!That’s not right…

Task 11

What is the value of the correlation r between the two variables with the weakest relationship? Give your answer to two decimal places.

0

Correct!That’s not right…

Task 12

To the nearest whole point, what is the most common rating for the variable Sexual_gender_1?

8

Task 13

How can you interpret the scatterplot between Sexual_fewq_1 and Sexual_gender_1?

No pattern, just a random block of dots

We now can easily investigate the correlation between any two variables in our dataset. However, in the lecture we also talked about reporting significance, degrees of freedom, and CIs, which we can’t get easily from this output. So, if we want to report the results of particular correlations, we should have a look at a different function, cor.test().

Pairwise Tests with cor.test()

Unlike cor(), cor.test() only tests the correlation between pairs of variables. If we want to look at lots of variables at once, we’d be better off with cor() (or, although we haven’t looked at it in this tutorial, Hmisc::rcorr()). However, if we want more information about a particular correlation, such as easily reportable degrees of freedom, p-values, and confidence intervals, cor.test() is the better choice.

Task 14punk!

Open the help documentation for the cor.test() function.

Under the heading Usage, we can see that the function has two methods: one is called “Default S3 method” and the other is mentions “class ‘formula’”. We could use either method, but we’ll use the formula version now - because it’s going to be extremely useful in the future!

Under the heading Arguments, the help documentation says about the formula argument: “a formula of the form ~ u + v, where each of u and v are numeric variables giving the data values for one sample.” So, we need to write the first argument as ~ variable_name_1 + variable_name_2. We then also have a data argument to specify the data that those variables come from.

Finally, we have two other important arguments with no default:

Task 15Prog-rocK

Use cor.test() to calculate the correlation between Sexual_strength_1 and Romantic_strength_1 as the output shows below.

 


    Pearson's product-moment correlation

data:  Sexual_strength_1 and Romantic_strength_1
t = 10.244, df = 299, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4208932 0.5888164
sample estimates:
      cor 
0.5096928 

When we run this function, the output is some reasonably easy to read information about the results of our test. We can see here that we have:

Task 16Prog-rocK

Stop and look at the values that the output has given you. Based on what you have learned from the lecture, write down what you think this value of r means, and how you could interpret it for these variables.

Task 16.1jazz...

Create a scatterplot of these two variables to help you understand the correlation. Does this change your interpretation at all?

Correlation does not imply causation

Although we might conclude from this result that the strength of sexual attraction and the strength of romantic attraction tend to change in the same way, that does not mean they are causally related. Our results do not mean that if you experience weaker sexual attraction, that will cause you to also experience weaker romantic attraction, or vice versa. It should be clear from the previous statement that there’s no way to establish which of these two might be the cause and which the effect in any case. What our results do suggest is that if you experience weaker sexual attraction, you tend to experience weaker romantic attraction (and vice versa), to a degree that is unlikely to occur if there is in fact no real relationship at all between the strength of sexual and romantic attraction.

Task 17punk!

Save the output from cor.test() as a new object called sex_rom_cor, so we can use it again later.

Our reporting in the tasks above could use some numbers, which we can get out of the output of cor.test(). Let’s look at how we can formally report the results of a correlation analysis.

 

Reporting Correlations

To finish our analysis, we have to tell people about it - so we should write up the results of our correlation test as in a report.

Results like this correlation test, that have a small number of values to report, are often reported in brackets in the text of your report. This takes the general form:

(name_of_estimate(degrees_of_freedom) = value_of_estimate, p = exact_p, 95% CI [lower_bound, upper_bound])

To get these values, let’s look again at the results that we already saved in the object sex_rom_cor.

Task 18punk!

Call the name of the object sex_rom_cor to see the displayed output.

This is the same output that we saw above when we ran the analysis. Although this is useful for us, the analysts, to understand the result, it’s not at all formatted well for reports.

You should never have any raw (unformatted) code or output in a report! Always report results in the text, in a figure, or in a formatted table.

Task 19Prog-rocK

Use the general format above and the values to type out the result of the correlation analysis.

To complete our report, we should describe in words what this result means. We essentially include the statistical result as a citation to give evidence for our claim.

Task 20Prog-rocK

Have a go writing out a report of this statistical analysis.

That’s looking pretty slick! This is exactly the sort of thing you would expect to see in a journal article.

If you’re happy with this and/or ready to be done, you skip down to the Recap to finish up - this is the end of the required bit of the tutorial. Well done!

If you’d like to learn how to get R to do the reporting for you, read on!

OPTIONAL: Inline Coding

There are two ways to do our reporting: typing out the results by hand, as we just did above, or using inline coding. For this module, you are only required to be able to type out the results by hand, as we did above.

However, inline coding is an extremely powerful feature of Markdown documents, but it requires a bit more coding skill. So, if you’re ready to give it a go, let’s get started!

Task 21punk!

Use class() and str() to look at the class and structure of the sex_rom_cor object.

The first command tells us that this object, which contains the results from the function cor.test(), has the class htest. What does this mean? Objects of this class tend to have a specific structure, which we can see using str(). Now we can see that there is a lot more stored in here than it seems. If we want to get out specific values, we can simply subset them using $, as the output from str() suggests (notice the $s in the output). The output from cor.test() that we have here contains the following useful elements (among others):

Task 22punk!

Use subsetting to extract the value of r from the sex_rom_cor object.

That’s very cool, but we can do a bit better…

Task 22.1punk!

Use subsetting to extract the value of r from the sex_rom_cor object, then round it to two decimal places.

That’s looking very promising! Now, if I want to report the correlation coefficient for this analysis as part of my writeup, I could use inline code to do this. Inline code is a small snippet of R code inside the text part of a Markdown document, no different from the code we use inside code chunks. The difference is, we can write lines of code that print out particular numbers - as we just did above - and then insert that code into the text where we want that number to appear. When we knit the document, R will run the inline code and replace it with whatever value the code produces. To do this, I just need to write `r some_r_code` in my Markdown.

If that’s clear as mud, it might help to see it in action. In the text of my Markdown document, I could write:

“Ratings of the strength of sexual and romantic attraction were correlated at r = `r sex_rom_cor$estimate %>% round(2)`

When I knit the Markdown document, this will come out as:

“Ratings of the strength of sexual and romantic attraction were correlated at r = 0.51”

The code `r sex_rom_cor$estimate %>% round(2)` is evaluated by R to produce a number, and then that number appears in place of the code in the knitted document. Cool, eh?

Once we have the basic idea, we can get almost any value of interest from our sex_rom_cor object using subsetting. The only other thing you need to know is that if subsetting gives you multiple numbers, you can get out just one of them using square brackets with a number. For example, this code: object_name$variable_name[1] will give you the first observation in that variable.

Task 23Prog-rocK

Use subsetting with square brackets to get out the lower and higher bounds of the confidence intervals from sex_rom_cor.

Task 24jazz...

Complete the report of this analysis below, replacing the square brackets with inline code that produces the correct number. Make sure you round all values to two decimal places, if necessary, except for p, which is rounded to three.

Strength of sexual attraction and strength of romantic attraction were significantly correlated, r([degrees of freedom]) = [value of r], 95% CI = [[lower bound], [upper bound]], p = [p-value].

Hint

To preview what number your inline code will produce when it’s knitted, you can click anywhere inside the backticks and press Ctrl (or ⌘ Command) + Enter. You’ll see a small pop-up that will show the value that your inline code produces.

Task 25punk!

Knit your document to see your completed inline code!

 

Recap

Whew, well done! You’re welcome - and encouraged! - to keep working with the gensex data to practice what we’ve learned today.

In sum, we’ve covered:

Remember, if you get stuck or you have questions, post them on Piazza, or bring them to StatsChats or to drop-ins.

 

Good job!

That’s all for today. See you soon!

 

 

 

Footnotes