Analysing Data: Chi-Square

This tutorial covers how to prepare for, complete, and report chi-square tests in R. Before you start this tutorial, you should make sure you review the relevant lecture, as this tutorial assumes you already know what chi-square is and what it’s for.

Setting Up

All you need is this tutorial and RStudio. Remember that you can easily switch between windows with the Alt + ↹ Tab (Windows) and ⌘ Command + ⇥ Tab (Mac OS) shortcuts.

Task 1

Create a new week_06 project and open it in RStudio. Then, create two new folders in the new week_6 folder: r_docs and data. Finally, open a new Markdown file and save it in the r_docs folder. Since we will be practicing reporting (writing up results), we will need Markdown later on. For the tasks, get into the habit of creating new code chunks as you go.

Remember, you can add new code chunks by:

Using the RStudio toolbar: Click Code > Insert Chunk
Using a keyboard shortcut: the default is Ctrl + Alt + I, but you can change this under Tools > Modify Keyboard Shortcuts…
Typing it out: ```{r}, press Enter, then ``` again.
Copy and pasting a code chunk you already have (but be careful of duplicated chunk names!)

Task 2punk!

Add and run the command to load the tidyverse package in the setup code chunk.

In Hot Water

You may remember this ~~horrific abomination~~ interesting interpretation of the tea-making process that blew up last summer. Many people, including my English partner

Here’s the video in question, if you would like a nice steaming cup of absolute outrage.

You may know someone - or be that someone yourself! - who is convinced that you can tell the difference when someone makes tea “wrong”. Well, whether there is a right or wrong way to make a cuppa is a matter of opinion

A Lady Tasting Tea

This particular research question is very appropriate for our statistical test today. In 1935, botanist Dr Blanche Muriel Bristol claimed that she could tell whether the milk had been added before or after the tea. Her claim was apparently substantiated by her subsequent performance on eight randomly presented cups of tea, four with the milk first and four with the tea first. The person testing her on this claim was a dapper young Ronald Fisher, who subsequently described a version of this experiment in the 1935 paper “A Lady Tasting Tea”

Arguably, this paper should have been called “A Doctor Tasting Tea”, but seeing as how people still struggle with giving women their proper titles almost a hundred years later, it’s not entirely surprising.

⁴ . This paper set out Fisher’s idea of the null hypothesis for the first time.

Data and Design

Our experiment today will be based on this excellent investigation by students at the University of Sheffield, which includes full descriptions of the results and instructions for doing your own tea taste test with your family, friends, or flatmates. In case you’re not up for making a few hundred cups of tea in the name of Science, we’ll use that study’s data instead, graciously provided by Dr Tom Stafford.

Task 3punk!

Read in the data hosted at the link below and save it in a new object. I’ve called mine chi_tea, but you do you.

The Tea Taste Test Team (henceforth: T³ Team) took quite a few measures for their experiment. For now, we’ll look at the main result of interest: whether people were able to correctly identify which was the milk-first and which was the tea-first cup.

The T³ Team’s paradigm consisted of a double-blind experiment. First, the experimenter asked their participants a few initial questions about how they typically liked their tea, and whether the participant believed they could tell the difference. The tea-maker prepared two cups of tea out of view of both the experimenter and the participant, one with the milk added first and the other with the tea added first. Then, the participant chose a ticket that randomly determined two things: whether the milk-first cup would be placed on their left or right, and whether they should first taste the cup on their left or right. Once they tasted both cups, the experimenter asked the participant which cup they preferred, and which cup they believed was made with the milk first. Importantly, the experimenter was also not aware of which was which; only once the participant made their judgement did the tea-maker reveal the truth.

Task 4

Task 4.1punk!

nrow(chi_tea)

[1] 95

Task 4.2punk!

ncol(chi_tea)

[1] 3

Task 4.3punk!

names(chi_tea)

[1] "believe_position" "actual_position"  "correct"

Test of Association

As we discussed in the lecture, a test of association (also called a test of independence) investigates whether two variables are associated with each other. In this case, we want to know whether there is an association between the location of the milk-first cup of tea (left or right), and the participant’s belief about the location of the milk-first cup of tea (left or right).

Task 5

Task 5.1Prog-rocK

What is the null hypothesis for this experiment? What pattern would we expect to see in the data if the null hypothesis were true? Write down your thoughts in your Markdown.

In this case, the null hypothesis is that there is no association between actual and believed milk-first tea location.

For this experiment, if the null hypothesis were true, we would expect to see approximately equal numbers of people guessing left or right, no matter whether the milk-first tea is actually on the left or right.

Task 5.2Prog-rocK

What is the alternative hypothesis for this experiment? What pattern would we expect to see in the data if the alternative hypothesis were true? Write down your thoughts in your Markdown.

The alternative hypothesis would be that there is an association between actual and believed milk-first tea location.

For this experiment, evidence for this hypothesis would appear as an imbalance in whether people guess left or right, depending on whether the milk-first tea is on the left or right. This would suggest an association between which side people think the milk-first tea is on, and which side is is actually on.

Note that this alternative hypothesis doesn’t specify what direction the association would be in. We could find an association between people believing the milk-first tea is on the left and it being on the left (i.e. tending to judge correctly); or we could find an association between people believing the milk-first tea is on the left and it being on the right (i.e. tending to judge incorrectly).

Task 5.3Prog-rocK

What do you think we will find? Do you think that people will be able to correctly identify the milk-first tea? Do you think you could? Write down your thoughts in your Markdown.

You can write whatever you like here, although do write something. It’s very important to think about - and document! - what you are expecting to find before you dive into your data.

Me, I drink coffee, and all tea tastes like hot leaf juice to me 🍵 🤷

Visualising the Data

Now that we have our predictions nailed down, let’s have a look at the data! As usual, it’s always important to visualise your data first thing.

Task 6

Let’s build a beautiful bar chart to represent the count data we have. We’ll start with a basic chart and build it up from there to create a publication-worthy, lab-report-quality graph. In the end, it will look like this:

Task 6.1punk!

Start by creating a basic bar chart of the actual_position variable like the one below, and adding a theme to spruce it up.

Hint

We haven’t done bar charts much before, but you do have some practice using ggplot2. Remember that we usually tell ggplot what kind of figure to make using geoms - see if you can guess what function you need!

If you get stuck, I recommend the R Graph Gallery for help and examples, or just Googling “How to make a bar chart in R”!

For a theme, there are a lot pre-loaded in R. Check out theme_minimal() and theme_bw for clean, simple themes. Naturally, you can also install extra themes. I like theme_cowplot(), but you’ll have to install the cowplot package to use it. Any of these are fine - choose one you like!

For the simple bar chart, we just need to tell R what data we want to use and then the kind of figure to draw.

chi_tea %>% 
  ggplot(aes(x = actual_position)) +
  geom_bar() + # draw a bar plot
  cowplot::theme_cowplot()

It’s pretty boring, but it works!

Task 6.2punk!

Hint

We already told R what to put on the x-axis, and we already have what we want on the y-axis (i.e., counts). So, to add our second variable, we can use the fill = argument in aes() to split up our two actual_position bars by their value on believe_position.

Hm…that doesn’t look quite right. Here, the fill argument has filled the bars with different colours. This doesn’t make comparison very easy, so instead I want to split the two bars into four, based on their value of the second variable, believe_position. In other words, I want geom_bar to draw the bars in different positions.

Task 6.3punk!

chi_tea %>% 
  ggplot(aes(x = actual_position, fill = believe_position)) +
  geom_bar(position = "dodge") +
  cowplot::theme_cowplot()

Hey, that’s looking better already! That’s basically the information that we want, and this would be enough if we were just using this graph to visualise the data for our own purposes. However, we have reports to write, so we’ll need to add some formatting to make it look profesh.

Task 6.4Prog-rocK

Hint

There are lots of ways to change the names of scales, including labs(). I like doing them separately with these scale_ functions because you can also use them to change other things, like what numbers appear on the y-axis, and what the labels are for categories. Ultimately you should build your plots in a way that makes sense to you, and that ends up looking the way you want it.

For example, in the solution I’ve also added two more arguments to scale_y_continuous: limits and breaks. The first changes the minimum and maximum values on the y axis, and the second changes how the numbers appear. This is optional, but you can try messing about with these two arguments to see what happens if you want to get a feel for how they work.

Use the help documentation if you’re not sure how to use these functions!

chi_tea %>% 
  ggplot(aes(x = actual_position, fill = believe_position)) +
  geom_bar(position = "dodge") +
  scale_x_discrete(name = "Actual Position of the Milk-First Tea",
                   labels = c("Left", "Right")) +
  scale_y_continuous(name = "Count",
                     limits = c(0, 30),
                     breaks = seq(0, 35, by = 5)) +
  scale_fill_discrete(name = "Believed Position",
                    labels = c("Left", "Right")) +
  cowplot::theme_cowplot()

Now that we’ve got the text nailed down, let’s do something about those colours.

Task 6.5punk!

Add a type = c(... argument to scale_fill_discrete to change the colours. Remember you will need to give two colours, one for “left” and one for “right”.

Hint

To add colours, you can use either named colours that R knows, or hexidecimal colour codes.

R recognises a lot of colour names. Just put the name you want in “quotes” to use it. If you wanted the Analysing Data theme colours, you could use darkcyan and purple4.

If you want a very precise shade that isn’t on the list of coloured names, you can also use any colour that your heart desires using the colour’s hexidecimal code. The hex code is a series of 6 letters and numbers that uniquely specify any colour in the RGB colour gamut. You can try typing random ones, or use this handy colour selector to get the exact shade you want.

If you wanted the Analysing Data theme colours, you could use darkcyan or #52006F for teal, and #009Fa7 or purple4 for the purple.

chi_tea %>% 
  ggplot(aes(x = actual_position, fill = believe_position)) +
  geom_bar(position = "dodge") +
  scale_x_discrete(name = "Actual Position of the Milk-First Tea",
                   labels = c("Left", "Right")) +
  scale_y_continuous(name = "Count",
                     limits = c(0, 30),
                     breaks = seq(0, 35, by = 5)) +
  scale_fill_discrete(name = "Believed Position",
                    labels = c("Left", "Right"),
                    type = c("#52006F", "#009FA7")) + # AnD colours, heck yeah!!
  cowplot::theme_cowplot()

Now, as stylish and attractive as that Analysing Data colour palette is, it isn’t really suitable for formal reporting. Let’s tone it down a bit.

Task 6.6punk!

chi_tea %>% 
  ggplot(aes(x = actual_position, fill = believe_position)) +
  geom_bar(position = "dodge") +
  scale_x_discrete(name = "Actual Position of the Milk-First Tea",
                   labels = c("Left", "Right")) +
  scale_y_continuous(name = "Count",
                     limits = c(0, 30),
                     breaks = seq(0, 35, by = 5)) +
  scale_fill_discrete(name = "Believed\nPosition",
                    labels = c("Left", "Right"),
                    type = c("grey77", "grey37"))+ # boring but professional!
  cowplot::theme_cowplot()

Lab Report Plots

For the Green lab reports, you must include a bar chart of your results. You don’t have to only use grey, or any particular theme, but you should choose colour and styling that look like something you might see in a journal article.

Task 7

To interpret the plot, look at one bit of it at a time. So, let’s look at the scenario where the milk-first tea was actually on the participant’s left. In this case, people actually thought it was on the right slightly more often than they thought it was on the left. If we look at the other half of the plot, when the milk-first tea was on the right, people tended to believe it was the one on the left!

Overall, people were incorrect more often than they were correct - but not by much, as we can tell because the bars are of fairly similar heights.

Now that we’ve had a look at our data, we can move on to our statistical testing!

Chai-Square

Get it??

⁵

Running the Analysis

So, we’ve made our predictions, and we’ve had a look at the data. Now we’re ready to conduct our

test.

Task 8

Task 8.1Prog-rocK

Use the chisq.test() function to perform a

test of association on the chi_tea data. Your output should look like the one below.

Hint

All you need to put into the chisq.test() function are the two variables we are using. You should use subsetting with $ here; this function doesn’t play well with pipes!

Reporting the Results

Now that we’ve got our test result, let’s report it in APA style. This takes the general form:

We don’t have confidence intervals for this test, so we don’t need to report them.

Task 9Prog-rocK

We start with the general form:

name_of_estimate(degrees_of_freedom) = value_of_estimate, p = exact_p

Now we need each of these values from the output:

name_of_estimate:
degrees_of_freedom: 1
value_of_estimate: 0.58
p: .448

So, our reporting should look like: (1) = 0.58, p = .448

Note: write by typing $\chi^2$ in your Markdown!

We should also describe in words what this result means. We essentially include the statistical result as a citation to give evidence for our claim.

Task 9.1Prog-rocK

You should definitely write out your own report in your own words, but here’s an example:

“A chi-square test of association was performed on participants’ believed location of the milk-first tea, versus its actual location. The results indicated no association between believed and actual location ((1) = 0.58, p = .448).”

That’s looking pretty slick! This is exactly the sort of thing you would expect to see in a journal article - or a lab report 😉

Task 9.2punk!

Observed and Expected Frequencies

To finish off our reporting, it’s good practice to report the actual numbers or frequencies that went into the chi-square test. Luckily, we can get this very easily out of the chi_tea_test object we’ve just created.

Task 10

Task 10.1

str(chi_tea_test)

List of 9
 $ statistic: Named num 0.577
  ..- attr(*, "names")= chr "X-squared"
 $ parameter: Named int 1
  ..- attr(*, "names")= chr "df"
 $ p.value  : num 0.448
 $ method   : chr "Pearson's Chi-squared test with Yates' continuity correction"
 $ data.name: chr "chi_tea$actual_position and chi_tea$believe_position"
 $ observed : 'table' int [1:2, 1:2] 25 24 28 18
  ..- attr(*, "dimnames")=List of 2
  .. ..$ chi_tea$actual_position : chr [1:2] "left" "right"
  .. ..$ chi_tea$believe_position: chr [1:2] "left" "right"
 $ expected : num [1:2, 1:2] 27.3 21.7 25.7 20.3
  ..- attr(*, "dimnames")=List of 2
  .. ..$ chi_tea$actual_position : chr [1:2] "left" "right"
  .. ..$ chi_tea$believe_position: chr [1:2] "left" "right"
 $ residuals: 'table' num [1:2, 1:2] -0.447 0.502 0.461 -0.518
  ..- attr(*, "dimnames")=List of 2
  .. ..$ chi_tea$actual_position : chr [1:2] "left" "right"
  .. ..$ chi_tea$believe_position: chr [1:2] "left" "right"
 $ stdres   : 'table' num [1:2, 1:2] -0.966 0.966 0.966 -0.966
  ..- attr(*, "dimnames")=List of 2
  .. ..$ chi_tea$actual_position : chr [1:2] "left" "right"
  .. ..$ chi_tea$believe_position: chr [1:2] "left" "right"
 - attr(*, "class")= chr "htest"

We can see that there’s quite a bit stored in this object - more than appears when we just call its name. Among other useful info, we can see that there are $ observed and $ expected objects stored here.

Task 10.2punk!

chi_tea_test$observed

                       chi_tea$believe_position
chi_tea$actual_position left right
                  left    25    28
                  right   24    18

You might notice that these are the same counts that appeared in our graph up above. Even though this information is presented visually there, it’s still a good idea to include these numbers in your report.

Task 10.3Prog-rocK

You should write this in your own words, but here’s an example:

“Of the participants who had the milk-first tea positioned on their left, 25 correctly believed it to be the cup on the left, while 28 believed it was the cup on the right. For participants who had the milk-first tea on their right, 18 correctly believed that it was the cup on the right, whereas 28 believed it was on the left. Overall, participants were incorrect about the location of the milk-first tea more often than they were correct.”

Task 11

You should write this in your own words, but here’s an example:

"In this experiment, participants were given two cups of tea at the same time, one on the left and one on the right, and asked to taste both. Then they were asked which cup they believed contained the tea that had milk added first (before the tea was poured in).

Of the participants who had the milk-first tea positioned on their left, 25 correctly believed it to be the cup on the left, while 28 believed it was the cup on the right. For participants who had the milk-first tea on their right, 18 correctly believed that it was the cup on the right, whereas 28 believed it was on the left. Overall, participants were incorrect about the location of the milk-first tea more often than they were correct.

A chi-square test of association was performed on participants’ believed location of the milk-first tea, versus its actual location. The results indicated no association between believed and actual location ((1) = 0.58, p = .448)."

Before we finish up, there’s one more thing we should check. Remember from the lecture that the

test requires all expected frequencies to be greater than five.

Task 12Prog-rocK

To get this information, we can just pull the table of expected frequencies from the chi_tea_test object. Then, we just look at the values in the table to check that they are all bigger than five.

chi_tea_test$expected

                       chi_tea$believe_position
chi_tea$actual_position     left    right
                  left  27.33684 25.66316
                  right 21.66316 20.33684

None of these are anywhere near five, so we’re all good!

Recap

Well done conducting your

analysis. I highly recommend reading all of the T³ Team’s findings - they asked a lot of interesting questions and have done a great job presenting their results.

Remember, if you get stuck or you have questions, post them on Piazza, or bring them to StatsChats or to drop-ins.

Chi-Square

Author

Affiliation

Published

DOI

Setting Up

Task 1

Task 2punk!

In Hot Water

A Lady Tasting Tea

Data and Design

Task 3punk!

Task 4

Task 4.1punk!

Task 4.2punk!

Task 4.3punk!

Test of Association

Task 5

Task 5.1Prog-rocK

Task 5.2Prog-rocK

Task 5.3Prog-rocK

Visualising the Data

Task 6

Task 6.1punk!

Task 6.2punk!

Task 6.3punk!

Task 6.4Prog-rocK

Task 6.5punk!

Task 6.6punk!

Task 7

Chai-Square Get it?? 5

Running the Analysis

Task 8

Task 8.1Prog-rocK

Reporting the Results

Task 9Prog-rocK

Task 9.1Prog-rocK

Task 9.2punk!

Observed and Expected Frequencies

Task 10

Task 10.1

Task 10.2punk!

Task 10.3Prog-rocK

Task 11

Task 12Prog-rocK

Recap

Footnotes

Chai-Square

Get it??

⁵