Analysing Data: Describing data

Guided part

This worksheet builds on the guided part of the practical that preceded it. You can download the R script from this live-coding session.

This worksheet presents you with the opportunity to practice what you learnt in the first part of the practical. To make things interesting, let’s use data from an actual paper by Swiderska and Küster (2018) exploring the ways in which the capacity of people to empathise with robots can be increased. You can find the data from the study at https://and.netlify.app/docs/harm_data.csv (you don’t need to download this file).

Setup

Task 1

First of all, if you haven’t done it yet, create a week_02 R project inside of your module folder and within it, create the standard folder structure, just like last week.

Task 2

Download this R Markdown file and save/move it into your r_docs folder.

Use the R Markdown file you downloaded in task 2 to complete the following tasks.

Part 1: Inspecting and wrangling data

Task 3

In the setup code chunk of the .Rmd file, write the code to load the packages you will need to complete this practical: tidyverse and kableExtra, and cowplot should be enough. You might need to install the latter if you haven’t used it yet.

# add library() commands to load all packages you need for this document.
library(tidyverse)
library(kableExtra)
library(cowplot)

Task 4

In the read-in code chunk, write the code to read the data into RStudio.

All you need to do is copy the URL (address) of the file as a "character string" into the readr::read_csv() function and assign its output to an object.

# write a line of code to read in the data
data <- read_csv("https://and.netlify.app/docs/harm_data.csv")

Task 5

In the inspect chunk, write code to complete the following tasks:

Task 5.1

Ask R to give you the number of columns, the number of rows, and the column names of the dataset.

The names(), ncol(), and nrow() functions are useful here.

# check number of rows and columns, and names of variables
nrow(data)

[1] 217

ncol(data)

[1] 12

names(data)

 [1] "ID"             "Condition"      "Humanness"     
 [4] "Harm"           "Gender"         "Age"           
 [7] "Pain"           "Experience"     "Agency"        
[10] "Consciousness"  "Empathy"        "Attractiveness"

OK, that’s quite a lot of columns. Let’s only keep a few to make things easier.

Task 5.2

Discard all columns except for "ID", "Condition", "Humanness", "Harm", "Gender", and "Age".

You want to be selecting columns.

# only keep the ID, Condition, Humanness, Harm, Gender, and Age variables in the dataset
data <- data %>%
  dplyr::select(ID, Condition, Humanness, Harm, Gender, Age)

Task 5.3

Add code that gives you the age range (minimum and maximum) of participants in the data set.

There are oh-so-many ways to do this but one you’re already familiar with involves summarising the data.

# check range of the Age column
data %>%
  dplyr::summarise(min = min(Age),
                   max = max(Age))

# A tibble: 1 x 2
    min   max
  <dbl> <dbl>
1     1    44

# alternatives
data %>% dplyr::pull(Age) %>% min()

[1] 1

data %>% dplyr::pull(Age) %>% max()

[1] 44

min(data$Age)

[1] 1

max(data$Age)

[1] 44

Task 6

Your code should make sure that you’re not analysing data from minors. But before we remove any participant from our data, it is crucially important to keep a record of how many we excluded.

Task 6.1

In the clean chunk, create an object remove_age and in it, store the number of participants who are younger than 16 years old.

Did someone say filter?

# how many rows are we about to remove?
remove_age <- data %>%
  dplyr::filter(Age < 16) %>%
  nrow()

Task 6.2

Add a line in the clean code chunk that only keeps data from participants who are 16+.

# only keep participants over 16 in your data
data <- data %>%
  dplyr::filter(Age >= 16)

Task 7

In the descriptives code chunk, write code that creates:

Task 7.1

A tibble of descriptive statistics (mean, standard deviation, minimum, maximum) for the Age variable. It should look like this:

# A tibble: 1 x 4
   mean    sd   min   max
  <dbl> <dbl> <dbl> <dbl>
1  22.3  6.20    18    44

We learnt how to summarise data in practical 9 last term

age_desc <- data %>%
  dplyr::summarise(mean = mean(Age),
            sd = sd(Age),
            min = min(Age),
            max = max(Age))
# let's see
age_desc

Task 7.2

A tibble with Ns, %s, and age (mean and SD) breakdown by categories of the Gender variable. Something like this:

# A tibble: 2 x 5
  Gender     n  perc age_mean age_sd
* <chr>  <int> <dbl>    <dbl>  <dbl>
1 Female   125  61.9     21.8   5.43
2 Male      77  38.1     23.1   7.24

PAAS practical 10 will come in handy here too.

When it comes to the perc column, it requires a little bit of thinking. Think about how you can use the n column and the number of rows in the data to derive the percentage. If you get stuck, check out the solution to last term’s practical 10

gender_desc <- data %>%
  dplyr::group_by(Gender) %>%
  dplyr::summarise(n = dplyr::n(),
            perc = n/nrow(data) * 100,
            age_mean = mean(Age),
            age_sd = sd(Age))
# let's see
gender_desc

Part 2: Tables and Visualisations

Now that we’ve inspected, cleaned, and summarised our data, let’s present it to the revered reader!

Task 8

Edit the code in the table_1 code chunk at the bottom of the documet, giving it your tibble with age breakdown by gender, to create a nice formatted table in your document. Make sure the table show values to 2 decimal digits.

Table 1: *Descriptive statistics by categories of gender*
Gender	N	%	M_age	SD_Age
Female	125	61.88	21.82	5.43
Male	77	38.12	23.13	7.24

# provide tibble to push to kable() and fill in missing column names
gender_desc %>%
  kableExtra::kbl(col.names = c("Gender", "*N*", "%", "*M*~age~", "*SD*~Age~"),
        caption = "*Descriptive statistics by categories of gender*",
        digits = 2) %>%
  kableExtra::kable_styling()

Task 9

Complete the code for the histogram of age and bar chart of gender in the prepare-plots chunk. They should look like this (feel free to choose the colours you like):

# you can store plots iside objects too!
age_hist <- data %>%
  ggplot2::ggplot(aes(x = Age)) +
  geom_histogram(fill = "white", colour = "black") +
  labs(x = "Participants' age in years", y = "N") +
  cowplot::theme_cowplot()

gender_bar <- data %>%
  ggplot2::ggplot(aes(x = Gender)) +
  geom_bar(fill = "seagreen") +
  labs(x = "Participants' gender", y = "N") +
  cowplot::theme_cowplot()

Since the plots are assigned to objects using the <- operator, they will not be shown when you run the code in the chunk. To see what they look like you need to either only run the part of the code to the right of the <- or run the chunk and then type the name of the object into the console.

Task 10

Complete the code for the age_by_condition_gender plot so that it looks like this:

age_by_condition_gender <- data %>%
  dplyr::group_by(Gender, Harm, Humanness) %>%
  dplyr::summarise(m_age = mean(Age)) %>%
  ggplot2::ggplot(aes(x = Gender, y = m_age, colour = Harm)) +
  geom_point(size = 3) +
  facet_wrap(~Humanness) +
  theme_bw()

Notice how the facet_wrap() function is used to create two plots faceted by the Humanness variable.

Task 11

Complete the Write-up section of the .Rmd file.

4.1 Method

4.1.1 Participants

The study was conducted on a sample of 202 volunteers (M_age = 22.3168317, SD_age = 6.1973043). The data were collected anonymously on-line. Data from 15 participants were excluded due to unlikely values of age. Table 1 shows the distribution of gender as well as an age brake-up by individual gender categories.

[…]

4.1.3 Procedure

Participants were presented with pictures of the avatars. Their task was to evaluate the degree to which mental capacities (experience, agency, consciousness, and pain) could be attributed to the faces and the extent to which the presented avatars elicited empathic reactions. Every page of the survey consisted of the respective face displayed above a 7-point, Likert-type response scale (1 = “Strongly disagree” to 7 = “Strongly agree”). The survey was delivered via EFS Survey (Version 9.0, QuestBack AG, Germany). The experiment followed a 2 (Harm: harmed vs. control) 2×2 (Robotization: human vs. robotic) between-subjects factorial design.

We lifted the above from the original paper. You should not!

Task 12

In the print-plots chunk, put the names of the object that contains the age_by_condition_gender plot so that it gets printed out in the section of the document corresponding to where the chunk is. Don’t forget the figure caption!

age_by_condition_gender

Figure 1: Mean age by gender and levels of the Humanness and Harm variables.

Task 13

Knit (generate) the document from your R markdown file and rejoice in its beauty.

That’s all for this week. You’ve done quite a lot today. You learnt about why it’s important to audit your data. You practised creating pipelines, grouping, and summarising data. You looked at the break-down of data by levels of a variable, created tables of basic descriptive statistics, and visualised the relationships between variables with some very pretty figures. Finally, you learned how to write up the Participants and Procedure sections of a paper.

Well done!