Practical 03
This worksheet builds on the guided part of the practical that preceded it. You can download the R
script from this live-coding session.
Last time, you were given a nice and tidy data set from a paper by Swiderska and Küster (2018) about humans’ ability to empathise with robots. However, as is the case only too often, when we get our hands on a data set, it can be in quite a state. For that reason, it’s important to learn how to tidy up messy data with typos, missing values, and other sorts of imperfections.
In this practical, we’ll be using an untidied version of the same dataset we worked with last time. You can find the data at https://and.netlify.app/docs/harm_data_messy.csv (you don’t need to download this file, just read it straight into RStudio).
Just like every week, we want to work in a new R
project. If you haven’t done it yet, create a week_03 R
project inside of your module folder and within it, create the standard folder structure, just like last week.
Download this R Markdown file and save/move it into your r_docs folder.
Use the R Markdown file you downloaded in task 2 to complete the following tasks.
In the setup
code chunk of the .Rmd file, write the code to load the tidyverse
package you will need to complete this practical.
In the read-in
code chunk, write the code to read the data into RStudio.
All you need to do is copy the URL (address) of the file as a "character string"
into the readr::read_csv()
function and assign its output to an object.
# complete the line to read in the data
data <- read_csv("https://and.netlify.app/docs/harm_data_messy.csv")
In the inspect
chunk, write code that does the following:
Ask R
to give you the number of columns, the number of rows, the column names of the dataset, and a rough summary of each variable, just like we did last time.
The names()
, ncol()
, nrow()
functions are useful here.
ncol(data)
[1] 15
[1] "ID" "Condition" "Humanness"
[4] "Harm" "Gender" "Age"
[7] "AgeNew" "Country" "Ethnicity"
[10] "Pain" "Experience" "Agency"
[13] "Consciousness" "Empathy" "Attractiveness"
summary(data)
ID Condition Humanness Harm
Min. : 60.0 Min. :1.00 Min. :1.000 Length:217
1st Qu.:200.0 1st Qu.:1.00 1st Qu.:1.000 Class :character
Median :359.0 Median :3.00 Median :2.000 Mode :character
Mean :324.6 Mean :2.53 Mean :1.507
3rd Qu.:433.0 3rd Qu.:4.00 3rd Qu.:2.000
Max. :558.0 Max. :4.00 Max. :2.000
Gender Age AgeNew Country
Min. :1.000 Length:217 Min. : 1.00 Min. : 7.0
1st Qu.:1.000 Class :character 1st Qu.:18.00 1st Qu.:180.0
Median :2.000 Mode :character Median :19.00 Median :194.0
Mean :1.619 Mean :20.84 Mean :164.6
3rd Qu.:2.000 3rd Qu.:23.00 3rd Qu.:194.0
Max. :2.000 Max. :44.00 Max. :206.0
NA's :2
Ethnicity Pain Experience Agency
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:5.000 1st Qu.:3.250 1st Qu.:3.143 1st Qu.:4.000
Median :6.000 Median :6.000 Median :5.071 Median :4.857
Mean :5.318 Mean :4.902 Mean :4.636 Mean :4.656
3rd Qu.:6.000 3rd Qu.:7.000 3rd Qu.:6.429 3rd Qu.:5.714
Max. :7.000 Max. :8.000 Max. :7.000 Max. :7.000
NA's :3 NA's :3 NA's :3 NA's :3
Consciousness Empathy Attractiveness
Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:3.500 1st Qu.:3.750 1st Qu.:1.000
Median :4.750 Median :4.857 Median :4.000
Mean :4.507 Mean :4.589 Mean :3.853
3rd Qu.:6.000 3rd Qu.:5.857 3rd Qu.:6.000
Max. :7.000 Max. :7.000 Max. :7.000
NA's :3 NA's :3
You may not remember but this is even more columns than last time. Let’s get rid of a few.
Discard the "AgeNew"
, "Country"
, and "Ethnicity"
variables from the dataset.
You want to be selecting columns. The function also works with a minus sign for removing columns. If you don’t know how to use the function, ask the Internet.
Have a good look at the summary of the data. Use this code book to figure out the answer to the following questions.
Variable name | Level | Description |
---|---|---|
ID | nominal | Unique participant ID |
Condition | nominal | Experimntal condition; combination of Humanness and Harm |
Humanness | nominal | Avatar type: 1 = Human, 2 = Robot |
Harm | nominal | Avatar state: 1 = Unharmed, 2 = Harmed |
Gender | nominal | Participant gender: 1 = Male, 2 = Female |
Age | continuous | Participant age in years |
Pain | ordinal | Does avatar have capacity to feel pain? 1 = Strongly disagree, 7 = Strongly agree |
Experience | ordinal | Does avatar have experience? 1 = Strongly disagree, 7 = Strongly agree |
Agency | ordinal | Does avatar have agency? 1 = Strongly disagree, 7 = Strongly agree |
Consciousness | ordinal | Does avatar have consciousness? 1 = Strongly disagree, 7 = Strongly agree |
Empathy | ordinal | Level of empathy for avatar: 1-7; Higher number means more empathy felt |
Attractiveness | ordinal | Attractiveness of avatar: 1-7; Higher number means more attractive |
There is no R
involved in this task, it’s just you, the code book, your eyes, and your brain. Look carefully!
Are there any variables that have different classes than they should have?
The biggest issue is that Age
has been interpreted as "character"
instead of as numeric
. We don’t know why yet; we’ll have to figure it out.
Other than that, Condition
, Humanness
, and Gender
are numeric because they have not been labelled. This is quite a common occurrence: categorical variables may not come with labels but just as numbers. We’ll take care of that later.
Are there any values in the numeric variables that make no sense, given the information in the code book?
The last six variables are supposed to have been measured on a 7-point scale each, yet Pain
has a maximum value of 8. Mischief is afoot my friends!
Are there any missing values (NA
s) in the dataset? If so, in which variables and how many?
There are two NA
s in Gender
and three each in the Pain
, Experience
, Agency
, Consciousness
, and Empathy
variables.
The same number of NA
s may be indicative of participants not completing the questionnaire but to know for certain, we need to look closer.
Are the levels of the Harm
variable coded correctly?
You can create a simple table
of the variable to see all its levels.
There seems to be a few typos there:
data$Harm %>% table()
.
Hamerd Harmed Unharmed
4 108 105
Now that you are relatively familiar with the data, let’s start cleaning it!
Insert a data-cleaning
code chunk in the .Rmd file and use it to recode the variables Humanness
, Harm
, and Gender
variables as per the code book.
Numeric values inside dplyr::recode()
must be surrounded with backticks.
OK, let’s now deal with Pain
.
Insert the same chunk, write code that modifies the Pain
variable so that any value above 7 is declared missing. But before you do, keep record of how many such cases there were (just like we did last week)!
This is the kind of situation where replace()
comes in handy.
Time to deal with that pesky Age
now. Let’s see what’s up with it.
Can you figure out what caused Age
to be read in as "character"
and not numeric
? You can just look at the variable by printing it out but, for extra points, try to identify the problem with code.
You can filter those values that turn to missing if you convert the variable into numeric
.
as.numeric()
is the function that does that.
data %>%
dplyr::filter(is.na(as.numeric(Age)))
# A tibble: 1 x 12
ID Condition Humanness Harm Gender Age Pain Experience Agency
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 376 1 Human Unha~ Female 18 y~ 6 6.14 5.57
# ... with 3 more variables: Consciousness <dbl>, Empathy <dbl>,
# Attractiveness <dbl>
Now that you know what causes the issue, let’s fix it.
First, keep a record of how many cases you’re about to manipulate.
## age_char seems like an appropriate name...
age_char <- data %>%
dplyr::filter(is.na(as.numeric(Age))) %>%
nrow()
Recode the variable to get rid of this value.
Checking whether your code does what it should is important. make sure what you did worked!
You can just do whatever you did to find the offending value of Age
in the first place.
## now should be an empty tibble (0 rows)
data %>%
dplyr::filter(is.na(as.numeric(Age)))
# A tibble: 0 x 12
# ... with 12 variables: ID <dbl>, Condition <dbl>, Humanness <chr>,
# Harm <chr>, Gender <chr>, Age <chr>, Pain <dbl>,
# Experience <dbl>, Agency <dbl>, Consciousness <dbl>,
# Empathy <dbl>, Attractiveness <dbl>
We fixed the odd value of Age
but the variable is still "character"
. Let’s turn it into numeric
.
data <- data %>%
dplyr::mutate(Age = as.numeric(Age))
Now that Age
is a numeric variable, let’s have a look if there are any unusual/unacceptable values. If so, make a note of how many and then remove the corresponding cases from the data set. Let’s consider everyone younger than 16 or older than 100 out of the acceptable range.
We went through a very similar procedure last time, feel free to check.
Save in your environment the information about how many participants had missing values in the Gender
variable.
The last data cleaning task is quite hard code so don’t beat yourself up if you can’t do it!
There seem to be a few participants who didn’t bother completing the study. Can you identify how many they were and remove their data from the dataset?
Let’s say you want to remove everyone who has at least 4 missing values in their data.
First, it’s best to create a column, let’s say NA_count
that contains the sum of NA
s for each participant.
The rowSums()
function calculates, well, sums for each row. We want to use it to count NA
s. To do that, we need to know whether a value is an NA
or not. You already know the function that does that.
The function, however, will only work with a single variable. To use the same function on all variables/columns of your dataset, you can use the dplyr::across()
function. Check out it’s documentation to see how it’s used.
Once you have the NA_count
column, discarding rows that have a value in this column that’s larger than our criterion of 4 NA
s is fairly trivial.
In the descriptives
code chunk, write code that creates a tibble with Ns, %s, and age (mean and SD) breakdown by categories of the Gender
variable. Something like this:
# A tibble: 3 x 5
Gender n perc age_mean age_sd
* <fct> <int> <dbl> <dbl> <dbl>
1 Female 131 61.8 23.6 8.13
2 Male 79 37.3 24.3 8.59
3 (Missing) 2 0.943 23 4.24
You did this last time!
gender_desc <- data %>%
dplyr::mutate(Gender = forcats::fct_explicit_na(as.factor(Gender))) %>%
dplyr::group_by(Gender) %>%
dplyr::summarise(n = dplyr::n(),
perc = dplyr::n()/nrow(data) * 100,
age_mean = mean(Age, na.rm = TRUE),
age_sd = sd(Age, na.rm = TRUE))
# call the object's name to print it out
gender_desc
Knit the document from your R markdown file and make sure everything looks OK.
And that’s it for the practical. There’s an extra task below for the keen beans so if you till have it in you, feel free to tackle it. If not, you can do it once you’ve rested.
Today you learnt how to thoroughly inspect messy data, paying close attention to classes and ranges of your columns. You also practised recoding categorical variables, finding and fixing entry errors, and dealing with improbably or impossible values. You also learnt how to think about missing data and how to handle them.
Complete the Participants section in Write-up (in the .Rmd file) and copy over the rest from last time.
The study was conducted on a sample of 217 volunteers, with data collected anonymously on-line. Data from two participants were excluded due to values outside of the accepted range of 16-100. Data from further three participants were removed due to a high number of missing data. Finally, two observations of the “Pain” variable were declared missing due to impossible values. This resulted in a sample size of N = 212 (Mage = 23.84, SDage = 8.26; 61.8% female, 37.3% male).
[…]
Well done!