Analysing Data: Data cleaning and treating NAs

Guided part

This worksheet builds on the guided part of the practical that preceded it. You can download the R script from this live-coding session.

Last time, you were given a nice and tidy data set from a paper by Swiderska and Küster (2018) about humans’ ability to empathise with robots. However, as is the case only too often, when we get our hands on a data set, it can be in quite a state. For that reason, it’s important to learn how to tidy up messy data with typos, missing values, and other sorts of imperfections.

In this practical, we’ll be using an untidied version of the same dataset we worked with last time. You can find the data at https://and.netlify.app/docs/harm_data_messy.csv (you don’t need to download this file, just read it straight into RStudio).

Setup

Task 1

Just like every week, we want to work in a new R project. If you haven’t done it yet, create a week_03 R project inside of your module folder and within it, create the standard folder structure, just like last week.

Task 2

Download this R Markdown file and save/move it into your r_docs folder.

Use the R Markdown file you downloaded in task 2 to complete the following tasks.

Task 3

In the setup code chunk of the .Rmd file, write the code to load the tidyverse package you will need to complete this practical.

# add library() commands to load all packages you need for this document.

library(tidyverse)

Task 4

In the read-in code chunk, write the code to read the data into RStudio.

All you need to do is copy the URL (address) of the file as a "character string" into the readr::read_csv() function and assign its output to an object.

# complete the line to read in the data

data <- read_csv("https://and.netlify.app/docs/harm_data_messy.csv")

Inspecting data

Task 5

In the inspect chunk, write code that does the following:

Task 5.1

Ask R to give you the number of columns, the number of rows, the column names of the dataset, and a rough summary of each variable, just like we did last time.

The names(), ncol(), nrow() functions are useful here.

ncol(data)

[1] 15

sample_size <- nrow(data) # we'll use this one for write-up
names(data)

 [1] "ID"             "Condition"      "Humanness"     
 [4] "Harm"           "Gender"         "Age"           
 [7] "AgeNew"         "Country"        "Ethnicity"     
[10] "Pain"           "Experience"     "Agency"        
[13] "Consciousness"  "Empathy"        "Attractiveness"

summary(data)

       ID          Condition      Humanness         Harm          
 Min.   : 60.0   Min.   :1.00   Min.   :1.000   Length:217        
 1st Qu.:200.0   1st Qu.:1.00   1st Qu.:1.000   Class :character  
 Median :359.0   Median :3.00   Median :2.000   Mode  :character  
 Mean   :324.6   Mean   :2.53   Mean   :1.507                     
 3rd Qu.:433.0   3rd Qu.:4.00   3rd Qu.:2.000                     
 Max.   :558.0   Max.   :4.00   Max.   :2.000                     
                                                                  
     Gender          Age                AgeNew         Country     
 Min.   :1.000   Length:217         Min.   : 1.00   Min.   :  7.0  
 1st Qu.:1.000   Class :character   1st Qu.:18.00   1st Qu.:180.0  
 Median :2.000   Mode  :character   Median :19.00   Median :194.0  
 Mean   :1.619                      Mean   :20.84   Mean   :164.6  
 3rd Qu.:2.000                      3rd Qu.:23.00   3rd Qu.:194.0  
 Max.   :2.000                      Max.   :44.00   Max.   :206.0  
 NA's   :2                                                         
   Ethnicity          Pain         Experience        Agency     
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:5.000   1st Qu.:3.250   1st Qu.:3.143   1st Qu.:4.000  
 Median :6.000   Median :6.000   Median :5.071   Median :4.857  
 Mean   :5.318   Mean   :4.902   Mean   :4.636   Mean   :4.656  
 3rd Qu.:6.000   3rd Qu.:7.000   3rd Qu.:6.429   3rd Qu.:5.714  
 Max.   :7.000   Max.   :8.000   Max.   :7.000   Max.   :7.000  
 NA's   :3       NA's   :3       NA's   :3       NA's   :3      
 Consciousness      Empathy      Attractiveness 
 Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.500   1st Qu.:3.750   1st Qu.:1.000  
 Median :4.750   Median :4.857   Median :4.000  
 Mean   :4.507   Mean   :4.589   Mean   :3.853  
 3rd Qu.:6.000   3rd Qu.:5.857   3rd Qu.:6.000  
 Max.   :7.000   Max.   :7.000   Max.   :7.000  
 NA's   :3       NA's   :3

You may not remember but this is even more columns than last time. Let’s get rid of a few.

Task 5.2

Discard the "AgeNew", "Country", and "Ethnicity" variables from the dataset.

You want to be selecting columns. The function also works with a minus sign for removing columns. If you don’t know how to use the function, ask the Internet.

data <- data %>%
  dplyr::select(-c(AgeNew, Country, Ethnicity))

Task 6

Have a good look at the summary of the data. Use this code book to figure out the answer to the following questions.

Table 1: Data set code book
Variable name	Level	Description
ID	nominal	Unique participant ID
Condition	nominal	Experimntal condition; combination of Humanness and Harm
Humanness	nominal	Avatar type: 1 = Human, 2 = Robot
Harm	nominal	Avatar state: 1 = Unharmed, 2 = Harmed
Gender	nominal	Participant gender: 1 = Male, 2 = Female
Age	continuous	Participant age in years
Pain	ordinal	Does avatar have capacity to feel pain? 1 = Strongly disagree, 7 = Strongly agree
Experience	ordinal	Does avatar have experience? 1 = Strongly disagree, 7 = Strongly agree
Agency	ordinal	Does avatar have agency? 1 = Strongly disagree, 7 = Strongly agree
Consciousness	ordinal	Does avatar have consciousness? 1 = Strongly disagree, 7 = Strongly agree
Empathy	ordinal	Level of empathy for avatar: 1-7; Higher number means more empathy felt
Attractiveness	ordinal	Attractiveness of avatar: 1-7; Higher number means more attractive

There is no R involved in this task, it’s just you, the code book, your eyes, and your brain. Look carefully!

Task 6.1

Are there any variables that have different classes than they should have?

The biggest issue is that Age has been interpreted as "character" instead of as numeric. We don’t know why yet; we’ll have to figure it out.

Other than that, Condition, Humanness, and Gender are numeric because they have not been labelled. This is quite a common occurrence: categorical variables may not come with labels but just as numbers. We’ll take care of that later.

Task 6.2

Are there any values in the numeric variables that make no sense, given the information in the code book?

The last six variables are supposed to have been measured on a 7-point scale each, yet Pain has a maximum value of 8. Mischief is afoot my friends!

Task 6.3

Are there any missing values (NAs) in the dataset? If so, in which variables and how many?

There are two NAs in Gender and three each in the Pain, Experience, Agency, Consciousness, and Empathy variables.

The same number of NAs may be indicative of participants not completing the questionnaire but to know for certain, we need to look closer.

Task 6.4

Are the levels of the Harm variable coded correctly?

You can create a simple table of the variable to see all its levels.

There seems to be a few typos there:

data$Harm %>% table()

.
  Hamerd   Harmed Unharmed 
       4      108      105

Cleaning

Now that you are relatively familiar with the data, let’s start cleaning it!

Task 7

Insert a data-cleaning code chunk in the .Rmd file and use it to recode the variables Humanness, Harm, and Gender variables as per the code book.

Numeric values inside dplyr::recode() must be surrounded with backticks.

data <- data %>%
  dplyr::mutate(Humanness = dplyr::recode(Humanness, `1` = "Human", `2` = "Robot"),
                Harm = dplyr::recode(Harm, "Hamerd" = "Harmed"),
                Gender = dplyr::recode(Gender, `1` = "Male", `2` = "Female"))

OK, let’s now deal with Pain.

Task 8

Insert the same chunk, write code that modifies the Pain variable so that any value above 7 is declared missing. But before you do, keep record of how many such cases there were (just like we did last week)!

This is the kind of situation where replace() comes in handy.

# how many cases of Pain > 7 are there?
# pain_impossible is not a title of a metal song, AFAIK
pain_impossible <- data %>%
  dplyr::filter(Pain > 7) %>%
  nrow()

# declare those cases missing
data <- data %>%
  dplyr::mutate(Pain = replace(Pain, Pain > 7, NA))

Time to deal with that pesky Age now. Let’s see what’s up with it.

Task 9

Can you figure out what caused Age to be read in as "character" and not numeric? You can just look at the variable by printing it out but, for extra points, try to identify the problem with code.

You can filter those values that turn to missing if you convert the variable into numeric.

as.numeric() is the function that does that.

data %>%
  dplyr::filter(is.na(as.numeric(Age)))

# A tibble: 1 x 12
     ID Condition Humanness Harm  Gender Age    Pain Experience Agency
  <dbl>     <dbl> <chr>     <chr> <chr>  <chr> <dbl>      <dbl>  <dbl>
1   376         1 Human     Unha~ Female 18 y~     6       6.14   5.57
# ... with 3 more variables: Consciousness <dbl>, Empathy <dbl>,
#   Attractiveness <dbl>

Task 10

Now that you know what causes the issue, let’s fix it.

Task 10.1

First, keep a record of how many cases you’re about to manipulate.

## age_char seems like an appropriate name...
age_char <- data %>%
  dplyr::filter(is.na(as.numeric(Age))) %>%
  nrow()

Task 10.2

Recode the variable to get rid of this value.

data <- data %>%
  dplyr::mutate(Age = dplyr::recode(Age, "18 years" = "18"))

Task 10.3

Checking whether your code does what it should is important. make sure what you did worked!

You can just do whatever you did to find the offending value of Age in the first place.

## now should be an empty tibble (0 rows)
data %>%
  dplyr::filter(is.na(as.numeric(Age)))

# A tibble: 0 x 12
# ... with 12 variables: ID <dbl>, Condition <dbl>, Humanness <chr>,
#   Harm <chr>, Gender <chr>, Age <chr>, Pain <dbl>,
#   Experience <dbl>, Agency <dbl>, Consciousness <dbl>,
#   Empathy <dbl>, Attractiveness <dbl>

Task 10.4

We fixed the odd value of Age but the variable is still "character". Let’s turn it into numeric.

data <- data %>%
  dplyr::mutate(Age = as.numeric(Age))

Task 11

Now that Age is a numeric variable, let’s have a look if there are any unusual/unacceptable values. If so, make a note of how many and then remove the corresponding cases from the data set. Let’s consider everyone younger than 16 or older than 100 out of the acceptable range.

We went through a very similar procedure last time, feel free to check.

## how bad Age values are there (more than 100, less than 16)
bad_age <- data %>%
  dplyr::filter(Age > 100 | Age < 16) %>%
  nrow()

## remove them
data <- data %>%
  # make sure filtering logic is correct!
  dplyr::filter(Age <= 100 & Age >= 16)

Task 12

Save in your environment the information about how many participants had missing values in the Gender variable.

na_gender <- data %>%
  dplyr::filter(is.na(Gender)) %>%
  nrow()

The last data cleaning task is quite hard code so don’t beat yourself up if you can’t do it!

Task 13

There seem to be a few participants who didn’t bother completing the study. Can you identify how many they were and remove their data from the dataset?

Let’s say you want to remove everyone who has at least 4 missing values in their data.

First, it’s best to create a column, let’s say NA_count that contains the sum of NAs for each participant.

The rowSums() function calculates, well, sums for each row. We want to use it to count NAs. To do that, we need to know whether a value is an NA or not. You already know the function that does that.

The function, however, will only work with a single variable. To use the same function on all variables/columns of your dataset, you can use the dplyr::across() function. Check out it’s documentation to see how it’s used.

Once you have the NA_count column, discarding rows that have a value in this column that’s larger than our criterion of 4 NAs is fairly trivial.

# count NAs in each row
data <- data %>%
  dplyr::mutate(NA_sum = rowSums(is.na(dplyr::across())))
# how many had 4 or more NAs
cba <- data %>%
  dplyr::filter(NA_sum >= 4) %>%
  nrow()
# get rid of them
data <- data %>%
  dplyr::filter(NA_sum < 4)

Descriptives

Task 14

In the descriptives code chunk, write code that creates a tibble with Ns, %s, and age (mean and SD) breakdown by categories of the Gender variable. Something like this:

# A tibble: 3 x 5
  Gender        n   perc age_mean age_sd
* <fct>     <int>  <dbl>    <dbl>  <dbl>
1 Female      131 61.8       23.6   8.13
2 Male         79 37.3       24.3   8.59
3 (Missing)     2  0.943     23     4.24

You did this last time!

gender_desc <- data %>%
  dplyr::mutate(Gender = forcats::fct_explicit_na(as.factor(Gender))) %>%
  dplyr::group_by(Gender) %>%
  dplyr::summarise(n = dplyr::n(),
            perc = dplyr::n()/nrow(data) * 100,
            age_mean = mean(Age, na.rm = TRUE),
            age_sd = sd(Age, na.rm = TRUE))
# call the object's name to print it out
gender_desc

Task 15

Knit the document from your R markdown file and make sure everything looks OK.

And that’s it for the practical. There’s an extra task below for the keen beans so if you till have it in you, feel free to tackle it. If not, you can do it once you’ve rested.

Today you learnt how to thoroughly inspect messy data, paying close attention to classes and ranges of your columns. You also practised recoding categorical variables, finding and fixing entry errors, and dealing with improbably or impossible values. You also learnt how to think about missing data and how to handle them.

EXTRA: Write-up

Task 16

Complete the Participants section in Write-up (in the .Rmd file) and copy over the rest from last time.

4.1.1 Participants

The study was conducted on a sample of 217 volunteers, with data collected anonymously on-line. Data from two participants were excluded due to values outside of the accepted range of 16-100. Data from further three participants were removed due to a high number of missing data. Finally, two observations of the “Pain” variable were declared missing due to impossible values. This resulted in a sample size of N = 212 (M_age = 23.84, SD_age = 8.26; 61.8% female, 37.3% male).

[…]

Well done!