Data Cleaning and Methods

Practical 04

Analysing Data (University of Sussex)
04-13-2021

Overview

Today we will work on a collaborative data cleaning and write-up task. You should work with your team to inspect, clean, summarise, and write up a report on participant demographics.

What For?

You will NOT be required to do any data cleaning for your lab report. However, data inspecting and cleaning is a critical part of working with data. You should think of this practical as an unmarked warm-up for the lab report, revision of the skills we’ve covered in the last few weeks, and practice for your final year project down the line.

Part 1: Data Cleaning

Teamwork!

Today’s writeup will be a collaborative effort. Before you jump in, decide on roles with your team.

Task 1

Decide who within your team will do the following roles. You should have only one scribe, but you can have more than one of the other roles.

Keep in mind that if someone in your team is usually the scribe, you should switch roles so that everyone gets practice working in RStudio.

At this point, the Scribe should share their screen, and your team should work through the following tasks together.

Setup

Task 2

Download the adata package, which contains the data, by running the following code IN THE CONSOLE.

remotes::install_url("http://and.netlify.app/pkg/adata.zip", 
                     build = FALSE, upgrade = "never")

Task 3

Just like every week, we want to work in a new R project. If you haven’t done it yet, create a week_04 R project inside of your module folder and within it, create the standard folder structure, just like last week.

Task 4

Download this R Markdown file and save/move it into your r_docs folder.

Use this R Markdown file to complete the following tasks.

Task 5

In the setup code chunk of the .Rmd file, write the code to load the tidyverse package you will need to complete this practical.

# add library() commands to load all packages you need for this document.

library(tidyverse)

Task 6

Complete the read-in code chunk by storing your candidate number in the candidate_number object, then running the entire chunk.

For the solutions, I’ll use a particular number to generate the data. You can get the same data by using the same number once the solutions are published. However, when you use a different number (such as your candidate number!), your data and variable names will be different than the ones in the solutions.

This is a good thing; it means that you can practice translating slightly different code/data to your own by analogy. However, you should be aware that this means you will not have exact solutions for this practical.

## Seed for solutions
candidate_number <- 99999

data <- adata::demo_data(seed = candidate_number)

You should see a new object, data, appear in your environment.

Inspection

Task 7

Have a look at the data object and the variables it contains.

## There are lots of ways to do this
## Use one or all of them!

data

View(data)

summary(data)

Task 8

Compare the variables in data with the codebook, below. Which variables have problems? What might those problems be?

The variable names might not look exactly the same as these - they may have CAPS or full stops instead of underscores. We did this on purpose! You should still easily be able to tell which variables are which.

Codebook

Error in kable_styling(.): could not find function "kable_styling"

Cleaning

Before we do anything else, let’s first record how many participants we started with. This will help us later on to report our exclusions.

Task 9

Save the number of participants in a new object called n_initial_px.

n_initial_px <- nrow(data)

Next, let’s look at each variable in turn. Remember that there are three main steps for each variable:

  1. Identify issues with the variable, or any cases that should be excluded.
  2. If necessary, save the number of cases who will be excluded.
  3. Change the variable or remove the cases.

Withdrawals

We are going to start with the variable containing requests to withdraw - which is where you should always start from as well. This is about respecting your participants’ ethical rights; you should always remove people who withdrew first, before you do any other analysis.

Task 10

Identify if any participants chose to withdraw.

If no one withdrew, the withdraw variable should contain only NAs. Does it?

##Option: create a table of the values in this variable
data$withdraw %>% 
  table()
.
Yes 
  6 
##Option: ask R if all of the values in the withdraw variable are NAs
all(is.na(data$withdraw))
[1] FALSE

So, it seems some people did ask to withdraw. In that case, we must remove them before we continue.

Task 10.1

Identify which cases (rows) should be excluded.

We know that is.na() will tell us which values are NAs (returns TRUE for NA values). However, we want the rows for which the value of withdraw is NOT NA. In other words, we want to negate the output of is.na()

If you’re not sure how to do this, have one of your Fixers Google “negation operator in R”!

##Use ! to negate
data %>% 
  dplyr::filter(!is.na(withdraw))
# A tibble: 6 x 4
  id_code gender age_years withdraw
  <fct>    <int> <chr>     <chr>   
1 UCIK         2 22        Yes     
2 KURL         2 20        Yes     
3 EBBT         1 20        Yes     
4 QCNP         1 22        Yes     
5 XGZA         1 19        Yes     
6 TRCB         2 23        Yes     
Task 10.2

Save the number of people who will be excluded for this reason in a new object called n_withdraw.

n_withdraw <- data %>% 
  dplyr::filter(!is.na(withdraw)) %>% 
  nrow()
Task 10.3

Remove them and save this change to the dataset.

Remember that this filter command should be the reverse of the one we were just using!

data <- data %>% 
  dplyr::filter(is.na(withdraw))

Age

Next, let’s have a look at the variable containing participants’ ages. As we’ve seen before, this variable should be numeric, but something has gone wrong. Let’s have a look.

Task 11

Complete the following steps to change the age variable to numeric.

Task 11.1

Identify what has caused this variable to be read in as character instead of numeric.

Make a table of the age variable to see if you can spot any issues.

As a simple check, let’s make a table of all of the values of age:

data$age_years %>% table()
.
14 15 16 17 18 19 1w 20 21 22 23 24 26  4  6 
 2  2  7 12 12 19  1 25 21 22  8  7  2  1  1 

We can spot that there’s one value that isn’t only a number, but instead has a letter in it. This doesn’t tell us which case that is, but it does tell us what the problem is!

Task 11.2

Identify which cases should be excluded.

Here we need to work out where the non-numeric characters are in our data. If you don’t remember how to do this, last week’s live-coding practical explained it!

To get the row that contains that value, we’ll use a two-step process that we demonstrated in the live-coding last week. First, we create a new variable forcing the age variable into numeric, knowing that it will create NAs where there are non-numeric characters. Then, we can filter based on both our new age variable and the original one; the row with a character typo is the mismatch between them.

data %>% 
  dplyr::mutate(age_num = as.numeric(age_years)) %>% 
  dplyr::filter(is.na(age_num) & !is.na(age_years))
# A tibble: 1 x 5
  id_code gender age_years withdraw age_num
  <fct>    <int> <chr>     <chr>      <dbl>
1 BFPK         2 1w        <NA>          NA
Task 11.3

Save the number of people who will be excluded for this reason in a new object called n_age_typo.

n_age_typo <- data %>% 
  dplyr::mutate(age_num = as.numeric(age_years)) %>% 
  dplyr::filter(is.na(age_num) & !is.na(age_years)) %>% 
  nrow()
Task 11.4

Change this mistyped value of age to NA.

There are lots of ways to do this. I like na_if() from dplyr. All you have to do is tell it the variable and which value(s) you want to replace. Easy!
data <- data %>% 
  dplyr::mutate(age_years = dplyr::na_if(age_years, "1w"))

Make sure you save this change back to your dataset.

Task 11.5

Change the age variable from character to numeric.

data <- data %>% 
  dplyr::mutate(age_years = as.numeric(age_years))

Well done! We can now finally have a look at the actual values in the age variable, now that they’re numbers. We should remove both missing values of age as well as unethical or impossible values.

Task 12

Complete the following tasks to finish cleaning up the age variable.

Task 12.1

Identify how which cases have unethical or impossible ages - younger than 18 or older than 100.

data %>% 
  dplyr::filter(age_years < 18 | age_years > 100)
# A tibble: 25 x 4
   id_code gender age_years withdraw
   <fct>    <int>     <dbl> <chr>   
 1 RTSI         1        17 <NA>    
 2 VFMU         1        17 <NA>    
 3 EJPM         1        16 <NA>    
 4 GTPG         2         4 <NA>    
 5 GIBD         1        16 <NA>    
 6 GAEU         1        17 <NA>    
 7 OHBT         1        16 <NA>    
 8 DLMA         2        16 <NA>    
 9 NCAN         2        14 <NA>    
10 EJPC         1         6 <NA>    
# ... with 15 more rows
Task 12.2

Save the number of people who will be excluded because of unethical or impossible ages in a new object called n_age_bad.

n_age_bad <- data %>% 
  dplyr::filter(age_years < 18 | age_years > 100) %>% 
  nrow()
Task 12.3

Before we carry on, save the number of people who will be excluded because of missing data in the age variable in a new object called n_age_na.

n_age_na <- data %>% 
  filter(is.na(age_years)) %>% 
  nrow()
Task 12.4

Remove everyone with missing or unethical values of age and save this change to the dataset.

data <- data %>% 
  dplyr::filter(age_years >= 18 & age_years <= 100)

Whew, well done!

Gender

Finally, we need to check the gender variable. According to our codebook, this should be a factor with four levels. If you’re not sure how to change a variable to a factor, just recoding as a character variable is fine.

Task 13

Complete the following tasks to clean up the gender variable.

Task 13.1

Get a table of the values of gender to check whether we do in fact have four categories.

data$gender %>% 
  table()
.
 1  2  3  4 
67 37  9  3 
Task 13.2

Check whether there are any NAs in the gender variable.

data %>% 
  filter(is.na(gender))
# A tibble: 0 x 4
# ... with 4 variables: id_code <fct>, gender <int>, age_years <dbl>,
#   withdraw <chr>

Great - we can simply recode our data, without having to remove anyone.

Task 13.3

Recode the gender variable with character values OR change into a factor (your choice!).

If you’re curious about factors, check out the Week 3 Level Up seminR.

If you just want to change the numeric values into character values, we can do this easily with recode.
data <- data %>% 
  dplyr::mutate(gender = recode(gender,
                                `1` = "Female", `2` = "Male",
                                `3` = "Other", `4` = "Prefer Not To Say"))
If you want to change this variable into factors, we can instead add labels to the existing numeric values using factor. Note that labels are applied in numeric order (so the first label to 1, the second to 2, etc.)
data <- data %>% 
  dplyr::mutate(gender = factor(gender, labels = c("Female", "Male", "Other", "Prefer Not To Say"))
                )

The practical upshot of either method is the same for our purposes.

Task 14

Get a summary of your variables again. Is everything in order?

summary(data)
    id_code                  gender     age_years    
 ABUM   :  1   Female           :67   Min.   :18.00  
 AXOL   :  1   Male             :37   1st Qu.:19.00  
 BDDR   :  1   Other            : 9   Median :21.00  
 BGGO   :  1   Prefer Not To Say: 3   Mean   :20.74  
 BNXE   :  1                          3rd Qu.:22.00  
 BOXU   :  1                          Max.   :26.00  
 (Other):110                                         
   withdraw        
 Length:116        
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Now we can see that the age variable is numeric, with a reasonable and ethical range of ages; that gender is now character (or factor, if you used that method); and we were careful to remove anyone who wanted their data withdrawn.

Task 15

To finish up, save the final number of participants in a new object called n_final_px.

n_final_px <- nrow(data)

Notice that this is exactly the same command we ran way back at the beginning to count the initial number of participants. We get a different answer because we’ve changed what data contains in between.

Well done!

 

Part Two: Reporting

Now we’ve got a clean and carefully checked dataset, we should write up a description of our participants. This is an essential part of scientific articles, because it’s important to be clear about who you tested and why you excluded anyone.

Task 16

Write a short paragraph reporting the following:

Remember that you’ve stored these numbers in handy objects!

You can call them in your console, or see them in your environment.

Generally, you do not need to report data wrangling (such as changing variable types or creating factors); this is an expected part of your data management, so you don’t need to explain that you’ve done it, or how. You do need to report material changes to your dataset, such as removing cases.

You should write up your description in your own words, but here’s an example:

“Initially, 150 participants took part in the study. Before any other analysis, 6 participants requested to withdraw and were removed. A further 3 participants neglected to give their ages, including 1 participant who had a typo in their age, and rn_age_bad` were below 18 years old; all 28 of these participants were removed for ethical reasons. In total, 116 participants were included in the analysis.”

It’s typical to give some summary descriptives about your participants as well. Let’s do that and add it to our report.

Task 17

Calculate the mean and sd of age for your final participants, and save these values as mean_age and sd_age.

mean_age <- mean(data$age_years)
sd_age <- sd(data$age_years)

Task 18

Create a summary table of your participants that gives the following information:

Format your table with nice headings and values rounded to two decimal places.

You may need to use the kableExtra package…

data %>% 
  group_by(gender) %>%  # split up by gender
  summarise(
    n = dplyr::n(), #count number in each group
    mean = mean(age_years),
    sd = sd(age_years),
    min = min(age_years),
    max = max(age_years)
  ) %>% 
  kable(col.names = c("Gender", "*N*", "*M*~age~", "*SD*~age~", "Min~age~", "Max~age~"),
        digits = 2) %>% 
  kable_styling()
Error in kable_styling(.): could not find function "kable_styling"

Task 19

Add the overall mean and SD of age and the table to your report. Then, knit your Markdown to see how it looks!

In addition to the paragraph above, you may add something like this:

“In the final sample, the mean age was 20.74 (SD = 1.79). A summary of participant descriptives is given in the table below.”

Then, put your table underneath!

Well done!