Analysing Data: Data Cleaning and Methods

Today we will work on a collaborative data cleaning and write-up task. You should work with your team to inspect, clean, summarise, and write up a report on participant demographics.

What For?

You will NOT be required to do any data cleaning for your lab report. However, data inspecting and cleaning is a critical part of working with data. You should think of this practical as an unmarked warm-up for the lab report, revision of the skills we’ve covered in the last few weeks, and practice for your final year project down the line.

Part 1: Data Cleaning

Teamwork!

Today’s writeup will be a collaborative effort. Before you jump in, decide on roles with your team.

Task 1punk!

Decide who within your team will do the following roles. You should have only one scribe, but you can have more than one of the other roles.

Keep in mind that if someone in your team is usually the scribe, you should switch roles so that everyone gets practice working in RStudio.

At this point, the Scribe should share their screen, and your team should work through the following tasks together.

Setup

Task 2punk!

Download the adata package, which contains the data, by running the following code IN THE CONSOLE.

Task 3punk!

Just like every week, we want to work in a new R project. If you haven’t done it yet, create a week_04 R project inside of your module folder and within it, create the standard folder structure, just like last week.

Task 4punk!

Task 5punk!

In the setup code chunk of the .Rmd file, write the code to load the tidyverse package you will need to complete this practical.

Task 6punk!

Complete the read-in code chunk by storing your candidate number in the candidate_number object, then running the entire chunk.

For the solutions, I’ll use a particular number to generate the data. You can get the same data by using the same number once the solutions are published. However, when you use a different number (such as your candidate number!), your data and variable names will be different than the ones in the solutions.

This is a good thing; it means that you can practice translating slightly different code/data to your own by analogy. However, you should be aware that this means you will not have exact solutions for this practical.

Inspection

Task 7punk!

Task 8Prog-rocK

Compare the variables in data with the codebook, below. Which variables have problems? What might those problems be?

Hint

The variable names might not look exactly the same as these - they may have CAPS or full stops instead of underscores. We did this on purpose! You should still easily be able to tell which variables are which.

Codebook

Cleaning

Before we do anything else, let’s first record how many participants we started with. This will help us later on to report our exclusions.

Task 9

Next, let’s look at each variable in turn. Remember that there are three main steps for each variable:

Withdrawals

We are going to start with the variable containing requests to withdraw - which is where you should always start from as well. This is about respecting your participants’ ethical rights; you should always remove people who withdrew first, before you do any other analysis.

Task 10punk!

Hint

If no one withdrew, the withdraw variable should contain only NAs. Does it?

##Option: create a table of the values in this variable
data$withdraw %>% 
  table()

.
Yes 
  6

##Option: ask R if all of the values in the withdraw variable are NAs
all(is.na(data$withdraw))

[1] FALSE

So, it seems some people did ask to withdraw. In that case, we must remove them before we continue.

Task 10.1Prog-rocK

Hint

We know that is.na() will tell us which values are NAs (returns TRUE for NA values). However, we want the rows for which the value of withdraw is NOT NA. In other words, we want to negate the output of is.na()

If you’re not sure how to do this, have one of your Fixers Google “negation operator in R”!

##Use ! to negate
data %>% 
  dplyr::filter(!is.na(withdraw))

# A tibble: 6 x 4
  id_code gender age_years withdraw
  <fct>    <int> <chr>     <chr>   
1 UCIK         2 22        Yes     
2 KURL         2 20        Yes     
3 EBBT         1 20        Yes     
4 QCNP         1 22        Yes     
5 XGZA         1 19        Yes     
6 TRCB         2 23        Yes

Task 10.2punk!

Save the number of people who will be excluded for this reason in a new object called n_withdraw.

Task 10.3punk!

Hint

Remember that this filter command should be the reverse of the one we were just using!

Age

Next, let’s have a look at the variable containing participants’ ages. As we’ve seen before, this variable should be numeric, but something has gone wrong. Let’s have a look.

Task 11

Task 11.1punk!

Identify what has caused this variable to be read in as character instead of numeric.

Hint

Make a table of the age variable to see if you can spot any issues.

As a simple check, let’s make a table of all of the values of age:

data$age_years %>% table()

.
14 15 16 17 18 19 1w 20 21 22 23 24 26  4  6 
 2  2  7 12 12 19  1 25 21 22  8  7  2  1  1

We can spot that there’s one value that isn’t only a number, but instead has a letter in it. This doesn’t tell us which case that is, but it does tell us what the problem is!

Task 11.2Prog-rocK

Hint

Here we need to work out where the non-numeric characters are in our data. If you don’t remember how to do this, last week’s live-coding practical explained it!

To get the row that contains that value, we’ll use a two-step process that we demonstrated in the live-coding last week. First, we create a new variable forcing the age variable into numeric, knowing that it will create NAs where there are non-numeric characters. Then, we can filter based on both our new age variable and the original one; the row with a character typo is the mismatch between them.

data %>% 
  dplyr::mutate(age_num = as.numeric(age_years)) %>% 
  dplyr::filter(is.na(age_num) & !is.na(age_years))

# A tibble: 1 x 5
  id_code gender age_years withdraw age_num
  <fct>    <int> <chr>     <chr>      <dbl>
1 BFPK         2 1w        <NA>          NA

Task 11.3

Save the number of people who will be excluded for this reason in a new object called n_age_typo.

n_age_typo <- data %>% 
  dplyr::mutate(age_num = as.numeric(age_years)) %>% 
  dplyr::filter(is.na(age_num) & !is.na(age_years)) %>% 
  nrow()

Task 11.4

There are lots of ways to do this. I like na_if() from dplyr. All you have to do is tell it the variable and which value(s) you want to replace. Easy!

data <- data %>% 
  dplyr::mutate(age_years = dplyr::na_if(age_years, "1w"))

Make sure you save this change back to your dataset.

Task 11.5

Well done! We can now finally have a look at the actual values in the age variable, now that they’re numbers. We should remove both missing values of age as well as unethical or impossible values.

Task 12

Task 12.1punk!

Identify how which cases have unethical or impossible ages - younger than 18 or older than 100.

data %>% 
  dplyr::filter(age_years < 18 | age_years > 100)

# A tibble: 25 x 4
   id_code gender age_years withdraw
   <fct>    <int>     <dbl> <chr>   
 1 RTSI         1        17 <NA>    
 2 VFMU         1        17 <NA>    
 3 EJPM         1        16 <NA>    
 4 GTPG         2         4 <NA>    
 5 GIBD         1        16 <NA>    
 6 GAEU         1        17 <NA>    
 7 OHBT         1        16 <NA>    
 8 DLMA         2        16 <NA>    
 9 NCAN         2        14 <NA>    
10 EJPC         1         6 <NA>    
# ... with 15 more rows

Task 12.2punk!

Save the number of people who will be excluded because of unethical or impossible ages in a new object called n_age_bad.

Task 12.3punk!

Before we carry on, save the number of people who will be excluded because of missing data in the age variable in a new object called n_age_na.

Task 12.4Prog-rocK

Remove everyone with missing or unethical values of age and save this change to the dataset.

Gender

Finally, we need to check the gender variable. According to our codebook, this should be a factor with four levels. If you’re not sure how to change a variable to a factor, just recoding as a character variable is fine.

Task 13

Task 13.1punk!

Get a table of the values of gender to check whether we do in fact have four categories.

data$gender %>% 
  table()

.
 1  2  3  4 
67 37  9  3

Task 13.2punk!

data %>% 
  filter(is.na(gender))

# A tibble: 0 x 4
# ... with 4 variables: id_code <fct>, gender <int>, age_years <dbl>,
#   withdraw <chr>

Task 13.3Prog-rocK

Recode the gender variable with character values OR change into a factor (your choice!).

Hint

If you’re curious about factors, check out the Week 3 Level Up seminR.

If you just want to change the numeric values into character values, we can do this easily with recode.

data <- data %>% 
  dplyr::mutate(gender = recode(gender,
                                `1` = "Female", `2` = "Male",
                                `3` = "Other", `4` = "Prefer Not To Say"))

If you want to change this variable into factors, we can instead add labels to the existing numeric values using factor. Note that labels are applied in numeric order (so the first label to 1, the second to 2, etc.)

data <- data %>% 
  dplyr::mutate(gender = factor(gender, labels = c("Female", "Male", "Other", "Prefer Not To Say"))
                )

The practical upshot of either method is the same for our purposes.

Task 14punk!

summary(data)

    id_code                  gender     age_years    
 ABUM   :  1   Female           :67   Min.   :18.00  
 AXOL   :  1   Male             :37   1st Qu.:19.00  
 BDDR   :  1   Other            : 9   Median :21.00  
 BGGO   :  1   Prefer Not To Say: 3   Mean   :20.74  
 BNXE   :  1                          3rd Qu.:22.00  
 BOXU   :  1                          Max.   :26.00  
 (Other):110                                         
   withdraw        
 Length:116        
 Class :character  
 Mode  :character

Now we can see that the age variable is numeric, with a reasonable and ethical range of ages; that gender is now character (or factor, if you used that method); and we were careful to remove anyone who wanted their data withdrawn.

Task 15punk!

To finish up, save the final number of participants in a new object called n_final_px.

n_final_px <- nrow(data)

Notice that this is exactly the same command we ran way back at the beginning to count the initial number of participants. We get a different answer because we’ve changed what data contains in between.

Part Two: Reporting

Now we’ve got a clean and carefully checked dataset, we should write up a description of our participants. This is an essential part of scientific articles, because it’s important to be clear about who you tested and why you excluded anyone.

Task 16Prog-rocK

Hint

Remember that you’ve stored these numbers in handy objects!

n_initial_px: original number of participants
n_withdraw: number who withdrew
n_age_typo: number with typos in age
n_age_na: number with NAs in age
n_age_bad: number with unethical or improbable age values
n_final_px: final number of participants

You can call them in your console, or see them in your environment.

Generally, you do not need to report data wrangling (such as changing variable types or creating factors); this is an expected part of your data management, so you don’t need to explain that you’ve done it, or how. You do need to report material changes to your dataset, such as removing cases.

You should write up your description in your own words, but here’s an example:

“Initially, 150 participants took part in the study. Before any other analysis, 6 participants requested to withdraw and were removed. A further 3 participants neglected to give their ages, including 1 participant who had a typo in their age, and rn_age_bad` were below 18 years old; all 28 of these participants were removed for ethical reasons. In total, 116 participants were included in the analysis.”

It’s typical to give some summary descriptives about your participants as well. Let’s do that and add it to our report.

Task 17punk!

Calculate the mean and sd of age for your final participants, and save these values as mean_age and sd_age.

Task 18jazz...

Create a summary table of your participants that gives the following information:

Hint

You may need to use the kableExtra package…

data %>% 
  group_by(gender) %>%  # split up by gender
  summarise(
    n = dplyr::n(), #count number in each group
    mean = mean(age_years),
    sd = sd(age_years),
    min = min(age_years),
    max = max(age_years)
  ) %>% 
  kable(col.names = c("Gender", "*N*", "*M*~age~", "*SD*~age~", "Min~age~", "Max~age~"),
        digits = 2) %>% 
  kable_styling()

Error in kable_styling(.): could not find function "kable_styling"

Task 19

Add the overall mean and SD of age and the table to your report. Then, knit your Markdown to see how it looks!

In addition to the paragraph above, you may add something like this:

“In the final sample, the mean age was 20.74 (SD = 1.79). A summary of participant descriptives is given in the table below.”

Then, put your table underneath!

Data Cleaning and Methods...with solutions

Author

Affiliation

Published

DOI

Overview

What For?

Part 1: Data Cleaning

Teamwork!

Task 1punk!

Setup

Task 2punk!

Task 3punk!

Task 4punk!

Task 5punk!

Task 6punk!

Inspection

Task 7punk!

Task 8Prog-rocK

Codebook

Cleaning

Task 9

Withdrawals

Task 10punk!

Task 10.1Prog-rocK

Task 10.2punk!

Task 10.3punk!

Age

Task 11

Task 11.1punk!

Task 11.2Prog-rocK

Task 11.3

Task 11.4

Task 11.5

Task 12

Task 12.1punk!

Task 12.2punk!

Task 12.3punk!

Task 12.4Prog-rocK

Gender

Task 13

Task 13.1punk!

Task 13.2punk!

Task 13.3Prog-rocK

Task 14punk!

Task 15punk!

Part Two: Reporting

Task 16Prog-rocK

Task 17punk!

Task 18jazz...

Task 19

Footnotes

Data Cleaning and Methods
...with solutions