Data Cleaning and Methods
...with solutions

Practical 04

Published

April 12, 2021

DOI

Overview

Today we will work on a collaborative data cleaning and write-up task. You should work with your team to inspect, clean, summarise, and write up a report on participant demographics.

What For?

You will NOT be required to do any data cleaning for your lab report. However, data inspecting and cleaning is a critical part of working with data. You should think of this practical as an unmarked warm-up for the lab report, revision of the skills we’ve covered in the last few weeks, and practice for your final year project down the line.

Part 1: Data Cleaning

Teamwork!

Today’s writeup will be a collaborative effort. Before you jump in, decide on roles with your team.

Task 1punk!

Decide who within your team will do the following roles. You should have only one scribe, but you can have more than one of the other roles.

Keep in mind that if someone in your team is usually the scribe, you should switch roles so that everyone gets practice working in RStudio.

At this point, the Scribe should share their screen, and your team should work through the following tasks together.

Setup

Task 2punk!

Download the adata package, which contains the data, by running the following code IN THE CONSOLE.

remotes::install_url("http://and.netlify.app/pkg/adata.zip", 
                     build = FALSE, upgrade = "never")

Task 3punk!

Just like every week, we want to work in a new R project. If you haven’t done it yet, create a week_04 R project inside of your module folder and within it, create the standard folder structure, just like last week.

Task 4punk!

Download this R Markdown file and save/move it into your r_docs folder.

Use this R Markdown file to complete the following tasks.

Task 5punk!

In the setup code chunk of the .Rmd file, write the code to load the tidyverse package you will need to complete this practical.

Task 6punk!

Complete the read-in code chunk by storing your candidate number in the candidate_number object, then running the entire chunk.

For the solutions, I’ll use a particular number to generate the data. You can get the same data by using the same number once the solutions are published. However, when you use a different number (such as your candidate number!), your data and variable names will be different than the ones in the solutions.

This is a good thing; it means that you can practice translating slightly different code/data to your own by analogy. However, you should be aware that this means you will not have exact solutions for this practical.

You should see a new object, data, appear in your environment.

Inspection

Task 7punk!

Have a look at the data object and the variables it contains.

Task 8Prog-rocK

Compare the variables in data with the codebook, below. Which variables have problems? What might those problems be?

Hint

The variable names might not look exactly the same as these - they may have CAPS or full stops instead of underscores. We did this on purpose! You should still easily be able to tell which variables are which.

Codebook

Error in kable_styling(.): could not find function "kable_styling"

Cleaning

Before we do anything else, let’s first record how many participants we started with. This will help us later on to report our exclusions.

Task 9

Save the number of participants in a new object called n_initial_px.

Next, let’s look at each variable in turn. Remember that there are three main steps for each variable:

  1. Identify issues with the variable, or any cases that should be excluded.
  2. If necessary, save the number of cases who will be excluded.
  3. Change the variable or remove the cases.

Withdrawals

We are going to start with the variable containing requests to withdraw - which is where you should always start from as well. This is about respecting your participants’ ethical rights; you should always remove people who withdrew first, before you do any other analysis.

Task 10punk!

Identify if any participants chose to withdraw.

Hint

If no one withdrew, the withdraw variable should contain only NAs. Does it?

So, it seems some people did ask to withdraw. In that case, we must remove them before we continue.

Task 10.1Prog-rocK

Identify which cases (rows) should be excluded.

Hint

We know that is.na() will tell us which values are NAs (returns TRUE for NA values). However, we want the rows for which the value of withdraw is NOT NA. In other words, we want to negate the output of is.na()

If you’re not sure how to do this, have one of your Fixers Google “negation operator in R”!

Task 10.2punk!

Save the number of people who will be excluded for this reason in a new object called n_withdraw.

n_withdraw <- data %>% 
  dplyr::filter(!is.na(withdraw)) %>% 
  nrow()
Task 10.3punk!

Remove them and save this change to the dataset.

Hint

Remember that this filter command should be the reverse of the one we were just using!

Age

Next, let’s have a look at the variable containing participants’ ages. As we’ve seen before, this variable should be numeric, but something has gone wrong. Let’s have a look.

Task 11

Complete the following steps to change the age variable to numeric.

Task 11.1punk!

Identify what has caused this variable to be read in as character instead of numeric.

Hint

Make a table of the age variable to see if you can spot any issues.

Task 11.2Prog-rocK

Identify which cases should be excluded.

Hint

Here we need to work out where the non-numeric characters are in our data. If you don’t remember how to do this, last week’s live-coding practical explained it!

Task 11.3

Save the number of people who will be excluded for this reason in a new object called n_age_typo.

Task 11.4

Change this mistyped value of age to NA.

Task 11.5

Change the age variable from character to numeric.

Well done! We can now finally have a look at the actual values in the age variable, now that they’re numbers. We should remove both missing values of age as well as unethical or impossible values.

Task 12

Complete the following tasks to finish cleaning up the age variable.

Task 12.1punk!

Identify how which cases have unethical or impossible ages - younger than 18 or older than 100.

Task 12.2punk!

Save the number of people who will be excluded because of unethical or impossible ages in a new object called n_age_bad.

Task 12.3punk!

Before we carry on, save the number of people who will be excluded because of missing data in the age variable in a new object called n_age_na.

Task 12.4Prog-rocK

Remove everyone with missing or unethical values of age and save this change to the dataset.

Whew, well done!

Gender

Finally, we need to check the gender variable. According to our codebook, this should be a factor with four levels. If you’re not sure how to change a variable to a factor, just recoding as a character variable is fine.

Task 13

Complete the following tasks to clean up the gender variable.

Task 13.1punk!

Get a table of the values of gender to check whether we do in fact have four categories.

Task 13.2punk!

Check whether there are any NAs in the gender variable.

Great - we can simply recode our data, without having to remove anyone.

Task 13.3Prog-rocK

Recode the gender variable with character values OR change into a factor (your choice!).

Hint

If you’re curious about factors, check out the Week 3 Level Up seminR.

Task 14punk!

Get a summary of your variables again. Is everything in order?

Task 15punk!

To finish up, save the final number of participants in a new object called n_final_px.

Well done!

 

Part Two: Reporting

Now we’ve got a clean and carefully checked dataset, we should write up a description of our participants. This is an essential part of scientific articles, because it’s important to be clear about who you tested and why you excluded anyone.

Task 16Prog-rocK

Write a short paragraph reporting the following:

Hint

Remember that you’ve stored these numbers in handy objects!

You can call them in your console, or see them in your environment.

Generally, you do not need to report data wrangling (such as changing variable types or creating factors); this is an expected part of your data management, so you don’t need to explain that you’ve done it, or how. You do need to report material changes to your dataset, such as removing cases.

It’s typical to give some summary descriptives about your participants as well. Let’s do that and add it to our report.

Task 17punk!

Calculate the mean and sd of age for your final participants, and save these values as mean_age and sd_age.

Task 18jazz...

Create a summary table of your participants that gives the following information:

Format your table with nice headings and values rounded to two decimal places.

Hint

You may need to use the kableExtra package…

Task 19

Add the overall mean and SD of age and the table to your report. Then, knit your Markdown to see how it looks!

Well done!

 

Footnotes