Practical 04
Today we will work on a collaborative data cleaning and write-up task. You should work with your team to inspect, clean, summarise, and write up a report on participant demographics.
You will NOT be required to do any data cleaning for your lab report. However, data inspecting and cleaning is a critical part of working with data. You should think of this practical as an unmarked warm-up for the lab report, revision of the skills we’ve covered in the last few weeks, and practice for your final year project down the line.
Today’s writeup will be a collaborative effort. Before you jump in, decide on roles with your team.
Decide who within your team will do the following roles. You should have only one scribe, but you can have more than one of the other roles.
Keep in mind that if someone in your team is usually the scribe, you should switch roles so that everyone gets practice working in RStudio.
At this point, the Scribe should share their screen, and your team should work through the following tasks together.
Download the adata
package, which contains the data, by running the following code IN THE CONSOLE.
build = FALSE, upgrade = "never")
Just like every week, we want to work in a new R
project. If you haven’t done it yet, create a week_04 R
project inside of your module folder and within it, create the standard folder structure, just like last week.
Download this R Markdown file and save/move it into your r_docs folder.
Use this R Markdown file to complete the following tasks.
In the setup
code chunk of the .Rmd file, write the code to load the tidyverse
package you will need to complete this practical.
Complete the read-in
code chunk by storing your candidate number in the candidate_number
object, then running the entire chunk.
For the solutions, I’ll use a particular number to generate the data. You can get the same data by using the same number once the solutions are published. However, when you use a different number (such as your candidate number!), your data and variable names will be different than the ones in the solutions.
This is a good thing; it means that you can practice translating slightly different code/data to your own by analogy. However, you should be aware that this means you will not have exact solutions for this practical.
## Seed for solutions
candidate_number <- 99999
data <- adata::demo_data(seed = candidate_number)
You should see a new object, data
, appear in your environment.
Have a look at the data
object and the variables it contains.
Compare the variables in data
with the codebook, below. Which variables have problems? What might those problems be?
The variable names might not look exactly the same as these - they may have CAPS or full stops instead of underscores. We did this on purpose! You should still easily be able to tell which variables are which.
Before we do anything else, let’s first record how many participants we started with. This will help us later on to report our exclusions.
Save the number of participants in a new object called n_initial_px
n_initial_px <- nrow(data)
Next, let’s look at each variable in turn. Remember that there are three main steps for each variable:
We are going to start with the variable containing requests to withdraw - which is where you should always start from as well. This is about respecting your participants’ ethical rights; you should always remove people who withdrew first, before you do any other analysis.
Identify if any participants chose to withdraw.
If no one withdrew, the withdraw
variable should contain only NA
s. Does it?
So, it seems some people did ask to withdraw. In that case, we must remove them before we continue.
Identify which cases (rows) should be excluded.
We know that
will tell us which values are NA
s (returns TRUE
for NA
values). However, we want the rows for which the value of withdraw
. In other words, we want to negate the output of
If you’re not sure how to do this, have one of your Fixers Google “negation operator in R”!
Save the number of people who will be excluded for this reason in a new object called n_withdraw
Remove them and save this change to the dataset.
Remember that this filter
command should be the reverse of the one we were just using!
Next, let’s have a look at the variable containing participants’ ages. As we’ve seen before, this variable should be numeric, but something has gone wrong. Let’s have a look.
Complete the following steps to change the age variable to numeric.
Identify what has caused this variable to be read in as character instead of numeric.
Make a table
of the age variable to see if you can spot any issues.
As a simple check, let’s make a table of all of the values of age:
data$age_years %>% table()
14 15 16 17 18 19 1w 20 21 22 23 24 26 4 6
2 2 7 12 12 19 1 25 21 22 8 7 2 1 1
We can spot that there’s one value that isn’t only a number, but instead has a letter in it. This doesn’t tell us which case that is, but it does tell us what the problem is!
Identify which cases should be excluded.
Here we need to work out where the non-numeric characters are in our data. If you don’t remember how to do this, last week’s live-coding practical explained it!
To get the row that contains that value, we’ll use a two-step process that we demonstrated in the live-coding last week. First, we create a new variable forcing the age variable into numeric, knowing that it will create NAs where there are non-numeric characters. Then, we can filter based on both our new age variable and the original one; the row with a character typo is the mismatch between them.
data %>%
dplyr::mutate(age_num = as.numeric(age_years)) %>%
dplyr::filter( & !
# A tibble: 1 x 5
id_code gender age_years withdraw age_num
<fct> <int> <chr> <chr> <dbl>
1 BFPK 2 1w <NA> NA
Save the number of people who will be excluded for this reason in a new object called n_age_typo
Change this mistyped value of age to NA
from dplyr
. All you have to do is tell it the variable and which value(s) you want to replace. Easy!
Make sure you save this change back to your dataset.
Change the age variable from character to numeric.
data <- data %>%
dplyr::mutate(age_years = as.numeric(age_years))
Well done! We can now finally have a look at the actual values in the age variable, now that they’re numbers. We should remove both missing values of age as well as unethical or impossible values.
Complete the following tasks to finish cleaning up the age variable.
Identify how which cases have unethical or impossible ages - younger than 18 or older than 100.
data %>%
dplyr::filter(age_years < 18 | age_years > 100)
# A tibble: 25 x 4
id_code gender age_years withdraw
<fct> <int> <dbl> <chr>
1 RTSI 1 17 <NA>
2 VFMU 1 17 <NA>
3 EJPM 1 16 <NA>
4 GTPG 2 4 <NA>
5 GIBD 1 16 <NA>
6 GAEU 1 17 <NA>
7 OHBT 1 16 <NA>
8 DLMA 2 16 <NA>
9 NCAN 2 14 <NA>
10 EJPC 1 6 <NA>
# ... with 15 more rows
Save the number of people who will be excluded because of unethical or impossible ages in a new object called n_age_bad
Before we carry on, save the number of people who will be excluded because of missing data in the age variable in a new object called n_age_na
Remove everyone with missing or unethical values of age and save this change to the dataset.
data <- data %>%
dplyr::filter(age_years >= 18 & age_years <= 100)
Whew, well done!
Finally, we need to check the gender variable. According to our codebook, this should be a factor with four levels. If you’re not sure how to change a variable to a factor, just recoding as a character variable is fine.
Complete the following tasks to clean up the gender variable.
Get a table of the values of gender to check whether we do in fact have four categories.
data$gender %>%
1 2 3 4
67 37 9 3
Check whether there are any NA
s in the gender variable.
Great - we can simply recode our data, without having to remove anyone.
Recode the gender variable with character values OR change into a factor (your choice!).
If you’re curious about factors, check out the Week 3 Level Up seminR.
data <- data %>%
dplyr::mutate(gender = recode(gender,
`1` = "Female", `2` = "Male",
`3` = "Other", `4` = "Prefer Not To Say"))
. Note that labels are applied in numeric order (so the first label to 1
, the second to 2
, etc.)
The practical upshot of either method is the same for our purposes.
Get a summary of your variables again. Is everything in order?
id_code gender age_years
ABUM : 1 Female :67 Min. :18.00
AXOL : 1 Male :37 1st Qu.:19.00
BDDR : 1 Other : 9 Median :21.00
BGGO : 1 Prefer Not To Say: 3 Mean :20.74
BNXE : 1 3rd Qu.:22.00
BOXU : 1 Max. :26.00
Class :character
Mode :character
Now we can see that the age variable is numeric, with a reasonable and ethical range of ages; that gender is now character (or factor, if you used that method); and we were careful to remove anyone who wanted their data withdrawn.
To finish up, save the final number of participants in a new object called n_final_px
n_final_px <- nrow(data)
Notice that this is exactly the same command we ran way back at the beginning to count the initial number of participants. We get a different answer because we’ve changed what data
contains in between.
Well done!
Now we’ve got a clean and carefully checked dataset, we should write up a description of our participants. This is an essential part of scientific articles, because it’s important to be clear about who you tested and why you excluded anyone.
Write a short paragraph reporting the following:
Remember that you’ve stored these numbers in handy objects!
: original number of participantsn_withdraw
: number who withdrewn_age_typo
: number with typos in agen_age_na
: number with NA
s in agen_age_bad
: number with unethical or improbable age valuesn_final_px
: final number of participantsYou can call them in your console, or see them in your environment.
Generally, you do not need to report data wrangling (such as changing variable types or creating factors); this is an expected part of your data management, so you don’t need to explain that you’ve done it, or how. You do need to report material changes to your dataset, such as removing cases.
You should write up your description in your own words, but here’s an example:
“Initially, 150 participants took part in the study. Before any other analysis, 6 participants requested to withdraw and were removed. A further 3 participants neglected to give their ages, including 1 participant who had a typo in their age, and r
n_age_bad` were below 18 years old; all 28 of these participants were removed for ethical reasons. In total, 116 participants were included in the analysis.”
It’s typical to give some summary descriptives about your participants as well. Let’s do that and add it to our report.
Calculate the mean and sd of age for your final participants, and save these values as mean_age
and sd_age
Create a summary table of your participants that gives the following information:
Format your table with nice headings and values rounded to two decimal places.
You may need to use the kableExtra
data %>%
group_by(gender) %>% # split up by gender
n = dplyr::n(), #count number in each group
mean = mean(age_years),
sd = sd(age_years),
min = min(age_years),
max = max(age_years)
) %>%
kable(col.names = c("Gender", "*N*", "*M*~age~", "*SD*~age~", "Min~age~", "Max~age~"),
digits = 2) %>%
Add the overall mean and SD of age and the table to your report. Then, knit your Markdown to see how it looks!
In addition to the paragraph above, you may add something like this:
“In the final sample, the mean age was 20.74 (SD = 1.79). A summary of participant descriptives is given in the table below.”
Then, put your table underneath!
Well done!