+ - 0:00:00
Notes for current slide
Notes for next slide

Level Up 05: Cleaning DiRty Data

1 / 12

Setup & Suggested Workflow

Tasks:

Open/create your seminRs project & create a new R Markdown document for this week

Load the tidyverse & stats packages in the setup chunk e.g. library(stats)

In a new chunk, read in the data:

dirty_data <- readr::read_csv("https://and.netlify.app/seminr/05/level_up/data/dirty_data.csv")

Reminder:

File > New Project... New Directory > New Project > Give your project a name & location

File > New File > R Markdown... Remember to save this file in your r_docs folder!

2 / 12

Session Objectives

Cleaning DiRty Data

  • Very rarely do we work with well-behaved, pretty data

  • Most datasets have unhelpful variable names, missing data, impossible values, typos etc.

  • Recap useful functions & techniques for cleaning data








To check your work against the solutions later, download the R Markdown

3 / 12

About the Dataset

Research Q:

Do people differ on their empathy towards avatars when they are human-like vs robotic?

Measures:

  • Four experimental design related variables: Participant ID, Condition, Humanness (robot or human), & Harm (harmed or unharmed)

  • Two participant demographic variables: Gender (1 = M, 2 = F) & Age (in years)

  • Four outcomes variables: Pain, Experience, Empathy, & Attractiveness all on scales from 1 - 7

i.e. can the avatar feel pain, does it have experience, ppt's level of empathy for avatar, & avatar attractiveness

4 / 12

Initial Inspection

Whole Dataset:

summary() & names()

summary(data)
names(data)

Specific Variable Sub$etting:

table() is useful for checking categories & counts

table(data$column)

Task: Run the 3 functions above on your overall dataset & your character variables, what are your initial thoughts?

Check each variable closely, remembering the possible scores, expected data types, missing data etc.

5 / 12

Problems

  1. Unhelpful variable names

  2. Pain has some impossible values

  3. Age is character data & has impossible values

  4. Harm contains some typos

  5. Some missing data








Remember to use # to comment descriptions of what you're doing in a code chunk & any decisions you've made!

6 / 12

Problem 1: Unhelpful Variable Names

  • We have some super unhelpful variable names in our dataset (which is really common)
  • Very easy fix with dplyr::rename()
data <- data %>% dplyr::rename(.data = ., new_name = old_name)

Task:

Rename variable Q14097 to be Empathy and variable Q15620 to be Attractiveness

7 / 12

Problem 2: Impossible Pain Values

  • All of our measures were on a 7-point scale, so any values above 7 signify some error
  • We can't guess what our participants meant so we have to replace these values with NA
  • To do that, we can use a combination of mutate() & replace()
data <- data %>%
dplyr::mutate(variable = replace(variable, variable > 10, NA))

This example takes a variable (called variable..) and replaces any scores above 10 with NA

Task:

Replace the problematic pain values (i.e. those > 7) with NA

8 / 12

Problem 3: Age Issues

  • 3-step process here to fix our character data & impossible values
  • We need to recode any problematic values, then convert our data to be numeric, & then remove any impossible/unethical ages
# Step 1: recoding problematic character data from numeric variable
data <- data %>%
dplyr::mutate(variable = dplyr::recode(variable, "50 years old" = "50"))
# Step 2: changing a variable from character data to numeric data
data <- data %>%
dplyr::mutate(variable = as.numeric(variable))
# Step 3: removing impossible/unethical values
data <- data %>%
dplyr::filter(variable <= 100 & variable >= 18)

Task:

Follow this 3-step process with Age to recode any problematic values & remove any impossible/unethical data

9 / 12

Problem 4: Harm Contains Typos

  • By using table() earlier, we found that Harm contains some spelling errors
  • Easy to recode using mutate() & recode()
data <- data %>%
dplyr::mutate(variable = dplyr::recode(variable, "old_value" = "new_value"))

Task:

Recode your Harm variable so that the typos are corrected

10 / 12

Problem 5: Some Missing Data

  • A super easy way of removing all cases with missing data is by using na.omit()
clean_data <- stats::na.omit(data)

Tasks:

Remove all the missing data from your dataset, & rename it clean_data

Run your initial checks again on our new fresh dataset to see if we've missed anything!

11 / 12

Made with Padlet
12 / 12

Setup & Suggested Workflow

Tasks:

Open/create your seminRs project & create a new R Markdown document for this week

Load the tidyverse & stats packages in the setup chunk e.g. library(stats)

In a new chunk, read in the data:

dirty_data <- readr::read_csv("https://and.netlify.app/seminr/05/level_up/data/dirty_data.csv")

Reminder:

File > New Project... New Directory > New Project > Give your project a name & location

File > New File > R Markdown... Remember to save this file in your r_docs folder!

2 / 12
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow