Level Up 05: Cleaning DiRty Data

# Level Up 05: Cleaning DiRty Data

---

# Setup & Suggested Workflow

### .orange[Tasks]: 
Open/create your seminRs project & create a new R Markdown document for this week

Load the tidyverse & stats packages in the setup chunk e.g. .orange[library(stats)]

In a new chunk, read in the data:

```r
dirty_data <- readr::read_csv("https://and.netlify.app/seminr/05/level_up/data/dirty_data.csv")
```

#### .orange[Reminder]:

File > New Project... New Directory > New Project > *Give your project a name & location*

File > New File > R Markdown...  *Remember to save this file in your r_docs folder!*

---

# Session Objectives

### Cleaning DiRty Data

- Very rarely do we work with well-behaved, pretty data

- Most datasets have unhelpful variable names, missing data, impossible values, typos etc.

- Recap useful functions & techniques for cleaning data

To check your work against the solutions later, download the [R Markdown](https://and.netlify.app/seminr/05/level_up/levelup_05.Rmd)

---

# About the Dataset

### Research Q:

Do people differ on their empathy towards avatars when they are human-like vs robotic?

### Measures:

- Four experimental design related variables: Participant ID, Condition, Humanness (robot or human), & Harm (harmed or unharmed)

- Two participant demographic variables: Gender (1 = M, 2 = F) & Age (in years)

- Four outcomes variables: Pain, Experience, Empathy, & Attractiveness all on scales from 1 - 7

i.e. can the avatar feel pain, does it have experience, ppt's level of empathy for avatar, & avatar attractiveness

---

# Initial Inspection

### Whole Dataset:

summary() & names()

```r
summary(data)

names(data)
```

### Specific Variable Sub$etting:

table() is useful for checking categories & counts

```r
table(data$column)
```

.orange[Task]: Run the 3 functions above on your overall dataset & your character variables, what are your initial thoughts?

Check each variable closely, remembering the possible scores, expected data types, missing data etc.

---

# Problems

1. Unhelpful variable names

1. Pain has some impossible values

1. Age is character data & has impossible values

1. Harm contains some typos

1. Some missing data

Remember to use .orange[#] to comment descriptions of what you're doing in a code chunk & any decisions you've made!

---

# Problem 1: Unhelpful Variable Names

- We have some super unhelpful variable names in our dataset (which is really common)  
- Very easy fix with dplyr::rename()

```r
data <- data %>% dplyr::rename(.data = ., new_name = old_name)
```

### .orange[Task]:

Rename variable Q14097 to be Empathy and variable Q15620 to be Attractiveness

---

# Problem 2: Impossible Pain Values

- All of our measures were on a 7-point scale, so any values above 7 signify some error  
- We can't guess what our participants meant so we have to replace these values with NA  
- To do that, we can use a combination of mutate() & replace()

```r
data <- data %>%
  dplyr::mutate(variable = replace(variable, variable > 10, NA))
```

This example takes a variable (called variable..) and replaces any scores above 10 with NA

### .orange[Task]:

Replace the problematic pain values (i.e. those > 7) with NA

---

# Problem 3: Age Issues

- 3-step process here to fix our character data & impossible values      
- We need to recode any problematic values, then convert our data to be numeric, & then remove any impossible/unethical ages

```r
# Step 1: recoding problematic character data from numeric variable
data <- data %>%
  dplyr::mutate(variable = dplyr::recode(variable, "50 years old" = "50"))

# Step 2: changing a variable from character data to numeric data
data <- data %>%
  dplyr::mutate(variable = as.numeric(variable))

# Step 3: removing impossible/unethical values 
data <- data %>%
  dplyr::filter(variable <= 100 & variable >= 18)
```

### .orange[Task]:

Follow this 3-step process with Age to recode any problematic values & remove any impossible/unethical data

---

# Problem 4: Harm Contains Typos

- By using table() earlier, we found that Harm contains some spelling errors  
- Easy to recode using mutate() & recode()

```r
data <- data %>%
  dplyr::mutate(variable = dplyr::recode(variable, "old_value" = "new_value"))
```

### .orange[Task]:

Recode your Harm variable so that the typos are corrected

---

# Problem 5: Some Missing Data

- A super easy way of removing all cases with missing data is by using na.omit()

```r
clean_data <- stats::na.omit(data)
```

### .orange[Tasks]:

Remove all the missing data from your dataset, & rename it clean_data

Run your initial checks again on our new fresh dataset to see if we've missed anything!

---