class: center, middle, inverse, title-slide # Level Up 05: Cleaning DiRty Data --- class: inverse # Setup & Suggested Workflow ### .orange[Tasks]: Open/create your seminRs project & create a new R Markdown document for this week Load the tidyverse & stats packages in the setup chunk e.g. .orange[library(stats)] In a new chunk, read in the data: ```r dirty_data <- readr::read_csv("https://and.netlify.app/seminr/05/level_up/data/dirty_data.csv") ``` #### .orange[Reminder]: File > New Project... New Directory > New Project > *Give your project a name & location* File > New File > R Markdown... *Remember to save this file in your r_docs folder!* --- class: inverse # Session Objectives ### Cleaning DiRty Data - Very rarely do we work with well-behaved, pretty data - Most datasets have unhelpful variable names, missing data, impossible values, typos etc. - Recap useful functions & techniques for cleaning data <br> <br> <br> <br> <br> <br> <br> To check your work against the solutions later, download the [R Markdown](https://and.netlify.app/seminr/05/level_up/levelup_05.Rmd) --- class: inverse # About the Dataset ### Research Q: Do people differ on their empathy towards avatars when they are human-like vs robotic? ### Measures: - Four experimental design related variables: Participant ID, Condition, Humanness (robot or human), & Harm (harmed or unharmed) - Two participant demographic variables: Gender (1 = M, 2 = F) & Age (in years) - Four outcomes variables: Pain, Experience, Empathy, & Attractiveness all on scales from 1 - 7 i.e. can the avatar feel pain, does it have experience, ppt's level of empathy for avatar, & avatar attractiveness --- class: inverse # Initial Inspection ### Whole Dataset: summary() & names() ```r summary(data) names(data) ``` ### Specific Variable Sub$etting: table() is useful for checking categories & counts ```r table(data$column) ``` .orange[Task]: Run the 3 functions above on your overall dataset & your character variables, what are your initial thoughts? Check each variable closely, remembering the possible scores, expected data types, missing data etc. --- class: inverse # Problems 1. Unhelpful variable names 1. Pain has some impossible values 1. Age is character data & has impossible values 1. Harm contains some typos 1. Some missing data <br> <br> <br> <br> <br> <br> <br> Remember to use .orange[#] to comment descriptions of what you're doing in a code chunk & any decisions you've made! --- class: inverse # Problem 1: Unhelpful Variable Names - We have some super unhelpful variable names in our dataset (which is really common) - Very easy fix with dplyr::rename() ```r data <- data %>% dplyr::rename(.data = ., new_name = old_name) ``` ### .orange[Task]: Rename variable Q14097 to be Empathy and variable Q15620 to be Attractiveness --- class: inverse # Problem 2: Impossible Pain Values - All of our measures were on a 7-point scale, so any values above 7 signify some error - We can't guess what our participants meant so we have to replace these values with NA - To do that, we can use a combination of mutate() & replace() ```r data <- data %>% dplyr::mutate(variable = replace(variable, variable > 10, NA)) ``` This example takes a variable (called variable..) and replaces any scores above 10 with NA ### .orange[Task]: Replace the problematic pain values (i.e. those > 7) with NA --- class: inverse # Problem 3: Age Issues - 3-step process here to fix our character data & impossible values - We need to recode any problematic values, then convert our data to be numeric, & then remove any impossible/unethical ages ```r # Step 1: recoding problematic character data from numeric variable data <- data %>% dplyr::mutate(variable = dplyr::recode(variable, "50 years old" = "50")) # Step 2: changing a variable from character data to numeric data data <- data %>% dplyr::mutate(variable = as.numeric(variable)) # Step 3: removing impossible/unethical values data <- data %>% dplyr::filter(variable <= 100 & variable >= 18) ``` ### .orange[Task]: Follow this 3-step process with Age to recode any problematic values & remove any impossible/unethical data --- class: inverse # Problem 4: Harm Contains Typos - By using table() earlier, we found that Harm contains some spelling errors - Easy to recode using mutate() & recode() ```r data <- data %>% dplyr::mutate(variable = dplyr::recode(variable, "old_value" = "new_value")) ``` ### .orange[Task]: Recode your Harm variable so that the typos are corrected --- class: inverse # Problem 5: Some Missing Data - A super easy way of removing all cases with missing data is by using na.omit() ```r clean_data <- stats::na.omit(data) ``` ### .orange[Tasks]: Remove all the missing data from your dataset, & rename it clean_data Run your initial checks again on our new fresh dataset to see if we've missed anything! --- class: center, middle <div class="padlet-embed" style="border:1px solid rgba(0,0,0,0.1);border-radius:2px;box-sizing:border-box;overflow:hidden;position:relative;width:100%;background:#F4F4F4"><p style="padding:0;margin:0"><iframe src="https://uofsussex.padlet.org/embed/nrud4gk8x63gbfdc" frameborder="0" allow="camera;microphone;geolocation" style="width:100%;height:608px;display:block;padding:0;margin:0"></iframe></p><div style="padding:8px;text-align:right;margin:0;"><a href="https://padlet.com?ref=embed" style="padding:0;margin:0;border:none;display:block;line-height:1;height:16px" target="_blank"><img src="https://padlet.net/embeds/made_with_padlet.png" width="86" height="16" style="padding:0;margin:0;background:none;border:none;display:inline;box-shadow:none" alt="Made with Padlet"></a></div></div>