Essentials 02: Pipes & Data Wrangling

# Essentials 02: Pipes & Data Wrangling

---

# SeminRs

- Informal, optional weekly sessions to help build a 'portfolio of skills'

- 1-2 hours of instruction, demos, walk throughs & activities to try out

### *Essentials*

- Focused on the fundRmental skills to help you get started with learning R  
- Covering: basic wrangling & visualising data, '.bold.pink[pretty]' R Markdown, inline code, debugging...

### *Level Up*

- Focused on more advanced programming skills & applying these skills to new .bold.orange[fun] topics  
- Covering: Papaja, advanced wrangling & manipulation of data, '.bold.italic.pink[even prettier]' R Markdown, spotifyR...

<br>

*Session topics are not fixed - use the Padlet linked on Canvas for suggestions!*

---

# Setup & Suggested Workflow

- Create one R project file for all seminR sessions

- Within this directory, create an r_docs & data folder & save all Rmds and datasets to these folders respectively

- Make a cheat sheet of useful functions and .orange[#] comment their meaning and usage as you go through seminRs, practicals, tutorials etc.

#### .orange[Reminder]:

File > New Project... New Directory > New Project > *Give your project a name & location*

File > New File > R Markdown...  *Remember to save this file in your r_docs folder!*

#### .orange[Tasks]: 
Open your seminRs project, create a new Rmd file, & read in the data

```r
spotify_data <- readr::read_csv("../data/spotify_decades_data.csv")  
spotify_data <- readr::read_csv("https://raw.githubusercontent.com/de84sussex/DS_spotify/main/spotify_decades_data.csv")
```

---

# Session Objectives

### Using Pipes & Wrangling Data with dplyr

- Understand pipes .orange[%>%] & how to use them

- Use different .orange[dplyr] functions to perform basic data wrangling

To check your answers or follow along, [download the solutions Rmd](https://and.netlify.app/seminr/02/essentials/essentials_02.Rmd)

---

# The Pipe .orange[%>%]

Part of the magrittr package, loads with library(tidyverse) or library(magrittr)

### Uses

- Use to chain multiple commands together  
- Pipe the values on the left into the functions/commands on the right 
- Avoids nested functions  
- For easy to read & efficient code 
- Easier to spot errors  
- Can add in as many commands/steps as you want  
- Keyboard shortcut: ctrl/cmd + shift + m

---

# The Pipe .orange[%>%]

### Nested Examples .orange[()]

f(x)

f(x, y)

h(g(f(x)))

```r
nested_smry <- dplyr::mutate(
  dplyr::summarise(
    dplyr::group_by(
      penguins, species), 
    m_mass = mean(body_mass_g, na.rm = T), 
    sd = sd(body_mass_g, na.rm = T), 
    n = n()), 
  se = sd/sqrt(n))
```

]

### Piped Examples .orange[%>%]

x %>% f

x %>% f(y)

x %>% f %>% g %>% h

```r
piped_smry <- penguins %>%
  dplyr::group_by(.data = ., species) %>%
    dplyr::summarise(.data = .,
      m_mass = mean(body_mass_g, na.rm = T),
      sd = sd(body_mass_g, na.rm = T),
      n = n()) %>%
  dplyr::mutate(se = sd/sqrt(n))
```

]

---

# The Dreaded .Dot

. is used as a placeholder for function arguments

```r
summary <- some_data %>%
  dplyr::group_by(`.data = .,` a_category) %>% 
  ...
```

The first argument for group_by() is the data .orange[.data = some_data]

When using the pipe, we don't need to specify this argument because it takes the output of the step before it

It's good practice to specify the arguments of functions so we use a . to reflect the piped input

```r
piped_smry_1 <- penguins %>%
  dplyr::group_by(`.data = .,` species) %>%
  ...

# Or

piped_smry_2 <- penguins %>%
  dplyr::group_by(`.,` species) %>%
  ...
```

---

# Wrangling Data with dplyr

- Part of the tidyverse  
- Pronounced as 'd plier' (like pliers)
- Contains really useful functions for manipulating data
- Functions follow the same grammar where .data is the first input

We've already covered:

- .orange[select()] for selecting variables/columns from their names  
- .orange[filter()] for selecting rows based on their values  
- .orange[mutate()] for creating new variables based on existing variables
- .orange[summarise()] for creating a summary table of multiple values
- .orange[group_by()] for grouping the function per categories

We're going to practise these & some additional dplyr functions:

- .orange[rename()] for renaming variables
- .orange[pull()] for 'pulling out' column values

---

# 1. select()

- .orange[select()] is for selecting specific columns
- Sub in the data, output name, & the columns you want to keep/remove

### Examples

```r
output_1 <- dplyr::select(.data = data, column_1) # to select column_1
output_2 <- dplyr::select(.data = data, -column_1, -column2) # to remove column_1 & column_2
output_3 <- data %>% dplyr::select(.data = ., column_1) # to use with %>% 
```

### Task:

Create a new code chunk, & using the spotify_data you loaded in at the start, remove the song_id, decade_fct, is_local columns & assign it back to the spotify_data object

.orange[Hint: Remember to load tidyverse/dplyr to use these functions & use names() to easily see the names of the columns!]

---

# 2. filter()

- .orange[filter()] is for selecting specific rows
- Sub in the data, output name, & the rows you want to keep based on some conditions

### Examples

```r
output_1 <- dplyr::filter(.data = data, column_1 == "some text") # keep rows = to some text
output_2 <- dplyr::filter(.data = data, column_1 != "some text") # keep rows not = to some text
output_3 <- dplyr::filter(.data = data, column_2 < 50) # keep rows where value is smaller than 50
output_4 <- dplyr::filter(.data = data, column_3 < 2 & column_4 == FALSE) # keep rows that meet BOTH conditions 
output_5 <- data %>% dplyr::filter(.data = ., column_1 == "some text") # to use with %>% 
```

### Task:

Using the spotify_data, filter tracks where both popularity is above 50 & they are not explicit (i.e., FALSE) & assign it back to the spotify_data object

---

# 3. mutate()

- .orange[mutate()] is for creating new columns from existing ones
- Sub in the data, output name, the column you want to mutate, & the operation(s)

### Examples

```r
output_1 <- dplyr::mutate(.data = data, new_column = old_column*500) # times values by 500
output_2 <- dplyr::mutate(.data = data, new_column = old_column^4) # power of 4
output_3 <- data %>% dplyr::mutate(.data = ., new_column = old_column/2) # to use with %>% 
```

### Task:

Using the spotify_data, create a new column called 'duration_secs' by calculating the track duration in seconds from the track.duration_ms column & assign it back to the spotify_data object

Extra task: create a duration_mins column with the track duration converted to minutes

---

# 4. summarise()

- .orange[summarise()] is for creating summary tables of variables
- Sub in the data, output name, & the column names & operations

### Examples

```r
output <- data %>% 
  summarise(.data = ., 
    col_name_1 = mean(column), 
    col_name_2 = sd(column)
    )
```

### Task:

Edit the example code above to create a summary table of the spotify_data of the mean, sd, min & max values for track.popularity, name the new summary table object pop_smry

---

# 5. group_by()

- .orange[group_by()] is for performing operations per groups, useful with summarise()
- Sub in the data, output name, grouping variable, & the column names & operations

### Examples

```r
output <- data %>% 
  group_by(.data = ., categorical_column) %>% 
  summarise(.data = ., 
    col_name_1 = mean(column), 
    col_name_2 = sd(column)
    )
```

### Task:

Edit the pop_smry object you created in the previous task to be grouped by playlist_name

---

# 6. rename()

- .orange[rename()] is for renaming columns
- Sub in the data, output name, & the columns you want to rename

### Examples

```r
output <- rename(.data = data, new_column_name = old_column_name)

output <- data %>% rename(.data = ., new_column_name = old_column_name)
```

### Task:

Using spotify_data, rename the playlist_name variable to be called decade & assign it back to the spotify_data object

---

# 7. pull()

- .orange[pull()] is for 'pulling' out values/data from a column, more useful in the future
- Sub in the data, output name, & column you want to 'pull' the values from

### Examples

```r
output <- pull(.data = data, column_1)

output <- data %>% pull(.data = ., column_1)
```

### Task:

Create a new object called songs, which consists of the track.name from spotify_data

---

# For extRa fun

Try putting all the commands we've used today into one long pipe!

1. Re-load your data with a new name i.e.,  
data_2 <- readr::read_csv("../data/spotify_decades_data.csv")  
1. rename() playlist_name to be called decade   
1. select() all variables except song_id, decade_fct, is_local   
1. filter() so that track.popularity is greater than 50 & track.explicit is FALSE  
1. mutate() to create a new column called duration_secs from track.duration_ms divided by 1000  
1. group_by() to group the following summary table by decade (variable created in step 2)  
1. summarise() to create a table of the mean, sd, min, & max track.popularity  
1. pull() to pull out the mean popularity values (hint: make sure to give the column name you created in the step before)

---