Essentials 03: Data Wrangling cont.

# Essentials 03: Data Wrangling cont.

---

# SeminRs

- Informal, optional weekly sessions to help build a 'portfolio of skills'

- 1-2 hours of instruction, demos, walk throughs & activities to try out

### *Essentials*

- Focused on the fundRmental skills to help you get started with learning R  
- Covering: basic wrangling & visualising data, '.bold.pink[pretty]' R Markdown, inline code, debugging...

### *Level Up*

- Focused on more advanced programming skills & applying these skills to new .bold.orange[fun] topics  
- Covering: Papaja, advanced wrangling & manipulation of data, '.bold.italic.pink[even prettier]' R Markdown, spotifyR...

<br>

*Session topics are not fixed - use the [Padlet](https://uofsussex.padlet.org/de84/seminrs) for suggestions!*

---

# Setup & Suggested Workflow

- Create one R project file for all seminR sessions, create an r_docs & data folder & save all Rmds and datasets to these folders respectively

- Make a cheat sheet of useful functions and .orange[#] comment their meaning and usage as you go through seminRs, practicals, tutorials etc.

#### .orange[Tasks]: Open/create your seminRs project &

1. Open last week's Rmd, & run your 'data' chunk *or*

1. Create a new Rmd file & read in the data below

```r
spotify_data <- readr::read_csv("https://raw.githubusercontent.com/de84sussex/DS_spotify/main/spotify_decades_data.csv")
```

#### .orange[Reminder]:

File > New Project... New Directory > New Project > *Give your project a name & location*

File > New File > R Markdown...  *Remember to save this file in your r_docs folder!*

---

# Session Objectives

### Wrangling Data with dplyr

- Use different .orange[dplyr] functions to perform basic data wrangling   
- Practise using pipes .orange[%>%] to chain our commands together

To check your answers or follow along, [download the solutions Rmd](https://and.netlify.app/seminr/03/essentials/essentials_03.Rmd)

---

# Pipes Recap .orange[%>%]

### Nested Example .orange[()]

```r
am_routine <- leave_house(get_dressed(get_ready(wake-up(person = me, time = "too_early"), 
    existential_crisis = TRUE, breakfast = FALSE), clothing = "pyjamas", 
    footwear = "fluffy_slippers"), university = FALSE, zoomiversity = TRUE)
```

### Piped Example .orange[%>%]

```r
am_routine <- me %>%
    wake_up(person = ., time = "too_early") %>% 
    get_ready(person = ., existential_crisis = TRUE, breakfast = FALSE) %>% 
    get_dressed(person = ., clothing = "pyjamas", footwear = "fluffy_slippers") %>% 
    leave_house(person = ., university = FALSE, zoomiversity = TRUE)
```

---

# Wrangling Data with dplyr

- Part of the tidyverse  
- Pronounced as 'd plier' (like pliers)
- Contains really useful functions for manipulating data
- Functions follow the same grammar where .data is the first input

We've already covered:

- .orange[select()] for selecting variables/columns from their names  
- .orange[filter()] for selecting rows based on their values  
- .orange[mutate()] for creating new variables based on existing variables
- .orange[summarise()] for creating a summary table of multiple values
- .orange[group_by()] for grouping the function per categories

We're going to practise these & some additional dplyr functions:

- .orange[rename()] for renaming variables
- .orange[pull()] for 'pulling out' column values

---

# About the [Data..](https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-audio-features)

| Column Name         | Description                                         | Column Name         | Description                                         |
| ------------------- |-----------------------------------------------------| ------------------- |-----------------------------------------------------|
| song_id             | Song identifier                                     | acousticness        | 0 to 1, 1 = track is acoustic                       | 
| playlist_name       | Playlist identifier                                 | liveness            | 0 to 1, 1 = performed live                          |
| decade_fct          | Decade of song release                              | tempo               | Tempo of a track in beats per minute (BPM)          |
| track_artists       | Name(s) of artists                                  | instrumentalness    | 0 to 1, 1 = no vocal content                        |    
| track.name          | Name of track                                       | valence             | 0 to 1, 0 = negative mood, 1 = positive mood        |
| danceability        | 0 to 1, 1 = most danceable                          | track.popularity    | 0 to 100, 0 = not popular, 100 = most popular       |
| energy              | 0 to 1, 1 = high intensity & activity               | track.duration_ms   | The duration of the track in milliseconds           |
| loudness            | -60 and 0 db, closer to 0 is louder                 | track.explicit      | T/F if song contains explicit content or not        |
| speechiness         | 0 to 1, 1 = spoken word                             | is_local            | T/F if song is a local file or not                  |

]

---

# 1. select()

- .orange[select()] is for selecting specific columns
- Sub in the data, output name, & the columns you want to keep/remove

### Examples

```r
output_1 <- dplyr::select(.data = data, column_1) # to select column_1
output_2 <- dplyr::select(.data = data, -column_1, -column2) # to remove column_1 & column_2
output_3 <- data %>% dplyr::select(.data = ., column_1) # to use with %>% 
```

### Task:

Create a new code chunk, & using the spotify_data you loaded in at the start, remove the song_id, decade_fct, is_local columns & assign it back to the spotify_data object

.orange[Hint: Remember to load tidyverse/dplyr to use these functions & use names() to easily see the names of the columns!]

---

# 2. filter()

- .orange[filter()] is for selecting specific rows
- Sub in the data, output name, & the rows you want to keep based on some conditions

### Examples

```r
output_1 <- dplyr::filter(.data = data, column_1 == "some text") # keep rows = to some text
output_2 <- dplyr::filter(.data = data, column_1 != "some text") # keep rows not = to some text
output_3 <- dplyr::filter(.data = data, column_2 < 50) # keep rows where value is smaller than 50
output_4 <- dplyr::filter(.data = data, column_3 < 2 & column_4 == FALSE) # keep rows that meet BOTH conditions 
output_5 <- data %>% dplyr::filter(.data = ., column_1 == "some text") # to use with %>% 
```

### Task:

Using the spotify_data, filter tracks where both popularity is above 50 & they are not explicit (i.e., FALSE) & assign it back to the spotify_data object

---

# 3. mutate()

- .orange[mutate()] is for creating new columns from existing ones
- Sub in the data, output name, the column you want to mutate, & the operation(s)

### Examples

```r
output_1 <- dplyr::mutate(.data = data, new_column = old_column*500) # times values by 500
output_2 <- dplyr::mutate(.data = data, new_column = old_column^4) # power of 4
output_3 <- data %>% dplyr::mutate(.data = ., new_column = old_column/2) # to use with %>% 
```

### Task:

Using the spotify_data, create a new column called 'duration_secs' by calculating the track duration in seconds from the track.duration_ms column & assign it back to the spotify_data object

---

# 4. summarise()

- .orange[summarise()] is for creating summary tables of variables
- Sub in the data, output name, & the column names & operations

### Examples

```r
output <- data %>% 
  summarise(.data = ., 
    col_name_1 = mean(column), 
    col_name_2 = sd(column)
    )
```

### Task:

Edit the example code above to create a summary table of the spotify_data of the mean, sd, min & max values for track.popularity, name the new summary table object pop_smry

---

# 5. group_by()

- .orange[group_by()] is for performing operations per groups, useful with summarise()
- Sub in the data, output name, grouping variable, & the column names & operations

### Examples

```r
output <- data %>% 
  group_by(.data = ., categorical_column) %>% 
  summarise(.data = ., 
    col_name_1 = mean(column), 
    col_name_2 = sd(column)
    )
```

### Task:

Edit the pop_smry object you created in the previous task to be grouped by playlist_name

---

# 6. rename()

- .orange[rename()] is for renaming columns
- Sub in the data, output name, & the columns you want to rename

### Examples

```r
output <- rename(.data = data, new_column_name = old_column_name)

output <- data %>% rename(.data = ., new_column_name = old_column_name)
```

### Task:

Using spotify_data, rename the playlist_name variable to be called decade & assign it back to the spotify_data object

---

# 7. pull()

- .orange[pull()] is for 'pulling' out values/data from a column, more useful in the future
- Sub in the data, output name, & column you want to 'pull' the values from

### Examples

```r
output <- pull(.data = data, column_1)

output <- data %>% pull(.data = ., column_1)
```

### Task:

Create a new object called songs, which consists of the track.name from spotify_data

---

# For extRa fun

Try putting all the commands we've used today into one long pipe!

First load in your data with a new name i.e.,

```r
data_2 <- readr::read_csv("https://raw.githubusercontent.com/de84sussex/DS_spotify/main/spotify_decades_data.csv")
```

1. rename() playlist_name to be called decade   
1. select() all variables except song_id, decade_fct, is_local   
1. filter() so that track.popularity is greater than 50 & track.explicit is FALSE  
1. mutate() to create a new column called duration_secs from track.duration_ms divided by 1000  
1. group_by() to group the following summary table by decade (variable created in step 2)  
1. summarise() to create a table of the mean, sd, min, & max track.popularity  
1. pull() to pull out the mean popularity values (hint: make sure to give the column name you created in the step before)

<br>

---