class: center, middle, inverse, title-slide # Essentials 03: Data Wrangling cont. --- class: inverse # SeminRs - Informal, optional weekly sessions to help build a 'portfolio of skills' - 1-2 hours of instruction, demos, walk throughs & activities to try out ### *Essentials* - Focused on the fundRmental skills to help you get started with learning R - Covering: basic wrangling & visualising data, '.bold.pink[pretty]' R Markdown, inline code, debugging... ### *Level Up* - Focused on more advanced programming skills & applying these skills to new .bold.orange[fun] topics - Covering: Papaja, advanced wrangling & manipulation of data, '.bold.italic.pink[even prettier]' R Markdown, spotifyR... <br> *Session topics are not fixed - use the [Padlet](https://uofsussex.padlet.org/de84/seminrs) for suggestions!* --- class: inverse # Setup & Suggested Workflow - Create one R project file for all seminR sessions, create an r_docs & data folder & save all Rmds and datasets to these folders respectively - Make a cheat sheet of useful functions and .orange[#] comment their meaning and usage as you go through seminRs, practicals, tutorials etc. #### .orange[Tasks]: Open/create your seminRs project & 1. Open last week's Rmd, & run your 'data' chunk *or* 1. Create a new Rmd file & read in the data below ```r spotify_data <- readr::read_csv("https://raw.githubusercontent.com/de84sussex/DS_spotify/main/spotify_decades_data.csv") ``` #### .orange[Reminder]: File > New Project... New Directory > New Project > *Give your project a name & location* File > New File > R Markdown... *Remember to save this file in your r_docs folder!* --- class: inverse # Session Objectives ### Wrangling Data with dplyr - Use different .orange[dplyr] functions to perform basic data wrangling - Practise using pipes .orange[%>%] to chain our commands together <br> <br> <br> To check your answers or follow along, [download the solutions Rmd](https://and.netlify.app/seminr/03/essentials/essentials_03.Rmd) --- class: inverse # Pipes Recap .orange[%>%] ### Nested Example .orange[()] ```r am_routine <- leave_house(get_dressed(get_ready(wake-up(person = me, time = "too_early"), existential_crisis = TRUE, breakfast = FALSE), clothing = "pyjamas", footwear = "fluffy_slippers"), university = FALSE, zoomiversity = TRUE) ``` ### Piped Example .orange[%>%] ```r am_routine <- me %>% wake_up(person = ., time = "too_early") %>% get_ready(person = ., existential_crisis = TRUE, breakfast = FALSE) %>% get_dressed(person = ., clothing = "pyjamas", footwear = "fluffy_slippers") %>% leave_house(person = ., university = FALSE, zoomiversity = TRUE) ``` --- class: inverse # Wrangling Data with dplyr - Part of the tidyverse - Pronounced as 'd plier' (like pliers) - Contains really useful functions for manipulating data - Functions follow the same grammar where .data is the first input We've already covered: - .orange[select()] for selecting variables/columns from their names - .orange[filter()] for selecting rows based on their values - .orange[mutate()] for creating new variables based on existing variables - .orange[summarise()] for creating a summary table of multiple values - .orange[group_by()] for grouping the function per categories We're going to practise these & some additional dplyr functions: - .orange[rename()] for renaming variables - .orange[pull()] for 'pulling out' column values --- # About the [Data..](https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-audio-features) .b[ | Column Name | Description | Column Name | Description | | ------------------- |-----------------------------------------------------| ------------------- |-----------------------------------------------------| | song_id | Song identifier | acousticness | 0 to 1, 1 = track is acoustic | | playlist_name | Playlist identifier | liveness | 0 to 1, 1 = performed live | | decade_fct | Decade of song release | tempo | Tempo of a track in beats per minute (BPM) | | track_artists | Name(s) of artists | instrumentalness | 0 to 1, 1 = no vocal content | | track.name | Name of track | valence | 0 to 1, 0 = negative mood, 1 = positive mood | | danceability | 0 to 1, 1 = most danceable | track.popularity | 0 to 100, 0 = not popular, 100 = most popular | | energy | 0 to 1, 1 = high intensity & activity | track.duration_ms | The duration of the track in milliseconds | | loudness | -60 and 0 db, closer to 0 is louder | track.explicit | T/F if song contains explicit content or not | | speechiness | 0 to 1, 1 = spoken word | is_local | T/F if song is a local file or not | ] --- class: inverse # 1. select() - .orange[select()] is for selecting specific columns - Sub in the data, output name, & the columns you want to keep/remove ### Examples ```r output_1 <- dplyr::select(.data = data, column_1) # to select column_1 output_2 <- dplyr::select(.data = data, -column_1, -column2) # to remove column_1 & column_2 output_3 <- data %>% dplyr::select(.data = ., column_1) # to use with %>% ``` ### Task: Create a new code chunk, & using the spotify_data you loaded in at the start, remove the song_id, decade_fct, is_local columns & assign it back to the spotify_data object .orange[Hint: Remember to load tidyverse/dplyr to use these functions & use names() to easily see the names of the columns!] --- class: inverse # 2. filter() - .orange[filter()] is for selecting specific rows - Sub in the data, output name, & the rows you want to keep based on some conditions ### Examples ```r output_1 <- dplyr::filter(.data = data, column_1 == "some text") # keep rows = to some text output_2 <- dplyr::filter(.data = data, column_1 != "some text") # keep rows not = to some text output_3 <- dplyr::filter(.data = data, column_2 < 50) # keep rows where value is smaller than 50 output_4 <- dplyr::filter(.data = data, column_3 < 2 & column_4 == FALSE) # keep rows that meet BOTH conditions output_5 <- data %>% dplyr::filter(.data = ., column_1 == "some text") # to use with %>% ``` ### Task: Using the spotify_data, filter tracks where both popularity is above 50 & they are not explicit (i.e., FALSE) & assign it back to the spotify_data object --- class: inverse # 3. mutate() - .orange[mutate()] is for creating new columns from existing ones - Sub in the data, output name, the column you want to mutate, & the operation(s) ### Examples ```r output_1 <- dplyr::mutate(.data = data, new_column = old_column*500) # times values by 500 output_2 <- dplyr::mutate(.data = data, new_column = old_column^4) # power of 4 output_3 <- data %>% dplyr::mutate(.data = ., new_column = old_column/2) # to use with %>% ``` ### Task: Using the spotify_data, create a new column called 'duration_secs' by calculating the track duration in seconds from the track.duration_ms column & assign it back to the spotify_data object .orange[ExtRa task]: create a duration_mins column with the track duration converted to minutes --- class: inverse # 4. summarise() - .orange[summarise()] is for creating summary tables of variables - Sub in the data, output name, & the column names & operations ### Examples ```r output <- data %>% summarise(.data = ., col_name_1 = mean(column), col_name_2 = sd(column) ) ``` ### Task: Edit the example code above to create a summary table of the spotify_data of the mean, sd, min & max values for track.popularity, name the new summary table object pop_smry --- class: inverse # 5. group_by() - .orange[group_by()] is for performing operations per groups, useful with summarise() - Sub in the data, output name, grouping variable, & the column names & operations ### Examples ```r output <- data %>% group_by(.data = ., categorical_column) %>% summarise(.data = ., col_name_1 = mean(column), col_name_2 = sd(column) ) ``` ### Task: Edit the pop_smry object you created in the previous task to be grouped by playlist_name --- class: inverse # 6. rename() - .orange[rename()] is for renaming columns - Sub in the data, output name, & the columns you want to rename ### Examples ```r output <- rename(.data = data, new_column_name = old_column_name) output <- data %>% rename(.data = ., new_column_name = old_column_name) ``` ### Task: Using spotify_data, rename the playlist_name variable to be called decade & assign it back to the spotify_data object --- class: inverse # 7. pull() - .orange[pull()] is for 'pulling' out values/data from a column, more useful in the future - Sub in the data, output name, & column you want to 'pull' the values from ### Examples ```r output <- pull(.data = data, column_1) output <- data %>% pull(.data = ., column_1) ``` ### Task: Create a new object called songs, which consists of the track.name from spotify_data --- class: inverse # For extRa fun Try putting all the commands we've used today into one long pipe! First load in your data with a new name i.e., ```r data_2 <- readr::read_csv("https://raw.githubusercontent.com/de84sussex/DS_spotify/main/spotify_decades_data.csv") ``` 1. rename() playlist_name to be called decade 1. select() all variables except song_id, decade_fct, is_local 1. filter() so that track.popularity is greater than 50 & track.explicit is FALSE 1. mutate() to create a new column called duration_secs from track.duration_ms divided by 1000 1. group_by() to group the following summary table by decade (variable created in step 2) 1. summarise() to create a table of the mean, sd, min, & max track.popularity 1. pull() to pull out the mean popularity values (hint: make sure to give the column name you created in the step before) <br> .orange[*All these commands don't make much sense to have in one pipe, but just for pRactise* :)] --- class: center, middle <div class="padlet-embed" style="border:1px solid rgba(0,0,0,0.1);border-radius:2px;box-sizing:border-box;overflow:hidden;position:relative;width:100%;background:#F4F4F4"><p style="padding:0;margin:0"><iframe src="https://uofsussex.padlet.org/embed/nrud4gk8x63gbfdc" frameborder="0" allow="camera;microphone;geolocation" style="width:100%;height:608px;display:block;padding:0;margin:0"></iframe></p><div style="padding:8px;text-align:right;margin:0;"><a href="https://padlet.com?ref=embed" style="padding:0;margin:0;border:none;display:block;line-height:1;height:16px" target="_blank"><img src="https://padlet.net/embeds/made_with_padlet.png" width="86" height="16" style="padding:0;margin:0;background:none;border:none;display:inline;box-shadow:none" alt="Made with Padlet"></a></div></div>