class: center, middle, inverse, title-slide # Essentials 01: Importing & Exploring Data --- class: inverse # SeminRs - Informal, optional weekly sessions to help build a 'portfolio of skills' - 1-2 hours of instruction, demos, walk throughs & activities to try out ### *Essentials* - Focused on the fundRmental skills to help you get started with learning R - Covering: basic wrangling & visualising data, '.bold.pink[pretty]' R Markdown, inline code, debugging... ### *Level Up* - Focused on more advanced programming skills & applying these skills to new .bold.orange[fun] topics - Covering: Papaja, advanced wrangling & manipulation of data, '.bold.italic.pink[even prettier]' R Markdown, spotifyR... <br> *Session topics are not fixed - use the Padlet linked on Canvas for suggestions!* --- class: inverse # Suggested Workflow - Create an R project file for all seminR sessions - Within this directory, create an r_docs & data folder - Save all Rmds and datasets to these folders respectively <br> File > New Project... New Directory > New Project > *Give your project a name & location* <br> - Open a new Rmd file for today's session - Make a cheat sheet of useful functions and .orange[#] comment their meaning and usage as you go through seminRs, practicals, tutorials etc. --- class: inverse # Session Objectives ### Importing & Exploring Data - Understand directories - Read in datasets - Use different functions to explore datasets - Understand different data types --- class: inverse # Directories & Paths Recap Reading in data requires knowledge of your files & folders - Directories = Folders - Paths = Directions Two types of paths: - Absolute C:/Users/danie/Documents/seminRs_21/images/image.png - Relative ./images/image.png <br> -- ### .orange[Top Tip: get & *stay* organised!] --- class: inverse # Importing Datasets - Collection of data, columns represent variables, & rows represent cases/people/entities - Lots of different data formats exist - common ones are: .csv, .sav, .Rdata... - Different functions, packages, & methods we can use to load them in R - We're going to focus on *readr*, others include *haven*, *foreign*, *readxl* --- class: inverse # readr Part of the tidyverse, can be loaded using .orange[library(tidyverse)] or .orange[library(readr)] ```r readr::read_csv("./data/spotify_decades_data.csv") data <- readr::read_csv("../spotify_decades_data.csv") ``` - First argument is the path to the file - Once it's run, it prints out column specification which gives the column names and data types ### .orange[Task] 1. Download the dataset (make sure it saves as a .csv file) 1. Check it's saved to your data folder/move it there manually 1. In a new chunk in your Rmd file, use read_csv to open the file & name the object *data* ```r https://raw.githubusercontent.com/de84sussex/DS_spotify/main/spotify_decades_data.csv ``` --- class: inverse # Data Types .pull-left[ <img src="./images/data_types.png" width="75%" /> ] .pull-right[ - ### Double (dbl) - ### Integers (int) - ### Character (chr) - ### Logical (lgl) - ### Factor (fctr) The things we can do with our variables depends on their data type - get to know your data! ] --- class: inverse # Getting to know your data... Below are some useful functions to explore your data, run these only in the console or make sure to delete them out of your Rmd file after you've seen the output All you need to do is put the name of your object (aka your dataset), into the brackets of the function .pull-left[To see your raw data: - objectname + ctrl/command & enter - print() - View() - head() - tail() ] .pull-right[To see characteristics of your data: - str() & dplyr::glimpse() - summary() - class() - names() - ncol() - nrow() ] --- class: inverse # .orange[Task] .pull-left[ .bold[Using a mix of these functions:] - View() - head() - tail() - ncol() - nrow() - summary() - class() - names() - str() <br> <img src="./images/QR_spotify.png" width="20%" /> ] .pull-right[ .bold[Find out & write on the Padlet:] 1. 3 artists 1. A song title where the song_id matches your day of birth (i.e. 11th March = 11) 1. How many columns/variables 1. How many rows 1. How many logical variables 1. How many factor variables 1. Name the first 6 artists in the dataset (don't use View() for this one) 1. Name the last 6 artists in the dataset (don't use View() for this one) 1. Name 5 columns 1. Find out the mean for the variable 'danceability' 1. Find out the data type for 'track.popularity' ] --- class: center, middle <div class="padlet-embed" style="border:1px solid rgba(0,0,0,0.1);border-radius:2px;box-sizing:border-box;overflow:hidden;position:relative;width:100%;background:#F4F4F4"><p style="padding:0;margin:0"><iframe src="https://uofsussex.padlet.org/embed/ngyekjt3k16qvv7e" frameborder="0" allow="camera;microphone;geolocation" style="width:100%;height:608px;display:block;padding:0;margin:0"></iframe></p><div style="padding:8px;text-align:right;margin:0;"><a href="https://padlet.com?ref=embed" style="padding:0;margin:0;border:none;display:block;line-height:1;height:16px" target="_blank"><img src="https://padlet.net/embeds/made_with_padlet.png" width="86" height="16" style="padding:0;margin:0;background:none;border:none;display:inline;box-shadow:none" alt="Made with Padlet"></a></div></div> --- class: inverse, center, middle # .orange[*My* Solutions..] --- class: inverse ### 1. 3 artists ### 2. A song title where the song_id matches your day of birth ```r View(data) data print(data) ``` <p align="center"> <img src="./images/view.png" width="60%" /> </p> --- class: inverse ### 3. How many columns/variables ```r ncol(data) ``` ``` ## [1] 18 ``` ### 4. How many rows ```r nrow(data) ``` ``` ## [1] 60 ``` --- class: inverse ### 5. How many logical variables & 6. How many factor variables ```r str(data) ``` ``` ## tibble [60 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ song_id : num [1:60] 1 2 3 4 5 6 7 8 9 10 ... ## $ playlist_name : chr [1:60] "decades_70" "decades_70" "decades_70" "decades_70" ... ## $ decade_fct : num [1:60] 1 1 1 1 1 1 1 1 1 1 ... ## $ track_artists : chr [1:60] "Elton John" "Stevie Wonder" "Eric Clapton" "Chaka Khan" ... ## $ track.name : chr [1:60] "Rocket Man (I Think It's Going To Be A Long, Long Time)" "Signed, Sealed, Delivered (I'm Yours)" "Wonderful Tonight" "I'm Every Woman" ... ## $ danceability : num [1:60] 0.601 0.67 0.572 0.617 0.838 0.808 0.579 0.631 0.482 0.7 ... ## $ energy : num [1:60] 0.532 0.619 0.214 0.879 0.806 0.535 0.508 0.59 0.835 0.816 ... ## $ loudness : num [1:60] -9.12 -10.37 -15.62 -7.56 -9.74 ... ## $ speechiness : num [1:60] 0.0286 0.0323 0.0293 0.0455 0.0408 0.0353 0.027 0.0297 0.0539 0.044 ... ## $ acousticness : num [1:60] 0.432 0.0514 0.649 0.127 0.213 0.179 0.00574 0.00367 0.0191 0.00115 ... ## $ liveness : num [1:60] 0.0925 0.0492 0.125 0.339 0.354 0.158 0.0575 0.0537 0.162 0.0901 ... ## $ tempo : num [1:60] 136.6 108.8 95.5 114.5 123.1 ... ## $ instrumentalness : num [1:60] 6.25e-06 0.00 1.29e-01 5.71e-05 2.03e-03 9.91e-05 4.94e-04 2.99e-03 1.43e-02 1.23e-03 ... ## $ valence : num [1:60] 0.341 0.807 0.485 0.746 0.846 0.848 0.609 0.927 0.776 0.838 ... ## $ track.popularity : num [1:60] 81 1 76 58 0 72 62 66 66 63 ... ## $ track.duration_ms: num [1:60] 281613 160500 225026 247413 361093 ... ## $ track.explicit : logi [1:60] FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ is_local : logi [1:60] FALSE FALSE FALSE FALSE FALSE FALSE ... ## - attr(*, "spec")= ## .. cols( ## .. song_id = col_double(), ## .. playlist_name = col_character(), ## .. decade_fct = col_double(), ## .. track_artists = col_character(), ## .. track.name = col_character(), ## .. danceability = col_double(), ## .. energy = col_double(), ## .. loudness = col_double(), ## .. speechiness = col_double(), ## .. acousticness = col_double(), ## .. liveness = col_double(), ## .. tempo = col_double(), ## .. instrumentalness = col_double(), ## .. valence = col_double(), ## .. track.popularity = col_double(), ## .. track.duration_ms = col_double(), ## .. track.explicit = col_logical(), ## .. is_local = col_logical() ## .. ) ``` --- class: inverse ### 7. Name the first 6 artists in the dataset ```r head(data) ``` ``` ## # A tibble: 6 x 18 ## song_id playlist_name decade_fct track_artists track.name danceability energy ## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> ## 1 1 decades_70 1 Elton John Rocket Ma~ 0.601 0.532 ## 2 2 decades_70 1 Stevie Wonder Signed, S~ 0.67 0.619 ## 3 3 decades_70 1 Eric Clapton Wonderful~ 0.572 0.214 ## 4 4 decades_70 1 Chaka Khan I'm Every~ 0.617 0.879 ## 5 5 decades_70 1 Marvin Gaye Got To Gi~ 0.838 0.806 ## 6 6 decades_70 1 Michael Jack~ Rock with~ 0.808 0.535 ## # ... with 11 more variables: loudness <dbl>, speechiness <dbl>, ## # acousticness <dbl>, liveness <dbl>, tempo <dbl>, instrumentalness <dbl>, ## # valence <dbl>, track.popularity <dbl>, track.duration_ms <dbl>, ## # track.explicit <lgl>, is_local <lgl> ``` --- class: inverse ### 8. Name the last 6 artists in the dataset ```r tail(data, 10) # this changes to the last 10 ``` ``` ## # A tibble: 10 x 18 ## song_id playlist_name decade_fct track_artists track.name danceability energy ## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> ## 1 51 decades_00 4 Usher, Lil J~ Yeah! 0.895 0.795 ## 2 52 decades_00 4 Rihanna Pon de Re~ 0.779 0.64 ## 3 53 decades_00 4 Toni Braxton He Wasn't~ 0.739 0.947 ## 4 54 decades_00 4 Ja Rule, Ash~ Always On~ 0.839 0.706 ## 5 55 decades_00 4 Black Eyed P~ I Gotta F~ 0.743 0.766 ## 6 56 decades_00 4 Kanye West Through T~ 0.571 0.739 ## 7 57 decades_00 4 Kings of Leon Sex on Fi~ 0.544 0.903 ## 8 58 decades_00 4 Jennifer Lop~ I'm Real ~ 0.708 0.587 ## 9 59 decades_00 4 Britney Spea~ Oops!...I~ 0.751 0.834 ## 10 60 decades_00 4 Beyonce, Sea~ Baby Boy ~ 0.655 0.488 ## # ... with 11 more variables: loudness <dbl>, speechiness <dbl>, ## # acousticness <dbl>, liveness <dbl>, tempo <dbl>, instrumentalness <dbl>, ## # valence <dbl>, track.popularity <dbl>, track.duration_ms <dbl>, ## # track.explicit <lgl>, is_local <lgl> ``` --- class: inverse ### 9. Name 5 columns ```r names(data) ``` ``` ## [1] "song_id" "playlist_name" "decade_fct" ## [4] "track_artists" "track.name" "danceability" ## [7] "energy" "loudness" "speechiness" ## [10] "acousticness" "liveness" "tempo" ## [13] "instrumentalness" "valence" "track.popularity" ## [16] "track.duration_ms" "track.explicit" "is_local" ``` --- class: inverse ### 10. Find out the mean for the variable 'danceability' ```r summary(data) ``` ``` ## song_id playlist_name decade_fct track_artists ## Min. : 1.00 Length:60 Min. :1.00 Length:60 ## 1st Qu.:15.75 Class :character 1st Qu.:1.75 Class :character ## Median :30.50 Mode :character Median :2.50 Mode :character ## Mean :30.50 Mean :2.50 ## 3rd Qu.:45.25 3rd Qu.:3.25 ## Max. :60.00 Max. :4.00 ## track.name danceability energy loudness ## Length:60 Min. :0.4090 Min. :0.2140 Min. :-15.625 ## Class :character 1st Qu.:0.5955 1st Qu.:0.5807 1st Qu.: -9.703 ## Mode :character Median :0.7020 Median :0.7175 Median : -7.403 ## Mean :0.6847 Mean :0.6885 Mean : -7.844 ## 3rd Qu.:0.7705 3rd Qu.:0.8295 3rd Qu.: -5.649 ## Max. :0.9200 Max. :0.9470 Max. : -1.915 ## speechiness acousticness liveness tempo ## Min. :0.02610 Min. :0.000154 Min. :0.03270 Min. : 74.38 ## 1st Qu.:0.03238 1st Qu.:0.023175 1st Qu.:0.08518 1st Qu.: 95.05 ## Median :0.04130 Median :0.087450 Median :0.11800 Median :104.64 ## Mean :0.07131 Mean :0.145696 Mean :0.17277 Mean :108.64 ## 3rd Qu.:0.06588 3rd Qu.:0.197250 3rd Qu.:0.25450 3rd Qu.:119.11 ## Max. :0.33200 Max. :0.649000 Max. :0.69600 Max. :174.43 ## instrumentalness valence track.popularity track.duration_ms ## Min. :0.000000 Min. :0.1610 Min. : 0.00 Min. :160500 ## 1st Qu.:0.000000 1st Qu.:0.5690 1st Qu.:58.00 1st Qu.:225856 ## Median :0.000041 Median :0.7460 Median :66.50 Median :244607 ## Mean :0.014627 Mean :0.6993 Mean :57.85 Mean :251161 ## 3rd Qu.:0.000934 3rd Qu.:0.8462 3rd Qu.:75.00 3rd Qu.:269540 ## Max. :0.429000 Max. :0.9800 Max. :83.00 Max. :391376 ## track.explicit is_local ## Mode :logical Mode :logical ## FALSE:55 FALSE:60 ## TRUE :5 ## ## ## ``` --- class: inverse ### 11. Find out the data type for 'track.popularity' ```r str(data) ``` Or ```r class(data$track.popularity) ``` ``` ## [1] "numeric" ``` --- class: center, middle <div class="padlet-embed" style="border:1px solid rgba(0,0,0,0.1);border-radius:2px;box-sizing:border-box;overflow:hidden;position:relative;width:100%;background:#F4F4F4"><p style="padding:0;margin:0"><iframe src="https://uofsussex.padlet.org/embed/nrud4gk8x63gbfdc" frameborder="0" allow="camera;microphone;geolocation" style="width:100%;height:608px;display:block;padding:0;margin:0"></iframe></p><div style="padding:8px;text-align:right;margin:0;"><a href="https://padlet.com?ref=embed" style="padding:0;margin:0;border:none;display:block;line-height:1;height:16px" target="_blank"><img src="https://padlet.net/embeds/made_with_padlet.png" width="86" height="16" style="padding:0;margin:0;background:none;border:none;display:inline;box-shadow:none" alt="Made with Padlet"></a></div></div>