Essentials 01: Importing & Exploring Data1 / 22

SeminRs

Informal, optional weekly sessions to help build a 'portfolio of skills'
1-2 hours of instruction, demos, walk throughs & activities to try out

Essentials

Focused on the fundRmental skills to help you get started with learning R
Covering: basic wrangling & visualising data, 'pretty' R Markdown, inline code, debugging...

Level Up

Focused on more advanced programming skills & applying these skills to new fun topics
Covering: Papaja, advanced wrangling & manipulation of data, 'even prettier' R Markdown, spotifyR...

Session topics are not fixed - use the Padlet linked on Canvas for suggestions!

2 / 22

Suggested Workflow

Create an R project file for all seminR sessions
Within this directory, create an r_docs & data folder
Save all Rmds and datasets to these folders respectively

File > New Project... New Directory > New Project > Give your project a name & location

Open a new Rmd file for today's session
Make a cheat sheet of useful functions and # comment their meaning and usage as you go through seminRs, practicals, tutorials etc.

3 / 22

Session Objectives

Importing & Exploring Data

Understand directories
Read in datasets
Use different functions to explore datasets
Understand different data types

4 / 22

Directories & Paths Recap

Reading in data requires knowledge of your files & folders

Directories = Folders
Paths = Directions

Two types of paths:

Absolute

C:/Users/danie/Documents/seminRs_21/images/image.png

Relative

./images/image.png

5 / 22

Directories & Paths Recap

Reading in data requires knowledge of your files & folders

Directories = Folders
Paths = Directions

Two types of paths:

Absolute

C:/Users/danie/Documents/seminRs_21/images/image.png

Relative

./images/image.png

Top Tip: get & stay organised!

6 / 22

Importing Datasets

Collection of data, columns represent variables, & rows represent cases/people/entities
Lots of different data formats exist - common ones are: .csv, .sav, .Rdata...
Different functions, packages, & methods we can use to load them in R
We're going to focus on readr, others include haven, foreign, readxl

7 / 22

readr

Part of the tidyverse, can be loaded using library(tidyverse) or library(readr)

readr::read_csv("./data/spotify_decades_data.csv")  
data <- readr::read_csv("../spotify_decades_data.csv")

First argument is the path to the file
Once it's run, it prints out column specification which gives the column names and data types

Task

Download the dataset (make sure it saves as a .csv file)
Check it's saved to your data folder/move it there manually
In a new chunk in your Rmd file, use read_csv to open the file & name the object data

https://raw.githubusercontent.com/de84sussex/DS_spotify/main/spotify_decades_data.csv

8 / 22

Data Types

Double (dbl)
Integers (int)
Character (chr)
Logical (lgl)
Factor (fctr)

The things we can do with our variables depends on their data type - get to know your data!

9 / 22

Getting to know your data...

Below are some useful functions to explore your data, run these only in the console or make sure to delete them out of your Rmd file after you've seen the output

All you need to do is put the name of your object (aka your dataset), into the brackets of the function

To see your raw data:

objectname + ctrl/command & enter
print()
View()
head()
tail()

To see characteristics of your data:

str() & dplyr::glimpse()
summary()
class()
names()
ncol()
nrow()

10 / 22

Task

Using a mix of these functions:

View()
head()
tail()
ncol()
nrow()
summary()
class()
names()
str()

Find out & write on the Padlet:

3 artists
A song title where the song_id matches your day of birth (i.e. 11th March = 11)
How many columns/variables
How many rows
How many logical variables
How many factor variables
Name the first 6 artists in the dataset (don't use View() for this one)
Name the last 6 artists in the dataset (don't use View() for this one)
Name 5 columns
Find out the mean for the variable 'danceability'
Find out the data type for 'track.popularity'

11 / 22

12 / 22

My Solutions..13 / 22

1. 3 artists

2. A song title where the song_id matches your day of birth

View(data) 
data 
print(data)

14 / 22

3. How many columns/variables

ncol(data)

## [1] 18

4. How many rows

nrow(data)

## [1] 60

15 / 22

5. How many logical variables & 6. How many factor variables

str(data)

## tibble [60 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ song_id          : num [1:60] 1 2 3 4 5 6 7 8 9 10 ...
##  $ playlist_name    : chr [1:60] "decades_70" "decades_70" "decades_70" "decades_70" ...
##  $ decade_fct       : num [1:60] 1 1 1 1 1 1 1 1 1 1 ...
##  $ track_artists    : chr [1:60] "Elton John" "Stevie Wonder" "Eric Clapton" "Chaka Khan" ...
##  $ track.name       : chr [1:60] "Rocket Man (I Think It's Going To Be A Long, Long Time)" "Signed, Sealed, Delivered (I'm Yours)" "Wonderful Tonight" "I'm Every Woman" ...
##  $ danceability     : num [1:60] 0.601 0.67 0.572 0.617 0.838 0.808 0.579 0.631 0.482 0.7 ...
##  $ energy           : num [1:60] 0.532 0.619 0.214 0.879 0.806 0.535 0.508 0.59 0.835 0.816 ...
##  $ loudness         : num [1:60] -9.12 -10.37 -15.62 -7.56 -9.74 ...
##  $ speechiness      : num [1:60] 0.0286 0.0323 0.0293 0.0455 0.0408 0.0353 0.027 0.0297 0.0539 0.044 ...
##  $ acousticness     : num [1:60] 0.432 0.0514 0.649 0.127 0.213 0.179 0.00574 0.00367 0.0191 0.00115 ...
##  $ liveness         : num [1:60] 0.0925 0.0492 0.125 0.339 0.354 0.158 0.0575 0.0537 0.162 0.0901 ...
##  $ tempo            : num [1:60] 136.6 108.8 95.5 114.5 123.1 ...
##  $ instrumentalness : num [1:60] 6.25e-06 0.00 1.29e-01 5.71e-05 2.03e-03 9.91e-05 4.94e-04 2.99e-03 1.43e-02 1.23e-03 ...
##  $ valence          : num [1:60] 0.341 0.807 0.485 0.746 0.846 0.848 0.609 0.927 0.776 0.838 ...
##  $ track.popularity : num [1:60] 81 1 76 58 0 72 62 66 66 63 ...
##  $ track.duration_ms: num [1:60] 281613 160500 225026 247413 361093 ...
##  $ track.explicit   : logi [1:60] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ is_local         : logi [1:60] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   song_id = col_double(),
##   ..   playlist_name = col_character(),
##   ..   decade_fct = col_double(),
##   ..   track_artists = col_character(),
##   ..   track.name = col_character(),
##   ..   danceability = col_double(),
##   ..   energy = col_double(),
##   ..   loudness = col_double(),
##   ..   speechiness = col_double(),
##   ..   acousticness = col_double(),
##   ..   liveness = col_double(),
##   ..   tempo = col_double(),
##   ..   instrumentalness = col_double(),
##   ..   valence = col_double(),
##   ..   track.popularity = col_double(),
##   ..   track.duration_ms = col_double(),
##   ..   track.explicit = col_logical(),
##   ..   is_local = col_logical()
##   .. )

16 / 22

7. Name the first 6 artists in the dataset

head(data)

## # A tibble: 6 x 18
##   song_id playlist_name decade_fct track_artists track.name danceability energy
##     <dbl> <chr>              <dbl> <chr>         <chr>             <dbl>  <dbl>
## 1       1 decades_70             1 Elton John    Rocket Ma~        0.601  0.532
## 2       2 decades_70             1 Stevie Wonder Signed, S~        0.67   0.619
## 3       3 decades_70             1 Eric Clapton  Wonderful~        0.572  0.214
## 4       4 decades_70             1 Chaka Khan    I'm Every~        0.617  0.879
## 5       5 decades_70             1 Marvin Gaye   Got To Gi~        0.838  0.806
## 6       6 decades_70             1 Michael Jack~ Rock with~        0.808  0.535
## # ... with 11 more variables: loudness <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, liveness <dbl>, tempo <dbl>, instrumentalness <dbl>,
## #   valence <dbl>, track.popularity <dbl>, track.duration_ms <dbl>,
## #   track.explicit <lgl>, is_local <lgl>

17 / 22

8. Name the last 6 artists in the dataset

tail(data, 10) # this changes to the last 10

## # A tibble: 10 x 18
##    song_id playlist_name decade_fct track_artists track.name danceability energy
##      <dbl> <chr>              <dbl> <chr>         <chr>             <dbl>  <dbl>
##  1      51 decades_00             4 Usher, Lil J~ Yeah!             0.895  0.795
##  2      52 decades_00             4 Rihanna       Pon de Re~        0.779  0.64 
##  3      53 decades_00             4 Toni Braxton  He Wasn't~        0.739  0.947
##  4      54 decades_00             4 Ja Rule, Ash~ Always On~        0.839  0.706
##  5      55 decades_00             4 Black Eyed P~ I Gotta F~        0.743  0.766
##  6      56 decades_00             4 Kanye West    Through T~        0.571  0.739
##  7      57 decades_00             4 Kings of Leon Sex on Fi~        0.544  0.903
##  8      58 decades_00             4 Jennifer Lop~ I'm Real ~        0.708  0.587
##  9      59 decades_00             4 Britney Spea~ Oops!...I~        0.751  0.834
## 10      60 decades_00             4 Beyonce, Sea~ Baby Boy ~        0.655  0.488
## # ... with 11 more variables: loudness <dbl>, speechiness <dbl>,
## #   acousticness <dbl>, liveness <dbl>, tempo <dbl>, instrumentalness <dbl>,
## #   valence <dbl>, track.popularity <dbl>, track.duration_ms <dbl>,
## #   track.explicit <lgl>, is_local <lgl>

18 / 22

9. Name 5 columns

names(data)

##  [1] "song_id"           "playlist_name"     "decade_fct"       
##  [4] "track_artists"     "track.name"        "danceability"     
##  [7] "energy"            "loudness"          "speechiness"      
## [10] "acousticness"      "liveness"          "tempo"            
## [13] "instrumentalness"  "valence"           "track.popularity" 
## [16] "track.duration_ms" "track.explicit"    "is_local"

19 / 22

10. Find out the mean for the variable 'danceability'

summary(data)

##     song_id      playlist_name        decade_fct   track_artists     
##  Min.   : 1.00   Length:60          Min.   :1.00   Length:60         
##  1st Qu.:15.75   Class :character   1st Qu.:1.75   Class :character  
##  Median :30.50   Mode  :character   Median :2.50   Mode  :character  
##  Mean   :30.50                      Mean   :2.50                     
##  3rd Qu.:45.25                      3rd Qu.:3.25                     
##  Max.   :60.00                      Max.   :4.00                     
##   track.name         danceability        energy          loudness      
##  Length:60          Min.   :0.4090   Min.   :0.2140   Min.   :-15.625  
##  Class :character   1st Qu.:0.5955   1st Qu.:0.5807   1st Qu.: -9.703  
##  Mode  :character   Median :0.7020   Median :0.7175   Median : -7.403  
##                     Mean   :0.6847   Mean   :0.6885   Mean   : -7.844  
##                     3rd Qu.:0.7705   3rd Qu.:0.8295   3rd Qu.: -5.649  
##                     Max.   :0.9200   Max.   :0.9470   Max.   : -1.915  
##   speechiness       acousticness         liveness           tempo       
##  Min.   :0.02610   Min.   :0.000154   Min.   :0.03270   Min.   : 74.38  
##  1st Qu.:0.03238   1st Qu.:0.023175   1st Qu.:0.08518   1st Qu.: 95.05  
##  Median :0.04130   Median :0.087450   Median :0.11800   Median :104.64  
##  Mean   :0.07131   Mean   :0.145696   Mean   :0.17277   Mean   :108.64  
##  3rd Qu.:0.06588   3rd Qu.:0.197250   3rd Qu.:0.25450   3rd Qu.:119.11  
##  Max.   :0.33200   Max.   :0.649000   Max.   :0.69600   Max.   :174.43  
##  instrumentalness      valence       track.popularity track.duration_ms
##  Min.   :0.000000   Min.   :0.1610   Min.   : 0.00    Min.   :160500   
##  1st Qu.:0.000000   1st Qu.:0.5690   1st Qu.:58.00    1st Qu.:225856   
##  Median :0.000041   Median :0.7460   Median :66.50    Median :244607   
##  Mean   :0.014627   Mean   :0.6993   Mean   :57.85    Mean   :251161   
##  3rd Qu.:0.000934   3rd Qu.:0.8462   3rd Qu.:75.00    3rd Qu.:269540   
##  Max.   :0.429000   Max.   :0.9800   Max.   :83.00    Max.   :391376   
##  track.explicit   is_local      
##  Mode :logical   Mode :logical  
##  FALSE:55        FALSE:60       
##  TRUE :5                        
##                                 
##                                 
##

20 / 22

11. Find out the data type for 'track.popularity'

str(data)

class(data$track.popularity)

## [1] "numeric"

21 / 22

22 / 22

SeminRs

Informal, optional weekly sessions to help build a 'portfolio of skills'

1-2 hours of instruction, demos, walk throughs & activities to try out

Essentials

Focused on the fundRmental skills to help you get started with learning R

Covering: basic wrangling & visualising data, 'pretty' R Markdown, inline code, debugging...

Level Up

Focused on more advanced programming skills & applying these skills to new fun topics

Covering: Papaja, advanced wrangling & manipulation of data, 'even prettier' R Markdown, spotifyR...

Session topics are not fixed - use the Padlet linked on Canvas for suggestions!

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Essentials 01: Importing & Exploring Data

SeminRs

Essentials

Level Up

Suggested Workflow

Session Objectives

Importing & Exploring Data

Directories & Paths Recap

Directories & Paths Recap

Top Tip: get & stay organised!

Importing Datasets

readr

Task

Data Types

Double (dbl)

Integers (int)

Character (chr)

Logical (lgl)

Factor (fctr)

Getting to know your data...

Task

My Solutions..

1. 3 artists

2. A song title where the song_id matches your day of birth

3. How many columns/variables

4. How many rows

5. How many logical variables & 6. How many factor variables

7. Name the first 6 artists in the dataset

8. Name the last 6 artists in the dataset

9. Name 5 columns

10. Find out the mean for the variable 'danceability'

11. Find out the data type for 'track.popularity'

SeminRs

Essentials

Level Up

Help