22 min read

Utilizing spotifyr and rvest

Every year Spotify’s yearly Wrapped brings with it a buzz of fervour, with users of varying degrees hastening to have a glimpse of the music that’s been in their ears all year. You see such posts flooding your Twitter timeline, your IG stories, you name it.

Behind the scenes, Spotify gives access to a wealth of information it captures on each track, information that they make available via their API. A quick glimpse through the documentation is enough to tell you that you’ll be able to do a host of stuff, from analyzing your playtime to the way your preferences change over the course of your listening. Perhaps more, but we’re getting ahead of ourselves here.

The spotifyr package contains a host of helper functions so that you don’t need to write your own to parse through Spotify’s API.

Through this analysis, I’ll be starting with exploring my listening patterns via the functionality made available by spotifyr, after which I’ll augment it by analyzing the lyrics of my more recent tracks. This is of course made possible via the web scraping capabilities given via the rvest package.

Let’s begin.

Initial setup

Getting access to Spotify’s API key is fairly simple. You’ll need it to access some of the more advanced functionalities in spotifyr, follow the documentation linked above.

Once you’ve created an application, Spotify will give you a client_key, and a client_secret. You’ll need both of them to access the API.

Predictably, spotifyr uses the credentials you just received to generate authentication tokens for you, which it then uses to proceed further.

library(tidyverse)
library(spotifyr)
library(lubridate)
library(stringr)
library(tidytext)
library(ggthemes)
library(hrbrthemes)
library(rvest)

authorization_code <- get_spotify_authorization_code(scope = scopes()[c(1:19)])

Quick note on the get_spotify_authorization_code function. A look at the help file tells us this:

get_spotify_authorization_code(
  client_id = Sys.getenv("SPOTIFY_CLIENT_ID"),
  client_secret = Sys.getenv("SPOTIFY_CLIENT_SECRET"),
  scope = scopes()
)

This is where the API credentials we got earlier come in handy. You can modify the .Renviron file present in your project with the following code:

SPOTIFY_CLIENT_ID = "ID_HERE"
SPOTIFY_CLIENT_SECRET = "SECRET_HERE"

This way, you won’t need to define something sensitive like an API key within your program script(s). Additionally, the scope parameter refers to the range of API capabilities Spotify gives you access to. By default, spotifyr tries to gain authentication for all scopes. However, as of recent developments not every scope works. Out of the 25 possible scopes present, the last 5 do not work.

Hence the scopes()[c(1:19)] section of code.

Okay, now that we’ve got our hands on API keys, and have corresponding environment variables set up that enable us to authenticate our app… we can begin fiddling around with the package in earnest.

Analyzing recently played tracks

To begin exploring what sort of data Spotify gives us, let’s look at recent songs I’ve played, upto 10 of them. This is done via the get_my_recently_played() method.

get_my_recently_played(limit = 10, authorization = authorization_code) %>% 
  glimpse()
## Rows: 10
## Columns: 31
## $ played_at                          <chr> "2023-05-25T17:00:28.965Z", "2023-0…
## $ context                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ track.artists                      <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.available_markets            <list> <"AR", "AU", "AT", "BE", "BO", "BR…
## $ track.disc_number                  <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1
## $ track.duration_ms                  <int> 244586, 233719, 110942, 234946, 151…
## $ track.explicit                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.href                         <chr> "https://api.spotify.com/v1/tracks/…
## $ track.id                           <chr> "0WQiDwKJclirSYG9v5tayI", "4ZuC5MfG…
## $ track.is_local                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.name                         <chr> "There Is a Light That Never Goes O…
## $ track.popularity                   <int> 80, 61, 48, 65, 54, 76, 42, 72, 60,…
## $ track.preview_url                  <chr> "https://p.scdn.co/mp3-preview/23f6…
## $ track.track_number                 <int> 9, 7, 1, 12, 9, 5, 13, 1, 24, 2
## $ track.type                         <chr> "track", "track", "track", "track",…
## $ track.uri                          <chr> "spotify:track:0WQiDwKJclirSYG9v5ta…
## $ track.album.album_type             <chr> "album", "album", "album", "album",…
## $ track.album.artists                <list> [<data.frame[1 x 6]>], [<data.frame…
## $ track.album.available_markets      <list> <"AD", "AE", "AG", "AL", "AM", "AO"…
## $ track.album.href                   <chr> "https://api.spotify.com/v1/albums…
## $ track.album.id                     <chr> "5Y0p2XCgRRIjna91aQE8q7", "33qkK1b…
## $ track.album.images                 <list> [<data.frame[3 x 3]>], [<data.frame…
## $ track.album.name                   <chr> "The Queen Is Dead", "Unknown Pleas…
## $ track.album.release_date           <chr> "1986-06-16", "1979-06-01", "2006-…
## $ track.album.release_date_precision <chr> "day", "day", "day", "day", "day", …
## $ track.album.total_tracks           <int> 10, 22, 25, 14, 12, 9, 40, 17, 24, …
## $ track.album.type                   <chr> "album", "album", "album", "album",…
## $ track.album.uri                    <chr> "spotify:album:5Y0p2XCgRRIjna91aQE8…
## $ track.album.external_urls.spotify  <chr> "https://open.spotify.com/album/5Y0…
## $ track.external_ids.isrc            <chr> "GBCRL1100054", "GBAAP0600166", "QM…
## $ track.external_urls.spotify        <chr> "https://open.spotify.com/track/0WQ…

A wealth of information in front of us. A few key highlights: - The played_at column contains a timestamp of when you played the corresponding track. - track.artists contains a list of artists, per song. Presumably this is to handle cases in which multiple artists contribute to a single song. We might have to deal with this later. - track.duration_ms contains the duration of each song, denoted in milliseconds. - track.id and track.name contain the unique ID assigned via Spotify to each track, along with their actual name. - Predictably, track.album.artists, track.album.name, track.album.id and that entire family contains information pertaining to the album a track belongs to.

Apart from data related to track names and album information, Spotify also grants us access to data pertaining to the actual contents of each track itself. Information like their tempo, their acousticness, and so on. You can explore all they offer here.

Of course I’m interesting in looking at the audio features of songs I’m listening to. Luckily, the get_track_audio_features() method lets me do just that.

Ideally, I’d like to do a combination of some rudimentary data cleaning, followed by some charts. I’m going to create a function that’ll combine this small workflow into one.

Before we begin, though, I’d like to change the default theme ggplot2 uses.

theme_set(theme_solarized_2())
import_roboto_condensed()

Okay. Onto the function. The function below primarily has 3 segments: - Initial transformations - Reshaping - Plotting

Let’s go through the code segment by segment:

played_tracks_analysis <- function(df) {
  df_processed <- df %>%
    mutate(
      artist.name = map_chr(track.artists, ~ .$name[1]),
      played_at = as_datetime(played_at),
      track.duration = lubridate::ms(paste((track.duration_ms / 1000) %/% 60,
                                           ":",
                                           as.integer(round((
                                             track.duration_ms / 1000
                                           ) %% 60), 0),
                                           sep = ""
      ))
    ) %>%
    select(
      track.id,
      track.name,
      artist.name,
      track.album.name,
      track.duration,
      track.popularity,
      played_at
    )
  
  df_ids <- df %>%
    pull(track.id) %>%
    as.vector()

Most of the above is fairly self-explanatory. We simply modify data types explicitly (as in the case of the played_at and track.duration columns), followed by choosing columns relevant to us. It’s worth highlighting how I’ve handled the data present in the track.artists column. Since it’s a list-column, we needed a way to pick specific elements for the purpose of our analysis, instead of all the elements present in each list within that column. I do this via map_chr(), which executes a function on each element of a vector, and returns a character vector. In this case, I’m executing an anonymous function to return the first element of each list-item.

Hence, ~.$name[1].

After this, I store all the track IDs in a separate vector.

Okay. Onto the next segment.

df_audio_features_long <-
    get_track_audio_features(ids = df_ids) %>%
    mutate(
      track.duration = lubridate::ms(paste((duration_ms / 1000) %/% 60,
                                           ":",
                                           as.integer(round((duration_ms / 1000) %% 60), 0),
                                           sep = ""
      )),
      track_minutes = minute(track.duration)
    ) %>%
    select(-duration_ms) %>%
    pivot_longer(
      cols = c(
        "danceability":"energy",
        "loudness",
        "speechiness":"tempo",
        "track_minutes"
      ),
      names_to = "audio_feature"
    ) %>%
    select(id, everything(),-c("uri":"analysis_url"))

As you can see, I’m passing the IDs I stored earlier as a parameter to the get_track_audio_features() function, followed by converting the duration_ms column to track.duration (milliseconds to seconds, I believe that is more intuitive), followed by further using lubridate::minute() to convert the seconds to minutes. I then use the pivot_longer() function from tidyr to change the format of the dataframe so far, from wide to long. I do this to for a visualization I have in mind that we will be getting to in just a minute. We conclude by selecting what’s relevant.

Finally, we wrap up by:

df_audio_features_long %>%
    ggplot(aes(value)) +
    geom_density(aes(fill = audio_feature), alpha = 0.5) +
    scale_fill_brewer(palette = "Set3", type = "qual") +
    facet_wrap(~ audio_feature, scale = "free") +
    labs(
      title = "Distribution of audio features",
      subtitle = "Recently played tracks only",
      x = "",
      y = "Density",
      fill = "Audio feature"
    )

A simple facet plot, showing the distribution of the audio features spotify tracks of the songs I’ve listened to recently.

Combining everything, we end up with:

played_tracks_analysis <- function(df) {
  df_processed <- df %>%
    mutate(
      artist.name = map_chr(track.artists, ~ .$name[1]),
      played_at = as_datetime(played_at),
      track.duration = lubridate::ms(paste((track.duration_ms / 1000) %/% 60,
                                           ":",
                                           as.integer(round((
                                             track.duration_ms / 1000
                                           ) %% 60), 0),
                                           sep = ""
      ))
    ) %>%
    select(
      track.id,
      track.name,
      artist.name,
      track.album.name,
      track.duration,
      track.popularity,
      played_at
    )
  
  df_ids <- df %>%
    pull(track.id) %>%
    as.vector()
  
  df_audio_features_long <-
    get_track_audio_features(ids = df_ids) %>%
    mutate(
      track.duration = lubridate::ms(paste((duration_ms / 1000) %/% 60,
                                           ":",
                                           as.integer(round((duration_ms / 1000) %% 60), 0),
                                           sep = ""
      )),
      track_minutes = minute(track.duration)
    ) %>%
    select(-duration_ms) %>%
    pivot_longer(
      cols = c(
        "danceability":"energy",
        "loudness",
        "speechiness":"tempo",
        "track_minutes"
      ),
      names_to = "audio_feature"
    ) %>%
    select(id, everything(),-c("uri":"analysis_url"))
  
  df_audio_features_long %>%
    ggplot(aes(value)) +
    geom_density(aes(fill = audio_feature), alpha = 0.5) +
    scale_fill_brewer(palette = "Set3", type = "qual") +
    facet_wrap(~ audio_feature, scale = "free") +
    labs(
      title = "Distribution of audio features",
      x = "",
      y = "Density",
      fill = "Audio feature"
    ) +
    theme(legend.position = "bottom")
  
  
}

played_tracks_analysis(get_my_recently_played(limit = 50, authorization = authorization_code))

There you have it! A concise little plot showing what sort of songs I’ve been listening to lately (50 most recent, at least in terms of acoustic characteristics). For instance, a quick look at the speechiness and instrumentalness charts tells me that I’ve lately been listening to a lot of songs with low amounts of actual spoken speech, with more instrumental pieces. Completely makes sense since I mainly listen to OSTs while working (as I am right now at the time of writing). For a full explanation of what these features mean, I’d recommend checking out the official API documentation, but for the most part the names are fairly self-explanatory.

Analyzing the playlists I listen to

If you’re like me and keep unreasonably long custom playlists… this section might be of relevance. I primarily maintain two custom playlists with songs I can listen to multiple times: - One containing songs in English - One with everything that isn’t English. This includes OSTs, instrumental music, and songs from other languages.

The Spotify API (or the spotifyr package more like) has, you guessed it, a get_playlist_audio_features() method that does exactly what the name says. All the method needs is our username, the unique ID of your playlist (you can see this when you opt to share your playlist), along with the authentication token you should already be having at this point.

I’ve created playlists_audio_features that simply contains the results of a call to get_playlist_audio_features(), which I’ll be using further.

You can do something really similar with get_my_top_artists_or_tracks()

Let’s have a look at what the variable we just created contains.

playlists_audio_features %>% 
  names()
##  [1] "playlist_id"                        "playlist_name"                     
##  [3] "playlist_img"                       "playlist_owner_name"               
##  [5] "playlist_owner_id"                  "danceability"                      
##  [7] "energy"                             "key"                               
##  [9] "loudness"                           "mode"                              
## [11] "speechiness"                        "acousticness"                      
## [13] "instrumentalness"                   "liveness"                          
## [15] "valence"                            "tempo"                             
## [17] "track.id"                           "analysis_url"                      
## [19] "time_signature"                     "added_at"                          
## [21] "is_local"                           "primary_color"                     
## [23] "added_by.href"                      "added_by.id"                       
## [25] "added_by.type"                      "added_by.uri"                      
## [27] "added_by.external_urls.spotify"     "track.artists"                     
## [29] "track.available_markets"            "track.disc_number"                 
## [31] "track.duration_ms"                  "track.episode"                     
## [33] "track.explicit"                     "track.href"                        
## [35] "track.is_local"                     "track.name"                        
## [37] "track.popularity"                   "track.preview_url"                 
## [39] "track.track"                        "track.track_number"                
## [41] "track.type"                         "track.uri"                         
## [43] "track.album.album_type"             "track.album.artists"               
## [45] "track.album.available_markets"      "track.album.href"                  
## [47] "track.album.id"                     "track.album.images"                
## [49] "track.album.name"                   "track.album.release_date"          
## [51] "track.album.release_date_precision" "track.album.total_tracks"          
## [53] "track.album.type"                   "track.album.uri"                   
## [55] "track.album.external_urls.spotify"  "track.external_ids.isrc"           
## [57] "track.external_urls.spotify"        "video_thumbnail.url"               
## [59] "key_name"                           "mode_name"                         
## [61] "key_mode"

At first glance… the majority of column names seem to be columns we’ve come across already, making subsequent steps a lot easier.

We first modify the data we just generated as follows. As you can see, the transformations are fairly simple, with map_chr and ~.$name[1] being used once again to retain a singular artist from the track.artists list-column.

playlists_audio_processed <- playlists_audio_features %>%
  mutate_at(c("key", "mode"), as.factor) %>%
  mutate(
    added_at = as_datetime(added_at),
    year = year(added_at),
    month = month(added_at),
    artists = map_chr(track.artists, ~ .$name[1]),
  )

I choose to visualize the above as a facetted barplot. I use slice_max() with a group_by() to keep only a specific number of rows belonging to each group in our grouped dataframe. We choose to keep ties in this case.

After that, we wrap things up with the tried and tested duo of geom_col() and facet_wrap().

playlists_audio_processed %>%
  count(artists, year, sort = TRUE) %>%
  mutate(year = as.factor(year)) %>%
  group_by(year) %>%
  slice_max(n = 5, order_by = n) %>%
  ungroup() %>% 
  ggplot(aes(reorder_within(
    x = artists, by = n, within = year
  ))) +
  geom_col(aes(y = n, fill = year), color = 'black') +
  coord_flip() +
  scale_x_reordered() +
  scale_fill_brewer(palette = "Set3", type = "qual") +
  facet_wrap(~ year, scale = "free_y") +
  labs(
    x = "Artist listened to",
    y = "# of tracks added",
    title = "# of tracks added by artists, per year",
    subtitle = "Top-5 artists yearly",
    fill = "Year"
  ) +
  theme(legend.position = 'bottom')

It doesn’t take a music aficionado to see how the artists whose songs I choose to add to my library (well, playlist, but I only really listen to songs from my playlists and not simply my library, so it’s essentially the same). I go from listening (and liking) songs by Queen to The Offspring to Dire Straits. However, come 2021 and… that changes. This is the year I first played Supergiant’s Hades and really began to appreciate video game music, so much so that I began listening to video game OST albums. In that same year, you’ll find Darren Korb, who is the man behind the Hades soundtrack and Borislav Slavov of Divinity: Original Sin II, to name a few. You’ll also begin to find popular Japanese acts like RADWIMPS and Aimer from this year onwards.

How can I further extend what we have so far? Simple enough. Let’s look at lyrics.

The rvest package enables us to scrape elements off a webpage of our choosing via a few lines of code. This, in combination with the SelectorGadget Chrome Extension enables us to pinpoint what element exactly we need to extract, without going through the cumbersome Inspect element process.

I’m going to try scraping lyrics off the songlyrics.com website. I had to run test scripts on a few other websites because not every website allows themselves to be scraped.

Before we actually use rvest, we have another problem on our hands. We need valid links to actually scrape through… links we simply do not have right now. To create valid links, we need to look at the format of links already present on our website of choice, following which we need to manipulate the data we have right now (track and artist names) into a format that matches what songlyrics.com already has.

The format of a valid link (in this case) resembles the following:

https://www.songlyrics.com/<ARTIST_NAME>/<TRACK_NAME>-lyrics

Of course, punctuation marks like question marks, exclamation marks, etc. are removed as well. What does this mean? Well… regex to the rescue!

Let’s have a look at my 50-most recently played tracks.

You can’t look at more than your 50-most recent songs.

recently_played <- get_my_recently_played(limit = 50, authorization = authorization_code)
recently_played %>% 
  glimpse()
## Rows: 50
## Columns: 31
## $ played_at                          <chr> "2023-05-25T17:00:28.965Z", "2023-0…
## $ context                            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ track.artists                      <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.available_markets            <list> <"AR", "AU", "AT", "BE", "BO", "BR…
## $ track.disc_number                  <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,…
## $ track.duration_ms                  <int> 244586, 233719, 110942, 234946, 151…
## $ track.explicit                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.href                         <chr> "https://api.spotify.com/v1/tracks/…
## $ track.id                           <chr> "0WQiDwKJclirSYG9v5tayI", "4ZuC5MfG…
## $ track.is_local                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.name                         <chr> "There Is a Light That Never Goes O…
## $ track.popularity                   <int> 80, 61, 48, 65, 54, 76, 42, 72, 60,…
## $ track.preview_url                  <chr> "https://p.scdn.co/mp3-preview/23f6…
## $ track.track_number                 <int> 9, 7, 1, 12, 9, 5, 13, 1, 24, 2, 2,…
## $ track.type                         <chr> "track", "track", "track", "track",…
## $ track.uri                          <chr> "spotify:track:0WQiDwKJclirSYG9v5ta…
## $ track.album.album_type             <chr> "album", "album", "album", "album",…
## $ track.album.artists                <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.album.available_markets      <list> <"AD", "AE", "AG", "AL", "AM", "AO…
## $ track.album.href                   <chr> "https://api.spotify.com/v1/albums/…
## $ track.album.id                     <chr> "5Y0p2XCgRRIjna91aQE8q7", "33qkK1br…
## $ track.album.images                 <list> [<data.frame[3 x 3]>], [<data.fram…
## $ track.album.name                   <chr> "The Queen Is Dead", "Unknown Pleas…
## $ track.album.release_date           <chr> "1986-06-16", "1979-06-01", "2006-0…
## $ track.album.release_date_precision <chr> "day", "day", "day", "day", "day", …
## $ track.album.total_tracks           <int> 10, 22, 25, 14, 12, 9, 40, 17, 24, …
## $ track.album.type                   <chr> "album", "album", "album", "album",…
## $ track.album.uri                    <chr> "spotify:album:5Y0p2XCgRRIjna91aQE8…
## $ track.album.external_urls.spotify  <chr> "https://open.spotify.com/album/5Y0…
## $ track.external_ids.isrc            <chr> "GBCRL1100054", "GBAAP0600166", "QM…
## $ track.external_urls.spotify        <chr> "https://open.spotify.com/track/0WQ…

I’m also going to be storing the base URL of the website we’ll be looking at.

html_url <- "https://www.songlyrics.com/"

Okay… now to get to the nitty-gritty of actually manipulating the artist and track names we have so that they fit nicely with the base URL we just stored. Remember library(stringr)? Now is where we use it.

Using a mix of str_to_lower() (to convert all characters to lower-case, for consistency) and str_replace_all(), I first replace all spaces with hyphens. This is to maintain consistency with the nomenclature used by the website I’ve chosen.

recently_played %>%
  mutate(
    artist.name = map_chr(track.artists, ~ .$name[1]),
    url_string = paste(
      str_to_lower(str_replace_all(artist.name, " ", "-")),
      "/",
      str_to_lower(str_replace_all(track.name, " ", "-")),
      sep = ""
    )
  ) %>% 
  select(url_string) %>% 
  head(10)
##                                                                                      url_string
## 1                               the-smiths/there-is-a-light-that-never-goes-out---2011-remaster
## 2                                                       joy-division/shadowplay---2007-remaster
## 3                                                             jeremy-soule/reign-of-the-septims
## 4                                                                    twenty-one-pilots/hometown
## 5                                                               chuck-berry/rock-and-roll-music
## 6                                                                   steve-miller-band/the-joker
## 7  bruce-springsteen/born-in-the-u.s.a.---live-at-la-coliseum,-los-angeles,-ca---september-1985
## 8                                                                          måneskin/own-my-mind
## 9                                                                             coldplay/miracles
## 10                                                                            john-waite/change

Now, here’s the interesting part. Spotify contains multiple versions of songs. Remasters, live versions, you name it. That’s why sometimes, depending on when I run this code, I can get a string that looks like this:

franz-ferdinand/this-fffire---new-version

See the triple hyphen? That’s because there were triple spaces; each of which has now been replaced. There’s a similar pattern for live versions too, like so:

david-bowie/modern-love---2018-remaster.

Clearly, we have further cleaning to do. This is where we use regex. We’ll be tacking on the following section of code to what we’ve got so far.

mutate(
    url_to_scrape = str_split(url_string, "---")[[1]][1],
    url_to_scrape = str_replace_all(url_to_scrape, "[?,\'\"]", ""),
    url_to_scrape = paste(html_url, url_to_scrape, "-lyrics", sep = "")
  )

Let’s break this down.

  • str_split() returns a list. Since I’m using the --- characters as a delimiter, we only really need the first element of the list being returned. This is, of course, based off the assumption of naming conventions following a consistent format, something akin to <SONG_NAME>---remaster.
  • We want to replace punctuation characters since they’re liable to break links. I use str_replace_all() to replace punctuation marks with empty characters.
  • What follows next is a call to paste() to combine the base URL, the manipulated string so far, and the hard-coded string -lyrics.
  • That’s how david-bowie/modern-love---2018-remaster becomes this: https://www.songlyrics.com/david-bowie/modern-love-lyrics.

Let’s combine what we’ve got so far.

recently_played_processed <- recently_played %>%
  mutate(
    artist.name = map_chr(track.artists, ~ .$name[1]),
    url_string = paste(
      str_to_lower(str_replace_all(artist.name, " ", "-")),
      "/",
      str_to_lower(str_replace_all(track.name, " ", "-")),
      sep = ""
    )
  ) %>%
  rowwise() %>%
  mutate(
    url_to_scrape = str_split(url_string, "---")[[1]][1],
    url_to_scrape = str_replace_all(url_to_scrape, "[!.?,\'\"]", ""),
    url_to_scrape = paste(html_url, url_to_scrape, "-lyrics", sep = "")
  )

recently_played_processed %>% 
  select(url_to_scrape)
## # A tibble: 50 × 1
## # Rowwise: 
##    url_to_scrape                                                                
##    <chr>                                                                        
##  1 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-l…
##  2 https://www.songlyrics.com/joy-division/shadowplay-lyrics                    
##  3 https://www.songlyrics.com/jeremy-soule/reign-of-the-septims-lyrics          
##  4 https://www.songlyrics.com/twenty-one-pilots/hometown-lyrics                 
##  5 https://www.songlyrics.com/chuck-berry/rock-and-roll-music-lyrics            
##  6 https://www.songlyrics.com/steve-miller-band/the-joker-lyrics                
##  7 https://www.songlyrics.com/bruce-springsteen/born-in-the-usa-lyrics          
##  8 https://www.songlyrics.com/måneskin/own-my-mind-lyrics                       
##  9 https://www.songlyrics.com/coldplay/miracles-lyrics                          
## 10 https://www.songlyrics.com/john-waite/change-lyrics                          
## # ℹ 40 more rows

You’ll notice a call to rowwise(). A snippet from ?rowwise():

rowwise() allows you to compute on a data frame a row-at-a-time. This is most useful when a vectorised function doesn’t exist.

If I didn’t have the call to rowwise(), here’s what the generated URLs look like.

##                                                                        url_to_scrape
## 1  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 2  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 3  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 4  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 5  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 6  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 7  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 8  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 9  https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 10 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 11 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 12 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 13 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 14 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 15 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 16 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 17 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 18 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 19 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 20 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 21 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 22 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 23 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 24 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 25 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 26 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 27 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 28 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 29 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 30 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 31 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 32 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 33 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 34 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 35 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 36 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 37 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 38 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 39 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 40 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 41 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 42 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 43 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 44 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 45 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 46 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 47 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 48 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 49 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 50 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics

Every row simply gets replaced with the value from the first row (which is generated the first time operations are run). Converting the preceding dataframe into a rowwise_df helps us get through this by making our operations execute on a row-by-row basis.

Okay, great. So we now have links that at least appear to be valid. We still need to work on actually extracting lyrics from those websites, and to handle cases in which the artificially generated links don’t actually exist.

Enter rvest.

Remember the SelectorGadget Chrome extension I mentioned earlier? Installing that makes leveraging the capabilities of rvest (and other alternatives like Python’s BeautifulSoup) a lot easier. These work on the principle of using identifiers known as selectors that correspond to a specific element present on a webpage. Instead of diving within nested div tags present on the source code of the webpage, you simply activate the extension and click the element you’re focusing on. It takes a few tries, but you should be able to find selectors corresponding to the element of concern.

Selectors are generally of two types: Xpath and CSS. I choose to use Xpath here.

Let’s do a test run, with a hard-coded URL.

test_url <- "https://www.songlyrics.com/system-of-a-down/toxicity-lyrics"

test_url %>% 
  read_html() %>% 
  html_elements(xpath = '//*[(@id = "songLyricsDiv")]') %>% 
  html_text()
## [1] "Conversion software version 7.0\nLooking at life through the eyes of a tire hub\nEating seeds as a pastime activity\nThe toxicity of our city, of our city\nNow, what do you own the world?\nHow do you own disorder, disorder\nNow somewhere between the sacred silence\nSacred silence and sleep\nSomewhere, between the sacred silence and sleep\nMore wood for the fires, loud neighbours\nFlashlight reveries caught in the headlights of a truck\nEating seeds as a pastime activity\nThe toxicity of our city, of our city\nNow, what do you own the world?\nHow do you own disorder, disorder\nNow somewhere between the sacred silence\nSacred silence and sleep\nSomewhere between the sacred silence and sleep\nNow, what do you own the world?\nHow do you own disorder, disorder\nNow somewhere between the sacred silence\nSacred silence and sleep\nSomewhere, between the sacred silence and sleep\nWhen I became the sun\nI shone life into the man's hearts\nWhen I became the sun\nI shone life into the man's hearts"

The read_html()->html_elements()->html_text() chain of functions works like this:

  • read_html() reads all the html elements off the URL we specified.
  • html_elements() returns the contents of the element we specify as a parameter. xpath in this case, a div with the ID of songLyricsDiv.
  • html_text() returns all the text from the element specified.

And there you go! With barely 5 lines of code, we have lyrics from a link we generated!

Of course, our work isn’t done yet. Our code won’t work if the link doesn’t exist, since the code will have nothing to scrape. I choose to handle this by returning NA if the code throws an error, following by wrapping up everything nicely in a convenient function:

scrape_lyrics <- function(url) {
  lyrics <- tryCatch(
    read_html(x = url) %>%
      html_elements(xpath = '//*[(@id = "songLyricsDiv")]') %>%
      html_text(),
    error = function(e) {
      NA
    }
  )
  return(lyrics)
  # lyrics
}

Here’s the fun part. Now we run our function on each URL we generated earlier, via map_chr() from the purrr package. This lets us run a function of our choosing on each element in a vector, returning the results as a character vector.

recently_played_processed <- recently_played_processed %>% 
  mutate(lyrics = map_chr(url_to_scrape, scrape_lyrics))

recently_played_processed %>% 
  filter(!is.na(lyrics)) %>% 
  select(artist.name, track.name, lyrics)
## # A tibble: 44 × 3
## # Rowwise: 
##    artist.name              track.name                                    lyrics
##    <chr>                    <chr>                                         <chr> 
##  1 The Smiths               There Is a Light That Never Goes Out - 2011 … "Take…
##  2 Joy Division             Shadowplay - 2007 Remaster                    "To t…
##  3 Twenty One Pilots        Hometown                                      "[Hoo…
##  4 Chuck Berry              Rock And Roll Music                           "Just…
##  5 Steve Miller Band        The Joker                                     "Some…
##  6 Bruce Springsteen        Born In The U.S.A. - Live at LA Coliseum, Lo… "We d…
##  7 Coldplay                 Miracles                                      "From…
##  8 John Waite               Change                                        "Peop…
##  9 Electric Light Orchestra Confusion                                     "Ever…
## 10 Red Hot Chili Peppers    Wet Sand                                      "My s…
## # ℹ 34 more rows

Fantastic stuff, if I say so myself. Keep in mind you can do this with get_my_top_artists_or_tracks() as well, although running it might take a while depending on how many tracks you specify.

Analyzing the scraped lyrics

Look at how far we’ve come. Now for (briefly, I know this has been a ride) to look at how we can handle the text. An older blog post of mine deals with this area in more detail, but that doesn’t mean we’re not going to try a few things on the wealth of data we have now.

We begin by tokenizing our text so far via tidytext::unnest_tokens(), followed by removing common stopwords.

recently_played_tokenized <- recently_played_processed %>% 
  filter(!is.na(lyrics)) %>% 
  select(artist.name, track.name, lyrics) %>% 
  unnest_tokens(word, lyrics) %>% 
  anti_join(stop_words)

We’re cruising! We’ve got a dataframe containing our tokenized lyrics. What’s left is some rudimentary text analysis. If you’ve been keeping up so far, why, the world is your oyster.

recently_played_tokenized %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  mutate(word_sentiment_count = if_else(str_detect(sentiment, "negative"), -1*n, n)) %>% 
  group_by(sentiment) %>% 
  slice_max(n, n = 10) %>% 
  ungroup() %>% 
  mutate(word = reorder_within(word, word_sentiment_count, sentiment)) %>% 
  ggplot(aes(word, word_sentiment_count)) + 
  geom_col(aes(fill = sentiment)) +
  scale_x_reordered() +
  coord_flip() +
  labs(title = "Word count per sentiment", subtitle = "Bing lexicon used", fill = "Sentiment", x = "Word", y = "Word count")
In the interest of brevity, I'm going to leave the rest to the reader. We have a tokenized dataframe of song lyrics; what's next is up to you. A few ideas:
  • Track sentiment over time. Are you listening to songs that get sadder by the year?
  • Experiment with different lexicons. The Bing lexicon only classifies words as positive or negative, the NRC lexicon has 9 different sentiments, for instance.
  • Explore the text on a per-artist basis. What words appear more often in songs performed by a particular artist?

I might play around with my data outside this blog, but for the time being, thank you for sticking to the end of the post. I’m happy to answer any questions you might have!