Every year Spotify’s yearly Wrapped brings with it a buzz of fervour, with users of varying degrees hastening to have a glimpse of the music that’s been in their ears all year. You see such posts flooding your Twitter timeline, your IG stories, you name it.
Behind the scenes, Spotify gives access to a wealth of information it captures on each track, information that they make available via their API. A quick glimpse through the documentation is enough to tell you that you’ll be able to do a host of stuff, from analyzing your playtime to the way your preferences change over the course of your listening. Perhaps more, but we’re getting ahead of ourselves here.
The spotifyr
package contains a
host of helper functions so that you don’t need to write your own to
parse through Spotify’s API.
Through this analysis, I’ll be starting with exploring my listening
patterns via the functionality made available by spotifyr
, after which
I’ll augment it by analyzing the lyrics of my more recent tracks. This
is of course made possible via the web scraping capabilities given via
the rvest
package.
Let’s begin.
Initial setup
Getting access to Spotify’s API key is fairly simple. You’ll need it to
access some of the more advanced functionalities in spotifyr
, follow
the documentation linked above.
Once you’ve created an application, Spotify will give you a
client_key
, and a client_secret
. You’ll need both of them to access
the API.
Predictably, spotifyr
uses the credentials you just received to
generate authentication tokens for you, which it then uses to proceed
further.
library(tidyverse)
library(spotifyr)
library(lubridate)
library(stringr)
library(tidytext)
library(ggthemes)
library(hrbrthemes)
library(rvest)
authorization_code <- get_spotify_authorization_code(scope = scopes()[c(1:19)])
Quick note on the get_spotify_authorization_code
function. A look at
the help file tells us this:
get_spotify_authorization_code(
client_id = Sys.getenv("SPOTIFY_CLIENT_ID"),
client_secret = Sys.getenv("SPOTIFY_CLIENT_SECRET"),
scope = scopes()
)
This is where the API credentials we got earlier come in handy. You can
modify the .Renviron
file present in your project with the following
code:
SPOTIFY_CLIENT_ID = "ID_HERE"
SPOTIFY_CLIENT_SECRET = "SECRET_HERE"
This way, you won’t need to define something sensitive like an API key
within your program script(s). Additionally, the scope
parameter
refers to the range of API capabilities Spotify gives you access to. By
default, spotifyr
tries to gain authentication for all scopes.
However, as of recent developments not every scope works. Out of the 25
possible scopes present, the last 5 do not work.
Hence the scopes()[c(1:19)]
section of code.
Okay, now that we’ve got our hands on API keys, and have corresponding environment variables set up that enable us to authenticate our app… we can begin fiddling around with the package in earnest.
Analyzing recently played tracks
To begin exploring what sort of data Spotify gives us, let’s look at
recent songs I’ve played, upto 10 of them. This is done via the
get_my_recently_played()
method.
get_my_recently_played(limit = 10, authorization = authorization_code) %>%
glimpse()
## Rows: 10
## Columns: 31
## $ played_at <chr> "2023-05-25T17:00:28.965Z", "2023-0…
## $ context <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ track.artists <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.available_markets <list> <"AR", "AU", "AT", "BE", "BO", "BR…
## $ track.disc_number <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1
## $ track.duration_ms <int> 244586, 233719, 110942, 234946, 151…
## $ track.explicit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.href <chr> "https://api.spotify.com/v1/tracks/…
## $ track.id <chr> "0WQiDwKJclirSYG9v5tayI", "4ZuC5MfG…
## $ track.is_local <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.name <chr> "There Is a Light That Never Goes O…
## $ track.popularity <int> 80, 61, 48, 65, 54, 76, 42, 72, 60,…
## $ track.preview_url <chr> "https://p.scdn.co/mp3-preview/23f6…
## $ track.track_number <int> 9, 7, 1, 12, 9, 5, 13, 1, 24, 2
## $ track.type <chr> "track", "track", "track", "track",…
## $ track.uri <chr> "spotify:track:0WQiDwKJclirSYG9v5ta…
## $ track.album.album_type <chr> "album", "album", "album", "album",…
## $ track.album.artists <list> [<data.frame[1 x 6]>], [<data.frame…
## $ track.album.available_markets <list> <"AD", "AE", "AG", "AL", "AM", "AO"…
## $ track.album.href <chr> "https://api.spotify.com/v1/albums…
## $ track.album.id <chr> "5Y0p2XCgRRIjna91aQE8q7", "33qkK1b…
## $ track.album.images <list> [<data.frame[3 x 3]>], [<data.frame…
## $ track.album.name <chr> "The Queen Is Dead", "Unknown Pleas…
## $ track.album.release_date <chr> "1986-06-16", "1979-06-01", "2006-…
## $ track.album.release_date_precision <chr> "day", "day", "day", "day", "day", …
## $ track.album.total_tracks <int> 10, 22, 25, 14, 12, 9, 40, 17, 24, …
## $ track.album.type <chr> "album", "album", "album", "album",…
## $ track.album.uri <chr> "spotify:album:5Y0p2XCgRRIjna91aQE8…
## $ track.album.external_urls.spotify <chr> "https://open.spotify.com/album/5Y0…
## $ track.external_ids.isrc <chr> "GBCRL1100054", "GBAAP0600166", "QM…
## $ track.external_urls.spotify <chr> "https://open.spotify.com/track/0WQ…
A wealth of information in front of us. A few key highlights: - The
played_at
column contains a timestamp of when you played the
corresponding track. - track.artists
contains a list of artists, per
song. Presumably this is to handle cases in which multiple artists
contribute to a single song. We might have to deal with this later. -
track.duration_ms
contains the duration of each song, denoted in
milliseconds. - track.id
and track.name
contain the unique ID
assigned via Spotify to each track, along with their actual name. -
Predictably, track.album.artists
, track.album.name
, track.album.id
and that entire family contains information pertaining to the album a
track belongs to.
Apart from data related to track names and album information, Spotify also grants us access to data pertaining to the actual contents of each track itself. Information like their tempo, their acousticness, and so on. You can explore all they offer here.
Of course I’m interesting in looking at the audio features of songs I’m
listening to. Luckily, the get_track_audio_features()
method lets me
do just that.
Ideally, I’d like to do a combination of some rudimentary data cleaning, followed by some charts. I’m going to create a function that’ll combine this small workflow into one.
Before we begin, though, I’d like to change the default theme ggplot2
uses.
theme_set(theme_solarized_2())
import_roboto_condensed()
Okay. Onto the function. The function below primarily has 3 segments: - Initial transformations - Reshaping - Plotting
Let’s go through the code segment by segment:
played_tracks_analysis <- function(df) {
df_processed <- df %>%
mutate(
artist.name = map_chr(track.artists, ~ .$name[1]),
played_at = as_datetime(played_at),
track.duration = lubridate::ms(paste((track.duration_ms / 1000) %/% 60,
":",
as.integer(round((
track.duration_ms / 1000
) %% 60), 0),
sep = ""
))
) %>%
select(
track.id,
track.name,
artist.name,
track.album.name,
track.duration,
track.popularity,
played_at
)
df_ids <- df %>%
pull(track.id) %>%
as.vector()
Most of the above is fairly self-explanatory. We simply modify data
types explicitly (as in the case of the played_at
and track.duration
columns), followed by choosing columns relevant to us. It’s worth
highlighting how I’ve handled the data present in the track.artists
column. Since it’s a list-column, we needed a way to pick specific
elements for the purpose of our analysis, instead of all the elements
present in each list within that column. I do this via map_chr()
,
which executes a function on each element of a vector, and returns a
character vector. In this case, I’m executing an anonymous function to
return the first element of each list-item.
Hence, ~.$name[1]
.
After this, I store all the track IDs in a separate vector.
Okay. Onto the next segment.
df_audio_features_long <-
get_track_audio_features(ids = df_ids) %>%
mutate(
track.duration = lubridate::ms(paste((duration_ms / 1000) %/% 60,
":",
as.integer(round((duration_ms / 1000) %% 60), 0),
sep = ""
)),
track_minutes = minute(track.duration)
) %>%
select(-duration_ms) %>%
pivot_longer(
cols = c(
"danceability":"energy",
"loudness",
"speechiness":"tempo",
"track_minutes"
),
names_to = "audio_feature"
) %>%
select(id, everything(),-c("uri":"analysis_url"))
As you can see, I’m passing the IDs I stored earlier as a parameter to
the get_track_audio_features()
function, followed by converting the
duration_ms
column to track.duration
(milliseconds to seconds, I
believe that is more intuitive), followed by further using
lubridate::minute()
to convert the seconds to minutes. I then use the
pivot_longer()
function from tidyr
to change the format of the
dataframe so far, from wide to long. I do this to for a visualization I
have in mind that we will be getting to in just a minute. We conclude by
selecting what’s relevant.
Finally, we wrap up by:
df_audio_features_long %>%
ggplot(aes(value)) +
geom_density(aes(fill = audio_feature), alpha = 0.5) +
scale_fill_brewer(palette = "Set3", type = "qual") +
facet_wrap(~ audio_feature, scale = "free") +
labs(
title = "Distribution of audio features",
subtitle = "Recently played tracks only",
x = "",
y = "Density",
fill = "Audio feature"
)
A simple facet plot, showing the distribution of the audio features spotify tracks of the songs I’ve listened to recently.
Combining everything, we end up with:
played_tracks_analysis <- function(df) {
df_processed <- df %>%
mutate(
artist.name = map_chr(track.artists, ~ .$name[1]),
played_at = as_datetime(played_at),
track.duration = lubridate::ms(paste((track.duration_ms / 1000) %/% 60,
":",
as.integer(round((
track.duration_ms / 1000
) %% 60), 0),
sep = ""
))
) %>%
select(
track.id,
track.name,
artist.name,
track.album.name,
track.duration,
track.popularity,
played_at
)
df_ids <- df %>%
pull(track.id) %>%
as.vector()
df_audio_features_long <-
get_track_audio_features(ids = df_ids) %>%
mutate(
track.duration = lubridate::ms(paste((duration_ms / 1000) %/% 60,
":",
as.integer(round((duration_ms / 1000) %% 60), 0),
sep = ""
)),
track_minutes = minute(track.duration)
) %>%
select(-duration_ms) %>%
pivot_longer(
cols = c(
"danceability":"energy",
"loudness",
"speechiness":"tempo",
"track_minutes"
),
names_to = "audio_feature"
) %>%
select(id, everything(),-c("uri":"analysis_url"))
df_audio_features_long %>%
ggplot(aes(value)) +
geom_density(aes(fill = audio_feature), alpha = 0.5) +
scale_fill_brewer(palette = "Set3", type = "qual") +
facet_wrap(~ audio_feature, scale = "free") +
labs(
title = "Distribution of audio features",
x = "",
y = "Density",
fill = "Audio feature"
) +
theme(legend.position = "bottom")
}
played_tracks_analysis(get_my_recently_played(limit = 50, authorization = authorization_code))
There you have it! A concise little plot showing what sort of songs I’ve
been listening to lately (50 most recent, at least in terms of acoustic
characteristics). For instance, a quick look at the speechiness
and
instrumentalness
charts tells me that I’ve lately been listening to a
lot of songs with low amounts of actual spoken speech, with more
instrumental pieces. Completely makes sense since I mainly listen to
OSTs while working (as I am right now at the time of writing). For a
full explanation of what these features mean, I’d recommend checking out
the official API documentation, but for the most part the names are
fairly self-explanatory.
Analyzing the playlists I listen to
If you’re like me and keep unreasonably long custom playlists… this section might be of relevance. I primarily maintain two custom playlists with songs I can listen to multiple times: - One containing songs in English - One with everything that isn’t English. This includes OSTs, instrumental music, and songs from other languages.
The Spotify API (or the spotifyr
package more like) has, you guessed
it, a get_playlist_audio_features()
method that does exactly what the
name says. All the method needs is our username, the unique ID of your
playlist (you can see this when you opt to share your playlist), along
with the authentication token you should already be having at this
point.
I’ve created playlists_audio_features
that simply contains the results
of a call to get_playlist_audio_features()
, which I’ll be using
further.
You can do something really similar with
get_my_top_artists_or_tracks()
Let’s have a look at what the variable we just created contains.
playlists_audio_features %>%
names()
## [1] "playlist_id" "playlist_name"
## [3] "playlist_img" "playlist_owner_name"
## [5] "playlist_owner_id" "danceability"
## [7] "energy" "key"
## [9] "loudness" "mode"
## [11] "speechiness" "acousticness"
## [13] "instrumentalness" "liveness"
## [15] "valence" "tempo"
## [17] "track.id" "analysis_url"
## [19] "time_signature" "added_at"
## [21] "is_local" "primary_color"
## [23] "added_by.href" "added_by.id"
## [25] "added_by.type" "added_by.uri"
## [27] "added_by.external_urls.spotify" "track.artists"
## [29] "track.available_markets" "track.disc_number"
## [31] "track.duration_ms" "track.episode"
## [33] "track.explicit" "track.href"
## [35] "track.is_local" "track.name"
## [37] "track.popularity" "track.preview_url"
## [39] "track.track" "track.track_number"
## [41] "track.type" "track.uri"
## [43] "track.album.album_type" "track.album.artists"
## [45] "track.album.available_markets" "track.album.href"
## [47] "track.album.id" "track.album.images"
## [49] "track.album.name" "track.album.release_date"
## [51] "track.album.release_date_precision" "track.album.total_tracks"
## [53] "track.album.type" "track.album.uri"
## [55] "track.album.external_urls.spotify" "track.external_ids.isrc"
## [57] "track.external_urls.spotify" "video_thumbnail.url"
## [59] "key_name" "mode_name"
## [61] "key_mode"
At first glance… the majority of column names seem to be columns we’ve come across already, making subsequent steps a lot easier.
We first modify the data we just generated as follows. As you can see,
the transformations are fairly simple, with map_chr
and ~.$name[1]
being used once again to retain a singular artist from the
track.artists
list-column.
playlists_audio_processed <- playlists_audio_features %>%
mutate_at(c("key", "mode"), as.factor) %>%
mutate(
added_at = as_datetime(added_at),
year = year(added_at),
month = month(added_at),
artists = map_chr(track.artists, ~ .$name[1]),
)
I choose to visualize the above as a facetted barplot. I use
slice_max()
with a group_by()
to keep only a specific number of rows
belonging to each group in our grouped dataframe. We choose to keep ties
in this case.
After that, we wrap things up with the tried and tested duo of
geom_col()
and facet_wrap()
.
playlists_audio_processed %>%
count(artists, year, sort = TRUE) %>%
mutate(year = as.factor(year)) %>%
group_by(year) %>%
slice_max(n = 5, order_by = n) %>%
ungroup() %>%
ggplot(aes(reorder_within(
x = artists, by = n, within = year
))) +
geom_col(aes(y = n, fill = year), color = 'black') +
coord_flip() +
scale_x_reordered() +
scale_fill_brewer(palette = "Set3", type = "qual") +
facet_wrap(~ year, scale = "free_y") +
labs(
x = "Artist listened to",
y = "# of tracks added",
title = "# of tracks added by artists, per year",
subtitle = "Top-5 artists yearly",
fill = "Year"
) +
theme(legend.position = 'bottom')
It doesn’t take a music aficionado to see how the artists whose songs I choose to add to my library (well, playlist, but I only really listen to songs from my playlists and not simply my library, so it’s essentially the same). I go from listening (and liking) songs by Queen to The Offspring to Dire Straits. However, come 2021 and… that changes. This is the year I first played Supergiant’s Hades and really began to appreciate video game music, so much so that I began listening to video game OST albums. In that same year, you’ll find Darren Korb, who is the man behind the Hades soundtrack and Borislav Slavov of Divinity: Original Sin II, to name a few. You’ll also begin to find popular Japanese acts like RADWIMPS and Aimer from this year onwards.
How can I further extend what we have so far? Simple enough. Let’s look at lyrics.
Web scraping and link cleaning
The rvest
package enables us to scrape elements off a webpage of our choosing via a few lines of code. This, in combination with the SelectorGadget Chrome Extension enables us to pinpoint what element exactly we need to extract, without going through the cumbersome Inspect element process.
I’m going to try scraping lyrics off the songlyrics.com website. I had to run test scripts on a few other websites because not every website allows themselves to be scraped.
Before we actually use rvest
, we have another problem on our hands. We need valid links to actually scrape through… links we simply do not have right now. To create valid links, we need to look at the format of links already present on our website of choice, following which we need to manipulate the data we have right now (track and artist names) into a format that matches what songlyrics.com already has.
The format of a valid link (in this case) resembles the following:
https://www.songlyrics.com/<ARTIST_NAME>/<TRACK_NAME>-lyrics
Of course, punctuation marks like question marks, exclamation marks, etc. are removed as well. What does this mean? Well… regex to the rescue!
Let’s have a look at my 50-most recently played tracks.
You can’t look at more than your 50-most recent songs.
recently_played <- get_my_recently_played(limit = 50, authorization = authorization_code)
recently_played %>%
glimpse()
## Rows: 50
## Columns: 31
## $ played_at <chr> "2023-05-25T17:00:28.965Z", "2023-0…
## $ context <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ track.artists <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.available_markets <list> <"AR", "AU", "AT", "BE", "BO", "BR…
## $ track.disc_number <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1,…
## $ track.duration_ms <int> 244586, 233719, 110942, 234946, 151…
## $ track.explicit <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.href <chr> "https://api.spotify.com/v1/tracks/…
## $ track.id <chr> "0WQiDwKJclirSYG9v5tayI", "4ZuC5MfG…
## $ track.is_local <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ track.name <chr> "There Is a Light That Never Goes O…
## $ track.popularity <int> 80, 61, 48, 65, 54, 76, 42, 72, 60,…
## $ track.preview_url <chr> "https://p.scdn.co/mp3-preview/23f6…
## $ track.track_number <int> 9, 7, 1, 12, 9, 5, 13, 1, 24, 2, 2,…
## $ track.type <chr> "track", "track", "track", "track",…
## $ track.uri <chr> "spotify:track:0WQiDwKJclirSYG9v5ta…
## $ track.album.album_type <chr> "album", "album", "album", "album",…
## $ track.album.artists <list> [<data.frame[1 x 6]>], [<data.fram…
## $ track.album.available_markets <list> <"AD", "AE", "AG", "AL", "AM", "AO…
## $ track.album.href <chr> "https://api.spotify.com/v1/albums/…
## $ track.album.id <chr> "5Y0p2XCgRRIjna91aQE8q7", "33qkK1br…
## $ track.album.images <list> [<data.frame[3 x 3]>], [<data.fram…
## $ track.album.name <chr> "The Queen Is Dead", "Unknown Pleas…
## $ track.album.release_date <chr> "1986-06-16", "1979-06-01", "2006-0…
## $ track.album.release_date_precision <chr> "day", "day", "day", "day", "day", …
## $ track.album.total_tracks <int> 10, 22, 25, 14, 12, 9, 40, 17, 24, …
## $ track.album.type <chr> "album", "album", "album", "album",…
## $ track.album.uri <chr> "spotify:album:5Y0p2XCgRRIjna91aQE8…
## $ track.album.external_urls.spotify <chr> "https://open.spotify.com/album/5Y0…
## $ track.external_ids.isrc <chr> "GBCRL1100054", "GBAAP0600166", "QM…
## $ track.external_urls.spotify <chr> "https://open.spotify.com/track/0WQ…
I’m also going to be storing the base URL of the website we’ll be looking at.
html_url <- "https://www.songlyrics.com/"
Okay… now to get to the nitty-gritty of actually manipulating the artist and track names we have so that they fit nicely with the base URL we just stored. Remember library(stringr)
? Now is where we use it.
Using a mix of str_to_lower()
(to convert all characters to lower-case, for consistency) and str_replace_all()
, I first replace all spaces with hyphens. This is to maintain consistency with the nomenclature used by the website I’ve chosen.
recently_played %>%
mutate(
artist.name = map_chr(track.artists, ~ .$name[1]),
url_string = paste(
str_to_lower(str_replace_all(artist.name, " ", "-")),
"/",
str_to_lower(str_replace_all(track.name, " ", "-")),
sep = ""
)
) %>%
select(url_string) %>%
head(10)
## url_string
## 1 the-smiths/there-is-a-light-that-never-goes-out---2011-remaster
## 2 joy-division/shadowplay---2007-remaster
## 3 jeremy-soule/reign-of-the-septims
## 4 twenty-one-pilots/hometown
## 5 chuck-berry/rock-and-roll-music
## 6 steve-miller-band/the-joker
## 7 bruce-springsteen/born-in-the-u.s.a.---live-at-la-coliseum,-los-angeles,-ca---september-1985
## 8 måneskin/own-my-mind
## 9 coldplay/miracles
## 10 john-waite/change
Now, here’s the interesting part. Spotify contains multiple versions of songs. Remasters, live versions, you name it. That’s why sometimes, depending on when I run this code, I can get a string that looks like this:
franz-ferdinand/this-fffire---new-version
See the triple hyphen? That’s because there were triple spaces; each of which has now been replaced. There’s a similar pattern for live versions too, like so:
david-bowie/modern-love---2018-remaster
.
Clearly, we have further cleaning to do. This is where we use regex. We’ll be tacking on the following section of code to what we’ve got so far.
mutate(
url_to_scrape = str_split(url_string, "---")[[1]][1],
url_to_scrape = str_replace_all(url_to_scrape, "[?,\'\"]", ""),
url_to_scrape = paste(html_url, url_to_scrape, "-lyrics", sep = "")
)
Let’s break this down.
str_split()
returns a list. Since I’m using the---
characters as a delimiter, we only really need the first element of the list being returned. This is, of course, based off the assumption of naming conventions following a consistent format, something akin to<SONG_NAME>---remaster
.- We want to replace punctuation characters since they’re liable to break links. I use
str_replace_all()
to replace punctuation marks with empty characters. - What follows next is a call to
paste()
to combine the base URL, the manipulated string so far, and the hard-coded string-lyrics
. - That’s how
david-bowie/modern-love---2018-remaster
becomes this:https://www.songlyrics.com/david-bowie/modern-love-lyrics
.
Let’s combine what we’ve got so far.
recently_played_processed <- recently_played %>%
mutate(
artist.name = map_chr(track.artists, ~ .$name[1]),
url_string = paste(
str_to_lower(str_replace_all(artist.name, " ", "-")),
"/",
str_to_lower(str_replace_all(track.name, " ", "-")),
sep = ""
)
) %>%
rowwise() %>%
mutate(
url_to_scrape = str_split(url_string, "---")[[1]][1],
url_to_scrape = str_replace_all(url_to_scrape, "[!.?,\'\"]", ""),
url_to_scrape = paste(html_url, url_to_scrape, "-lyrics", sep = "")
)
recently_played_processed %>%
select(url_to_scrape)
## # A tibble: 50 × 1
## # Rowwise:
## url_to_scrape
## <chr>
## 1 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-l…
## 2 https://www.songlyrics.com/joy-division/shadowplay-lyrics
## 3 https://www.songlyrics.com/jeremy-soule/reign-of-the-septims-lyrics
## 4 https://www.songlyrics.com/twenty-one-pilots/hometown-lyrics
## 5 https://www.songlyrics.com/chuck-berry/rock-and-roll-music-lyrics
## 6 https://www.songlyrics.com/steve-miller-band/the-joker-lyrics
## 7 https://www.songlyrics.com/bruce-springsteen/born-in-the-usa-lyrics
## 8 https://www.songlyrics.com/måneskin/own-my-mind-lyrics
## 9 https://www.songlyrics.com/coldplay/miracles-lyrics
## 10 https://www.songlyrics.com/john-waite/change-lyrics
## # ℹ 40 more rows
You’ll notice a call to rowwise()
. A snippet from ?rowwise()
:
rowwise() allows you to compute on a data frame a row-at-a-time. This is most useful when a vectorised function doesn’t exist.
If I didn’t have the call to rowwise()
, here’s what the generated URLs look like.
## url_to_scrape
## 1 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 2 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 3 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 4 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 5 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 6 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 7 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 8 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 9 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 10 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 11 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 12 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 13 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 14 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 15 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 16 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 17 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 18 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 19 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 20 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 21 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 22 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 23 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 24 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 25 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 26 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 27 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 28 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 29 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 30 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 31 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 32 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 33 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 34 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 35 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 36 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 37 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 38 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 39 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 40 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 41 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 42 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 43 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 44 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 45 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 46 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 47 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 48 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 49 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
## 50 https://www.songlyrics.com/the-smiths/there-is-a-light-that-never-goes-out-lyrics
Every row simply gets replaced with the value from the first row (which is generated the first time operations are run). Converting the preceding dataframe into a rowwise_df
helps us get through this by making our operations execute on a row-by-row basis.
Okay, great. So we now have links that at least appear to be valid. We still need to work on actually extracting lyrics from those websites, and to handle cases in which the artificially generated links don’t actually exist.
Enter rvest
.
Remember the SelectorGadget Chrome extension I mentioned earlier? Installing that makes leveraging the capabilities of rvest
(and other alternatives like Python’s BeautifulSoup) a lot easier. These work on the principle of using identifiers known as selectors that correspond to a specific element present on a webpage. Instead of diving within nested div
tags present on the source code of the webpage, you simply activate the extension and click the element you’re focusing on. It takes a few tries, but you should be able to find selectors corresponding to the element of concern.
Selectors are generally of two types: Xpath and CSS. I choose to use Xpath here.
Let’s do a test run, with a hard-coded URL.
test_url <- "https://www.songlyrics.com/system-of-a-down/toxicity-lyrics"
test_url %>%
read_html() %>%
html_elements(xpath = '//*[(@id = "songLyricsDiv")]') %>%
html_text()
## [1] "Conversion software version 7.0\nLooking at life through the eyes of a tire hub\nEating seeds as a pastime activity\nThe toxicity of our city, of our city\nNow, what do you own the world?\nHow do you own disorder, disorder\nNow somewhere between the sacred silence\nSacred silence and sleep\nSomewhere, between the sacred silence and sleep\nMore wood for the fires, loud neighbours\nFlashlight reveries caught in the headlights of a truck\nEating seeds as a pastime activity\nThe toxicity of our city, of our city\nNow, what do you own the world?\nHow do you own disorder, disorder\nNow somewhere between the sacred silence\nSacred silence and sleep\nSomewhere between the sacred silence and sleep\nNow, what do you own the world?\nHow do you own disorder, disorder\nNow somewhere between the sacred silence\nSacred silence and sleep\nSomewhere, between the sacred silence and sleep\nWhen I became the sun\nI shone life into the man's hearts\nWhen I became the sun\nI shone life into the man's hearts"
The read_html()->html_elements()->html_text()
chain of functions works like this:
read_html()
reads all the html elements off the URL we specified.html_elements()
returns the contents of the element we specify as a parameter.xpath
in this case, adiv
with the ID ofsongLyricsDiv
.html_text()
returns all the text from the element specified.
And there you go! With barely 5 lines of code, we have lyrics from a link we generated!
Of course, our work isn’t done yet. Our code won’t work if the link doesn’t exist, since the code will have nothing to scrape. I choose to handle this by returning NA
if the code throws an error, following by wrapping up everything nicely in a convenient function:
scrape_lyrics <- function(url) {
lyrics <- tryCatch(
read_html(x = url) %>%
html_elements(xpath = '//*[(@id = "songLyricsDiv")]') %>%
html_text(),
error = function(e) {
NA
}
)
return(lyrics)
# lyrics
}
Here’s the fun part. Now we run our function on each URL we generated earlier, via map_chr()
from the purrr
package. This lets us run a function of our choosing on each element in a vector, returning the results as a character vector.
recently_played_processed <- recently_played_processed %>%
mutate(lyrics = map_chr(url_to_scrape, scrape_lyrics))
recently_played_processed %>%
filter(!is.na(lyrics)) %>%
select(artist.name, track.name, lyrics)
## # A tibble: 44 × 3
## # Rowwise:
## artist.name track.name lyrics
## <chr> <chr> <chr>
## 1 The Smiths There Is a Light That Never Goes Out - 2011 … "Take…
## 2 Joy Division Shadowplay - 2007 Remaster "To t…
## 3 Twenty One Pilots Hometown "[Hoo…
## 4 Chuck Berry Rock And Roll Music "Just…
## 5 Steve Miller Band The Joker "Some…
## 6 Bruce Springsteen Born In The U.S.A. - Live at LA Coliseum, Lo… "We d…
## 7 Coldplay Miracles "From…
## 8 John Waite Change "Peop…
## 9 Electric Light Orchestra Confusion "Ever…
## 10 Red Hot Chili Peppers Wet Sand "My s…
## # ℹ 34 more rows
Fantastic stuff, if I say so myself. Keep in mind you can do this with get_my_top_artists_or_tracks()
as well, although running it might take a while depending on how many tracks you specify.
Analyzing the scraped lyrics
Look at how far we’ve come. Now for (briefly, I know this has been a ride) to look at how we can handle the text. An older blog post of mine deals with this area in more detail, but that doesn’t mean we’re not going to try a few things on the wealth of data we have now.
We begin by tokenizing our text so far via tidytext::unnest_tokens()
, followed by removing common stopwords.
recently_played_tokenized <- recently_played_processed %>%
filter(!is.na(lyrics)) %>%
select(artist.name, track.name, lyrics) %>%
unnest_tokens(word, lyrics) %>%
anti_join(stop_words)
We’re cruising! We’ve got a dataframe containing our tokenized lyrics. What’s left is some rudimentary text analysis. If you’ve been keeping up so far, why, the world is your oyster.
recently_played_tokenized %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
mutate(word_sentiment_count = if_else(str_detect(sentiment, "negative"), -1*n, n)) %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder_within(word, word_sentiment_count, sentiment)) %>%
ggplot(aes(word, word_sentiment_count)) +
geom_col(aes(fill = sentiment)) +
scale_x_reordered() +
coord_flip() +
labs(title = "Word count per sentiment", subtitle = "Bing lexicon used", fill = "Sentiment", x = "Word", y = "Word count")
- Track sentiment over time. Are you listening to songs that get sadder by the year?
- Experiment with different lexicons. The
Bing
lexicon only classifies words as positive or negative, theNRC
lexicon has 9 different sentiments, for instance. - Explore the text on a per-artist basis. What words appear more often in songs performed by a particular artist?
I might play around with my data outside this blog, but for the time being, thank you for sticking to the end of the post. I’m happy to answer any questions you might have!