Tidymodelling song classes - Nishank Saini's blog

Getting a hang of the tidymodels metapackage has been on my radar for quite a while now. Developed from the ground up as the successor of the caret package, tidymodels aims to do for machine learning and modelling what the tidyverse did for data analysis and visualization: a suite of well-integrated packages, making it easier for us to focus on getting results, rather than the nitty-gritties of individual packages.

Running a basic analysis on a tidytuesday dataset is a great way to get a taste of tidymodels, especially considering the hiatus my blog has been on. I haven’t had the opportunity to use much of R lately, so I’m trying to be more regular now in my free time.

Anyway, down to business.

library(tidyverse) 
library(tidymodels) 

spotify_songs <-
  readr::read_csv(
    'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
  ) # reading in the data

spotify_songs %>%
  glimpse() # inspecting the data

## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

There seem to be quite a few things of note being captured in the data! We see a couple of identifiers for a particular song (or track, as seems to be the case here), along with playlist identifiers as well. Details like the popularity of a particular track and playlist genres are of note as well, but it’s the latter half of the variables that seem particularly intriguing. Have a look at the Spotify API here.

I feel like we could try building a basic classification model to try and predict song genres based off the given features, but before we do that, let’s try and have a better look at the data first.

spotify_songs <- spotify_songs %>% 
  mutate_if(is.character, as.factor)

spotify_songs %>%
  count(playlist_genre, sort = TRUE)

## # A tibble: 6 × 2
##   playlist_genre     n
##   <fct>          <int>
## 1 edm             6043
## 2 rap             5746
## 3 pop             5507
## 4 r&b             5431
## 5 latin           5155
## 6 rock            4951

spotify_songs %>%
  count(track_artist, sort = TRUE)

## # A tibble: 10,693 × 2
##    track_artist                  n
##    <fct>                     <int>
##  1 Martin Garrix               161
##  2 Queen                       136
##  3 The Chainsmokers            123
##  4 David Guetta                110
##  5 Don Omar                    102
##  6 Drake                       100
##  7 Dimitri Vegas & Like Mike    93
##  8 Calvin Harris                91
##  9 Hardwell                     84
## 10 Kygo                         83
## # ℹ 10,683 more rows

Wow, a lot of EDM on this dataset. The difference in numbers isn’t all that much amongst the genres though. Exploring the relationships of the features amongst genres could prove to be a worthwhile endeavour, however.

I start off with a pretty basic visualization depicting song durations across playlist genres.

theme_set(theme_light())

spotify_songs %>%
  ggplot(aes(duration_ms)) +
  geom_density(aes(fill = playlist_genre), alpha = 0.8) +
  geom_vline(
    aes(xintercept = median(duration_ms)),
    linetype = 'dashed',
    size = 1,
    alpha = 0.8
  ) +
  scale_x_continuous(labels = scales::number_format()) +
  scale_y_continuous(labels = scales::number_format()) +
  facet_wrap(~ playlist_genre) +
  theme(axis.text.x = element_text(angle = 30)) +
  labs(
    title = 'Distribution of song lengths',
    subtitle = 'By playlist genre\nDotted line represents median song duration',
    x = 'Duration (in ms)',
    fill = 'Playlist genre'
  ) +
  scale_fill_brewer(palette = 'Paired')

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Rock songs seem to be the lengthiest, and edm songs the shortest in length. Not surprising there. We can quickly confirm the same below:

spotify_songs %>%
  mutate(playlist_genre = fct_reorder(playlist_genre, duration_ms, median)) %>%
  ggplot(aes(playlist_genre, duration_ms)) +
  geom_boxplot(aes(color = playlist_genre)) +
  coord_flip() +
  scale_y_continuous(labels = scales::number_format()) +
  scale_color_brewer(palette = 'Paired') +
  labs(
    title = 'Median duration of songs',
    subtitle = 'By playlist genre',
    x = 'Duration (in ms)',
    y = 'Playlist genre',
    color = 'Playlist genre'
  )

We could try inspecting each of the features across genres, but recreating similar plots for each feature would be tedious. I’m going to reshape the data using a relatively new function in the tidyverse(), called pivot_longer(), which is essentially a replacement of the gather() function.

spotify_songs %>%
  pivot_longer(danceability:duration_ms, names_to = "feature") %>%
  ggplot(aes(value)) +
  geom_density(aes(fill = playlist_genre), alpha = 0.5) +
  facet_wrap( ~ feature, scales = 'free') +
  labs(title = 'Distribution of different features', subtitle = 'By playlist genre', fill = 'Playlist genre') +
  scale_fill_brewer(palette = 'Paired') +
  scale_y_continuous(labels = scales::number_format()) +
  scale_x_continuous(labels = scales::number_format()) +
  theme(axis.text.x = element_text(angle = 90))

There you go. In one plot, we can see the distributions of each of the features present in the dataset, split across genres. Pretty neat.

The instrumentalness variable does not seem to have a lot of variation across genres, I’m probably gonna remove that from consideration because it’s not really going to add anything to the model I’ll be attempting to train. Variables like key, valence, speechiness, danceability and energy seem to be remarkably different across genres.

This brings us to the question I was initially looking to answer at the start of the blog post. I’m going to try and use the rsample, recipes, parsnip and yardstick packages from the tidymodels metapackage to split and preprocess the data, to train a few machine learning algorithms on my data while also evaluating performance.

Let’s get started!

model_df <- spotify_songs %>%
  select(-contains('track'),
         -playlist_id,
         -playlist_name,
         -playlist_subgenre,
         -instrumentalness)

set.seed(100)
model_splits <- model_df %>%
  initial_split(prop = 0.8)

training_df <- training(model_splits)
test_df <- testing(model_splits)


dim(training_df)

## [1] 26266    12

dim(test_df)

## [1] 6567   12

The initial_split() function splits data into training and testing sets. If you look at what the function returns, you’ll essentially just see a list of splits, and no data. We can access the data by the training() and testing() functions respectively. The rsample package also contains a ton of other options, including resampling, I haven’t checked them out yet but I hope to do so soon!

The recipes package makes it incredibly easy to preprocess different datasets, after specifying the operations we wish to perform only once. Before recipes, we would have to remember the preprocessing steps we performed on our training set, taking care to perform exactly the same steps on our testing set, in order. The recipes package removes this hassle from our workflow by letting us specify a… recipe, which is essentially just a sequence of preprocessing steps we wish to perform. We do this by the variety of step_* functions present in the package.

The predictors we wish to use have different distributions, hence, a reliable option would be for us to normalize them.

model_recipe <- recipe(playlist_genre ~ ., data = training_df) %>%
  step_normalize(all_numeric(),-all_outcomes()) %>%
  prep()

The prep() function estimates the required parameters for our preprocessing steps based off our training data. These same parameters are then applied to other datasets via the bake() function, enabling us to preprocess varying datasets in exactly the same way. I do this on our testing data below.

test_df_preprocessed <- bake(model_recipe, test_df)

Now that we’ve seen the data, it’s time to try and build some machine learning models!

The problem in front of us is essentially that of multiple classification, and to help solve this I’ll be building three models:

Decision Trees
Random Forests
K-Nearest Classification

Before the parsnip platform came into being, R programmers had to keep in mind the varying interfaces and parameters for several machine learning algorithms. The same algorithm could have differently named parameters across packages! (Think randomForest and ranger with ntree and num.tree, respectively)

The parsnip package enables us to have one fixed interface, no matter the algorithm.

Let’s train the decision tree first, using the decision_tree() function from parsnip.

set.seed(100)
rpart_model <- decision_tree(mode = 'classification') %>%
  set_engine('rpart') %>%
  fit(playlist_genre ~ ., data = juice(model_recipe))

summary(rpart_model)

##              Length Class         Mode     
## lvl           6     -none-        character
## spec          7     decision_tree list     
## fit          14     rpart         list     
## preproc       1     -none-        list     
## elapsed       1     -none-        list     
## censor_probs  0     -none-        list

rpart_model

## parsnip model object
## 
## n= 26266 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 26266 21349 edm (0.19 0.16 0.17 0.16 0.17 0.15)  
##    2) speechiness< 0.2330702 19630 15527 edm (0.21 0.16 0.19 0.15 0.092 0.19)  
##      4) danceability>=-0.493343 13361 10389 edm (0.22 0.21 0.2 0.17 0.11 0.093)  
##        8) tempo>=0.04157124 6171  3868 edm (0.37 0.15 0.17 0.1 0.12 0.084)  
##         16) tempo< 0.2733446 3403  1484 edm (0.56 0.12 0.17 0.055 0.035 0.052) *
##         17) tempo>=0.2733446 2768  2162 rap (0.14 0.18 0.18 0.16 0.22 0.12) *
##        9) tempo< 0.04157124 7190  5341 latin (0.093 0.26 0.22 0.23 0.1 0.1)  
##         18) duration_ms< 0.3397166 5311  3739 latin (0.1 0.3 0.25 0.17 0.1 0.078)  
##           36) danceability>=0.3549124 3060  1928 latin (0.087 0.37 0.21 0.17 0.12 0.042) *
##           37) danceability< 0.3549124 2251  1550 pop (0.12 0.2 0.31 0.16 0.078 0.13) *
##         19) duration_ms>=0.3397166 1879  1160 r&b (0.065 0.15 0.13 0.38 0.11 0.17) *
##      5) danceability< -0.493343 6269  3790 rock (0.18 0.06 0.19 0.12 0.053 0.4) *
##    3) speechiness>=0.2330702 6636  3892 rap (0.12 0.15 0.08 0.2 0.41 0.036) *

The juice() function helps us extract the preprocessed training data from our recipe earlier.

Note the parsnip syntax being used here. You specify a model specification using a function, after which you decide the engine being used to implement the algorithm via set_engine(). The fit() argument takes in a formula specifiying the outcome and the predictors being used.

The above approach is constant across all the algorithms supported by parsnip. This is incredibly helpful. No longer does one have to look up documentation on the arguments and parameters required for the model we wish to train!

Let’s look at the accuracy here. Keep in mind that the probability of randomly getting a correct guess is 0.17.

rpart_model %>%
  predict(new_data = test_df_preprocessed) %>%
  mutate(truth = test_df_preprocessed$playlist_genre, model = 'decision tree') %>%
  metrics(truth = truth, estimate = .pred_class)

## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.399
## 2 kap      multiclass     0.278

That’s more than twice as much as randomly guessing. Not exactly impressive, but let’s see how the other algorithms turn up.

Let’s train a random forest on the data now. Notice how the syntax hardly changes.

set.seed(100)
rf_model <- rand_forest(mode = 'classification') %>%
  set_engine('ranger') %>%
  fit(playlist_genre ~ ., data = juice(model_recipe))


rf_model %>%
  predict(new_data = test_df_preprocessed) %>%
  mutate(truth = test_df_preprocessed$playlist_genre) %>%
  metrics(truth = truth, estimate = .pred_class)

## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.554
## 2 kap      multiclass     0.464

55%! That’s a remarkable improvement over the previous model! The accuracy rate is of course still low, which really brings home the complexity and feature selection required to classify songs!

The last model I’ll be training is a nearest neighbours classifier, done similarly via parsnip.

knn_model <- nearest_neighbor(mode = 'classification') %>%
  set_engine('kknn') %>%
  fit(playlist_genre ~ ., data = juice(model_recipe))

knn_model

## parsnip model object
## 
## 
## Call:
## kknn::train.kknn(formula = playlist_genre ~ ., data = data, ks = min_rows(5,     data, 5))
## 
## Type of response variable: nominal
## Minimal misclassification: 0.5092134
## Best kernel: optimal
## Best k: 5

knn_model %>%
  predict(new_data = test_df_preprocessed) %>%
  mutate(truth = test_df_preprocessed$playlist_genre, model = 'knn') %>%
  metrics(truth = truth, estimate = .pred_class)

## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.444
## 2 kap      multiclass     0.332

44%. That’s pretty okay, all things considered.

I think that’s it for this blog post! Here’s what we learned:

It’s always a good idea to do some EDA, at the very least you’ll be making things easier for yourself by removing unnecessary variables.
recipes is a great way to ensure preprocessing is consistent.
The parsnip package makes modelling across a variety of widely different models a lot easier than what it used to be.

I’ll try exploring the dials and tune packages in subsequent posts for hyperparameter tuning in tidymodels!