Getting a hang of the tidymodels metapackage has been on my radar for quite a while now. Developed from the ground up as the successor of the caret package, tidymodels aims to do for machine learning and modelling what the tidyverse did for data analysis and visualization: a suite of well-integrated packages, making it easier for us to focus on getting results, rather than the nitty-gritties of individual packages.
Running a basic analysis on a tidytuesday dataset is a great way to get a taste of tidymodels, especially considering the hiatus my blog has been on. I haven’t had the opportunity to use much of R lately, so I’m trying to be more regular now in my free time.
Anyway, down to business.
library(tidyverse)
library(tidymodels)
spotify_songs <-
readr::read_csv(
'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
) # reading in the data
spotify_songs %>%
glimpse() # inspecting the data
## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms <dbl> 194754, 162600, 176616, 169093, 189052, 16304…
There seem to be quite a few things of note being captured in the data! We see a couple of identifiers for a particular song (or track, as seems to be the case here), along with playlist identifiers as well. Details like the popularity of a particular track and playlist genres are of note as well, but it’s the latter half of the variables that seem particularly intriguing. Have a look at the Spotify API here.
I feel like we could try building a basic classification model to try and predict song genres based off the given features, but before we do that, let’s try and have a better look at the data first.
spotify_songs <- spotify_songs %>%
mutate_if(is.character, as.factor)
spotify_songs %>%
count(playlist_genre, sort = TRUE)
## # A tibble: 6 × 2
## playlist_genre n
## <fct> <int>
## 1 edm 6043
## 2 rap 5746
## 3 pop 5507
## 4 r&b 5431
## 5 latin 5155
## 6 rock 4951
spotify_songs %>%
count(track_artist, sort = TRUE)
## # A tibble: 10,693 × 2
## track_artist n
## <fct> <int>
## 1 Martin Garrix 161
## 2 Queen 136
## 3 The Chainsmokers 123
## 4 David Guetta 110
## 5 Don Omar 102
## 6 Drake 100
## 7 Dimitri Vegas & Like Mike 93
## 8 Calvin Harris 91
## 9 Hardwell 84
## 10 Kygo 83
## # ℹ 10,683 more rows
Wow, a lot of EDM on this dataset. The difference in numbers isn’t all that much amongst the genres though. Exploring the relationships of the features amongst genres could prove to be a worthwhile endeavour, however.
I start off with a pretty basic visualization depicting song durations across playlist genres.
theme_set(theme_light())
spotify_songs %>%
ggplot(aes(duration_ms)) +
geom_density(aes(fill = playlist_genre), alpha = 0.8) +
geom_vline(
aes(xintercept = median(duration_ms)),
linetype = 'dashed',
size = 1,
alpha = 0.8
) +
scale_x_continuous(labels = scales::number_format()) +
scale_y_continuous(labels = scales::number_format()) +
facet_wrap(~ playlist_genre) +
theme(axis.text.x = element_text(angle = 30)) +
labs(
title = 'Distribution of song lengths',
subtitle = 'By playlist genre\nDotted line represents median song duration',
x = 'Duration (in ms)',
fill = 'Playlist genre'
) +
scale_fill_brewer(palette = 'Paired')
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Rock songs seem to be the lengthiest, and edm songs the shortest in length. Not surprising there. We can quickly confirm the same below:
spotify_songs %>%
mutate(playlist_genre = fct_reorder(playlist_genre, duration_ms, median)) %>%
ggplot(aes(playlist_genre, duration_ms)) +
geom_boxplot(aes(color = playlist_genre)) +
coord_flip() +
scale_y_continuous(labels = scales::number_format()) +
scale_color_brewer(palette = 'Paired') +
labs(
title = 'Median duration of songs',
subtitle = 'By playlist genre',
x = 'Duration (in ms)',
y = 'Playlist genre',
color = 'Playlist genre'
)
We could try inspecting each of the features across genres, but recreating similar plots for each feature would be tedious. I’m going to reshape the data using a relatively new function in the tidyverse()
, called pivot_longer()
, which is essentially a replacement of the gather()
function.
spotify_songs %>%
pivot_longer(danceability:duration_ms, names_to = "feature") %>%
ggplot(aes(value)) +
geom_density(aes(fill = playlist_genre), alpha = 0.5) +
facet_wrap( ~ feature, scales = 'free') +
labs(title = 'Distribution of different features', subtitle = 'By playlist genre', fill = 'Playlist genre') +
scale_fill_brewer(palette = 'Paired') +
scale_y_continuous(labels = scales::number_format()) +
scale_x_continuous(labels = scales::number_format()) +
theme(axis.text.x = element_text(angle = 90))
There you go. In one plot, we can see the distributions of each of the features present in the dataset, split across genres. Pretty neat.
The instrumentalness variable does not seem to have a lot of variation across genres, I’m probably gonna remove that from consideration because it’s not really going to add anything to the model I’ll be attempting to train. Variables like key, valence, speechiness, danceability and energy seem to be remarkably different across genres.
This brings us to the question I was initially looking to answer at the start of the blog post. I’m going to try and use the rsample
, recipes
, parsnip
and yardstick
packages from the tidymodels
metapackage to split and preprocess the data, to train a few machine learning algorithms on my data while also evaluating performance.
Let’s get started!
model_df <- spotify_songs %>%
select(-contains('track'),
-playlist_id,
-playlist_name,
-playlist_subgenre,
-instrumentalness)
set.seed(100)
model_splits <- model_df %>%
initial_split(prop = 0.8)
training_df <- training(model_splits)
test_df <- testing(model_splits)
dim(training_df)
## [1] 26266 12
dim(test_df)
## [1] 6567 12
The initial_split()
function splits data into training and testing sets. If you look at what the function returns, you’ll essentially just see a list of splits, and no data. We can access the data by the training()
and testing()
functions respectively. The rsample
package also contains a ton of other options, including resampling, I haven’t checked them out yet but I hope to do so soon!
The recipes
package makes it incredibly easy to preprocess different datasets, after specifying the operations we wish to perform only once. Before recipes, we would have to remember the preprocessing steps we performed on our training set, taking care to perform exactly the same steps on our testing set, in order. The recipes
package removes this hassle from our workflow by letting us specify a… recipe
, which is essentially just a sequence of preprocessing steps we wish to perform. We do this by the variety of step_*
functions present in the package.
The predictors we wish to use have different distributions, hence, a reliable option would be for us to normalize
them.
model_recipe <- recipe(playlist_genre ~ ., data = training_df) %>%
step_normalize(all_numeric(),-all_outcomes()) %>%
prep()
The prep()
function estimates the required parameters for our preprocessing steps based off our training data. These same parameters are then applied to other datasets via the bake()
function, enabling us to preprocess varying datasets in exactly the same way. I do this on our testing data below.
test_df_preprocessed <- bake(model_recipe, test_df)
Now that we’ve seen the data, it’s time to try and build some machine learning models!
The problem in front of us is essentially that of multiple classification, and to help solve this I’ll be building three models:
- Decision Trees
- Random Forests
- K-Nearest Classification
Before the parsnip
platform came into being, R programmers had to keep in mind the varying interfaces and parameters for several machine learning algorithms. The same algorithm could have differently named parameters across packages! (Think randomForest
and ranger
with ntree
and num.tree
, respectively)
The parsnip
package enables us to have one fixed interface, no matter the algorithm.
Let’s train the decision tree first, using the decision_tree()
function from parsnip
.
set.seed(100)
rpart_model <- decision_tree(mode = 'classification') %>%
set_engine('rpart') %>%
fit(playlist_genre ~ ., data = juice(model_recipe))
summary(rpart_model)
## Length Class Mode
## lvl 6 -none- character
## spec 7 decision_tree list
## fit 14 rpart list
## preproc 1 -none- list
## elapsed 1 -none- list
## censor_probs 0 -none- list
rpart_model
## parsnip model object
##
## n= 26266
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 26266 21349 edm (0.19 0.16 0.17 0.16 0.17 0.15)
## 2) speechiness< 0.2330702 19630 15527 edm (0.21 0.16 0.19 0.15 0.092 0.19)
## 4) danceability>=-0.493343 13361 10389 edm (0.22 0.21 0.2 0.17 0.11 0.093)
## 8) tempo>=0.04157124 6171 3868 edm (0.37 0.15 0.17 0.1 0.12 0.084)
## 16) tempo< 0.2733446 3403 1484 edm (0.56 0.12 0.17 0.055 0.035 0.052) *
## 17) tempo>=0.2733446 2768 2162 rap (0.14 0.18 0.18 0.16 0.22 0.12) *
## 9) tempo< 0.04157124 7190 5341 latin (0.093 0.26 0.22 0.23 0.1 0.1)
## 18) duration_ms< 0.3397166 5311 3739 latin (0.1 0.3 0.25 0.17 0.1 0.078)
## 36) danceability>=0.3549124 3060 1928 latin (0.087 0.37 0.21 0.17 0.12 0.042) *
## 37) danceability< 0.3549124 2251 1550 pop (0.12 0.2 0.31 0.16 0.078 0.13) *
## 19) duration_ms>=0.3397166 1879 1160 r&b (0.065 0.15 0.13 0.38 0.11 0.17) *
## 5) danceability< -0.493343 6269 3790 rock (0.18 0.06 0.19 0.12 0.053 0.4) *
## 3) speechiness>=0.2330702 6636 3892 rap (0.12 0.15 0.08 0.2 0.41 0.036) *
The juice()
function helps us extract the preprocessed training data from our recipe earlier.
Note the parsnip
syntax being used here. You specify a model specification using a function, after which you decide the engine being used to implement the algorithm via set_engine()
. The fit()
argument takes in a formula specifiying the outcome and the predictors being used.
The above approach is constant across all the algorithms supported by parsnip
. This is incredibly helpful. No longer does one have to look up documentation on the arguments and parameters required for the model we wish to train!
Let’s look at the accuracy here. Keep in mind that the probability of randomly getting a correct guess is 0.17.
rpart_model %>%
predict(new_data = test_df_preprocessed) %>%
mutate(truth = test_df_preprocessed$playlist_genre, model = 'decision tree') %>%
metrics(truth = truth, estimate = .pred_class)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.399
## 2 kap multiclass 0.278
That’s more than twice as much as randomly guessing. Not exactly impressive, but let’s see how the other algorithms turn up.
Let’s train a random forest on the data now. Notice how the syntax hardly changes.
set.seed(100)
rf_model <- rand_forest(mode = 'classification') %>%
set_engine('ranger') %>%
fit(playlist_genre ~ ., data = juice(model_recipe))
rf_model %>%
predict(new_data = test_df_preprocessed) %>%
mutate(truth = test_df_preprocessed$playlist_genre) %>%
metrics(truth = truth, estimate = .pred_class)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.554
## 2 kap multiclass 0.464
55%! That’s a remarkable improvement over the previous model! The accuracy rate is of course still low, which really brings home the complexity and feature selection required to classify songs!
The last model I’ll be training is a nearest neighbours classifier, done similarly via parsnip
.
knn_model <- nearest_neighbor(mode = 'classification') %>%
set_engine('kknn') %>%
fit(playlist_genre ~ ., data = juice(model_recipe))
knn_model
## parsnip model object
##
##
## Call:
## kknn::train.kknn(formula = playlist_genre ~ ., data = data, ks = min_rows(5, data, 5))
##
## Type of response variable: nominal
## Minimal misclassification: 0.5092134
## Best kernel: optimal
## Best k: 5
knn_model %>%
predict(new_data = test_df_preprocessed) %>%
mutate(truth = test_df_preprocessed$playlist_genre, model = 'knn') %>%
metrics(truth = truth, estimate = .pred_class)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.444
## 2 kap multiclass 0.332
44%. That’s pretty okay, all things considered.
I think that’s it for this blog post! Here’s what we learned:
- It’s always a good idea to do some EDA, at the very least you’ll be making things easier for yourself by removing unnecessary variables.
recipes
is a great way to ensure preprocessing is consistent.- The
parsnip
package makes modelling across a variety of widely different models a lot easier than what it used to be.
I’ll try exploring the dials
and tune
packages in subsequent posts for hyperparameter tuning in tidymodels
!