The TidyTuesday project by the R for Data Science community is a great resource for practicing up on Data analytics. Every Tuesday, they release a dataset to the public, and I’ve found it to be a great source of motivation and practice. Every dataset is different, and every dataset has its own story to tell. Plus, it’s always a treat to look at the various ways in which people go through the same dataset.
I’ve always been an avid fan of anime in general, ever since I first started watching Naruto back in 8th Grade. Doing something in R related to anime has been in the back of my mind for a while now, so when I came across this particular dataset on the TidyTuesday initiate, I was stoked! I decided to get my hands dirty with this dataset, which brings us to this, my second blog post.
The first thing I do is load the dataset via the URL provided on the TidyTuesday page of the dataset.
After reading it, I load up the tidyverse
package to streamline my analysis, along with briefly going through the data via the glimpse()
function. At first glance, a few columns seem to be of note to me, namely the synopsis
, score
, and studio
columns.
I use select()
to keep only the columns I think are relevant to my analysis. In this case, I remove columns containing Japanese characters, redundant information or columns that simply don’t seem that interesting to me. Keeping columns that you won’t be using will just slow down your analysis later on.
tidy_anime <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-23/tidy_anime.csv")
library(tidyverse)
tidy_anime %>% glimpse()
## Rows: 77,911
## Columns: 28
## $ animeID <dbl> 1, 1, 1, 1, 1, 1, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6…
## $ name <chr> "Cowboy Bebop", "Cowboy Bebop", "Cowboy Bebop", "Cowboy…
## $ title_english <chr> "Cowboy Bebop", "Cowboy Bebop", "Cowboy Bebop", "Cowboy…
## $ title_japanese <chr> "カウボーイビバップ", "カウボーイビバップ", "カウボーイ…
## $ title_synonyms <chr> "[]", "[]", "[]", "[]", "[]", "[]", "[\"Cowboy Bebop: K…
## $ type <chr> "TV", "TV", "TV", "TV", "TV", "TV", "Movie", "Movie", "…
## $ source <chr> "Original", "Original", "Original", "Original", "Origin…
## $ producers <chr> "Bandai Visual", "Bandai Visual", "Bandai Visual", "Ban…
## $ genre <chr> "Action", "Adventure", "Comedy", "Drama", "Sci-Fi", "Sp…
## $ studio <chr> "Sunrise", "Sunrise", "Sunrise", "Sunrise", "Sunrise", …
## $ episodes <dbl> 26, 26, 26, 26, 26, 26, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ status <chr> "Finished Airing", "Finished Airing", "Finished Airing"…
## $ airing <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ start_date <date> 1998-04-03, 1998-04-03, 1998-04-03, 1998-04-03, 1998-0…
## $ end_date <date> 1999-04-02, 1999-04-02, 1999-04-02, 1999-04-02, 1999-0…
## $ duration <chr> "24 min per ep", "24 min per ep", "24 min per ep", "24 …
## $ rating <chr> "R - 17+ (violence & profanity)", "R - 17+ (violence & …
## $ score <dbl> 8.81, 8.81, 8.81, 8.81, 8.81, 8.81, 8.41, 8.41, 8.41, 8…
## $ scored_by <dbl> 405664, 405664, 405664, 405664, 405664, 405664, 120243,…
## $ rank <dbl> 26, 26, 26, 26, 26, 26, 164, 164, 164, 164, 164, 164, 1…
## $ popularity <dbl> 39, 39, 39, 39, 39, 39, 449, 449, 449, 449, 449, 449, 4…
## $ members <dbl> 795733, 795733, 795733, 795733, 795733, 795733, 197791,…
## $ favorites <dbl> 43460, 43460, 43460, 43460, 43460, 43460, 776, 776, 776…
## $ synopsis <chr> "In the year 2071, humanity has colonized several of th…
## $ background <chr> "When Cowboy Bebop first aired in spring of 1998 on TV …
## $ premiered <chr> "Spring 1998", "Spring 1998", "Spring 1998", "Spring 19…
## $ broadcast <chr> "Saturdays at 01:00 (JST)", "Saturdays at 01:00 (JST)",…
## $ related <chr> "{'Adaptation': [{'mal_id': 173, 'type': 'manga', 'name…
tidy_anime %>% head(20) #different genres are separated into different rows
## # A tibble: 20 × 28
## animeID name title_english title_japanese title_synonyms type source
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Cowboy Bebop Cowboy Bebop カウボーイビ… "[]" TV Origi…
## 2 1 Cowboy Bebop Cowboy Bebop カウボーイビ… "[]" TV Origi…
## 3 1 Cowboy Bebop Cowboy Bebop カウボーイビ… "[]" TV Origi…
## 4 1 Cowboy Bebop Cowboy Bebop カウボーイビ… "[]" TV Origi…
## 5 1 Cowboy Bebop Cowboy Bebop カウボーイビ… "[]" TV Origi…
## 6 1 Cowboy Bebop Cowboy Bebop カウボーイビ… "[]" TV Origi…
## 7 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 8 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 9 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 10 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 11 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 12 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 13 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 14 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 15 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 16 5 Cowboy Bebo… Cowboy Bebop… カウボーイビ… "[\"Cowboy Be… Movie Origi…
## 17 6 Trigun Trigun トライガン "[]" TV Manga
## 18 6 Trigun Trigun トライガン "[]" TV Manga
## 19 6 Trigun Trigun トライガン "[]" TV Manga
## 20 6 Trigun Trigun トライガン "[]" TV Manga
## # ℹ 21 more variables: producers <chr>, genre <chr>, studio <chr>,
## # episodes <dbl>, status <chr>, airing <lgl>, start_date <date>,
## # end_date <date>, duration <chr>, rating <chr>, score <dbl>,
## # scored_by <dbl>, rank <dbl>, popularity <dbl>, members <dbl>,
## # favorites <dbl>, synopsis <chr>, background <chr>, premiered <chr>,
## # broadcast <chr>, related <chr>
theme_set(theme_light())
tidy_anime_select <- tidy_anime %>%
select(-title_japanese, -title_synonyms, -background, -related)
It’s always a good idea to look at the distribution of the data you have. I don’t have a particular preference for anime, but I’m still curious as to what genre generally gets the highest ratings from the plethora of users out there. To get some idea of the same, I plot a histogram of the ratings, across genres, using facet_wrap()
and geom_histogram()
.
tidy_anime_select %>%
filter(!is.na(genre)) %>%
ggplot(aes(score)) +
geom_histogram(aes(fill = genre)) +
facet_wrap(~genre) +
guides(fill = FALSE) +
labs(title = "Distribution of ratings across genres")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 173 rows containing non-finite values (`stat_bin()`).
Notice I used colorRampPalette()
to create a custom color scheme. Most pre-defined palettes present in RColorBrewer
have around 8 colors, which leads to a problem when you’ve got more than 8 categorical variables. colorRampPalette()
creates a new palette having as many unique colors as you want by interpolating between the colors that are already present in the pre-defined palette.
In the following section of code, I create a simple bar-plot that shows the median ratings for each genre, followed by the median ratings for the studios with the 20 highest median ratings. I do this using geom_col()
and geom_label()
.
#creating a custom color scheme
library(RColorBrewer)
custColorsDiscrete <- colorRampPalette(brewer.pal(9, "Blues"))(41)
tidy_anime_select %>%
group_by(genre) %>%
summarise(avgScore = median(score, na.rm = TRUE)) %>%
arrange(desc(avgScore)) %>%
mutate(genre = fct_reorder(genre, avgScore)) %>%
filter(!is.na(genre)) %>%
ggplot(aes(genre, avgScore)) +
geom_col(aes(fill = genre), color = "black") +
geom_label(aes(label = avgScore),
label.padding = unit(0.10, "lines"),
alpha = 0.8,
nudge_y = -0.15) +
coord_flip() +
scale_fill_manual(values = custColorsDiscrete) +
labs(title = "Median rating for all genres",
x = "Genre",
y = "Median score",
subtitle = "People love mystery thrillers") +
guides(fill = FALSE)
custColorsDiscrete2 <- colorRampPalette(brewer.pal(9, "Blues"))(21)
tidy_anime_select %>%
group_by(studio) %>%
summarize(avgScore = median(score, na.rm = TRUE)) %>%
arrange(desc(avgScore)) %>%
top_n(20) %>%
mutate(studio = fct_reorder(studio, avgScore)) %>%
ggplot(aes(studio, avgScore)) +
geom_col(aes(fill = studio), color = "black") +
geom_label(aes(label = avgScore),
label.padding = unit(0.15, "lines"),
alpha = 0.8,
nudge_y = -0.20) +
scale_fill_manual(values = custColorsDiscrete2) +
coord_flip() +
guides(fill = FALSE) +
labs(title = "Median rating per studio",
subtitle = "Top 20 studios taken",
x = "Studio",
y = "Median score")
## Selecting by avgScore
There are more than 40 genres, which really only leads to cluttered visualizations. To clear up my visualizations a bit, I store the most common genres across the dataset in a vector called topGenres
.
A couple of studios I’m familiar with include studios like Toei animation, Ufotable, and of course, Studio Ghibli. However, I’m not really sure as to how common they really are, and how diverse they are in their offerings.
I attempt to answer that with the following code.
I group the dataset according to the genre and studio, after which I can easily see the most common genre-studio pairings via summarize()
and arrange()
. I also use the topGenres
vector created earlier to look at studios that only have shows present in the most common genres.
fct_reorder()
from the forcats
package is something I’ve used countless times to reorder categorical features based on a metric.
topGenres <- tidy_anime_select %>%
count(genre, sort = TRUE) %>%
head() %>%
pull(genre) #41 genres
tidy_anime %>%
filter(!is.na(studio) & !is.na(genre)) %>%
group_by(genre, studio) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
top_n(5) %>%
add_count(genre) %>%
ungroup() %>%
filter(genre %in% topGenres) %>%
mutate(studio = fct_reorder(studio, count, sum)) %>%
ggplot(aes(studio, count)) +
geom_col(aes(fill = genre), alpha = 0.8) +
coord_flip() +
scale_fill_brewer(palette = "Spectral") +
labs(title = "Top anime studios and the genres of their products",
x = "Studio name",
y = "Number of shows",
fill = "Genre")
## `summarise()` has grouped output by 'genre'. You can override using the
## `.groups` argument.
## Selecting by count
Eventually I want to be predicting the score based off various other parameters present in the dataset. Looking at the distribution, the score
variable seems to be present in a log-normal distribution. I’ll try to predict the logarithm of the scores later.
tidy_anime_select %>%
ggplot(aes(score)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 174 rows containing non-finite values (`stat_bin()`).
The `description` column features a wealth of unstructured data worth looking into. The `tidytext` package is a great tool for exploring the relationships within text, while also being great with the rest of the `tidyverse`.
The unnest_tokens
function here tokenizes the text into individual words by default, after some basic pre-processing steps (removing numbers, making everything lower case, etc.) I then use anti_join()
with a lexicon of stop words that is provided by the package itself to get rid of words used commonly (words like a, the, an, of…).
I then perform steps similar to what I performed earlier while examining the relationships between studios and genres, except this time I also use scale_x_reordered()
(from the drlib
package) to reorder the variables within each individual facet. This gives us the most common words present in each popular genre.
library(tidytext)
tidy_anime_words <- tidy_anime_select %>%
unnest_tokens(word, synopsis) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
tidy_anime_words %>%
filter(genre %in% topGenres) %>%
count(genre, word, sort = TRUE) %>%
filter(!word %in% c("source", "mal", "written", "rewrite", "world", "life", "ann")) %>%
group_by(genre) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(reorder_within(word, n, genre),n)) +
geom_col(aes(fill = genre)) +
coord_flip() +
scale_x_reordered() +
facet_wrap(~genre, scales = "free") +
labs(title = "Most common words in synopses per genre",
subtitle = "Extremely common words removed",
x = "Word",
y = "Count") +
guides(fill = FALSE)
## Selecting by n
When dealing with text data, there may be some words that are common throughout the dataset despite not being a common English stop word. It is highly likely, then, that these words might be present across categories, thus limiting the insights you could gather from each category (mysterious in the above plot, for example).
The tf-idf metric attempts to solve this problem by trying to find words that are more specific to a particular category than the rest. Extremely common words have lower ranks according to this metric. The bind_tf_idf()
function from the tidytext
package helps us calculate the metric for each word present within each genre, visualized here:
tidy_anime_words %>%
filter(genre %in% topGenres) %>%
count(genre, word, sort = TRUE) %>%
bind_tf_idf(word, genre, n) %>%
arrange(desc(tf_idf)) %>%
group_by(genre) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = fct_reorder(word, tf_idf)) %>%
ggplot(aes(reorder_within(word, tf_idf, genre),tf_idf)) +
geom_col(aes(fill = genre)) +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(labels = scales::percent_format()) +
facet_wrap(~genre, scales = "free") +
labs(title = "Most specific words in synopses per genre",
subtitle = "Most of them seem to be character names",
x = "Word",
y = "tf_idf",
fill = "Genre")
## Selecting by tf_idf
Recall that our data also contains a variety of columns containing time data, leading to a wealth of questions worth looking into. This could allow us to get information pertaining to, say, timings related to specific genres, or whether timings affect average ratings.
The lubridate
package contains a multitude of functions that help in parsing data-time data, present in a variety of formats. I use that here in order to parse the time-related features as date-time features, and not as regular numeric/character features.
I then use geom_density()
to look at the density of the time slots between the most common genres.
library(lubridate)
tidy_anime_select %>%
mutate(year = year(start_date)) %>%
ggplot(aes(year)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 238 rows containing non-finite values (`stat_bin()`).
tidy_anime_time <- tidy_anime_select %>%
extract(broadcast, into = c("time"), regex = "(\\d\\d:\\d\\d)") %>%
select(time, genre, score) %>%
filter(!is.na(time),
genre %in% topGenres) %>%
mutate(time = hm(time),
hour = hour(time))
tidy_anime_time %>%
ggplot(aes(hour)) +
geom_density(aes(fill = genre), alpha = 0.5) +
labs(title = "Density of time slots",
subtitle = "Adventure shows show a sharp spike around 18:00 JST",
x = "Hour",
y = "Density",
fill = "Genre",
caption = "All timings in JST") +
scale_y_continuous(labels = scales::percent_format())
I also attempted to train a rudimentary linear model to see the impact the hour a show airs might have on average ratings. Turns out, the hour is statistically significant when it comes to predicting ratings.
I use the tidy()
function from the broom
package to convert my linear model into a tibble, after which I plot the average impact each hour has on ratings, along with their respective confidence intervals.
library(broom)
tidy_anime_time %>%
mutate(hour = as.factor(hour)) %>%
lm(score~hour, data = .) %>%
tidy(conf.int = TRUE) %>%
mutate(term = str_replace(term, "hour", "")) %>% #for readability
filter(term!='(Intercept)') %>%
arrange(desc(estimate)) %>%
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(term, estimate)) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +
coord_flip() +
labs(title = "Impact of airing hours on average ratings",
subtitle = "Afternoon shows do not seem to go well with the viewers",
x = "Hour",
y = "Impact on rating",
caption = "Hours in JST")
I use the glmnet
and broom
packages here. The cast_sparse()
function present in the tidytext
package makes converting our data frame into a sparse matrix extremely easy. Our matrix will contain 1 row for each unique anime, and 1 column for each word present in our data set.
I use cross-validation to train my model.
library(Matrix)
library(glmnet)
tidy_anime_word_matrix <- tidy_anime_words %>%
filter(str_detect(word, "[a-zA-Z]")) %>%
cast_sparse(animeID, word)
dim(tidy_anime_word_matrix)
## [1] 13114 42100
anime_ids <- as.integer(rownames(tidy_anime_word_matrix))
scores <- log2(tidy_anime_words$score[anime_ids]) #for getting the log of the scores present in the sparse matrix
#Training a cross-validated lasso model
cv_glmnet_mod <- cv.glmnet(tidy_anime_word_matrix, scores)
Upon visualizing the model, we can see that overfitting takes place relatively soon, after we cross around 527 terms. The dotted lines represent the values of lambda that minimized the error of our model. We can easily retrieve those exact values by using broom
.
In order to see the impact each word present in our data has on the score
response variable, we need to first get the model that had the best fit on the data as a function of the lambda penalty parameter. This is done by looking at the glmnet.fit
model to find the optimal, final model built after cross-validation.
To get the impact of each word based off our best fit model, I simply filter for the terms which were put into the model when the value of lambda led to predictions within 1 standard error, after which I use tidy()
to visualize the corresponding terms along with their coefficients.
plot(cv_glmnet_mod)
cv_glmnet_mod$glmnet.fit %>%
tidy() %>%
filter(term %in% c("official", "video", "season", "franchise", "rewrite")) %>%
ggplot(aes(lambda, estimate)) +
geom_line(aes(color = term)) +
labs(title = "Lambda vs estimate",
subtitle = "Overfitting on a few of these terms",
caption = "Terms picked arbitrarily")
cv_glmnet_mod$glmnet.fit %>%
tidy() %>%
filter(lambda == cv_glmnet_mod$lambda.1se) %>%
filter(term!="(Intercept)") %>%
mutate(direction = ifelse(estimate>0,"positive", "negative")) %>%
group_by(direction) %>%
top_n(9, abs(estimate)) %>%
ungroup() %>%
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(term, estimate)) +
geom_col(aes(fill = direction)) +
coord_flip() +
labs(title = "Impact of each word on anime score",
subtitle = "Calculated using lasso regression",
x = "Word",
y = "Coefficient",
fill = "Direction of impact",
caption = "Majorly Japanese words seem to be negative impactors on the overall score")
And there you have it. Through this analysis we did quite a few things:
- Exploratory visualizations,
- tf-idf to find out words specific within genres,
- Manipulating time data, and using the same as predictors,
- Manipulating text as predictors.
That was fun, but there are quite a few things I could’ve done as well, or improved upon:
- Exploring the other features of the dataset.
- Scrape the webpages provided in the dataset.
- Explore other features as predictors.
- Trying out different machine learning models to maximize accuracy.