Analyzing Global Suicide Data - Nishank Saini's blog

I occassionally browse through Kaggle for datasets that might pique my interest. The Suicide rates dataset immediately grabbed my attention. Given the rise in the number of teen suicides over the years, it was a no-brainer for me to attempt to analyze the dataset.

The first thing I do is of course, load up the tidyverse package.

library(tidyverse)

While beginning a data analysis, it’s always good practice to simply look at the underlying data first. The glimpse() and the str() functions are a great way to achieve this.

Basic data pre-processing

suicideData <- read_csv("../../../static/data/suicide-data.csv")
suicideData %>% glimpse()

## Rows: 27,820
## Columns: 12
## $ country              <chr> "Albania", "Albania", "Albania", "Albania", "Alba…
## $ year                 <dbl> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1…
## $ sex                  <chr> "male", "male", "female", "male", "male", "female…
## $ age                  <chr> "15-24 years", "35-54 years", "15-24 years", "75+…
## $ suicides_no          <dbl> 21, 16, 14, 1, 9, 1, 6, 4, 1, 0, 0, 0, 2, 17, 1, …
## $ population           <dbl> 312900, 308000, 289700, 21800, 274300, 35600, 278…
## $ `suicides/100k pop`  <dbl> 6.71, 5.19, 4.83, 4.59, 3.28, 2.81, 2.15, 1.56, 0…
## $ `country-year`       <chr> "Albania1987", "Albania1987", "Albania1987", "Alb…
## $ `HDI for year`       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `gdp_for_year ($)`   <dbl> 2156624900, 2156624900, 2156624900, 2156624900, 2…
## $ `gdp_per_capita ($)` <dbl> 796, 796, 796, 796, 796, 796, 796, 796, 796, 796,…
## $ generation           <chr> "Generation X", "Silent", "Generation X", "G.I. G…

Quite a few columns have names that aren’t very intuitive, or that have features that will make further analysis just a little inconvenient (note the $ symbols present in a few column names).

Right off the bat, we can see that the dataset contains a year variable, however, this variable seems to be of the dbl type, something that is not particularly helpful when dealing with dates. Instead, I use the year() function from the lubridate package in order to process the variable as a date variable successfully.

Seeing as the dataset contains the absolute number of suicides, I decided to use mutate() to create a new variable, suicideRate, that contains the percentage of suicides for the countries present in the dataset, this might help a little bit in dealing with countries with extreme populations.

I also use the rename() function to change the column names of some columns to something a bit cleaner.

suicideDataProcessed <- suicideData %>% 
mutate(suicideRate = (suicides_no/population)*100, 
       year = lubridate::year(lubridate::years(year)), 
       country = as.factor(country), 
       generation = as.factor(generation), 
       row_id = row_number()) %>% 
rename(gdp_per_capita = `gdp_per_capita ($)`, 
       gdp_for_year = `gdp_for_year ($)`)

theme_set(theme_light())

Exploratory Data Analysis

I’ll be using ggplot to generate a few visualizations in order to get a better understanding of the underlying patterns in the data set. For example, the first question that comes to mind is to look at the distribution of suicides amongst the different age groups shown in the dataset, I do this using geom_histogram:

suicideDataProcessed %>% 
ggplot(aes(x = suicideRate, fill = age, alpha = 0.5)) + 
geom_histogram() + 
facet_wrap(~age) +
guides(alpha = FALSE) +
labs(title = "Distribution of suicide rates across age groups", subtitle = "Lots of countries having low suicide rates for the youngest age group", x = "Suicide rates (in %age)")

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the above, we see that there is a very notable spike right at the beginning of the histogram for the youngest age group, this corresponds to the fact that lots of countries have low suicide rates amongst the youth. Notably, the 75+ age group appears to have a slightly more significant tail as compared to the other age groups, indicating the fact that some countries have high suicide rates when it comes to the older age groups.

I plot a similar histogram here, however this time, I look at the distribution of suicide rates amongst the different generations shown in the data set.

I also use an additional aesthetic in the plot to look at the distribution of age groups within generations, mainly because I’ve always been curious as to which age group exactly belonged to generation X, which age group belonged to generation Y, and so on.

suicideDataProcessed %>% 
ggplot(aes(x = suicideRate, fill = age, alpha = 0.5)) + 
geom_histogram(position = "dodge") + 
facet_wrap(~generation) +
guides(alpha = FALSE) +
labs(title = "Suicide rates across generations", subtitle = "Distribution within generations visible as well", x = "Suicide rate (in %age)")

Interestingly enough, we find a variety of groups belonging to the generation groups present in the data. I’d have thought generations like those of the boomers or millenials would largely comprise of single groups. There are a few generations unknown to me here, like the G.I generation as well as the Silent generation.

This graph largely resembles the previous graph, with major spikes occurring amongst the younger generations.

Recall that our dataset also contains a year variable. I’d love to explore the time-related aspect of the data, which I achieve by using geom_smooth() to inspect the relationship between year and suicideRate.

The method = "lm" in the geom_smooth() call tells the function to plot a linear regression line between the two variables.

suicideDataProcessed %>% 
ggplot(aes(year, suicideRate, color = age)) +
geom_smooth(method = "lm") +
labs(y = "Suicide rate (in %age)", title = "Suicide rates over time", subtitle = "Suicide rates for the youngest age group nearly constant over time")

## `geom_smooth()` using formula = 'y ~ x'

The rates show a negative trend over time, especially in the case of the 75+ years age group. This is encouraging news, however, keep in mind that this does not necessarily mean that the suicides have dropped, only that the rates have dropped. The population boom could easily be to blame for this instead of better social and cultural factors in society.

When it comes to inspecting suicides globally, a key question that comes to mind is whether third-world countries tend to have more suicides, corresponding to a lower standards of living, or if first-world countries also have significant numbers of suicides.

We can do this by using group_by() and summarize() to look at the average suicides for each country. Again, I use median instead of mean.

I also use fct_reorder() for reording the factor variables in the graph.

#Creating a custom color palette
library(RColorBrewer)
custColors <- rev(colorRampPalette(brewer.pal(9,"RdBu"))(20)) #reversing the color palette for better aesthetics

suicideDataProcessed %>% 
group_by(country) %>% 
summarize(count = n(), medianSuicides = median(suicides_no))  %>% 
mutate(country = fct_reorder(country, medianSuicides)) %>% 
top_n(20) %>% 
ungroup() %>% 
ggplot(aes(country, medianSuicides)) + 
geom_col(aes(fill = country),color = "black") +
coord_flip() +
scale_fill_manual(values = custColors) +
labs(title = "Median number of suicides per country", x = "Country", y = "Median suicides") +
guides(fill = FALSE)

## Selecting by medianSuicides

The United States and Japan have the highest average number of suicides, in stark contrast to their high standards of living.

In a previous graph, we looked at the distribution of suicide rates across different age groups/generations. Now, we’ll be looking at the median number of suicides per age group and generation, in a manner similar to the previous graph.

I convert both the age and generation variables to factor variables in order to use fct_reorder.

suicideDataProcessed %>% 
group_by(age) %>% 
summarize(count = n(), medianSuicides = median(suicides_no)) %>% 
mutate(age = as.factor(age), age = fct_reorder(age, medianSuicides)) %>% 
ungroup() %>% 
ggplot(aes(age, medianSuicides)) +
geom_col(aes(fill = age), color = "black") +
coord_flip() +
scale_fill_brewer(palette = "Reds", direction = 1) +
labs(title = "Median suicides per age group", x = "Age group", y = "Median suicides")

suicideDataProcessed %>% 
group_by(generation) %>% 
summarize(count = n(), medianSuicides = median(suicides_no)) %>% 
mutate(generation = as.factor(generation), generation = fct_reorder(generation, medianSuicides)) %>% 
ungroup() %>% 
ggplot(aes(generation, medianSuicides)) +
geom_col(aes(fill = generation), color = "black") +
scale_fill_brewer(palette = "Reds", direction = 1) +
coord_flip() +
labs(title = "Median suicides per generation", x = "Generation", y = "Median suicides")

Amidst all the news of millenial suicides, we find that it is actually the 35-54 age group that has the highest average suicides in the data.

The graph below this tries to answer a question previously raised over the course of this analysis: the effect of population on the suicide rate. I use geom_smooth() again here to effectively visualise the relationship between the two variables.

suicideDataProcessed %>% 
select(suicideRate, year, population)  %>% 
ggplot(aes(population, suicideRate)) +
geom_smooth(method = "lm") +
scale_x_continuous(labels = scales::comma_format()) + 
labs(title = "Suicide rates vs population", subtitle = "Suicide rates increase and become more variable", y = "Suicide rates (in %age)", x = "Population")

## `geom_smooth()` using formula = 'y ~ x'

According to the above graph, suicide rates should be increasing on average as the population increases, however, the variability increases drastically as well, shedding light on the fact that suicide rates do not depend solely on population.

Coming back to looking at the data over the years, I’d like to visualize something similar to a timeline, that tries to show the trends, if any, of suicides over the years. I decided to do this using a basic line plot, achieved using geom_line().

Again, I use group_by() and summarize() to calculate the median number of suicides for each year in the dataset. I use scale_color_continuous to use custom colours for the plot, with red for more suicides, and blue for fewer suicides.

I also vary the sizes of the points in accordance to the median suicides.

suicideDataProcessed %>% 
group_by(year) %>% 
summarize(count = n(), medianSuicides = median(suicides_no)) %>% 
ungroup() %>% 
ggplot(aes(year, medianSuicides)) +
geom_point(aes(size = medianSuicides, color = medianSuicides)) +
geom_line(aes(color = medianSuicides,alpha = 0.2)) +
geom_hline(aes(yintercept = max(medianSuicides),alpha = 0.2), linetype = "dotted") +
geom_hline(aes(yintercept = median(medianSuicides),alpha = 0.2), linetype = "dotted") +
geom_hline(aes(yintercept = min(medianSuicides),alpha = 0.2), linetype = "dotted") +
scale_color_continuous(low = "blue", high = "red") +
guides(alpha = FALSE) +
labs(title = "Median suicides across the years", subtitle = "Median suicides overall is roughly 25", y = "Median suicides", x = "Year")

Visualizing the data on a world map

This dataset has data related to countries, while also featuring a very convenient country variable. What better way to make use of this than a map visualization?

To do this, I first used the map_data function that is a part of the ggplot2 package, and then store the data in a variable called worldData. I’ll use this variable later to plot the full visualization.

worldData <- map_data('world')
str(worldData)

## 'data.frame':	99338 obs. of  6 variables:
##  $ long     : num  -69.9 -69.9 -69.9 -70 -70.1 ...
##  $ lat      : num  12.5 12.4 12.4 12.5 12.5 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ subregion: chr  NA NA NA NA ...

I write a custom function below to look at country names in the worldData dataset. The function uses the str_detect() function from the stringr package to look at country names that have common strings present in them, for example, Saint Vincent and Saint Lucia. This function helps to get an idea of how to deal with mismatched country names, as shown subsequently.

countryCount <- function(countryName){
worldData %>% 
filter(str_detect(region, pattern = countryName)) %>% 
group_by(region) %>% 
summarize(count = n())
}

countryNames <- c("Korea", "Antigua", "Barbuda", "Saint", "Trinidad", "Verde")
lapply(countryNames, countryCount)

## [[1]]
## # A tibble: 2 × 2
##   region      count
##   <chr>       <int>
## 1 North Korea   257
## 2 South Korea   260
## 
## [[2]]
## # A tibble: 1 × 2
##   region  count
##   <chr>   <int>
## 1 Antigua    12
## 
## [[3]]
## # A tibble: 1 × 2
##   region  count
##   <chr>   <int>
## 1 Barbuda    10
## 
## [[4]]
## # A tibble: 7 × 2
##   region                    count
##   <chr>                     <int>
## 1 Saint Barthelemy             11
## 2 Saint Helena                  7
## 3 Saint Kitts                  13
## 4 Saint Lucia                  10
## 5 Saint Martin                  7
## 6 Saint Pierre and Miquelon    24
## 7 Saint Vincent                10
## 
## [[5]]
## # A tibble: 1 × 2
##   region   count
##   <chr>    <int>
## 1 Trinidad    30
## 
## [[6]]
## # A tibble: 1 × 2
##   region     count
##   <chr>      <int>
## 1 Cape Verde    87

After using str() to inspect the new variable we just created, we notice that it contains a variable that contains country names, similar to our original dataset; however, the column names are different. I fix this using a simple rename() call.

There are a few countries that are labelled differently in the different datasets, which means using a join operation may not show us the full picture. In order to deal with the mismatched countries, I first use an anti_join() to get the countries that are labelled differently. I will deal with them subsequently.

suicideDataProcessed %>% 
rename(region = country) %>% 
mutate(region = as.character(region)) %>% 
anti_join(worldData) %>% 
group_by(region) %>% 
summarize(count = n())

## Joining with `by = join_by(region)`

## # A tibble: 10 × 2
##    region                       count
##    <chr>                        <int>
##  1 Antigua and Barbuda            324
##  2 Cabo Verde                      12
##  3 Macau                           12
##  4 Republic of Korea              372
##  5 Russian Federation             324
##  6 Saint Kitts and Nevis           36
##  7 Saint Vincent and Grenadines   300
##  8 Trinidad and Tobago            324
##  9 United Kingdom                 372
## 10 United States                  372

worldData <- worldData %>% 
mutate(region = ifelse(str_detect(region, "Korea"),"Korea", region),
      region = ifelse(str_detect(region, "Barbuda"),"Antigua", region))

The worldData variable contains data that will enable me to plot a world map using ggplot, however, I need to plot the dataset I have on top of the world map. To do this, I use a full_join(), after renaming the country column to region.

I also use mutate() to ensure that as many countries are matched as possible, as some of the countries are just labelled differently.

The remainder of the code is pretty similar to what I’ve been doing so far. I use the viridis package to generate colours for the worldmap.

library(viridis)

## Loading required package: viridisLite

suicideDataProcessed %>% 
mutate(country = as.character(country), 
       country = ifelse(country == "Russian Federation", "Russia", country),
       country = ifelse(country == "United Kingdom", "UK", country),
       country = ifelse(country == "United States", "USA", country),
       country = ifelse(country == "Trinidad and Tobago", "Trinidad", country),
       country = ifelse(country == "Saint Kitts and Nevis", "Saint Kitts", country),
       country = ifelse(country == "Saint Vincent and Grenadines", "Saint Vincent", country),
       country = ifelse(country == "Cabo Verde", "Cape Verde", country),
       country = ifelse(country == "Antigua and Barbuda", "Antigua", country),
       country = ifelse(country == "Republic of Korea", "Korea", country)
      ) %>%
group_by(country) %>% 
rename(region = country) %>%
summarize(count = n(), medianSuicides = median(suicides_no)) %>% 
full_join(worldData, by = "region") %>% 
ungroup() %>% 
ggplot(aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = medianSuicides)) +
scale_fill_viridis(option = "B", direction = -1) +
labs(title = "Median suicides globally", subtitle = "Gray areas represent missing data", x = "", y = "") +
coord_map("globular") +
theme(axis.text.x = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks = element_blank()
    )

And there you have it, a world map, coloured according to the average number of suicides per country. Note how dark the US, Japan, and Russia are, this directly corresponds to an earlier graph wherein we plotted the same data using a bar plot.

The graph has a lot of missing data, I tried to eliminate the ones that could be eliminated by simple renaming, but upon closer inspection, the dataset has no information when it comes to countries like China and India, surprisingly.

Creating a predictive model

The original dataset has quite a few variables present, but what I’d like to look at is to see if I can use them to roughly predict suicide numbers for a particular year.

We can create a very basic linear regression model using lm() that uses several dependent variables present in our data to predict the number of suicides, our independent variable. To do so for each country, I use the map function from the purrr package. This enables me to define a function, and apply it to different groups of data.

I first split the data into different data sets using the split() function, this creates a separate data frame for each country present originally. I then use map() as well as lm() to train a linear regression model for each individual country present, after which I again use summary() to generate some information on each individual model trained. The map_dfr() function works similarly to the map() function, with the only difference being that the former returns a dataframe, whereas the latter returns a list.

gather() is used here to reshape the data from the wide format returned by map_dfr() to long data, in order to effectively visualize it using ggplot2. I visualize the r.squared metric for each model as a qualitative indicator, with lower values indicating better fits.

suicideDataProcessed %>% 
select(country, sex, age, suicides_no, population, generation, year) %>% 
mutate(sex = as.factor(sex), age = as.factor(age)) %>% 
split(.$country) %>% 
purrr::map(~lm(suicides_no ~ population + age + sex + generation + year, data = .)) %>% 
purrr::map(summary) %>%   
map_dfr("r.squared") %>% 
gather(country, r.squared, Albania:Uzbekistan) %>% 
mutate(country = as.factor(country), country = fct_reorder(country, r.squared)) %>% 
filter(!is.na(r.squared), r.squared > 0.8) %>% 
ggplot(aes(country, r.squared)) +
geom_line(group = 1, aes(colour = r.squared)) +
geom_hline(aes(yintercept = median(r.squared)), linetype = "dotted") +
coord_flip() +
scale_color_viridis(direction = -1) +
labs(title = "R squared for each country", subtitle = "Dotted line represents the median R squared", x= "Country", y = "R squared")

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `country = fct_reorder(country, r.squared)`.
## Caused by warning:
## ! `fct_reorder()` removing 2 missing values.
## ℹ Use `.na_rm = TRUE` to silence this message.
## ℹ Use `.na_rm = FALSE` to preserve NAs.

The next graph works similarly to the previous one, with one key difference lying in its using of the tidy() function from the broom package. This function essentially converts model output to a dataframe, so that it can be manipulated with relative ease using packages like dplyr and ggplot2.

The p-value of a hypothesis test indicates the significance of your results, with low values indicating that the null hypothesis is _rejected`, and high values indicating that we fail to reject the null hypothesis.

In other words, when training machine learning models, a low p-value associated with a feature generally means that the feature is significant.

We could look at the p-values of the variables associated with our earlier linear model the following way:

suicideDataProcessed %>% 
select(country, sex, age, suicides_no, population, generation, year) %>% 
mutate(sex = as.factor(sex), age = as.factor(age)) %>% 
split(.$country) %>% 
purrr::map(~lm(suicides_no ~ population + age + sex + generation+year-1, data = .)) %>% 
map_dfr(broom::tidy) %>% 
filter(!is.na(statistic), !is.na(p.value)) %>% 
group_by(term) %>% 
summarize(avEst=median(estimate), avError = median(std.error), avStat = median(statistic), avPValue = median(p.value)) %>% 
ggplot(aes(term, avPValue)) +
geom_point() +
geom_line(group = 1) +
coord_flip() +
labs(title = "P value versus features", subtitle = "Lower p values indicate better results", x = "Term", y = "P-value")

As we can see from the above graph, the `year`, `sex`, and `population` variables are generally considered strong indicators when it comes to predicting suicides for a particular country, with `age groups` and `generations` being comparitively less significant.

Conclusion

Some key takeaways that we can take from the data:

Higher standards of living do not necessarily indicate lower suicide rates.
Suicice numbers were decreasing up till 2010, after which they started rising again.
Middle-aged people are more likely to commit suicides globally.
The year, sex, and population are strong indicators when used to predict the number of suicides for a country.
I got to learn a lot of new things while exploring this data, which only leads me to believe I’ve got so much more to learn when it comes to data analysis.

Some things I could’ve done better:

Use more complicated machine learning algorithms.
Explore the impact of gdp on suicides.
Use broom and purrr a lot more.