Slightly more than a year back India went through the 2019 general elections.
At the time, the country primarily had 2 parties in the fray: the Bhartiya Janta Party, a party that won with a significant majority in the previous 2014 elections, and the Indian National Congress, a party that has historically governed for most of modern India’s history.
Politics in India is a messy, complicated affair, and for someone like me who found the barrage of potential schemes and regulations daunting, doubly so. Even the election manifestos released by each party are not that easy to comprehend, which is not a good thing, since I believe people should know what they’re voting for.
This led me to try and analyze the manifestos the only way I know how: via R!
But that wasn’t enough. I didn’t want only me to know the results of my analysis, I wanted to share it with anybody who’s interested. This led me to 2 options:
A Shiny app would let users play around with my R code, in an intuitive, interactive manner. This could be any user with access to internet. A Plumber API would let users do the same, albeit with some technical knowledge required, thereby limiting the potential reach of my app.
Thus, my first Shiny app was born. I’m planning on working on a Plumber API soon, and I will be covering it in a subsequent blog post.
This blog post will be divided into 2 parts.
- The first part will deal with the actual analysis, including data tidying, tokenization, visualizations, and more.
- The second part will revolve around converting my code into a Shiny app.
Let’s begin.
Part 1
Loading in the data
For this step, my initial approach was to scrape the data based off their websites. However, that proved to be harder than expected, mainly because the manifestos were not present as actual web pages, but as embedded documents.
Fortunately for me, though, those documents were easily downloaded from each party’s official website:
library(pdftools) # for handling PDFs
library(tidyverse) # for streamlining data analysis
library(tidytext) # for text analysis
theme_set(theme_light())
bjpRaw <- pdf_text("../../../static/data/bjp.pdf")
congressRaw <- pdf_text("../../../static/data/congress.pdf")
Note: Since I want my Shiny app to be based off user input, I’ll be writing the code for most of the plots below as
functions
. That way, I’ll easily be able to parse the input values received in the Shiny app to the code instead of hardcoding values.
Initial cleaning
While loading the files, I noticed parsing errors with some words, leading them to be truncated. Word stemming is out of scope for some of these truncated words, so I’ve tried to simply filter out the faulty words, as far as possible.
I’m also extracting and cleaning the individual manifestors separately, after which I use rbind()
to combine them into the data frame we’ll be working with.
I’m using tidytext
’s unnest_tokens()
function to tokenize the data into individual words, after which I’ll use the inbuilt stop-words lexicon to filter out stop-words via anti_join()
.
filterWords <-
c(
"impo",
"ance",
"oppo",
"unities",
"di",
"erent",
"unde",
"ake",
"ts",
"bene",
"se",
"ing",
"electri",
"cation",
"commi",
"ed",
"diately",
"ve",
"pa",
"er",
"icipation",
"fu"
)
bjp <- bjpRaw %>%
tbl_df() %>%
unnest_tokens(word, value) %>% # tokenizing the data
mutate(party = "bjp",
section = row_number() %/% 10) %>% # adding a section variable for granularity, helps in tf-idf and correlations
filter(!word %in% filterWords,
str_detect(word, "[a-zA-Z]")) # keeping only valid words
congress <- congressRaw %>%
tbl_df() %>%
unnest_tokens(word, value) %>%
mutate(party = "congress",
section = row_number() %/% 10) %>%
filter(!word %in% filterWords)
df <- rbind(bjp, congress) # combining
dfProcessed <- df %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "[0-9]"), # keeping only those words with no numeric characters
str_detect(word,"[a-zA-Z]")) %>%
filter(!word %in% filterWords)
Now that the data is loaded and cleaned, we can begin with our visualizations.
The first plot is a simple bar graph, displaying the most commonly occurring words in each manifesto. The user decides how many words to display.
Exploratory analysis
wordCount <- function(num) {
dfProcessed %>%
count(word, party, sort = TRUE) %>%
filter(!word %in% filterWords) %>%
group_by(party) %>%
arrange(desc(n)) %>%
top_n(num) %>%
ungroup() %>%
ggplot(aes(reorder_within(word, n, party), n)) +
geom_col(aes(fill = party)) +
geom_label(aes(label = n), alpha = 0.2, size = 3) +
scale_x_reordered() +
coord_flip() +
facet_wrap( ~ party, scales = "free") +
labs(
title = paste(num,"most common words per party"),
y = "Count",
x = "Word",
caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf"
)
}
wordCount(10) # an example
Let’s plot how often words occur in both manifestos.
I do this by simply calculating the total number of words per party, the number of times a particular word appears per manifesto, and taking their ratio, after which I use pivot_wider()
to reshape the data to a format more suitable for the plot I have in mind.
library(scales)
freqPercent <- dfProcessed %>%
group_by(party) %>%
count(word, sort = TRUE) %>%
left_join(dfProcessed %>%
group_by(party) %>%
summarise(total = n())) %>% # total words in processed data frames per party
mutate(freq = n / total) # percentage of a particular word occurring for a particular party
freqPlot <- function() {
freqPercent %>%
select(party, word, freq) %>%
top_n(500) %>% # to prevent overplotting
pivot_wider(names_from = party, values_from = freq) %>%
arrange(bjp, congress) %>%
ggplot(aes(bjp, congress)) +
geom_jitter(
alpha = 0.2,
size = 4,
width = 0.25,
height = 0.25
) +
geom_text(aes(label = word),
size = 4.5,
check_overlap = TRUE,
vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "red",
alpha = 0.8,
linetype = "dashed") +
labs(title = "Word frequences in the manifestos by BJP and Congress",
subtitle = "Words above and below the red line are more frequently\npresent in Congress and BJP respectively",
caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf")
}
freqPlot()
In the Shiny app, I’ll be removing the
top_n()
call I used in the above section. I used it here to prevent overplotting in the.Rmd
document, since I could not find a reliable way to resize ggplots in.Rmd
files. In Shiny, we can set up a dimension of the plot easily, so hopefully we’ll still be getting an insightful graph without too much information loss.
The next plot we’ll be working on is the tf-idf
plot. For the uninitiated, tf-idf
is a metric that gives words more importance, based on how often they appear in 1 document as compared to other documents. This prevents commonly occurring words being given extra weightage, leading us to visualize what topics each party talks about more in comparison with the other.
The bind_tf_idf()
function from tidytext
is used here.
tfidfWords <- function(partyInput, num) {
dfProcessed %>%
filter(party == partyInput) %>% # calculating only based on user input
filter(!word %in% filterWords) %>%
count(word, section) %>%
bind_tf_idf(word, section, n) %>% # generating tf-idf values per word, per document section
arrange(desc(tf_idf)) %>%
head(num) %>%
mutate(word = fct_reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf)) +
geom_col(aes(fill = tf_idf), color = "black") +
coord_flip() +
scale_fill_distiller(direction = 1, palette = "RdPu") + # setting a color palette manually
labs(
title = paste("Words more specific in", partyInput, "'s manifesto"),
subtitle = paste("Top",num,"Words weighted by tf-idf"),
caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf"
)
}
tfidfWords("bjp", 10) # an example
Network graphs
Network graphs are an intuitive way of visualizing connections between terms occurring in corpora. They enable us to effectively grasp which words frequently occur besides each other in the form of n-grams
, or which words seem to be correlated with each other, through the document (section
in this case.)
The igraph
and ggraph
libraries provide utilities for converting data frames into just such a graph.
An
n-gram
is basically a unit of words adjacent to each other. A unigram is an individual word, a bi-gram is 2 words occurring in succession, a tri-gram comprises of 3 words, and so on.
The code below converts the raw data into a bi-gram tokenized, cleaned data frame, suitable for network plotting.
library(igraph)
library(ggraph)
library(widyr)
bjpBG <- bjpRaw %>%
tbl_df() %>%
unnest_tokens(word, value, token = "ngrams", n = 2) %>% # tokenizing via n-grams (bigrams in this case)
mutate(party = "bjp") %>%
filter(str_detect(word,"[a-zA-Z]"))
congressBG <- congressRaw %>%
tbl_df() %>%
unnest_tokens(word, value, token = "ngrams", n = 2) %>%
mutate(party = "congress")
dfBG <- rbind(bjpBG, congressBG) # combining the bi-gram dataframes
dfBGProcessed <- dfBG %>%
separate(word, into = c("word1", "word2"), sep = " ") %>% # separating the words based on the occurrence of a space
filter(
!word1 %in% stop_words$word & # filtering stop words and non-numeric characters only
!word2 %in% stop_words$word &
is.na(as.numeric(word1)) & is.na(as.numeric(word2))
) %>%
filter(!word1 %in% filterWords &
!word2 %in% filterWords) %>%
filter(str_detect(word1,"[a-zA-Z]") &
str_detect(word2,"[a-zA-Z]")) %>%
mutate(word1 = ifelse(word1=="sta", "start", word1), # cleaning up a few truncated words
word2 = ifelse(word2=="suppo", "support", word2),
word1 = ifelse(word1=="nancial", "financial", word1)) %>%
count(word1, word2, party, sort = TRUE)
Now that the data has been converted into a suitable format, we can work on our function for plotting the data below:
bgPlotFunc <- function(partyInput) {
set.seed(100)
dfBGProcessed %>%
filter(party == partyInput) %>% # takes in user input
filter(n > 5) %>% # to avoid overplotting
graph_from_data_frame() %>% # converting the data frame into a network graph
ggraph(layout='nicely') + # converting the graph into a ggplot object
geom_edge_link(aes(edge_colour = n, edge_width = n)) + # defining edges
geom_node_point(alpha = 0.8, size = 3) + # defining points
geom_node_text(aes(label = name), repel = TRUE) + # defining labels
scale_edge_color_distiller(palette = "RdPu", direction = 1) + # defining colour
guides(alpha = FALSE, edge_alpha = FALSE) +
labs(title = paste("Frequent word pairings in", partyInput, "'s manifesto"),
subtitle = "Darker, wider colors are more frequent",
caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf") +
theme_void()
}
bgPlotFunc("congress") # an example
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Correlation plots are another way of visualizing word-pairings across a document, with the added advantage of showing words that occur in the same document, but not necessarily adjacent to each other, unlike the bi-gram plot we just plotted. The pairwise_cor()
function from the widyr
package generates correlations for each word, in each document section
.
corWords <- function(partyInput) {
set.seed(100)
dfProcessed %>%
filter(party == partyInput) %>% # filtering based on user input
filter(!word %in% filterWords) %>%
group_by(word) %>%
filter(n() > 20) %>% # filtering out words that occur less than 20 times
pairwise_cor(word, section, sort = TRUE, upper = FALSE) %>% # correlations for each word per section
filter(!is.na(correlation),
correlation > 0.04, # filtering out words with low correlations
!item1 %in% filterWords & # cleaning up output
!item2 %in% filterWords) %>%
graph_from_data_frame() %>% # plotting the network graph
ggraph(layout='nicely') +
geom_edge_link(aes(edge_colour = correlation, edge_width = correlation)) +
geom_node_point(size = 3, alpha = 0.8) +
geom_node_text(aes(label = name), repel = TRUE) +
scale_edge_color_distiller(palette = "RdPu", direction = 1) +
labs(title = paste("Correlated words in",partyInput,"'s manifesto"),
subtitle = "Darker, wider colors are more correlated",
caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf") +
theme_void()
}
corWords("congress") # an example
Topic modelling
Topic modelling is similar to unsupervised clustering in the sense that it helps analysts find groups of observations previously left unseen. It enables us to find groups of abstract topics or themes across documents, facilitating the discovery of hidden semantic structures, if any.
The stm
package in R is an R implementation of the Structured Topic Model, a general framework for topic modeling with document-level covariate information.
Training an STM understandably takes some time.
library(stm)
topicWords <- function(partyInput) {
dfSparse <- dfProcessed %>%
filter(party == partyInput) %>%
count(section, word, sort = TRUE) %>%
cast_sparse(section, word, n) # requires sparse matrix as input
topicModel <- # the topic model algorithm
stm(dfSparse,
K = 3,
verbose = FALSE,
init.type = "LDA") # using latent dirichlet allocation
tdBeta <- tidy(topicModel) # converting model results to a tibble via broom::tidy()
tdBeta %>%
group_by(topic) %>%
arrange(desc(beta)) %>%
top_n(20, beta) %>% # keeping the 20 most influential words in a particular topic
ungroup() %>%
mutate(topic = paste("Topic", topic),
term = fct_reorder(term, beta)) %>%
ggplot(aes(term, beta)) +
geom_col(alpha = 0.8, aes(fill = topic)) +
facet_wrap( ~ topic, scales = "free") +
coord_flip() +
labs(title = paste("Possible topics for ", partyInput),
subtitle = "Topic modelling via LDA used",
caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf")
}
topicWords("congress") # an example
If you’re still here, thank you for reading so far! I’m done with the analysis, so if that’s what you were interested in, you may stop reading now.
The next part deals with converting what we did above into a living, breathing application on the internet, available to anyone and everyone with an internet connection. Read on to find out how!
Part 2
I first learned about Shiny way back in college, when we had to create interactive plots for a college course. Surprisingly though it took me quite a while to get started with it, and once I did, I couldn’t help but marvel at the ease of execution. A few lines of code, and anybody can create a web app with just a little bit of R knowledge.
Shiny apps traditionally consist of 2 parts:
- a
ui
file, describing the layout of your application - a
server
file, that contains the code you’ll be using to handle the data being displayed
You can put the above segments into 2 separate files, or you could put them as functions in a single file that you’d be using to create the app.
I’m using the shiny
and the shinydashboard
packages primarily to set up my application.
Let’s begin with the ui
file, step by step.
The first thing I do before doing anything related to
Shiny
is load up the code I wrote above containing all my functions and variables via asource()
call. This is so that I can access the results of my code in the Shiny app.
Shiny UI works by using specific elements to define the app layout, placed in a “layout” function that governs how each element is placed. By default, the fluidPage()
function is used, which basically adjusts the dimensions of your app to match the dimensions of the browser window its open in.
I’m using dashboardPage()
from the shinydashboard
package since I want to be creating a dashboard, and this package has several helpful functions that facilitate this.
The dashboardPage()
function always takes in the following parameters:
- a
header
, containing stuff like the title of the app, - a
sidebar
, containing stuff like a menu, or tabs for navigating through the app, - and a
body
, containing the results of the executed code
In the code below, I’m designing a tabbed user interface via tabItem()
.
Shiny also has several *input()
and *output()
functions, determining the kind of data they’ll be handling. For example, plotOutput()
will be rendering graphs, whereas textOutput()
will be rendering text. Similarly for the *input()
family. These functions also have associated IDs, letting us access them in our server
code.
library(shiny)
library(shinydashboard)
library(dashboardthemes)
source("script.R")
ui <- dashboardPage(
dashboardHeader(title = "Manifesto-viz"),
dashboardSidebar(
sidebarMenu(
menuItem("Word frequencies", tabName = "wordFreq"),
menuItem("tf-idf", tabName = "tfidf"),
menuItem("Bigrams", tabName = "bigrams"),
menuItem("Correlated words", tabName = "corWords"),
menuItem("Topics per party", tabName = "topicWords")
)
),
dashboardBody(
shinyDashboardThemes(theme = "grey_dark"), # theming
tabItems # setting up app tabs
(
tabItem
(tabName = "wordFreq",
fluidRow(
box(
title = "Word frequencies",
#background = "black",
width = 12,
solidHeader = TRUE,
plotOutput("freqPlot", height = 600) # displaying a graph, along with assigning an ID to the result
)
),
fluidRow(
box(
numericInput( # taking numeric user input
inputId = "numInput",
label = "Number of most frequent words",
value = 10,
min = 1,
max = 30
),
title = "Word count per manifesto",
solidHeader = TRUE,
width = 12,
#background = "black",
plotOutput("facetPlot") # plot output
)
)),
tabItem(tabName = "tfidf",
fluidRow(
box(
numericInput( # numeric input
inputId = "numInputTFIDF",
label = "Number of words more specific to a party",
value = 10,
min = 1,
max = 30
),
title = "Words more specific per manifesto",
#background = "black",
width = 12,
solidHeader = TRUE,
selectInput( # drop down
inputId = "selectParty",
label = "Party",
choices = c("BJP", "Congress")
),
plotOutput("tfidfPlot") # plot output
)
)),
tabItem(tabName = "bigrams",
fluidRow(
box(
title = "Frequent word pairings",
#background = "black",
width = 12,
solidHeader = TRUE,
selectInput( # drop down
inputId = "selectPartyBG",
label = "Party",
choices = c("BJP", "Congress")
),
plotOutput("bgPlot") # plot output
)
)),
tabItem(tabName = "corWords",
fluidRow(
box(
title = "Correlated words from each manifesto",
solidHeader = TRUE,
#background = "black",
width = 12,
selectInput( # drop down
inputId = "selectPartyCor",
label = "Party",
choices = c("BJP", "Congress")
),
plotOutput("corWords") # plot output
)
)),
tabItem(tabName = "topicWords",
fluidRow(
box(
solidHeader = TRUE,
#background = "black",
width = 12,
title = "Topic modelling of each manifesto",
selectInput( # drop down
inputId = "selectPartyTopic",
label = "Party",
choices = c("BJP", "Congress")
),
plotOutput("topicWords", width = 1000) # plot output
)
))
)
)
)
Notice how each of the
*input()
and*output()
functions has an ID being passed as a parameter. We’ll be using them now.
The above code defined what our application looks like, but we haven’t really handled its behaviour yet.
Luckily, most of the work is already being done in the functions we defined in Part 1 of the document. All we need to do now is call specific functions in specific elements of our application.
The input
and output
parameters being passed to server
let us access the elements we defined in ui
via the standard $
notation. This enables us to treat each element like an object, objects which can be assigned values of our choosing.
Our entire application almost entirely relies on plots, so we’ll be using the renderPlot()
function to render the plots defined in our functions, which will then be assigned to the output
s we defined under ui
. We’ll be assigning each function to its corresponding output
attribute.
In the code below, I’m calling each of the functions we declared in part 1 to their corresponding ui
elements.
Here’s the server
code:
server <- function(input, output) {
output$freqPlot <- renderPlot({ # rendering plot output
freqPlot() # calling a function from the script in part 1
})
wordInput <- reactive({
wordCount(input$numInput)
})
output$facetPlot <- renderPlot({
wordInput() # rendering and assigning a plot
})
tfidfInput <- reactive({
tfidfWords(stringr::str_to_lower(input$selectParty), # parsing user input
input$numInputTFIDF)
})
output$tfidfPlot <- renderPlot({
tfidfInput() # rendering and assigning a plot
})
output$bgPlot <- renderPlot({
bgPlotFunc(stringr::str_to_lower(input$selectPartyBG)) # parsing user input
})
output$corWords <- renderPlot({
corWords(stringr::str_to_lower(input$selectPartyCor)) # parsing user input
})
topicReactive <- reactive({
topicWords(stringr::str_to_lower(input$selectPartyTopic)) # parsing user input
})
output$topicWords <- renderPlot({
topicReactive() # rendering and assigning a plot
})
}
# Run the application
shinyApp(ui = ui, server = server)
You might’ve noticed the reactive()
calls in the above snippet. I’ve used them in computationally intensive parts of my app (for example, the topic modelling), as they control which parts of my apps update at which time, instead of going through each computation repetitively. Reactive expressions limit the code that will be executed again based on user input.
If you’re still here, thank you for reading this far! That was a long post, but through it we went covered quite a bit of things:
- Reading, cleaning and extracting data from pdfs via
pdftools
pivot_wider()
- Text analysis via
tidytext
tf-idf
- Correlations and bigrams
- Network plots
- Topic modelling
- Shiny UI elements
- The
input
andoutput
Shiny objects reactive
expressions
This was honestly exhausting, but I feel relieved that I finally got around doing it. It’s always exciting giving other people the opportunity to interact with your code, after all.
I’m welcome to suggestions and feedback, as always. You could use the comments below or contact me at my email.
The app lives here.