Slightly more than a year back India went through the 2019 general elections.

At the time, the country primarily had 2 parties in the fray: the Bhartiya Janta Party, a party that won with a significant majority in the previous 2014 elections, and the Indian National Congress, a party that has historically governed for most of modern India’s history.

Politics in India is a messy, complicated affair, and for someone like me who found the barrage of potential schemes and regulations daunting, doubly so. Even the election manifestos released by each party are not that easy to comprehend, which is not a good thing, since I believe people should know what they’re voting for.

This led me to try and analyze the manifestos the only way I know how: via R!

But that wasn’t enough. I didn’t want only me to know the results of my analysis, I wanted to share it with anybody who’s interested. This led me to 2 options:

A Shiny web app
A Plumber API

A Shiny app would let users play around with my R code, in an intuitive, interactive manner. This could be any user with access to internet. A Plumber API would let users do the same, albeit with some technical knowledge required, thereby limiting the potential reach of my app.

Thus, my first Shiny app was born. I’m planning on working on a Plumber API soon, and I will be covering it in a subsequent blog post.

This blog post will be divided into 2 parts.

The first part will deal with the actual analysis, including data tidying, tokenization, visualizations, and more.
The second part will revolve around converting my code into a Shiny app.

Let’s begin.

Part 1

Loading in the data

For this step, my initial approach was to scrape the data based off their websites. However, that proved to be harder than expected, mainly because the manifestos were not present as actual web pages, but as embedded documents.

Fortunately for me, though, those documents were easily downloaded from each party’s official website:

library(pdftools) # for handling PDFs
library(tidyverse) # for streamlining data analysis
library(tidytext) # for text analysis

theme_set(theme_light())
bjpRaw <- pdf_text("../../../static/data/bjp.pdf")
congressRaw <- pdf_text("../../../static/data/congress.pdf")

Note: Since I want my Shiny app to be based off user input, I’ll be writing the code for most of the plots below as functions. That way, I’ll easily be able to parse the input values received in the Shiny app to the code instead of hardcoding values.

Initial cleaning

While loading the files, I noticed parsing errors with some words, leading them to be truncated. Word stemming is out of scope for some of these truncated words, so I’ve tried to simply filter out the faulty words, as far as possible.

I’m also extracting and cleaning the individual manifestors separately, after which I use rbind() to combine them into the data frame we’ll be working with.

I’m using tidytext’s unnest_tokens() function to tokenize the data into individual words, after which I’ll use the inbuilt stop-words lexicon to filter out stop-words via anti_join().

filterWords <-
  c(
    "impo",
    "ance",
    "oppo",
    "unities",
    "di",
    "erent",
    "unde",
    "ake",
    "ts",
    "bene",
    "se",
    "ing",
    "electri",
    "cation",
    "commi",
    "ed",
    "diately",
    "ve",
    "pa",
    "er",
    "icipation",
    "fu"
  )
bjp <- bjpRaw %>%
  tbl_df() %>%
  unnest_tokens(word, value) %>% # tokenizing the data
  mutate(party = "bjp",
         section = row_number() %/% 10) %>% # adding a section variable for granularity, helps in tf-idf and correlations
  filter(!word %in% filterWords,
         str_detect(word, "[a-zA-Z]")) # keeping only valid words

congress <- congressRaw %>%
  tbl_df() %>%
  unnest_tokens(word, value) %>%
  mutate(party = "congress",
         section = row_number() %/% 10) %>% 
  filter(!word %in% filterWords)


df <- rbind(bjp, congress) # combining

dfProcessed <- df %>%
  anti_join(stop_words) %>%
  filter(!str_detect(word, "[0-9]"), # keeping only those words with no numeric characters
         str_detect(word,"[a-zA-Z]")) %>%
  filter(!word %in% filterWords)

Now that the data is loaded and cleaned, we can begin with our visualizations.

The first plot is a simple bar graph, displaying the most commonly occurring words in each manifesto. The user decides how many words to display.

Exploratory analysis

wordCount <- function(num) {
  dfProcessed %>%
    count(word, party, sort = TRUE) %>%
    filter(!word %in% filterWords) %>% 
    group_by(party) %>%
    arrange(desc(n)) %>% 
    top_n(num) %>%
    ungroup() %>%
    ggplot(aes(reorder_within(word, n, party), n)) +
    geom_col(aes(fill = party)) +
    geom_label(aes(label = n), alpha = 0.2, size = 3) +
    scale_x_reordered() +
    coord_flip() +
    facet_wrap( ~ party, scales = "free") +
    labs(
      title = paste(num,"most common words per party"),
      y = "Count",
      x = "Word",
      caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf"
    )
}

wordCount(10) # an example

Let’s plot how often words occur in both manifestos.

I do this by simply calculating the total number of words per party, the number of times a particular word appears per manifesto, and taking their ratio, after which I use pivot_wider() to reshape the data to a format more suitable for the plot I have in mind.

library(scales)
freqPercent <- dfProcessed %>%
  group_by(party) %>%
  count(word, sort = TRUE) %>% 
  left_join(dfProcessed %>%
              group_by(party) %>%
              summarise(total = n())) %>% # total words in processed data frames per party
  mutate(freq = n / total) # percentage of a particular word occurring for a particular party


freqPlot <- function() {
  
  freqPercent %>%
    select(party, word, freq) %>%
    top_n(500) %>% # to prevent overplotting
    pivot_wider(names_from = party, values_from = freq) %>% 
    arrange(bjp, congress) %>% 
    ggplot(aes(bjp, congress)) +
    geom_jitter(
      alpha = 0.2,
      size = 4,
      width = 0.25,
      height = 0.25
    ) +
    geom_text(aes(label = word),
              size = 4.5,
              check_overlap = TRUE,
              vjust = 1.5) +
    scale_x_log10(labels = percent_format()) +
    scale_y_log10(labels = percent_format()) +
    geom_abline(color = "red",
                alpha = 0.8,
                linetype = "dashed") +
    labs(title = "Word frequences in the manifestos by BJP and Congress",
         subtitle = "Words above and below the red line are more frequently\npresent in Congress and BJP respectively",
         caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf")
  
  
}

freqPlot()

In the Shiny app, I’ll be removing the top_n() call I used in the above section. I used it here to prevent overplotting in the .Rmd document, since I could not find a reliable way to resize ggplots in .Rmd files. In Shiny, we can set up a dimension of the plot easily, so hopefully we’ll still be getting an insightful graph without too much information loss.

The next plot we’ll be working on is the tf-idf plot. For the uninitiated, tf-idf is a metric that gives words more importance, based on how often they appear in 1 document as compared to other documents. This prevents commonly occurring words being given extra weightage, leading us to visualize what topics each party talks about more in comparison with the other.

The bind_tf_idf() function from tidytext is used here.

tfidfWords <- function(partyInput, num) {
  dfProcessed %>%
    filter(party == partyInput) %>% # calculating only based on user input
    filter(!word %in% filterWords) %>% 
    count(word, section) %>%
    bind_tf_idf(word, section, n) %>% # generating tf-idf values per word, per document section
    arrange(desc(tf_idf)) %>%
    head(num) %>%
    mutate(word = fct_reorder(word, tf_idf)) %>%
    ggplot(aes(word, tf_idf)) +
    geom_col(aes(fill = tf_idf), color = "black") +
    coord_flip() +
    scale_fill_distiller(direction = 1, palette = "RdPu") + # setting a color palette manually
    labs(
      title = paste("Words more specific in", partyInput, "'s manifesto"),
      subtitle = paste("Top",num,"Words weighted by tf-idf"),
      caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf"
    )
}

tfidfWords("bjp", 10) # an example

Network graphs

Network graphs are an intuitive way of visualizing connections between terms occurring in corpora. They enable us to effectively grasp which words frequently occur besides each other in the form of n-grams, or which words seem to be correlated with each other, through the document (section in this case.)

The igraph and ggraph libraries provide utilities for converting data frames into just such a graph.

An n-gram is basically a unit of words adjacent to each other. A unigram is an individual word, a bi-gram is 2 words occurring in succession, a tri-gram comprises of 3 words, and so on.

The code below converts the raw data into a bi-gram tokenized, cleaned data frame, suitable for network plotting.

library(igraph)
library(ggraph)
library(widyr)

bjpBG <- bjpRaw %>%
  tbl_df() %>%
  unnest_tokens(word, value, token = "ngrams", n = 2) %>% # tokenizing via n-grams (bigrams in this case)
  mutate(party = "bjp") %>% 
  filter(str_detect(word,"[a-zA-Z]"))

congressBG <- congressRaw %>%
  tbl_df() %>%
  unnest_tokens(word, value, token = "ngrams", n = 2) %>%
  mutate(party = "congress")

dfBG <- rbind(bjpBG, congressBG) # combining the bi-gram dataframes

dfBGProcessed <- dfBG %>%
  separate(word, into = c("word1", "word2"), sep = " ") %>% # separating the words based on the occurrence of a space
  filter(
    !word1 %in% stop_words$word & # filtering stop words and non-numeric characters only
      !word2 %in% stop_words$word &
      is.na(as.numeric(word1)) & is.na(as.numeric(word2))
  ) %>%
  filter(!word1 %in% filterWords &
         !word2 %in% filterWords) %>%
  filter(str_detect(word1,"[a-zA-Z]") &
         str_detect(word2,"[a-zA-Z]")) %>% 
  mutate(word1 = ifelse(word1=="sta", "start", word1), # cleaning up a few truncated words
         word2 = ifelse(word2=="suppo", "support", word2),
         word1 = ifelse(word1=="nancial", "financial", word1)) %>% 
  count(word1, word2, party, sort = TRUE)

Now that the data has been converted into a suitable format, we can work on our function for plotting the data below:

bgPlotFunc <- function(partyInput) {
  set.seed(100)
  
  dfBGProcessed %>%
    filter(party == partyInput) %>% # takes in user input
    filter(n > 5) %>% # to avoid overplotting
    graph_from_data_frame() %>% # converting the data frame into a network graph
    ggraph(layout='nicely') + # converting the graph into a ggplot object
    geom_edge_link(aes(edge_colour = n, edge_width = n)) + # defining edges
    geom_node_point(alpha = 0.8, size = 3) + # defining points
    geom_node_text(aes(label = name), repel = TRUE) + # defining labels
    scale_edge_color_distiller(palette = "RdPu", direction = 1) + # defining colour
    guides(alpha = FALSE, edge_alpha = FALSE) +
    labs(title = paste("Frequent word pairings in", partyInput, "'s manifesto"),
         subtitle = "Darker, wider colors are more frequent",
         caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf") +
    theme_void()
}

bgPlotFunc("congress") # an example

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Correlation plots are another way of visualizing word-pairings across a document, with the added advantage of showing words that occur in the same document, but not necessarily adjacent to each other, unlike the bi-gram plot we just plotted. The pairwise_cor() function from the widyr package generates correlations for each word, in each document section.

corWords <- function(partyInput) {
  set.seed(100)
  dfProcessed %>%
    filter(party == partyInput) %>% # filtering based on user input
    filter(!word %in% filterWords) %>%
    group_by(word) %>%
    filter(n() > 20) %>% # filtering out words that occur less than 20 times
    pairwise_cor(word, section, sort = TRUE, upper = FALSE) %>% # correlations for each word per section
    filter(!is.na(correlation),
           correlation > 0.04, # filtering out words with low correlations
           !item1 %in% filterWords & # cleaning up output
           !item2 %in% filterWords) %>%
    graph_from_data_frame() %>% # plotting the network graph
    ggraph(layout='nicely') +
    geom_edge_link(aes(edge_colour = correlation, edge_width = correlation)) +
    geom_node_point(size = 3, alpha = 0.8) +
    geom_node_text(aes(label = name), repel = TRUE) +
    scale_edge_color_distiller(palette = "RdPu", direction = 1) +
    labs(title = paste("Correlated words in",partyInput,"'s manifesto"),
         subtitle = "Darker, wider colors are more correlated",
         caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf") +
    theme_void()
}

corWords("congress") # an example

Topic modelling

Topic modelling is similar to unsupervised clustering in the sense that it helps analysts find groups of observations previously left unseen. It enables us to find groups of abstract topics or themes across documents, facilitating the discovery of hidden semantic structures, if any.

The stm package in R is an R implementation of the Structured Topic Model, a general framework for topic modeling with document-level covariate information.

Training an STM understandably takes some time.

library(stm)

topicWords <- function(partyInput) {
  dfSparse <- dfProcessed %>%
    filter(party == partyInput) %>%
    count(section, word, sort = TRUE) %>%
    cast_sparse(section, word, n) # requires sparse matrix as input
  
  topicModel <- # the topic model algorithm
    stm(dfSparse,
        K = 3,
        verbose = FALSE,
        init.type = "LDA") # using latent dirichlet allocation
  
  
  tdBeta <- tidy(topicModel) # converting model results to a tibble via broom::tidy()
  
  tdBeta %>% 
    group_by(topic) %>%
    arrange(desc(beta)) %>%
    top_n(20, beta) %>% # keeping the 20 most influential words in a particular topic
    ungroup() %>%
    mutate(topic = paste("Topic", topic),
           term = fct_reorder(term, beta)) %>%
    ggplot(aes(term, beta)) +
    geom_col(alpha = 0.8, aes(fill = topic)) +
    facet_wrap( ~ topic, scales = "free") +
    coord_flip() +
    labs(title = paste("Possible topics for ", partyInput),
         subtitle = "Topic modelling via LDA used",
         caption = "Sources: https://www.bjp.org/en/manifesto2019\nhttps://manifesto.inc.in/pdf/english.pdf")
  
  
}
topicWords("congress") # an example

If you’re still here, thank you for reading so far! I’m done with the analysis, so if that’s what you were interested in, you may stop reading now.

The next part deals with converting what we did above into a living, breathing application on the internet, available to anyone and everyone with an internet connection. Read on to find out how!

Part 2

I first learned about Shiny way back in college, when we had to create interactive plots for a college course. Surprisingly though it took me quite a while to get started with it, and once I did, I couldn’t help but marvel at the ease of execution. A few lines of code, and anybody can create a web app with just a little bit of R knowledge.

Shiny apps traditionally consist of 2 parts:

a ui file, describing the layout of your application
a server file, that contains the code you’ll be using to handle the data being displayed

You can put the above segments into 2 separate files, or you could put them as functions in a single file that you’d be using to create the app.

I’m using the shiny and the shinydashboard packages primarily to set up my application.

Let’s begin with the ui file, step by step.

The first thing I do before doing anything related to Shiny is load up the code I wrote above containing all my functions and variables via a source() call. This is so that I can access the results of my code in the Shiny app.

Shiny UI works by using specific elements to define the app layout, placed in a “layout” function that governs how each element is placed. By default, the fluidPage() function is used, which basically adjusts the dimensions of your app to match the dimensions of the browser window its open in.

I’m using dashboardPage() from the shinydashboard package since I want to be creating a dashboard, and this package has several helpful functions that facilitate this.

The dashboardPage() function always takes in the following parameters:

a header, containing stuff like the title of the app,
a sidebar, containing stuff like a menu, or tabs for navigating through the app,
and a body, containing the results of the executed code

In the code below, I’m designing a tabbed user interface via tabItem().

Shiny also has several *input() and *output() functions, determining the kind of data they’ll be handling. For example, plotOutput() will be rendering graphs, whereas textOutput() will be rendering text. Similarly for the *input() family. These functions also have associated IDs, letting us access them in our server code.

library(shiny)
library(shinydashboard)
library(dashboardthemes)

source("script.R")

ui <- dashboardPage(
  dashboardHeader(title = "Manifesto-viz"),
  dashboardSidebar(
    sidebarMenu(
      menuItem("Word frequencies", tabName = "wordFreq"),
      menuItem("tf-idf", tabName = "tfidf"),
      menuItem("Bigrams", tabName = "bigrams"),
      menuItem("Correlated words", tabName = "corWords"),
      menuItem("Topics per party", tabName = "topicWords")
    )
  ),
  dashboardBody(
    shinyDashboardThemes(theme = "grey_dark"), # theming
    tabItems # setting up app tabs
    (
      tabItem
      (tabName = "wordFreq",
        fluidRow(
          box(
            title = "Word frequencies",
            #background = "black",
            width = 12,
            solidHeader = TRUE,
            plotOutput("freqPlot", height = 600) # displaying a graph, along with assigning an ID to the result
          )
        ),
        fluidRow(
          box(
            numericInput( # taking numeric user input
              inputId = "numInput",
              label = "Number of most frequent words",
              value = 10,
              min = 1,
              max = 30
            ),
            title = "Word count per manifesto",
            solidHeader = TRUE,
            width = 12,
            #background = "black",
            plotOutput("facetPlot") # plot output
          )
        )),
      tabItem(tabName = "tfidf",
              fluidRow(
                box(
                  numericInput( # numeric input
                    inputId = "numInputTFIDF",
                    label = "Number of words more specific to a party",
                    value = 10,
                    min = 1,
                    max = 30
                  ),
                  title = "Words more specific per manifesto",
                  #background = "black",
                  width = 12,
                  solidHeader = TRUE,
                  selectInput( # drop down
                    inputId = "selectParty",
                    label = "Party",
                    choices = c("BJP", "Congress")
                  ),
                  plotOutput("tfidfPlot") # plot output
                )
              )),
      tabItem(tabName = "bigrams",
              fluidRow(
                box(
                  title = "Frequent word pairings",
                  #background = "black",
                  width = 12,
                  solidHeader = TRUE,
                  selectInput( # drop down
                    inputId = "selectPartyBG",
                    label = "Party",
                    choices = c("BJP", "Congress")
                  ),
                  plotOutput("bgPlot") # plot output
                )
              )),
      tabItem(tabName = "corWords",
              fluidRow(
                box(
                  title = "Correlated words from each manifesto",
                  solidHeader = TRUE,
                  #background = "black",
                  width = 12,
                  selectInput( # drop down
                    inputId = "selectPartyCor",
                    label = "Party",
                    choices = c("BJP", "Congress")
                  ),
                  plotOutput("corWords") # plot output
                )
              )),
      tabItem(tabName = "topicWords",
              fluidRow(
                box(
                  solidHeader = TRUE,
                  #background = "black",
                  width = 12,
                  title = "Topic modelling of each manifesto",
                  selectInput( # drop down
                    inputId = "selectPartyTopic",
                    label = "Party",
                    choices = c("BJP", "Congress")
                  ),
                  plotOutput("topicWords", width = 1000) # plot output
                )
              ))
    )
  )
)

Notice how each of the *input() and *output() functions has an ID being passed as a parameter. We’ll be using them now.

The above code defined what our application looks like, but we haven’t really handled its behaviour yet.

Luckily, most of the work is already being done in the functions we defined in Part 1 of the document. All we need to do now is call specific functions in specific elements of our application.

The input and output parameters being passed to server let us access the elements we defined in ui via the standard $ notation. This enables us to treat each element like an object, objects which can be assigned values of our choosing.

Our entire application almost entirely relies on plots, so we’ll be using the renderPlot() function to render the plots defined in our functions, which will then be assigned to the outputs we defined under ui. We’ll be assigning each function to its corresponding output attribute.

In the code below, I’m calling each of the functions we declared in part 1 to their corresponding ui elements.

Here’s the server code:

server <- function(input, output) {
  output$freqPlot <- renderPlot({ # rendering plot output
    freqPlot() # calling a function from the script in part 1
  })
  
  wordInput <- reactive({
    wordCount(input$numInput)
  })
  
  output$facetPlot <- renderPlot({
    wordInput() # rendering and assigning a plot
  })
  
  tfidfInput <- reactive({
    tfidfWords(stringr::str_to_lower(input$selectParty), # parsing user input
               input$numInputTFIDF)
  })
  
  output$tfidfPlot <- renderPlot({
    tfidfInput() # rendering and assigning a plot
  })
  
  output$bgPlot <- renderPlot({
    bgPlotFunc(stringr::str_to_lower(input$selectPartyBG)) # parsing user input
  })
  
  output$corWords <- renderPlot({
    corWords(stringr::str_to_lower(input$selectPartyCor)) # parsing user input
  })
  
  topicReactive <- reactive({
    topicWords(stringr::str_to_lower(input$selectPartyTopic)) # parsing user input
  })
  
  output$topicWords <- renderPlot({
    topicReactive() # rendering and assigning a plot
  })
}

# Run the application
shinyApp(ui = ui, server = server)

You might’ve noticed the reactive() calls in the above snippet. I’ve used them in computationally intensive parts of my app (for example, the topic modelling), as they control which parts of my apps update at which time, instead of going through each computation repetitively. Reactive expressions limit the code that will be executed again based on user input.

If you’re still here, thank you for reading this far! That was a long post, but through it we went covered quite a bit of things:

Reading, cleaning and extracting data from pdfs via pdftools
pivot_wider()
Text analysis via tidytext
tf-idf
Correlations and bigrams
Network plots
Topic modelling
Shiny UI elements
The input and output Shiny objects
reactive expressions

This was honestly exhausting, but I feel relieved that I finally got around doing it. It’s always exciting giving other people the opportunity to interact with your code, after all.

I’m welcome to suggestions and feedback, as always. You could use the comments below or contact me at my email.

The app lives here.

Analyzing election manifestos: Creating a Shiny app!