Easy word cloud with R — Peru’s presidential debate version

Peru has been in political crisis in the last months of 2020, due to this, the presidential election of 2021 were of greater importance for the general population.

I decided to show in a simple way the proposals launched by the candidates during the first presidential debate and share it through twitter. Using word cloud was an easy way to compare the candidates’ proposals.

Let’s start!

The easiest and fastest way to obtain the transcript of the debate was using YouTube annotations, these were generated after the live broadcast of the debate. Automatic captions generated by YouTube are not perfect, but as the debate is an exhibition without interruption the faults found are minimal.

Debate’s video used for this example

Now we can use in .txt file the youtube annotations with the seconds marked. As the discussion blocks are segmented and there is a specific order for the presentation of each candidate, it is easy for us to do a manual separation of the transcription. As a result, we generate our database that is made up of each candidate and each debate segment.

Dataset generated in .txt files of the debate’s transcript

Now, Let’s Code

The libraries that we will use for this project are:

  • tidyverse, because there is no R project without this
  • textmining tm, to be able to do operations on the texts
  • SnowballC, to remove stopwords in Spanish or English
  • wordcloud2, to graph the word cloud, I can also use worldclud, but the first one shows a fancier graph.
library(tidyverse)
library(tm)# para el text mining
library(SnowballC) # para quitar las stopwords
library(wordcloud2) #wordcloud con mejores interacciones

Reading the .txt file with readLines and use UTF-8 encoding because the debate is in Spanish and it generated words with special characters

x = readLines("debate/vm_corrupcion.txt", encoding="UTF-8")

The loaded files have the following form:

[1] "a diferencia de algunos candidatos aquí"  "09:08"                                   
[3] "presentes que están siendo procesados" "09:10"
[5] "por haber recibido financiamiento turbio" "09:12"
[7] "en sus campañas electorales o por haber" "09:15"

Cleaning .txt files

Transform the vector into a list of lists corpus to make cleaning the text

x = Corpus(VectorSource(x))
  • Remove numbers, punctuation, spaces and symbols from the list
  • Change the words to lowercase
  • Remove the stopwords in Spanish
  • Remove some words not considered in the stopword list
x = x %>% 
tm_map(removeNumbers)%>% # eliminar los minutos
tm_map(removePunctuation)%>% # eliminar puntuacion
tm_map(stripWhitespace)%>% # quito dobles espacios
tm_map(content_transformer(tolower))%>% # aplicamos minusculas
tm_map(removeWords, stopwords("Spanish"))%>% # quito stopwords
tm_map(removeWords,
c("entonces","aquí","el",
"de","en","que",
"por","los","las","para",
"una","con","este"))# quito mis propias stopwords
  • Now that we clean up the text, it’s time to shape the base to make the graphics
  • We create a table with the repetition frequency of each word, we transform it into a matrix and we obtain the following
dtm = TermDocumentMatrix(x)
matriz = as.matrix(dtm)
Matrix generated with as.amtrix()
  • Add the rows of the matrix to create a vector called palabras with the quantity and frequency of each word
palabras = sort(rowSums(matriz), decreasing = TRUE)
palabras with words and frequency
  • Finally we create a dataframe with words and assign it a column with the name of the candidate
x_df = data.frame(
row.names = NULL,
palabra = names(palabras),
freq = palabras,
candidato = "iniciales")
Final dataframe generated

Now that we have the dataframe and the clean base, I’m going to join all the steps in a function that uses the .txt file and the initials of the candidate called iniciales, I am calling this function nube

nube = function(x,iniciales){

x = Corpus(VectorSource(vm))
x = x %>%
tm_map(removeNumbers)%>%
tm_map(removePunctuation)%>%
tm_map(stripWhitespace)%>%
tm_map(content_transformer(tolower))%>%
tm_map(removeWords, stopwords("Spanish"))%>%
tm_map(removeWords,
c("entonces","aquí","el","de","en","que",
"por","los","las","para","una","con","este"))
dtm = TermDocumentMatrix(x)
matriz = as.matrix(dtm)
palabras = sort(rowSums(matriz), decreasing = TRUE)
x_df = data.frame(
row.names = NULL,
palabra = names(palabras),
freq = palabras,
candidato = iniciales)
return(x_df)
}

Creating a word cloud

Now that we have the dataframe we can create the word cloud, we just need to load the .txt file, clean the data with the cloud function and graph the word cloud.

  • Use the data frame with the column of words and frequency of words, size defines the size of the words, use ‘random-cark’ for the text colors (or also use ‘random-light’)
vm = readLines("debate/vm_pandemia.txt", encoding = "UTF-8") df_vm = nube(vm,"VM")wordcloud2(data = df_vm[1:2], size = 0.8, color = 'random-dark')

And voila, our word cloud is generated

We can make a comparison between the word cloud of the candidates by proposed topic, this can give us an idea of the proposal of each candidate.

You can see the comparison made by the different topics and candidates on my Rpubs page, at this link.

I leave the complete code of my git page here

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ana Muñoz Maquera

Ana Muñoz Maquera

Business Engineer, Data Scientist in HR & Datacamp Student. bit.ly/anamumaq_