Easy word cloud with R — Peru’s presidential debate version

Peru has been in political crisis in the last months of 2020, due to this, the presidential election of 2021 were of greater importance for the general population.

I decided to show in a simple way the proposals launched by the candidates during the first presidential debate and share it through twitter. Using word cloud was an easy way to compare the candidates’ proposals.

Let’s start!

The easiest and fastest way to obtain the transcript of the debate was using YouTube annotations, these were generated after the live broadcast of the debate. Automatic captions generated by YouTube are not perfect, but as the debate is an exhibition without interruption the faults found are minimal.

Now we can use in .txt file the youtube annotations with the seconds marked. As the discussion blocks are segmented and there is a specific order for the presentation of each candidate, it is easy for us to do a manual separation of the transcription. As a result, we generate our database that is made up of each candidate and each debate segment.

Now, Let’s Code

The libraries that we will use for this project are:

  • tidyverse, because there is no R project without this
  • textmining tm, to be able to do operations on the texts
  • SnowballC, to remove stopwords in Spanish or English
  • wordcloud2, to graph the word cloud, I can also use worldclud, but the first one shows a fancier graph.

Reading the .txt file with readLines and use UTF-8 encoding because the debate is in Spanish and it generated words with special characters

The loaded files have the following form:

Cleaning .txt files

Transform the vector into a list of lists corpus to make cleaning the text

  • Remove numbers, punctuation, spaces and symbols from the list
  • Change the words to lowercase
  • Remove the stopwords in Spanish
  • Remove some words not considered in the stopword list
  • Now that we clean up the text, it’s time to shape the base to make the graphics
  • We create a table with the repetition frequency of each word, we transform it into a matrix and we obtain the following
  • Add the rows of the matrix to create a vector called palabras with the quantity and frequency of each word
  • Finally we create a dataframe with words and assign it a column with the name of the candidate

Now that we have the dataframe and the clean base, I’m going to join all the steps in a function that uses the .txt file and the initials of the candidate called iniciales, I am calling this function nube

Creating a word cloud

Now that we have the dataframe we can create the word cloud, we just need to load the .txt file, clean the data with the cloud function and graph the word cloud.

  • Use the data frame with the column of words and frequency of words, size defines the size of the words, use ‘random-cark’ for the text colors (or also use ‘random-light’)

And voila, our word cloud is generated

We can make a comparison between the word cloud of the candidates by proposed topic, this can give us an idea of the proposal of each candidate.

You can see the comparison made by the different topics and candidates on my Rpubs page, at this link.

I leave the complete code of my git page here

Business Engineer, Data Analyst, Business Hacker & Datacamp Student. bit.ly/anamumaq_