Trump and Clinton Speeches, Step 1: Text Mining

This post was originally published at ODSC. Check out their curated blog posts!


Part 1 Obtaining Transcripts for Campaign Trail Speeches
The political season is long and arduous. As a former Ohioan I dreaded any election year because it is punctuated with endless negative and inflammatory campaign ads. Now that I live in the Democratic stronghold state of Massachusetts, I am relieved that my TV and radio are mostly free from such informationally barren commercials. Instead I try to focus on the candidate topics and statements outside of commercials. Besides, I always wonder who is swayed by these ads…but that is an analysis for another time!

As a data scientist with a passion for text mining, I am keenly interested in word choice. Over the course of the last year, Hillary Clinton and Donald Trump have spoken multiple times a day. These speeches provide an ample corpus for a text mining. Of course, I am not alone in my interest of candidate word choice. The 24 hour news cycle spends considerable time on individual pithy candidate comments or out of context quotes. For example, Hillary Clinton’s comment calling Trump supporters “deplorables” was covered incessantly for weeks and of course, Donald Trump has done some name calling too. Surely name calling is not humanistic and probably not a good “look” for a candidate but the issues of today go beyond such nonsense.

So I propose a quantitative and analytical approach based on multiple speeches to draw out the styles and topics of Hillary Clinton and Donald Trump speeches. The analysis should lead to a balanced understanding of the candidates free from news anchor opinions. This blog series is broken up into 4 parts to illustrate common text mining techniques applied to Trump and Clinton speeches.

The blog sequence covers:

Obtaining Transcripts for Candidate Speeches
Organizing Speeches & Initial Metrics
Topic Modeling Visualizations
Comparing Trump & Clinton
Whether you are a Democrat or a Republican I hope you enjoy the series and learn something along the way.

Finding Reliable Text
Surprisingly, it was difficult to get full transcripts of the stump speeches. I suspect the average American relies on news articles with commentary, live feeds and social media. Further, I didn’t want to rely on Liberal or Conservative websites or manual transcriptions that could be biased.

So I settled on YouTube’s closed captioning data from actual Clinton and Trump speeches. I have to assume Google’s transcribing software is not politically motivated so errors are unbiased. After some developer sleuthing I found the caption data is in an XML file.

To gather Trump and Clinton speeches, I selected the Right Side Broadcasting YouTube Channel. The channel uploads Trump rally speeches from around the country with consistent titles including states and dates. Later I found another YouTube channel RBC Broadcasting that covers both candidate speeches. The blog series uses speeches from both channels.

The Right Side Broadcasting YouTube Channel page.

Follow these steps to get a single video’s caption. Using Chrome, navigate to a video link such as https://www.youtube.com/watch?v=uXiJ8gudUwo. Once there right click anywhere on the page and select “Inspect.”

This will open up the Chrome developer console alongside the video. First click “Network” to change from the HTML information. Next type “timed” into the filter box. Lastly, click on the “cc” icon to enable closed captioning.

The steps to identify a speech’s closed caption information.

If the video offers closed captioning developer panel will display a file starting with “timedtext” as shown below. Hover over the file name and right click to “open link in a new tab.” Once opened you can see the XML contains each word and second by second information. The URLs expire so be sure to parse them immediately.

 

Now that you have the data set, check out the rest of this post at ODSC!

This entry was posted in blog posts, ODSC and tagged , . Bookmark the permalink.