Topic Modeling of WSJ’s YouTube Advertising Story

The tweet that led me to analyze this issue

The tweet that led me to analyze this issue

While browsing Twitter today, I saw a couple people that I follow tweet about a controversy involving a Wall Street Journal journalist. YouTuber h3h3productions released a video claiming that WSJ used fake screenshots for one of its stories. The WSJ story claimed that YouTube continued to run advertisements on racist videos, even though it is against the website’s policy. However, h3h3productions claimed that the video the journalist used doctored screenshots as evidence for the story. After seeing this h3h3production’s video, I decided to conduct topic modeling on tweets using the keyword “WSJ.”

To conduct topic modeling, I downloaded an R file that my professor provided. First, I had to make sure that I install all of the necessary packages in R, and I put the keys and tokens from an app I created in Twitter into the code. Then, I pulled 1,000 tweets containing “wsj” and ran the code that would remove all of the stop words. I then ran code that divided words frequently found together into five distinct topics. This allowed me to create files for Gephi and Tableau, the two visualization programs that I used.

Topics and frequently used terms

The topics and frequently used terms, as determined by R, imported onto an Excel spreadsheet

Gephi is a visualization program that allows the user to see which nodes are connected. The visualization I produced connected Twitter users together that used the keyword “wsj.” I color coded the nodes based on their modularity, which means that they are colored based on the community that Gephi established for them. I also adjusted the nodes and node labels based on their degree, which means that bigger nodes and labels have more connections to other nodes. The edges, which are the lines connecting the nodes, are colored based on which topic R assigned to each node. WJS and h3h3productions have the most degrees, compared to the rest of the nodes.

Gephi visualization

Gephi visualization

I also used Tableau, which is another visualization software, to analyze the tweets. The first visualization I created was a timeline that showed how frequently each topic was tweeted about over a period of time. This timeline shows that most of the tweets belonged to the topics that I called “Information Spread,” followed by “Anti-WSJ.” The tweets belonging to “Information Spread” included words that are often used to get people to share a video or other information, and tweets belonging to “Anti-WSJ” included words that are often used when people are unhappy with the way information is being presented. I also created a box visualization in Tableau, which shows the frequency that tweets fall under a particular topic. Again, most of the tweets belongs to “Information Spread” and “Anti-WSJ.”

This slideshow requires JavaScript.

Based on these visualizations, it is highly likely that people are tagging WSJ and h3h3productions to garner attention toward the controversy. The Gephi visualization shows that WSJ and h3h3productions have the most degrees in the network, which means that they are talked about a lot. In addition, according to the two Tableau visualizations, most of the tweets belong to the topic called “Information Spread,” which contains words that are often used to urge people to share something on the internet, such as h3h3production’s video.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s