Christoph Spörlein

The large-scale spread of online communication (e.g., in the form of blogs, comment threads or text messenger services) has generated a plethora of new data sources that can provide valuable and relevant insights into social science research questions. This short tutorial introduces the basics of mining and analyzing social media data using R.


Prerequisites

Before we can actually analyze anything related to our research questions, we will need to have a couple of programs installed. This tutorial relies exclusively on R and if you are unfamiliar with it, don’t worry. R is easy to learn. But if you are still hesitant, here is a list of benefits you get from learning it:

So first things first, get R and RStudio.

Here is my short R Tutorial in case you need to freshen up on your R skills.

Next, you will need accounts for all social media services you plan to use (e.g., YouTube, Twitter, reddit).

In order to access social media, you will also need access to the APIs (application programming interface) of the various social media sites. These APIs can be accessed directly but relying on packages to do this from inside R is typically more convenient. Still, you will need to register your “project” and get the necessary credentials that allow you to interface with the services. In the following, I will give you a step-by-step overview for YouTube and Twitter. Luckily, you don’t need special credentials to access data from Reddit.

YouTube

  1. Go to this page and log in.
  2. Navigate to the APIs Console.
  3. Create a new project and click on it when creation is complete.
  4. Click on APIs & Services
  5. Click on + activate APIs and Services
  6. Search for YouTube Data API v3 and activate it.
  7. Click an Credentials or Create credentials
  8. Click on “Create API key” and save the key somewhere.

Twitter

  1. Go to this page and log in.
  2. Click on “Create an app” (depending on your specific registration you may need to add a valid mobile phone number).
  3. Apply for developer access and fill out all the information requested (you might want to save what you filled in in a separate document).
  4. Confirm your Developer Account and wait for it to be reviewed (which took 7 hours in my case).
  5. Click on the link in the application confirmation email.
  6. Click on your app name and then select “Apps” and “Create App”.
  7. Fill out name, description and website (can be anything). In the field “Tell us how the app will be used”, you could paste the information you gave under point 3 above. Click on create app.
  8. Click on the “Keys and Tokens” tab and save the API key and the API secret key.
  9. Now also generate an access token and save both the token and the token secret.

Setting up R

Installing and loading required packages

Social media packages

In general, you will only need to install packages only when you (re-)installed R. But you will need to load them every time you want to use them. In order to make your scripts run with as little user input as possible, it can be useful to have R install packages as needed and subsequently load them.

vosonSML is a similar package with access to YouTube, Twitter and Reddit.

install.packages(c("vosonSML"))
library(vosonSML)

The tuber package is an alternative to vosonSML when trying to access YouTube data and provides additionally functionality regarding the acquisition of video and channel meta data. I will not demonstrate its use but feel free to check out this vignette for additional information.

#install.packages(c("tuber"))
#library(tuber)

Most packages come with extensive help files which you can access by typing:

?vosonSML

or to get help on specific functions, just type ? and the function name:

?Authenticate
?Collect

Utility packages

In addition to packages that help you get data, you will also need packages that help you work with the data. The tidyverse package gives you easy access to powerful data handling tools (e.g., dplyr and stringr) as well as ggplot2 - the package to create amazing visualizations of your results. So go ahead, install and load the tidyverse package:

install.packages(c("tidyverse"))
library(tidyverse)

Reproducible research and open science

You might also think about getting a GitHub account to save and share your work more easily. Here is a great tutorial helping you to set up a GitHub workflow within RStudio. Obviously, you can also go the more tedious route of manually uploading your files to GitHub at regular intervals. Whatever floats your boat! Sharing code and data becomes increasingly more important and the sooner you familiarize yourself with this aspect of doing science, the better for you down the road.


Downloading data

Accessing and downloading data is fairly simple and nothing like scraping websites “manually” - given you have the necessary credentials. In case you want to learn how to scrape content from websites directly, you can find a short case study here.

As a cautionary note, you should be aware that services like YouTube or Twitter may change their rules of data access at any time! So you may have a great research question and plans to collect data but that may be over in a moments notice and you are left with nothing! That is one of the main drawbacks of having free access to interesting data collected by companies.

Getting YouTube data

Let’s assume, we installed all necessary packages in an earlier session and we are starting with a clean slate. Anytime that is the case, we want to start by loading packages. Almost by default now, I will always start with loading the tidyverse package. To get YouTube data, we need the vosonSML package.

library(tidyverse)
library(vosonSML)

Next, we need to tell the YouTube API that we are authorized to access and download their data. For that reasons, we retrieved an API-key earlier (see Section Prerequisites -> YouTube). For convenience, store that key in an object and tell YouTube your intent to access data:

apikey <- "xxxxxxxx"
key <- Authenticate("youtube", 
                    apiKey=apikey)

When running this for the first time, you may be redirected to a site in your browser stating your identity. Notice that there will/may be a “.httr-oauth” file in your working directory now. This file essentially stores your credentials letting you skip the “identify yourself via the browser” step in future data requests. Note also that when you use GitHub, this file will be automatically added to the .gitignore file and thus not uploaded to GitHub where others have access to your credentials. Pretty neat!

For this example, we want to collect the comments from this video. Obviously, feel free to keep listening to it while you follow this tutorial! There is a couple of things you should note when working with YouTube videos: first, video URLs have a regular structure where https://www.youtube.com/watch?v= is followed by a unique video ID (G1IbRujko-A). This video ID is what we need to download the comment data.

# You can either use this function to extract the video ID
videos <- GetYoutubeVideoIDs(c("https://www.youtube.com/watch?v=G1IbRujko-A"))

# or supply it "manually": 
videos <- c("G1IbRujko-A")

# Either way works. If you want to download comments for multiple values, just add them to the vector of videoIDS like so: 
# videos <- c("ID1","ID2","etc")

# This will use the key and download the data
yt_data <-key %>%   
  Collect(videos)

What did we get?

str(yt_data)
## Classes 'dataource', 'youtube' and 'data.frame': 7617 obs. of  9 variables:
##  $ Comment           : chr  "Put this at 1.5x and thank me later lads" "Saxophone teacher: What inspired you to learn this instrument. \n\nMe:" "3:51:19 Best part" "So many dislikes holyshit" ...
##  $ User              : chr  "<U+05E2><U+05DE><U+05D9><U+05EA> <U+05E2><U+05E0><U+05E3>" "KillerMachine30_YT" "Tyler Phommachak" "Lavenyus Manufacturing" ...
##  $ ReplyCount        : chr  "0" "0" "0" "0" ...
##  $ LikeCount         : chr  "0" "1" "0" "0" ...
##  $ PublishTime       : chr  "2019-04-17T05:42:56.000Z" "2019-04-17T03:39:31.000Z" "2019-04-17T03:12:54.000Z" "2019-04-17T01:46:36.000Z" ...
##  $ CommentId         : chr  "UgxraIpGf3Fq0CJITR94AaABAg" "Ugz26jSTzSzMwFbHt_l4AaABAg" "Ugy8tUjOmHvI21fY3rB4AaABAg" "UgzK2fn5UbQaJRfiR914AaABAg" ...
##  $ ParentID          : chr  "None" "None" "None" "None" ...
##  $ ReplyToAnotherUser: chr  "FALSE" "FALSE" "FALSE" "FALSE" ...
##  $ VideoID           : chr  "G1IbRujko-A" "G1IbRujko-A" "G1IbRujko-A" "G1IbRujko-A" ...
names(yt_data)
## [1] "Comment"            "User"               "ReplyCount"        
## [4] "LikeCount"          "PublishTime"        "CommentId"         
## [7] "ParentID"           "ReplyToAnotherUser" "VideoID"
nrow(yt_data)
## [1] 7617

We collect the comments in string form, who made the comments, how many replies did it get, how many likes, when was it published, a unique ID, the ID of its parent (if applicable), was it a reply to some other user and to video does the comment belong. So when you collect comments from multiple videos, this variable will enable you to identify where they came from or match them to specific videos.

YouTube comments have a very specific structure: there are parent comments which are direct replies to a video and there are child comments which are replies to parent comments or other child comments. So when ParentID==“None”, then the comment is a parent comment but when it has a different value it is a child comment. The actual value then refers to the parent comment to which the child comment replies. The “ReplytoAnotherUser” variable essentially records the same information but rather than recording the comment id (see CommentId variable), it records the user name.

So how extensive is the comment network in this example? How frequently do users engage with what others wrote?

ggplot(yt_data, aes(x=as.numeric(ReplyCount))) +
  geom_bar() + 
  theme_minimal() + 
  labs(x="Reply Count", 
       y="Count", 
       title="Interactions in YouTube video comment section", 
       subtitle = "Source: www.youtube.com/watch?v=G1IbRujko-A")

Apparently, the vast majority of parent comments receive no replies at all whereas 61 out of 5629 parent comments receive more than replies - roughly 1.08 percent. Then again, studying commenting behavior or interaction in comment threads using this particular video may not be the smartest choice of data to begin with…


Getting Twitter data

Getting twitter data works very similar. You first feed in your credentials:

appname <- "xxxxx"
myapikey <- "xxxxxxx"
myapisecret <- "xxxxx"
myaccesstoken <- "xxxxxx"
myaccesstokensecret <- "xxxx"

Then you authenticate your access and download data by specifying a search term:

tw_data <- Authenticate("twitter", 
                        appname=appname, 
                        apiKey=myapikey, 
                        apiSecret=myapisecret, 
                        accessToken=myaccesstoken, 
                        accessTokenSecret=myaccesstokensecret) %>%
  Collect(searchTerm="#brexit", numTweets=20)

There are a number of additional options you need to be aware of:

  • searchType: can be “recent”, “mixed” or “popular”. Default is “recent”
  • includeRetweets: Default is TRUE
  • retryOnRateLimit: Default is FALSE.
  • geocode: “latitude, longitude, radius”, e.g., geocode= “37.78,-122.4, 1mi”

So, what did we get from this small data collection?

str(tw_data[1:14])
## Classes 'tbl_df', 'tbl', 'datasource', 'twitter' and 'data.frame':   16 obs. of  14 variables:
##  $ user_id             : chr  "95285344" "497698757" "2509738688" "964990428134207488" ...
##  $ status_id           : chr  "1118425742059167744" "1118425731543990272" "1118425719451918336" "1118425716763373568" ...
##  $ created_at          : POSIXct, format: "2019-04-17 08:07:30" "2019-04-17 08:07:28" ...
##  $ screen_name         : chr  "PhilDuck" "angewick" "martinnewby_1" "edbutt78" ...
##  $ text                : chr  "The new #Brexit Party, like everything #Farage touches, is all about division\nhttps://t.co/MggExzl2kM" "1/ A thread about the Government's response to the petition asking for a Public Inquiry into illegality in the "| __truncated__ "Revealed: The DUP arranged after the EU ref for NI public bodies to discuss 'investment opportunities' with Ric"| __truncated__ "@Jo2901F @UKIP @GerardBattenMEP @Nigel_Farage bolted when we won the referendum and wanted UKIP to fail, now th"| __truncated__ ...
##  $ source              : chr  "Twitter for Android" "Twitter for iPhone" "Twitter for Android" "Twitter for iPhone" ...
##  $ display_text_width  : num  118 140 140 279 113 140 144 140 140 140 ...
##  $ reply_to_status_id  : chr  NA NA NA "1118423313393569793" ...
##  $ reply_to_user_id    : chr  NA NA NA "321436614" ...
##  $ reply_to_screen_name: chr  NA NA NA "Jo2901F" ...
##  $ is_quote            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ is_retweet          : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
##  $ favorite_count      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ retweet_count       : int  155 170 508 0 100 1 1047 170 20 808 ...
names(tw_data)
##  [1] "user_id"                 "status_id"              
##  [3] "created_at"              "screen_name"            
##  [5] "text"                    "source"                 
##  [7] "display_text_width"      "reply_to_status_id"     
##  [9] "reply_to_user_id"        "reply_to_screen_name"   
## [11] "is_quote"                "is_retweet"             
## [13] "favorite_count"          "retweet_count"          
## [15] "hashtags"                "symbols"                
## [17] "urls_url"                "urls_t.co"              
## [19] "urls_expanded_url"       "media_url"              
## [21] "media_t.co"              "media_expanded_url"     
## [23] "media_type"              "ext_media_url"          
## [25] "ext_media_t.co"          "ext_media_expanded_url" 
## [27] "ext_media_type"          "mentions_user_id"       
## [29] "mentions_screen_name"    "lang"                   
## [31] "quoted_status_id"        "quoted_text"            
## [33] "quoted_created_at"       "quoted_source"          
## [35] "quoted_favorite_count"   "quoted_retweet_count"   
## [37] "quoted_user_id"          "quoted_screen_name"     
## [39] "quoted_name"             "quoted_followers_count" 
## [41] "quoted_friends_count"    "quoted_statuses_count"  
## [43] "quoted_location"         "quoted_description"     
## [45] "quoted_verified"         "retweet_status_id"      
## [47] "retweet_text"            "retweet_created_at"     
## [49] "retweet_source"          "retweet_favorite_count" 
## [51] "retweet_retweet_count"   "retweet_user_id"        
## [53] "retweet_screen_name"     "retweet_name"           
## [55] "retweet_followers_count" "retweet_friends_count"  
## [57] "retweet_statuses_count"  "retweet_location"       
## [59] "retweet_description"     "retweet_verified"       
## [61] "place_url"               "place_name"             
## [63] "place_full_name"         "place_type"             
## [65] "country"                 "country_code"           
## [67] "geo_coords"              "coords_coords"          
## [69] "bbox_coords"             "status_url"             
## [71] "name"                    "location"               
## [73] "description"             "url"                    
## [75] "protected"               "followers_count"        
## [77] "friends_count"           "listed_count"           
## [79] "statuses_count"          "favourites_count"       
## [81] "account_created_at"      "verified"               
## [83] "profile_url"             "profile_expanded_url"   
## [85] "account_lang"            "profile_banner_url"     
## [87] "profile_background_url"  "profile_image_url"
tw_data$text[1:3]
## [1] "The new #Brexit Party, like everything #Farage touches, is all about division\nhttps://t.co/MggExzl2kM"                                                                                                                                                                                  
## [2] "1/ A thread about the Government's response to the petition asking for a Public Inquiry into illegality in the #Brexit referendum, because this sort of self-serving and bilious dismissal of the public needs calling out sometimes. And I'm cross.\nhttps://t.co/RoyFiRmxPH"           
## [3] "Revealed: The DUP arranged after the EU ref for NI public bodies to discuss 'investment opportunities' with Richard Cook - the Tory behind a hidden pro-union business group that donated £435,000 to the DUP during its Brexit campaign.\n\nhttps://t.co/xUgtwP8Lqr @irish_news #Brexit"

Overall, you will get 88 variables with information on the user, the content and how the tweet is linked to other content.

vosonSML has built-in functions to create users networks and in the case of twitter data also “semantic” networks:

actor_nw <- tw_data %>%
  Create("actor")
## Generating twitter actor network...
## Done.
plot(actor_nw$graph)


semantic_nw <- tw_data %>%
  Create("semantic", 
         termFreq=20, 
         removeTermsOrHashtags=c("#brexit"))
## Generating twitter semantic network...
## Done.
plot(semantic_nw$graph)

“brexit” and “campaign” seem to be highly important words in the 20 tweets we collect as exemplified by their central position and the large number of arrows (or edges) directed at them. Later in this tutorial, you will learn more sophisticated methods to a) plot networks and b) create semantic networks so don’t be put off by the look of this example, publication-quality visualizations of networks are right around the corner.


Getting Reddit data

Since this is also done with the same package, getting Reddit data involves the same routine: authenticate and then collect

red_auth <- Authenticate("reddit")
red_data <- red_auth %>%
  Collect("https://www.reddit.com/r/de/comments/bd8i60/tetraeder/")
str(red_data)
## Classes 'datasource', 'reddit' and 'data.frame': 25 obs. of  19 variables:
##  $ id              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ structure       : chr  "1" "1_1" "1_1_1" "1_1_1_1" ...
##  $ post_date       : chr  "14-04-19" "14-04-19" "14-04-19" "14-04-19" ...
##  $ comm_date       : chr  "15-04-19" "15-04-19" "15-04-19" "15-04-19" ...
##  $ num_comments    : num  25 25 25 25 25 25 25 25 25 25 ...
##  $ subreddit       : chr  "de" "de" "de" "de" ...
##  $ upvote_prop     : num  0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 ...
##  $ post_score      : num  458 458 458 458 458 458 458 458 458 458 ...
##  $ author          : chr  "lebadger" "lebadger" "lebadger" "lebadger" ...
##  $ user            : chr  "noodleboiiii" "chinupf" "noodleboiiii" "chinupf" ...
##  $ comment_score   : num  34 15 19 5 5 5 1 3 6 5 ...
##  $ controversiality: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ comment         : chr  "Wenn es 4 Uhr morgens ist und du in 3h zur Chemievorlesung musst, aber dann laut ueber sowas lachst, statt zu s"| __truncated__ "Und, wie war die Vorlesung? Ü" "Hasse mich selbst, da ich nicht geschlafen habe; musste gerade aber nochmal über das Meme lachen also war es okay." "In der Uni lernt man ja auch was fürs Leben" ...
##  $ title           : chr  "Tetraeder" "Tetraeder" "Tetraeder" "Tetraeder" ...
##  $ post_text       : chr  "" "" "" "" ...
##  $ link            : chr  "https://i.redd.it/7t5htsyy2bs21.jpg" "https://i.redd.it/7t5htsyy2bs21.jpg" "https://i.redd.it/7t5htsyy2bs21.jpg" "https://i.redd.it/7t5htsyy2bs21.jpg" ...
##  $ domain          : chr  "i.redd.it" "i.redd.it" "i.redd.it" "i.redd.it" ...
##  $ URL             : chr  "https://www.reddit.com/r/de/comments/bd8i60/tetraeder/?ref=search_posts" "https://www.reddit.com/r/de/comments/bd8i60/tetraeder/?ref=search_posts" "https://www.reddit.com/r/de/comments/bd8i60/tetraeder/?ref=search_posts" "https://www.reddit.com/r/de/comments/bd8i60/tetraeder/?ref=search_posts" ...
##  $ thread_id       : chr  "bd8i60" "bd8i60" "bd8i60" "bd8i60" ...
names(red_data)
##  [1] "id"               "structure"        "post_date"       
##  [4] "comm_date"        "num_comments"     "subreddit"       
##  [7] "upvote_prop"      "post_score"       "author"          
## [10] "user"             "comment_score"    "controversiality"
## [13] "comment"          "title"            "post_text"       
## [16] "link"             "domain"           "URL"             
## [19] "thread_id"

Note that the structure of Reddit can be quite complex. Contrary to YouTube where you have a parent comment with x child comments replying to it, redditors can directly reply to child comments creating complicated web of communication. The structure variable captures this in a stylized fashion:

red_data$structure
##  [1] "1"           "1_1"         "1_1_1"       "1_1_1_1"     "1_2"        
##  [6] "1_2_1"       "1_2_1_1"     "1_2_1_1_1"   "1_2_2"       "1_2_2_1"    
## [11] "1_2_2_1_1"   "1_2_2_1_1_1" "1_2_2_1_2"   "1_2_2_2"     "1_2_3"      
## [16] "2"           "3"           "4"           "5"           "6"          
## [21] "6_1"         "6_1_1"       "7"           "8"           "9"

In YouTube lingo, there are 9 parent comments (numbers 1 to 9) with varying degrees of comment activity. comment 1 has two direct replies (1_1 and 1_2) while 1_2 has two replies of its own (1_2_1 and 1_2_2).

As with YouTube and Twitter data, we can visualize the communication network easily with the Create() function:

network <- Create(red_data, 
                  type="actor")
## Generating reddit actor network...
## Done.
plot(network$graph)


Analysing social media data

One of the main advantages of social media data relates to the mode by which the data is generated. In contrast to classical survey data research, the impetus to comment on something rests completely with the individual and is not based on input by the researcher in the form of a direct question. So you get information regarding a specific issue from individuals who feel they need to chime into the discussion using their own words. But that is also one of the aspects of social media data that make working with it more complex and difficult compared to survey data: you need to work with texts which are more or less completely unstandardized compared to survey questions. That means you need to deal with misspelling, with emoticons, with the complexity of written language when it comes to negations, metaphors, slang, irony, and so on. This short tutorial will be no means cover all of these aspects but rather focus on providing you with the basics of content analyses. The decisions regarding whether and how you deal with these issues should be guided mainly by your research questions.

NLP (natural language processing) is a highly active field of research and who knows, maybe good approaches to dealing with some of the more difficult aspects of written language are right around the corner…

Data handling

For this part of the tutorial, we will use a collection of Trump’s 2016 campaign tweets.

In addition to our trusty tidyverse package, we will rely on two packages specialized in working with and analyzing textual data: the tidytext and quanteda packages. Both packages do essentially the same but with a different approach. quanteda provides more extensive capabilities - especially with regards to describing text (here is a great cheatsheet). The underlying ideas of working with text data are however similar so understanding basic concepts is independent of the packages used here. I will switch between the two packages frequently because functions in one may be more intuitive to use than in the other package.

To be more transparent from which package a specific function is, I will start using packagename::command() which tells R, use this package to execute the function (can be important when two packages share the same command) but more importantly, it tells you where to look for help in case you need more information. OK, lets start by loading the packages and the data:

library(tidyverse)
library(tidytext)
library(quanteda)

# read in Trump tweets and drop some variables
dt_tweets <- read.csv2("Donald-Tweets!.csv", 
                       sep=",", 
                       header=TRUE, 
                       stringsAsFactors=FALSE) %>%
  select(-Tweet_Url, -twt_favourites_IS_THIS_LIKE_QUESTION_MARK, -X, -X.1, -Type)
str(dt_tweets)
## 'data.frame':    7375 obs. of  7 variables:
##  $ Date      : chr  "16-11-11" "16-11-11" "16-11-11" "16-11-11" ...
##  $ Time      : chr  "15:26:37" "13:33:35" "11:14:20" "2:19:44" ...
##  $ Tweet_Text: chr  "Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z" "Busy day planned in New York. Will soon be making some very important decisions on the people who will be runni"| __truncated__ "Love the fact that the small groups of protesters last night have passion for our great country. We will all co"| __truncated__ "Just had a very open and successful presidential election. Now professional protesters, incited by the media, a"| __truncated__ ...
##  $ Media_Type: chr  "photo" "" "" "" ...
##  $ Hashtags  : chr  "ThankAVet" "" "" "" ...
##  $ Tweet_Id  : chr  "7.97E+17" "7.97E+17" "7.97E+17" "7.97E+17" ...
##  $ Retweets  : int  41112 28654 50039 67010 36688 44655 225164 45492 17169 19710 ...

Corpus

Text documents can be stored together with meta information about the documents into a so-called corpus.

# convert the data set to a text corpus
tweet_corp <- quanteda::corpus(dt_tweets, 
                               text_field="Tweet_Text")

# quick summary
summary(tweet_corp, 
        n=2)

Several things you should note here: The first four variables have been added to the tweets: Text provides an ID for the tweets, Types refers the number of distinct tokens, Tokens refers to the number of tokens in each document. Here, tokens are essentially words (and punctuation) so Types is the number of distinct words and tokens the number of words. Sentences gives the number of sentences in each tweet. Let’s cross reference this with the actual content of the tweets:

# look at the content of specific tweets
quanteda::texts(tweet_corp[2])
##                                                                                                                               text2 
## "Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government!"

Both tweets are composed of two sentences. Check!

text2 has 23 words and two punctuation symbols totaling 25 tokens. Check!

Because Types==24 for the text2, there should be a duplicate word or punctuation sign… Aha, it’s “be”! Check!

There are a number of interesting things we could visualize now but let’s just look at whether there is relationship between the number of distinct tokens and the time of posting a tweet (with the assumption that tweeting late in the evening or at night contains less elaborate text):

# let's make this simple and just extract the hours of a day as a variable
library(lubridate)

# a corpus is a list so we need to adress variables in lists with [[]]
# in other words, the variable Time is in the documents list
quanteda::docvars(tweet_corp, "hr") <- hour(hms::as.hms(tweet_corp$documents[["Time"]]))

# the summary function per default only summarizes 100 rows
# hence, the n option specifying we want the summary for all documents in the corpus
sum.corpus <- summary(tweet_corp, n=ndoc(tweet_corp))
ggplot(sum.corpus, aes(x=hr, y=Types)) +
  geom_jitter() +
  geom_smooth() +
  theme_minimal() + 
  labs(x="Hour", 
       y="Number of unique types per tweet")

At first glance, there seems to be a bump in the number of types from early morning (6 to 7am) until noon. However the reduction in the number of types per tweet is relatively small. One certainly notices periods of sleep by a lower frequency of tweeting.

Maybe tweeting becomes less “elaborate” later in the campaign when fatigue sets in?

# let's make new date variable R recognizes as such
quanteda::docvars(tweet_corp, "datum") <- ymd(tweet_corp$documents[["Date"]])

sum.corpus <- summary(tweet_corp, n=ndoc(tweet_corp))
ggplot(sum.corpus, aes(x=datum, y=Types)) +
  geom_point() +
  geom_smooth() +
  theme_minimal() + 
  labs(x="Date", 
       y="Number of unique types per tweet")

The trends don’t seem to be very systematic here.

You can also “tokenize” the tweets which means you cut up each tweet text into its components (words, sentences, characters). Typically, you want to tokenize by words:

tok_tweets <- quanteda::tokens(tweet_corp,
                     what="word",
                     remove_numbers=TRUE,
                     remove_punct=TRUE,
                     remove_symbols=TRUE,
                     remove_separators=TRUE,
                     remove_twitter=TRUE,
                     remove_url=TRUE,
                     ngrams=1)
head(tok_tweets,
     n=2)
## tokens from 2 documents.
## text1 :
##  [1] "Today"     "we"        "express"   "our"       "deepest"  
##  [6] "gratitude" "to"        "all"       "those"     "who"      
## [11] "have"      "served"    "in"        "our"       "armed"    
## [16] "forces"    "ThankAVet"
## 
## text2 :
##  [1] "Busy"       "day"        "planned"    "in"         "New"       
##  [6] "York"       "Will"       "soon"       "be"         "making"    
## [11] "some"       "very"       "important"  "decisions"  "on"        
## [16] "the"        "people"     "who"        "will"       "be"        
## [21] "running"    "our"        "government"

There are a lot of option used here simply to demonstrate to you the types of questions you face when handling text data. Do you need numbers? Do you need symbols? Are URLs important to your questions? quanteda gives you to freedom to decide all that but the decision is yours and should be guided by your research question. Try to remember the “ngrams=1” option, we will discuss this later on.

There are typically many words in text documents that do not convey a lot of information (at least when considered as single entities, they may be highly important to give context!). These are called “stop words” and can be removed automatically by storing them in a dictionary (i.e., a list of words):

tok_tweets <- quanteda::tokens_remove(tok_tweets,
                                       c("These","are","my","stopwords", "in", "be"))
head(tok_tweets,
     n=2)
## tokens from 2 documents.
## text1 :
##  [1] "Today"     "we"        "express"   "our"       "deepest"  
##  [6] "gratitude" "to"        "all"       "those"     "who"      
## [11] "have"      "served"    "our"       "armed"     "forces"   
## [16] "ThankAVet"
## 
## text2 :
##  [1] "Busy"       "day"        "planned"    "New"        "York"      
##  [6] "Will"       "soon"       "making"     "some"       "very"      
## [11] "important"  "decisions"  "on"         "the"        "people"    
## [16] "who"        "will"       "running"    "our"        "government"

You can also supply preconstructed dictionaries taking care of stop words:

# stop words from the tidytext package
data("stop_words")
nrow(stop_words)
## [1] 1149
head(stop_words)
tok_tweets <- quanteda::tokens_remove(tok_tweets,
                                       stop_words$word)
head(tok_tweets,
     n=2)
## tokens from 2 documents.
## text1 :
## [1] "express"   "deepest"   "gratitude" "served"    "armed"     "forces"   
## [7] "ThankAVet"
## 
## text2 :
## [1] "Busy"       "day"        "planned"    "York"       "decisions" 
## [6] "people"     "running"    "government"

Wow, a collection of almost 1,200 stop words. Well, that’s amazing when you analyze English language text. You will not be so fortunate when analyzing other languages. But you are not the first person to encounter this problem and there are of course resources online that supply stop word collections for other languages. As a first start, you may want to check out the stopword package which covers many of the large languages (e.g., Spanish, French, German, Chinese, Arabic, Swahili as well as Esperanto or Latin).

Another very important aspect of text data is stemming. Stemming reduces words to their common word stem. For example, rather than counting running, run, runs, ran, etc. as separate words, stemming replaces all with “run”. Thus, the primarily purpose of stemming is to reduce the number of distinct words (or types) by reducing them to their common word stem.

tok_tweets <- quanteda::tokens_wordstem(tok_tweets)
head(tok_tweets,
     n=2)
## tokens from 2 documents.
## text1 :
## [1] "express"   "deepest"   "gratitud"  "serv"      "arm"       "forc"     
## [7] "ThankAVet"
## 
## text2 :
## [1] "Busi"   "day"    "plan"   "York"   "decis"  "peopl"  "run"    "govern"

The ordering of stemming and removing stop words obviously matters. If you don’t want it to matter, then you need to supply stemmed stop words.

Working with a corpus can be more difficult because it requires more work to generate new variables or access list entries. Our tweets are also fairly simply so we could just work with the tidytext package to achieve the same but within a framework you are (likely) more used to:

# Stemming requires additional help from the SnowballC package
library(SnowballC)

# read tibble
tweet_tidy <- as_tibble(dt_tweets) %>% 
  # tokenize the tweets
  tidytext::unnest_tokens(word, Tweet_Text) %>%
  # remove stop words
  anti_join(stop_words) %>%
  # and stem the words
  mutate(word=wordStem(word))

head(tweet_tidy)

Some differences are of note here: because we did not work with a corpus, every row in our data set now corresponds to a word. In addition, the only way to identify complete tweets is through the Tweet_Id which is somewhat less optimal than having the complete tweet saved as a list. Moreover, tidytext has somewhat less functionality in that we cannot easily remove URLs or twitter-specific characters (and by easy I mean by simply setting an option to “TRUE”). But it does have it’s advantage especially when it comes to plotting aspects of the data or manipulating data so don’t discount it yet.

Describing text data

Word frequencies

The easiest way to visualize the content of text is to look at word frequencies. Let’s stay with tidytext for the moment. Our tweet_tidy data set has removed punctuation and stop words from Trump’s tweets and stemmed the remaining words. Next, we simply count how often words show up and plot the 25 most frequently used words:

# we need to remove additional words from urls, etc. first
add_stopw <- as.data.frame(c("http","t.co","amp", "rt"), stringsAsFactors = FALSE)
colnames(add_stopw) <- "word"

tweet_count <- tweet_tidy %>%
  anti_join(add_stopw) %>%
  count(word, sort=T) %>%
  mutate(word=reorder(word,n))

ggplot(tweet_count[1:25,], aes(x=word, y=n)) +
  geom_col() + 
  coord_flip() +
  theme_minimal()

A lot of reference to his twitter handle and campaign. Another way to present this is through word clouds:

library(ggwordcloud)

ggplot(tweet_count %>% filter(n>100), aes(label=word, size=n)) +
       geom_text_wordcloud(eccentricity = 1) +
       scale_size_area(max_size = 15) +
       theme_minimal()

Let’s replicate this using quanteda and learn something about “Document-Term-Matrix” (DTM) as we go along. DTMs completely reconstruct your data so that the columns now represent every singly word in your corpus and the rows represent every document (i.e., tweets) in your corpus. The cells themselves record the number of times a word occurs in a single document. As you might expect, a DTM is typically sparse meaning that there are many empty cells. So let’s use our corpus and instead of tokenizing it, we transform it into a DTM:

# note that quanteda refers to a DTM as a Document-FEATURE-Matrix
tweet_dtm <- quanteda::dfm(tweet_corp,
             remove_punct=TRUE,
             remove_symbols=TRUE,
             remove_separators=TRUE,
             remove_twitter=TRUE,
             remove_url=TRUE) %>%
  dfm_remove(pattern=c(stopwords("english"), 
                       unlist(c(stop_words[,1])),
                       unlist(c(add_stopw)))) %>%
  dfm_wordstem()

tweet_dtm
## Document-feature matrix of: 7,375 documents, 9,120 features (99.9% sparse).
head(tweet_dtm, n = 5, nf = 5)
## Document-feature matrix of: 5 documents, 5 features (80.0% sparse).
## 5 x 5 sparse Matrix of class "dfm"
##        features
## docs    express deepest gratitud serv arm
##   text1       1       1        1    1   1
##   text2       0       0        0    0   0
##   text3       0       0        0    0   0
##   text4       0       0        0    0   0
##   text5       0       0        0    0   0

Overall, there are 7375 tweets in this DTM and 9120 distinct words. As expected, the overwhelming majority of cells are empty (sparsity>99%). Moreover, you can see that the order of words is chronological (i.e., the first tweet was “Today we express our deepest gratitude to all those who …”). The exact five stemmed words of the first tweet do not seem to appear in any of the next four tweets.

In order to get the top 25 words, we can use to topfeatures() function and plot them subsequently:

top25 <- topfeatures(tweet_dtm, n=25)
data_top25 <- as_tibble(cbind(names(top25), as.vector(top25))) 
data_top25$V2 <- as.numeric(data_top25$V2)
data_top25 <- data_top25 %>%
  mutate(V1=reorder(V1,V2))

# gives the identical plots
ggplot(data_top25, aes(x=V1, y=V2)) +
  geom_col() + 
  coord_flip() +
  theme_minimal() +
  labs(x="Words", 
       y="Frequency")

textplot_wordcloud(tweet_dtm,
                    min_count = 100) 

In general, you will probably not use DTMs too often in typical social science research questions. They are often used in so-called “bag-of-words” analyses where you essentially throw all words into a bag thus losing any information you had on context, collocation, etc. This is the basic principal of spam filters for email programs. Collect a tone of emails, have someone classify them as spam or ham (i.e, not spam), convert the email corpus to a DTM and model how the counts of words differ between spam and ham by training a basic classifier (e.g, you can think of logistic regression models as a classifier). What you will likely find is that spam more frequently contains words like “sexy”, “money” and “win” more frequently than ham thereby enabling you to differentiate between the two. Well, that is typically not something we are in the business of doing. But of course you can easily think of research questions where DTMs and bag-of-words approaches might come in handy such as “how does the choice of words differ in programs of different political parties?” or “how can we identify subtle cues of applicant rejection in telephone interviews?”.

Word collocation

Word frequencies are a simple tool to look at single words. What can be more insightfull is to look at which words occur together frequently. If you still remember Trump’s campaign, “crooked Hillary” would be an example of two words coming up frequently together. And based on the word frequencies, both “Hillary” and “crooked” are among his top 25 most frequently tweeted words.

Word collocation is assessed using so called “n-grams” (yes, this is where you need that mental note again from the quanteda::tokens() option above). In our example, “crooked Hillary” would represent a “2-gram” or bigram. In other words, n simply refers to the number of words you are looking at. “Make America great again” is probably the most frequent “4-gram” in his tweets (actually, that’s probably only true when we don’t remove stop words). Let’s look at some common bi-, tri- and quatro(?)-grams:

# it's functions time!!!!!!!111!!!11
# to avoid copy pasting walls of code

ngram <- function(nofngram, ntop) {
  data <- quanteda::tokens(tweet_corp,
                     what="word",
                     remove_numbers=TRUE,
                     remove_punct=TRUE,
                     remove_symbols=TRUE,
                     remove_separators=TRUE,
                     remove_twitter=TRUE,
                     remove_url=TRUE) %>%
    quanteda::tokens_remove(pattern=c(stopwords("english"), 
                       unlist(c(stop_words[,1])),
                       unlist(c(add_stopw)))) %>%  
    quanteda::tokens_wordstem() %>%  
    quanteda::tokens_ngrams(n=nofngram) %>%   
    quanteda::dfm() %>%   
    quanteda::topfeatures(n=ntop)
  return(data)
}


tok_bi <- ngram(2,10)
tok_tri <- ngram(3,10)
tok_quatro <- ngram(4,10)


# small function to handle data
countplot <- function(dataset){
  dataset <- as_tibble(cbind(names(dataset), as.vector(dataset))) 
  dataset$V2 <- as.numeric(dataset$V2)
  dataset <- dataset %>%
    mutate(V1=reorder(V1,V2))
  return(dataset)
}

# gives the identical plots
ggplot(countplot(tok_bi), aes(x=V1, y=V2)) +
  geom_col() + 
  coord_flip() +
  theme_minimal() +
  labs(x="Words", 
       y="Frequency",
       title="Bigrams")

ggplot(countplot(tok_tri), aes(x=V1, y=V2)) +
  geom_col() + 
  coord_flip() +
  theme_minimal() +
  labs(x="Words", 
       y="Frequency",
       title="Trigrams")

ggplot(countplot(tok_quatro), aes(x=V1, y=V2)) +
  geom_col() + 
  coord_flip() +
  theme_minimal() +
  labs(x="Words", 
       y="Frequency",
       title="Quatrograms")

Easy enough. And as expected, there is a lot of mentioning of “Crooked Hillary”. Many collocations refer to names (“donald trump”, especially among bigrams) or names with adjectives (“goofi elizabeth warren” or “lyin ted cruz”, among trigrams). Overall surprisingly little political concepts but a lot of name calling.

And this type of analysis can be very insightful, for example when contrasting how the behavior of social groups is described. Think of investigating which adjectives are collocated with the word immigrant in the news or what “women” and “men” do in Jane Austen novels (Spoiler: women remember, read and feel while men stop, take and reply)

Word and text similarity

Another descriptive aspect of texts is how similar they are to other texts. Here is where quanteda shines as you would need to calculate all this “manually” using tidytext.

Similarity can be assessed for single words or whole texts. Let’s try a single word first. Which words are similar to “hillari”:

similar_wrds <- quanteda::textstat_simil(tweet_dtm, 
                               "hillari", 
                               margin="features")
head(similar_wrds[order(similar_wrds[,1], decreasing = T),], 10)
##   hillari     crook   clinton judgement      beat     berni    e-mail 
## 1.0000000 0.5531597 0.4603057 0.1867358 0.1752339 0.1664330 0.1340344 
##    sander       bad       rig 
## 0.1209902 0.1172424 0.1136254

Which words are distant to “hillari”:

distant_wrds <- quanteda::textstat_dist(tweet_dtm, 
                               "hillari", 
                               margin="features")
head(distant_wrds[order(distant_wrds[,1], decreasing = T),], 10)
##       realdonaldtrump                 trump             trump2016 
##              45.95650              44.04543              34.68429 
##                  poll makeamericagreatagain               america 
##              33.89690              33.34666              33.18132 
##                 peopl                  vote                   cnn 
##              32.14032              31.27299              30.08322 
##                donald 
##              30.04996

We can do the same for whole documents but be aware that doing so is somewhat questionable given that we have tweets, Consider this: we are comparing how a single tweet compares to all other tweets. So just for completeness sake, here is how you would compare the first tweet to all other tweets. How does it compare to its most similar counterpart?

similar_docs <- quanteda::textstat_simil(tweet_dtm, 
                               1, 
                               margin="documents")

sim_doc <- similar_docs[order(similar_docs[,1], decreasing = T),][2]

# Here are the original tweets
quanteda::texts(tweet_corp[1])
##                                                                                                                         text1 
## "Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z"
quanteda::texts(tweet_corp[names(sim_doc)])
##                                                                                                                                       text3509 
## "I will take care of the Veterans who have served this country so bravely.\n#ThankAVet Video: https://t.co/WH9GSeSH29 https://t.co/WMR7jnmsyz"

There you have it.

Readability and Lexical diversity

Lexical diversity is another descriptive measure for whole documents measuring the diversity of text in terms of the words used. Now, tweets are not be best source to demonstrate this aspect. So let me use another data source here to which you unfortunately do not have access (yet).

Theodore Abel was an American sociologist in the 1930s who had the genius idea of faking a contest encouraging members of the NSDAP who joined before 1933 to lay down their life story and motivation for joining up. I will use two of the roughly 600 letters he received to illustrate lexical diversity. Here you can read the first and second letter.

########################## reading data
text1 <- read_lines("Wilhelm Naatz236350.txt")
text2 <- read_lines("FC Prinz v SchaumburgLippe238439.txt")
names <- c("Wilhelm Naatz", "Prinz Schaumburg-Lippe")

nazis <- as_tibble(cbind(rbind(text1,text2),names)) %>%
  rename(text=V1)
str(nazis)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
##  $ text : chr  "Wilhelm Naatz Duisburg, Lützowstr.19 Lebenslauf Ich wurde am 26. Mai 1904 als Sohn des Mühlenbesitzers und Land"| __truncated__ "Friedrich Christian Prinz zu Schaumburg-Lippe. Ich wurde im Jahre 1906 in ein Milieu, in eine Welt hineingebore"| __truncated__
##  $ names: chr  "Wilhelm Naatz" "Prinz Schaumburg-Lippe"
# German characters can be a annoying. Make sure to have .txt files saved using utf-8 encoding. Otherwise you may be in a world of pain.

nazi_corp <- corpus(nazis)
summary(nazi_corp)
########################## calculating readability
# Note that there are dozens of different readability measures, so please consult the help file for more information
textstat_readability(nazi_corp,
                     measure="Flesch")
########################## calculating diversity
# Can only be calculated for tokens or a DTM
nazi_dtm <- dfm(nazi_corp)
textstat_lexdiv(nazi_dtm, 
                measure="all")

Surprisingly, the text by Wilhelm Naatz, a tradesman and farmer’s son, is both more readable and more lexically diverse compared to the letter written by someone from the nobility.

Sentiment analysis

And now to the big one. Sentiment analysis is pretty popular at the moment with applications in many diverse fields (e.g., sentiment analyses of tweets regarding companies to inform quantitative stock trading). Sentiment analysis assigns words in a document sentiment scores (typically related to how “positive” or “negative” certain words are). Hence, you can get document-specific summaries of how sentiment changes within documents (e.g., think of books) or you can aggregate sentiment over documents in order to answer some questions (e.g., how does the average sentiment differ between books). Here, we want to track the sentiment of Trump’s tweets over his campaign. So first, we need to get sentiment scores. There exist many options and you are again facing the issue of language dependency: there will be many for English but you may be out of luck for other languages. Moreover, these sentiment dictionaries can be topic specific. But as always, you are most likely not the first person who has to deal with this and there are many resources online that may have what you are looking for. For instance, here are one, two, three resources for German.

Fortunately, the tidytext package has us covered with respect to English and comes with three general purpose sentiment dictionaries. Note that not all words get a sentiment score. Moreover, sentiment analysis is another one of those “bag-of-word”-type analyses so qualifiers like “not”, etc. are NOT taken into account!

# The three options are 
# afinn: scores positive and negative sentiment ranging from [-5;5]
# bing: scores into "positive" and "negative"
# nrc: scores emotions like anger, participation, disgust, fear, joy, negative, positive, sadness, surprise and trust

senti <- get_sentiments("afinn")
head(senti)
# Let's read our tweets in again and merge the sentiments
tweet_tidy <- as_tibble(dt_tweets) %>% 
  # tokenize the tweets
  tidytext::unnest_tokens(word, Tweet_Text) %>%
  # merge sentiment
  inner_join(senti)

# Note: inner_join keeps only those rows that are present in both data sets
# now we can group the sentiment by tweets and a positivity score by totaling the sentiments

positivity <- tweet_tidy %>%
  group_by(Tweet_Id) %>%
  summarise(positiv=sum(score))
head(positivity)
# it would be great to plot this score over time again, so let's merge the positivity score back to the tweet_tidy data
tweet_tidy <- as_tibble(dt_tweets) %>% 
  inner_join(positivity, by="Tweet_Id") %>%
  mutate(datum=ymd(Date))

ggplot(tweet_tidy, aes(x=datum, y=positiv)) + 
  geom_point() +
  geom_smooth() +
  theme_minimal() +
  labs(x="Date",
       y="Average sentiment")

So over time, his tweets became progressively more negative in sentiment. What does the most negative and most positive tweet look like?

cat(tweet_tidy$Tweet_Text[tweet_tidy$positiv==max(tweet_tidy$positiv)][5])
## The @WSJ Wall Street Journal loves to write badly about me. They better be careful or I will unleash big time on them. Look forward to it!
cat(tweet_tidy$Tweet_Text[tweet_tidy$positiv==min(tweet_tidy$positiv)][1])
## Trump rally disrupter was once on Clintons payroll
## https://t.co/75oLLuD4SI

Actually there are 100 tweets with the maximum values and 56 with the lowest value.

How do people react to the sentiment of these tweets? Let’s look at the number of retweets separately for negative and positive sentiment tweets:

tweet_tidy <- tweet_tidy %>%
  mutate(senti_indi=
           case_when(positiv<=0 ~ "negative",
                     positiv>0 ~ "positive"))

ggplot(tweet_tidy, aes(x=senti_indi, y=log(Retweets), fill=senti_indi)) +
  geom_violin() + 
  theme_minimal() + 
  labs(x="Average sentiment") + 
  theme(legend.position = "none")

More negative tweets get retweeted considerably more often than positive tweets. How about angry vs. joyful tweets?

senti <- get_sentiments("nrc")
head(senti)
anger_joy <- as_tibble(dt_tweets) %>% 
  tidytext::unnest_tokens(word, Tweet_Text) %>%
  inner_join(senti) %>%
  mutate(joy=
           case_when(sentiment=="joy" ~ 1,
                     sentiment!="joy" ~ 0),
         anger=
           case_when(sentiment=="anger" ~ 1,
                     sentiment!="anger" ~ 0)) %>%
  group_by(Tweet_Id) %>%
  summarise(joy=mean(joy, na.rm=T),
            anger=mean(anger, na.rm=T))

tweet_tidy <- as_tibble(dt_tweets) %>% 
  inner_join(anger_joy, by="Tweet_Id") %>%
  mutate(datum=ymd(Date))

df_joy <- tweet_tidy %>% 
  select(joy, datum, Retweets) %>%
  mutate(emotion="Joy") %>%
  rename(value=joy)
df_anger <- tweet_tidy %>% 
  select(anger, datum, Retweets) %>%
  mutate(emotion="Anger") %>%
  rename(value=anger)
df_emo <- rbind(df_joy,df_anger)


ggplot(df_emo, aes(x=datum, y=value, color=emotion, group=emotion)) + 
  geom_point() +
  geom_smooth() +
  theme_minimal() +
  labs(x="Date",
       y="Percent of tweet text expressing either joy or anger")

df_emo <- df_emo %>%
  # filtering for above average values, reasonable in this case because the group means are fairly close (7 vs. 8 percent)
  filter(value>mean(value, na.rm=T))

ggplot(df_emo, aes(x=emotion, y=log(Retweets), fill=emotion)) +
  geom_violin() + 
  theme_minimal() + 
  labs(x="Tweets with high emotional salience regarding ...") + 
  theme(legend.position = "none")

Similar pattern compared to the negative sentiment analysis where angrier tweets receive more retweets. What do joyful and angry tweets look like?

cat(tweet_tidy$Tweet_Text[tweet_tidy$joy==max(tweet_tidy$joy)][1])
## RT @marklevinshow: Trump: Rove is a clown and a loser http://t.co/jfDKiaTJeN
cat(tweet_tidy$Tweet_Text[tweet_tidy$anger==max(tweet_tidy$anger)][1])
## Will be interviewed on @FoxNews by @JudgeJeanine tonight at 9:00 P.M. Enjoy!

Well, I wouldn’t count the first example as a particularly joyful tweet. But that’s just me.

But what this demonstrates quite nicely is the general drawback of bag-of-words approaches in that the broader context in which words - and thus sentiment attached to these words - occur doesn’t matter. Clowns may make you laugh but getting called a clown is a whole different situation.

Topic models and POS tagging

Enough with the child’s play of tweeting, let’s use some more interesting documents to introduce topic models and part-of-speech (POS) tagging. Topic models are another bag-of-words method which focuses on discovering latent “topics” in a collection of documents. In short, topic models compare the frequency of certain words across documents to group documents together into “topics”. In other words, while every topic is a mixture of words, every document also represents a mixture of topics. Note that this methods is exploratory meaning to it will not tell you how many topics are in your data per se - that is something you need to specify and the validity of which you need to explore.

POS tagging comes in handy for topic models as it allows us to select words by type - that is, are they a noun, a verb, an adjective and adverbs etc. Because topics are probably better captured by nouns (and adjectives), identifying nouns in the first place is crucial.

For this part of the tutorial, we will rely on a more extensive collection of Theodore Abel’s Nazi biograms. More specifically, our data set represents a random sample of 50 biograms.

library(textreadr)

sample <- readRDS("nazi_sample.rds")

biograms <- list()

for (i in 1:length(sample)) {
  x <- read_rtf(sample[i])  
  x <- paste(x, collapse = " ") 
  biograms[i] <- x
}

df <- data.frame(matrix(unlist(biograms), 
                        nrow=length(sample), 
                        byrow=T), 
                 stringsAsFactors=FALSE)

# get names as new variable
nazi_names <- str_sub(sample,1,-11)

df_biograms <- cbind(df,nazi_names, 
                     stringsAsFactors=FALSE) 
names(df_biograms) <- c("text","author")

OK, just so you can follow what I did here: I have a sample of file names that I read using readRDS(), then I generate an empty list to store the letters in. The loop goes over every file, loading it, does some cleaning up of line breaks and saves it in the list. Then I transfer it into a data frame, extract the names of the individuals who wrote the letters and attach them as a new variable to the letters.

POS tagging

First things first, let’s identify nouns in these letters using POS tagging which is implemented in the udpipe package:

library(udpipe)

# download the language module in you don't already have it. udpipe has around 50 languages, check the help file for additional information.
#ud_model <- udpipe_download_model(language = "german")

ud_model <- udpipe_load_model(ud_model$file_model)

# depending on the size of the text corpus, this may take some time to run. For this 10% sample of Nazi biograms, it took around 5 minutes
annot_bio <- udpipe_annotate(ud_model, x = df_biograms$text, doc_id = df_biograms$author)
annot_bio <- as.data.frame(annot_bio)
head(annot_bio)

Our new data set is completely restructured. Rows are grouped by doc_id corresponding to the letters. In addition, there are IDs for each paragraph and sentence. Next, the actual sentence content is recorded in sentence. But here is the important part: every row corresponds to a token (i.e., a word or punctuation). Hence, the original sentence “August Kirwa, Trier, Löwenbrückenerstr.15, Trier, den 11.” is split into 12 token rows. Two things to note: first, sentences are split by looking for “. [dot-space]”. Thus, this sentence is cut short by leaving out " August 1934" because of the split after “11.”. That is also the reason why “Löwenbrückenerstr.15” is counted as one token and not used to sentence split as the " [space]" is missing after the “.[dot]”. Although this is of little importance here, it may be important in your application and you should remedy it with some regular expression magic or tuning the udpipe tokenizer (you can also supply the udpipe_annote() function with a dataset where each row represent a sentence thus enabling you to preconstruct a “correct” data set, see the udpipe documentation for more information).

Next, each token’s ID is recorded within sentences (so they are not unique IDs for the whole corpus) and the actual token itself. The lemma column is very important again. It records a lemmatization of tokens. Lemmatization is very similar to stemming but while stemming works by “cutting off” parts of words where the results need not be a “word” you can use in a sentence, lemmatization uses linguistic models to get the “root” or “base form” of words.

upos records what we are after, the “universal part of speech tag”. They are pretty self-explanatory but in case you need a refresher, check out this documentation. xpos records the same information with additional language-specific detail. feats gives you morphological features (i.e., a words gender, whether it’s singular or plural, etc.). And maybe also of relevance in some projects is what dep_rel records, namely the “universal dependency relations”.

Let’s look at one sentence to get a better idea regarding the wealth of information POS tagging can give us:

annot_bio[79:96, c("token","lemma","upos","feats","dep_rel")]

What are some common nouns and adjectives in these letters?

annot_plot <- annot_bio %>%
  filter(upos %in% c("NOUN","ADJ")) %>%
  group_by(upos) %>%
  count(lemma, sort=T)

library(ggwordcloud)
ggplot(annot_plot %>% filter(n>20), aes(label = lemma, size = n, color=n)) +
  geom_text_wordcloud(eccentricity = 1) +
  scale_size_area(max_size = 10) +
  theme_minimal() +
  scale_color_gradient(low="#fb6a4a", high="#67000d") + 
  facet_wrap(~upos)

# a different approach is to look for keywords and rank them using Google's textrank algorithm
library(textrank)
rank <- textrank_keywords(annot_bio$lemma, 
                          relevant = annot_bio$upos %in% c("NOUN", "ADJ"), 
                          ngram_max = 8, sep = " ")
rank <- as_tibble(subset(rank$keywords, ngram > 1 & freq >= 5))
ggplot(rank, aes(label=keyword, size=freq, color=freq)) +
  geom_text_wordcloud(eccentricity = 1) +
  scale_size_area(max_size = 15) +
  theme_minimal() +
  scale_color_gradient(low="#fb6a4a", high="#67000d")

Unsurprisingly in letters detailing individual’s life stories, there is a lot of reference to “my X” (e.g., my cv, my father, my mother, my wife, my comrade, my duty) - and with respect to retelling the rise of a movement, a lot of “our Y” (e.g., our “Führer”, our enemy, our struggle, our people, our fatherland).

POS tagging is a highly versatile tool enabling sophisticated analyses. Remember the word collocation examples discussed previously (with bigrams and trigrams, etc.)? You could this with POS tagging looking for collocations of names and adjectives and much more.

Topic models

Topic models work on DTMs so let’s convert the data set to one using udpipe’s in-built function- For simplicity sake, let’s focus only on nouns:

nazi_corp <- annot_bio %>%
  filter(upos %in% c("NOUN"))

dtf <- document_term_frequencies(nazi_corp, document = "doc_id", term = "lemma")
nazi_dtm <- document_term_matrix(x = dtf)
nazi_dtm <- dtm_remove_lowfreq(nazi_dtm, minfreq = 5)

As I said before, topic models are an exploratory method so we need to come up with a good approximation of the number of topics in the biograms. So let’s take this as a starting point, work with 6 topics for now and run our first topic model. Latent Dirichlet allocation is a very popular algorithm to do topic models and it is implemented in the topicmodels package:

library(topicmodels)

nazi_lda <- LDA(nazi_dtm, 
                k = 6, 
                control = list(seed = 12345))

as_tibble(terms(nazi_lda, 10))
nazi_topics <- broom::tidy(nazi_lda, matrix="beta")

nazi_topterms <- nazi_topics %>%
  group_by(topic) %>% 
  top_n(10, beta) %>%
  ungroup() %>%  
  arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta))

ggplot(nazi_topterms, aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

As a next step, you should probably start validating the number of topics LDA extracts (e.g., by using LDAtuning package to explore data-driven suggestions concerning the number of topics). I will bypass this step - assuming that we have the correct number of topics (likely not true) - and go straight to demonstrating the stm package that facilitates exploring the topics and their meaning.

Why demonstrate the topicmodels package first? The stm package really shines when you are able to provide meta information on the texts. Currently, the Nazi biograms have little meaningful meta information attached to them but this will change in the future as I am currently working on a meta data set collecting information on the writers socioeconomic background, war experience, social status, etc. Hence, this analysis will be considerably expanded in the future. For a full demonstration, you can check out this article in the meantime.

OK, so to use the stm package, we need to have our data as a DTM again. I want to use the lemmitized nouns for this so I construct a new data set from those effectively regenerating the letter data set by pasting the lemmas together in one big string:

library(stm)

nazi_corp <- annot_bio %>%
  filter(upos %in% c("NOUN"))

dat <- c()

for (i in unique(nazi_corp$doc_id)) {
  temp <- nazi_corp %>% filter(doc_id==i)
  temp2 <- paste(temp$lemma, collapse=" ") 
  temp3 <- cbind(i, temp2)
  dat <- rbind(dat,temp3)
}

dat <- as_tibble(dat)
colnames(dat) <- c("docid","text")

# make corpus
nazi_corp <- corpus(dat)
# make dtm
nazi_dtm <- dfm(nazi_corp)
nazi_dtm <- dfm_remove(nazi_dtm, c("jahr","|","tag","zeit"))
# convert to correct format
nazi_dtm_stm <- convert(nazi_dtm, to = "stm")

# finally, run stm
nazi_stm <- stm(documents= nazi_dtm_stm$documents, 
                vocab=nazi_dtm_stm$vocab, 
                K = 6,
                verbose=F)

plot(nazi_stm, 
     type = "summary")

# This plot can be insightful. For me, it is just too messy
#plot(nazi_stm, 
#     type = "labels")
# And here is how you compare to topics:
plot(nazi_stm,
     type = "perspectives",
     topics=1:2)

The stm packages also offers a solution to assess the number of topics based on the data

nazi_k <- searchK(nazi_dtm_stm$documents, 
                  nazi_dtm_stm$vocab, 
                  K = 2:10,
                  verbose=F)
plot(nazi_k)
# between 8 and 10 topics seem optimal

Text networks

Text networks use methodology to model social relations between individuals and apply it to text documents. The relationship (i.e., the ties ) between words can thus be used to describe similarities between words. In a similar vein, the relationship between authors can be modeled by the kinds of words that co-occur in the documents they wrote.

All we need to visualize these networks are functions implemented in the textnets package. Note that because I don’t like the handling of German text data in textnets data handling functions, I create the data sets myself but provide the equivalent codes of the textnets package as comments:

# the textnet package is not on CRAN so install it from GitHub using devtools
#library(devtools)
#install_github("cbail/textnets")
library(textnets)

# data needs to be in "tidy" format meaning one row per document with additional information on the author. We already prepared this previously in the df_biograms
#annot_bio <- PrepText(df_biograms, 
#                          groupvar = "author", 
#                          textvar = "text", 
#                          node_type = "words", 
#                          tokenizer = "words", 
#                          pos = "nouns", 
#                          remove_stop_words = TRUE, 
#                          compound_nouns = TRUE,
#                          language="german")
# Important to notice here is the "node_type" argument. By specifying "words", we will create networks for words. We specify the alternative, "groups", for an analysis later on.

# here is my custom data preparation using the POS implemented in the udpipe library. Note that textnets also uses udpipe but for some reasons, the results are very different.

# read POSed texts
annot_bio <- readRDS("annot_bio.rds") %>%
  # keep only nouns
  filter(upos=="NOUN") %>%
  # keep only lemmatized words and authors
  select(lemma, doc_id) %>%
  # kick out some stopwords
  filter(!lemma %in% c("Jahr","Tag","Zeit")) %>%
  # group by authors
  group_by(doc_id) %>%
  # count word frequencies by author
  count(lemma, sort=T) %>%
  ungroup() %>%
  rename(count=n) %>%
  # keep only words occuring at least 5 times
  filter(count>5) %>%
  # this orders the variables to work with textnets functions
  select(lemma, doc_id, count)

# create the network
nazi_network <- CreateTextnet(annot_bio)

# a first visualization
VisTextNet(nazi_network)

# the same visualization but interactive
#VisTextNetD3(nazi_network)
# usually better to save it locally using this set of commands:
#library(htmlwidgets)
#vis <- VisTextNetD3(nazi_network,
#                    height=500,
#                    width=700,
#                    bound=FALSE,
#                    zoom=TRUE,
#                    charge=-10)
#saveWidget(vis, "nazi_textnet.html")
# the "charge" option controls the "closeness" of nodes. The more negative the charge, the more strongly they are repelled from one another.

Note that the node color corresponds to text communities (same color indicates a strong relationship between its components). There are some options of note to the VisTextNet() function. For example, betweenness=T plots the size of words according to their betweenness centrality (i.e., the extent to which a word is between cluster in a sort of “brokerage position”).

The textnets package also calculates a number of network measures for you:

# get text communities and plot their constituting words
text_communities <- TextCommunities(nazi_network)

ggplot(text_communities %>% filter(modularity_class %in% c(1,8,10,9)), 
       aes(label=group, 
       color=modularity_class)) +
  geom_text_wordcloud(eccentricity = 1) +
  scale_size_area(max_size = 15) +
  theme_minimal() +  
  facet_wrap(~modularity_class)

text_centrality <- TextCentrality(nazi_network)
text_centrality[text_centrality$betweenness_centrality>quantile(text_centrality$betweenness_centrality, 0.9),]

Conversely, we can create the same network representing the relationship between authors based on the words they use:

annot_bio <- annot_bio %>%
  # this orders the variables to work with textnets functions
  select(doc_id, lemma, count)

nazi_network <- CreateTextnet(annot_bio)
 
VisTextNet(nazi_network)

Packages overview

I hope you found this tutorial to be a comprehensive first step into the world of text mining. The following table provides a short overview over many of the packages used and at which point they are useful in working with text data.

Name Analysis step
vosonSML downloading data
quanteda corpus generation, tokenization, document-term-matrix generation, stop word, punctuation, etc. removal, n-grams, lexical diversity, readability
tidytext tokenization, n-grams, sentiment data
stopword removing stop words
SnowballC stemming, stop words
textreadr reading text data files (ee.g., .txt, .rtf, .doc)
udpipe POS tagging, tokenization, lemmatization, n-grams, cooccurance, keywords
textrank keyword extraction
topicmodels topic models
stm topic models, explore and visualize topics, topic models including meta data
textnets text networks, topic models