The large-scale spread of online communication (e.g., in the form of blogs, comment threads or text messenger services) has generated a plethora of new data sources that can provide valuable and relevant insights into social science research questions. This short tutorial introduces the basics of mining and analyzing social media data using R.
Before we can actually analyze anything related to our research questions, we will need to have a couple of programs installed. This tutorial relies exclusively on R and if you are unfamiliar with it, don’t worry. R is easy to learn. But if you are still hesitant, here is a list of benefits you get from learning it:
So first things first, get R and RStudio.
Here is my short R Tutorial in case you need to freshen up on your R skills.
Next, you will need accounts for all social media services you plan to use (e.g., YouTube, Twitter, reddit).
In order to access social media, you will also need access to the APIs (application programming interface) of the various social media sites. These APIs can be accessed directly but relying on packages to do this from inside R is typically more convenient. Still, you will need to register your “project” and get the necessary credentials that allow you to interface with the services. In the following, I will give you a step-by-step overview for YouTube and Twitter. Luckily, you don’t need special credentials to access data from Reddit.
In addition to packages that help you get data, you will also need packages that help you work with the data. The tidyverse package gives you easy access to powerful data handling tools (e.g., dplyr and stringr) as well as ggplot2 - the package to create amazing visualizations of your results. So go ahead, install and load the tidyverse package:
install.packages(c("tidyverse"))
library(tidyverse)
You might also think about getting a GitHub account to save and share your work more easily. Here is a great tutorial helping you to set up a GitHub workflow within RStudio. Obviously, you can also go the more tedious route of manually uploading your files to GitHub at regular intervals. Whatever floats your boat! Sharing code and data becomes increasingly more important and the sooner you familiarize yourself with this aspect of doing science, the better for you down the road.
Accessing and downloading data is fairly simple and nothing like scraping websites “manually” - given you have the necessary credentials. In case you want to learn how to scrape content from websites directly, you can find a short case study here.
As a cautionary note, you should be aware that services like YouTube or Twitter may change their rules of data access at any time! So you may have a great research question and plans to collect data but that may be over in a moments notice and you are left with nothing! That is one of the main drawbacks of having free access to interesting data collected by companies.
Let’s assume, we installed all necessary packages in an earlier session and we are starting with a clean slate. Anytime that is the case, we want to start by loading packages. Almost by default now, I will always start with loading the tidyverse package. To get YouTube data, we need the vosonSML package.
library(tidyverse)
library(vosonSML)
Next, we need to tell the YouTube API that we are authorized to access and download their data. For that reasons, we retrieved an API-key earlier (see Section Prerequisites -> YouTube). For convenience, store that key in an object and tell YouTube your intent to access data:
apikey <- "xxxxxxxx"
key <- Authenticate("youtube",
apiKey=apikey)
When running this for the first time, you may be redirected to a site in your browser stating your identity. Notice that there will/may be a “.httr-oauth” file in your working directory now. This file essentially stores your credentials letting you skip the “identify yourself via the browser” step in future data requests. Note also that when you use GitHub, this file will be automatically added to the .gitignore file and thus not uploaded to GitHub where others have access to your credentials. Pretty neat!
For this example, we want to collect the comments from this video. Obviously, feel free to keep listening to it while you follow this tutorial! There is a couple of things you should note when working with YouTube videos: first, video URLs have a regular structure where https://www.youtube.com/watch?v= is followed by a unique video ID (G1IbRujko-A). This video ID is what we need to download the comment data.
# You can either use this function to extract the video ID
videos <- GetYoutubeVideoIDs(c("https://www.youtube.com/watch?v=G1IbRujko-A"))
# or supply it "manually":
videos <- c("G1IbRujko-A")
# Either way works. If you want to download comments for multiple values, just add them to the vector of videoIDS like so:
# videos <- c("ID1","ID2","etc")
# This will use the key and download the data
yt_data <-key %>%
Collect(videos)
What did we get?
str(yt_data)
## Classes 'dataource', 'youtube' and 'data.frame': 7617 obs. of 9 variables:
## $ Comment : chr "Put this at 1.5x and thank me later lads" "Saxophone teacher: What inspired you to learn this instrument. \n\nMe:" "3:51:19 Best part" "So many dislikes holyshit" ...
## $ User : chr "<U+05E2><U+05DE><U+05D9><U+05EA> <U+05E2><U+05E0><U+05E3>" "KillerMachine30_YT" "Tyler Phommachak" "Lavenyus Manufacturing" ...
## $ ReplyCount : chr "0" "0" "0" "0" ...
## $ LikeCount : chr "0" "1" "0" "0" ...
## $ PublishTime : chr "2019-04-17T05:42:56.000Z" "2019-04-17T03:39:31.000Z" "2019-04-17T03:12:54.000Z" "2019-04-17T01:46:36.000Z" ...
## $ CommentId : chr "UgxraIpGf3Fq0CJITR94AaABAg" "Ugz26jSTzSzMwFbHt_l4AaABAg" "Ugy8tUjOmHvI21fY3rB4AaABAg" "UgzK2fn5UbQaJRfiR914AaABAg" ...
## $ ParentID : chr "None" "None" "None" "None" ...
## $ ReplyToAnotherUser: chr "FALSE" "FALSE" "FALSE" "FALSE" ...
## $ VideoID : chr "G1IbRujko-A" "G1IbRujko-A" "G1IbRujko-A" "G1IbRujko-A" ...
names(yt_data)
## [1] "Comment" "User" "ReplyCount"
## [4] "LikeCount" "PublishTime" "CommentId"
## [7] "ParentID" "ReplyToAnotherUser" "VideoID"
nrow(yt_data)
## [1] 7617
We collect the comments in string form, who made the comments, how many replies did it get, how many likes, when was it published, a unique ID, the ID of its parent (if applicable), was it a reply to some other user and to video does the comment belong. So when you collect comments from multiple videos, this variable will enable you to identify where they came from or match them to specific videos.
YouTube comments have a very specific structure: there are parent comments which are direct replies to a video and there are child comments which are replies to parent comments or other child comments. So when ParentID==“None”, then the comment is a parent comment but when it has a different value it is a child comment. The actual value then refers to the parent comment to which the child comment replies. The “ReplytoAnotherUser” variable essentially records the same information but rather than recording the comment id (see CommentId variable), it records the user name.
So how extensive is the comment network in this example? How frequently do users engage with what others wrote?
ggplot(yt_data, aes(x=as.numeric(ReplyCount))) +
geom_bar() +
theme_minimal() +
labs(x="Reply Count",
y="Count",
title="Interactions in YouTube video comment section",
subtitle = "Source: www.youtube.com/watch?v=G1IbRujko-A")
Apparently, the vast majority of parent comments receive no replies at all whereas 61 out of 5629 parent comments receive more than replies - roughly 1.08 percent. Then again, studying commenting behavior or interaction in comment threads using this particular video may not be the smartest choice of data to begin with…
Getting twitter data works very similar. You first feed in your credentials:
appname <- "xxxxx"
myapikey <- "xxxxxxx"
myapisecret <- "xxxxx"
myaccesstoken <- "xxxxxx"
myaccesstokensecret <- "xxxx"
Then you authenticate your access and download data by specifying a search term:
tw_data <- Authenticate("twitter",
appname=appname,
apiKey=myapikey,
apiSecret=myapisecret,
accessToken=myaccesstoken,
accessTokenSecret=myaccesstokensecret) %>%
Collect(searchTerm="#brexit", numTweets=20)
There are a number of additional options you need to be aware of:
So, what did we get from this small data collection?
str(tw_data[1:14])
## Classes 'tbl_df', 'tbl', 'datasource', 'twitter' and 'data.frame': 16 obs. of 14 variables:
## $ user_id : chr "95285344" "497698757" "2509738688" "964990428134207488" ...
## $ status_id : chr "1118425742059167744" "1118425731543990272" "1118425719451918336" "1118425716763373568" ...
## $ created_at : POSIXct, format: "2019-04-17 08:07:30" "2019-04-17 08:07:28" ...
## $ screen_name : chr "PhilDuck" "angewick" "martinnewby_1" "edbutt78" ...
## $ text : chr "The new #Brexit Party, like everything #Farage touches, is all about division\nhttps://t.co/MggExzl2kM" "1/ A thread about the Government's response to the petition asking for a Public Inquiry into illegality in the "| __truncated__ "Revealed: The DUP arranged after the EU ref for NI public bodies to discuss 'investment opportunities' with Ric"| __truncated__ "@Jo2901F @UKIP @GerardBattenMEP @Nigel_Farage bolted when we won the referendum and wanted UKIP to fail, now th"| __truncated__ ...
## $ source : chr "Twitter for Android" "Twitter for iPhone" "Twitter for Android" "Twitter for iPhone" ...
## $ display_text_width : num 118 140 140 279 113 140 144 140 140 140 ...
## $ reply_to_status_id : chr NA NA NA "1118423313393569793" ...
## $ reply_to_user_id : chr NA NA NA "321436614" ...
## $ reply_to_screen_name: chr NA NA NA "Jo2901F" ...
## $ is_quote : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ is_retweet : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
## $ favorite_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ retweet_count : int 155 170 508 0 100 1 1047 170 20 808 ...
names(tw_data)
## [1] "user_id" "status_id"
## [3] "created_at" "screen_name"
## [5] "text" "source"
## [7] "display_text_width" "reply_to_status_id"
## [9] "reply_to_user_id" "reply_to_screen_name"
## [11] "is_quote" "is_retweet"
## [13] "favorite_count" "retweet_count"
## [15] "hashtags" "symbols"
## [17] "urls_url" "urls_t.co"
## [19] "urls_expanded_url" "media_url"
## [21] "media_t.co" "media_expanded_url"
## [23] "media_type" "ext_media_url"
## [25] "ext_media_t.co" "ext_media_expanded_url"
## [27] "ext_media_type" "mentions_user_id"
## [29] "mentions_screen_name" "lang"
## [31] "quoted_status_id" "quoted_text"
## [33] "quoted_created_at" "quoted_source"
## [35] "quoted_favorite_count" "quoted_retweet_count"
## [37] "quoted_user_id" "quoted_screen_name"
## [39] "quoted_name" "quoted_followers_count"
## [41] "quoted_friends_count" "quoted_statuses_count"
## [43] "quoted_location" "quoted_description"
## [45] "quoted_verified" "retweet_status_id"
## [47] "retweet_text" "retweet_created_at"
## [49] "retweet_source" "retweet_favorite_count"
## [51] "retweet_retweet_count" "retweet_user_id"
## [53] "retweet_screen_name" "retweet_name"
## [55] "retweet_followers_count" "retweet_friends_count"
## [57] "retweet_statuses_count" "retweet_location"
## [59] "retweet_description" "retweet_verified"
## [61] "place_url" "place_name"
## [63] "place_full_name" "place_type"
## [65] "country" "country_code"
## [67] "geo_coords" "coords_coords"
## [69] "bbox_coords" "status_url"
## [71] "name" "location"
## [73] "description" "url"
## [75] "protected" "followers_count"
## [77] "friends_count" "listed_count"
## [79] "statuses_count" "favourites_count"
## [81] "account_created_at" "verified"
## [83] "profile_url" "profile_expanded_url"
## [85] "account_lang" "profile_banner_url"
## [87] "profile_background_url" "profile_image_url"
tw_data$text[1:3]
## [1] "The new #Brexit Party, like everything #Farage touches, is all about division\nhttps://t.co/MggExzl2kM"
## [2] "1/ A thread about the Government's response to the petition asking for a Public Inquiry into illegality in the #Brexit referendum, because this sort of self-serving and bilious dismissal of the public needs calling out sometimes. And I'm cross.\nhttps://t.co/RoyFiRmxPH"
## [3] "Revealed: The DUP arranged after the EU ref for NI public bodies to discuss 'investment opportunities' with Richard Cook - the Tory behind a hidden pro-union business group that donated £435,000 to the DUP during its Brexit campaign.\n\nhttps://t.co/xUgtwP8Lqr @irish_news #Brexit"
Overall, you will get 88 variables with information on the user, the content and how the tweet is linked to other content.
vosonSML has built-in functions to create users networks and in the case of twitter data also “semantic” networks:
actor_nw <- tw_data %>%
Create("actor")
## Generating twitter actor network...
## Done.
plot(actor_nw$graph)
semantic_nw <- tw_data %>%
Create("semantic",
termFreq=20,
removeTermsOrHashtags=c("#brexit"))
## Generating twitter semantic network...
## Done.
plot(semantic_nw$graph)
“brexit” and “campaign” seem to be highly important words in the 20 tweets we collect as exemplified by their central position and the large number of arrows (or edges) directed at them. Later in this tutorial, you will learn more sophisticated methods to a) plot networks and b) create semantic networks so don’t be put off by the look of this example, publication-quality visualizations of networks are right around the corner.
Since this is also done with the same package, getting Reddit data involves the same routine: authenticate and then collect
red_auth <- Authenticate("reddit")
red_data <- red_auth %>%
Collect("https://www.reddit.com/r/de/comments/bd8i60/tetraeder/")
str(red_data)
## Classes 'datasource', 'reddit' and 'data.frame': 25 obs. of 19 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ structure : chr "1" "1_1" "1_1_1" "1_1_1_1" ...
## $ post_date : chr "14-04-19" "14-04-19" "14-04-19" "14-04-19" ...
## $ comm_date : chr "15-04-19" "15-04-19" "15-04-19" "15-04-19" ...
## $ num_comments : num 25 25 25 25 25 25 25 25 25 25 ...
## $ subreddit : chr "de" "de" "de" "de" ...
## $ upvote_prop : num 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 ...
## $ post_score : num 458 458 458 458 458 458 458 458 458 458 ...
## $ author : chr "lebadger" "lebadger" "lebadger" "lebadger" ...
## $ user : chr "noodleboiiii" "chinupf" "noodleboiiii" "chinupf" ...
## $ comment_score : num 34 15 19 5 5 5 1 3 6 5 ...
## $ controversiality: num 0 0 0 0 0 0 0 0 0 0 ...
## $ comment : chr "Wenn es 4 Uhr morgens ist und du in 3h zur Chemievorlesung musst, aber dann laut ueber sowas lachst, statt zu s"| __truncated__ "Und, wie war die Vorlesung? Ü" "Hasse mich selbst, da ich nicht geschlafen habe; musste gerade aber nochmal über das Meme lachen also war es okay." "In der Uni lernt man ja auch was fürs Leben" ...
## $ title : chr "Tetraeder" "Tetraeder" "Tetraeder" "Tetraeder" ...
## $ post_text : chr "" "" "" "" ...
## $ link : chr "https://i.redd.it/7t5htsyy2bs21.jpg" "https://i.redd.it/7t5htsyy2bs21.jpg" "https://i.redd.it/7t5htsyy2bs21.jpg" "https://i.redd.it/7t5htsyy2bs21.jpg" ...
## $ domain : chr "i.redd.it" "i.redd.it" "i.redd.it" "i.redd.it" ...
## $ URL : chr "https://www.reddit.com/r/de/comments/bd8i60/tetraeder/?ref=search_posts" "https://www.reddit.com/r/de/comments/bd8i60/tetraeder/?ref=search_posts" "https://www.reddit.com/r/de/comments/bd8i60/tetraeder/?ref=search_posts" "https://www.reddit.com/r/de/comments/bd8i60/tetraeder/?ref=search_posts" ...
## $ thread_id : chr "bd8i60" "bd8i60" "bd8i60" "bd8i60" ...
names(red_data)
## [1] "id" "structure" "post_date"
## [4] "comm_date" "num_comments" "subreddit"
## [7] "upvote_prop" "post_score" "author"
## [10] "user" "comment_score" "controversiality"
## [13] "comment" "title" "post_text"
## [16] "link" "domain" "URL"
## [19] "thread_id"
Note that the structure of Reddit can be quite complex. Contrary to YouTube where you have a parent comment with x child comments replying to it, redditors can directly reply to child comments creating complicated web of communication. The structure variable captures this in a stylized fashion:
red_data$structure
## [1] "1" "1_1" "1_1_1" "1_1_1_1" "1_2"
## [6] "1_2_1" "1_2_1_1" "1_2_1_1_1" "1_2_2" "1_2_2_1"
## [11] "1_2_2_1_1" "1_2_2_1_1_1" "1_2_2_1_2" "1_2_2_2" "1_2_3"
## [16] "2" "3" "4" "5" "6"
## [21] "6_1" "6_1_1" "7" "8" "9"
In YouTube lingo, there are 9 parent comments (numbers 1 to 9) with varying degrees of comment activity. comment 1 has two direct replies (1_1 and 1_2) while 1_2 has two replies of its own (1_2_1 and 1_2_2).
As with YouTube and Twitter data, we can visualize the communication network easily with the Create() function:
network <- Create(red_data,
type="actor")
## Generating reddit actor network...
## Done.
plot(network$graph)
I hope you found this tutorial to be a comprehensive first step into the world of text mining. The following table provides a short overview over many of the packages used and at which point they are useful in working with text data.
Name | Analysis step |
---|---|
vosonSML | downloading data |
quanteda | corpus generation, tokenization, document-term-matrix generation, stop word, punctuation, etc. removal, n-grams, lexical diversity, readability |
tidytext | tokenization, n-grams, sentiment data |
stopword | removing stop words |
SnowballC | stemming, stop words |
textreadr | reading text data files (ee.g., .txt, .rtf, .doc) |
udpipe | POS tagging, tokenization, lemmatization, n-grams, cooccurance, keywords |
textrank | keyword extraction |
topicmodels | topic models |
stm | topic models, explore and visualize topics, topic models including meta data |
textnets | text networks, topic models |
Benoit et al. 2019. quanteda: Quantitative Analysis of Textual Data
Graham & Ackland. 2018. vosonSML: Tools for Collecting Social Media Data and Generating Networks for Analysis
Grün & Hornik. 2011. topicmodels: An R Package for Fitting Topic Models
Puschmann. 2018. Automatisierte Inhaltsanalyse mit R
Roberts et al. 2014. stm: R Package for Structural Topic Models
Silge & Robinson. 2017. Text Mining with R. O’Reily
Social media packages
In general, you will only need to install packages only when you (re-)installed R. But you will need to load them every time you want to use them. In order to make your scripts run with as little user input as possible, it can be useful to have R install packages as needed and subsequently load them.
vosonSML is a similar package with access to YouTube, Twitter and Reddit.
install.packages(c("vosonSML")) library(vosonSML)
The tuber package is an alternative to vosonSML when trying to access YouTube data and provides additionally functionality regarding the acquisition of video and channel meta data. I will not demonstrate its use but feel free to check out this vignette for additional information.
#install.packages(c("tuber")) #library(tuber)
Most packages come with extensive help files which you can access by typing:
or to get help on specific functions, just type ? and the function name: