Christoph Spörlein

Prerequisites

Before we can actually analyze anything related to our research questions, we will need to have a couple of programs installed. R is easy to learn. But if you are still hesitant, here is a list of benefits you get from learning it:

it’s free and open source
it has a large community that provides support and contributes amazing extensions in the form of user-written packages
it has the tidyverse which is a collection of package that just make it fun to work with data
R isn’t just statistical software, you can write apps (e.g., to visualize educational selectivity of immigrants in Germany or let people play around with survey results), websites (e.g., like my personal website), books (i.e., on how to write documents like the one you are reading at the moment, geocomputation or on text mining)

So first things first, get R and RStudio.

R is easy to learn, even more so when you have experience with other statistical software such as SPSS or Stata. In the following, I will guide you through common data handling tasks in social science research.

This tutorial follows some convention when presenting you with information:

packages are written in italic (i.e., the tidyverse package)
variables are written in bold (i.e., age or education)
commands are written in bold but with parentheses (i.e, summary() or mean())

I don’t strive to give you the most elegant or optimal way to do something but merely one way to do it - one way that I hope is easy to read and follow. Some of the code snippets I use to demonstrate ideas can easily be written much, much shorter but in doing so may be much harder to follow for beginners. As you gain proficiency, you will almost inadvertently switch to shorter solutions you may have seen other people do.

I will also not explain every command in great detail. It is useful to look at the corresponding help files to get more information by typing ?COMMAND in the console.

Every subsection has some exercises for you to practices what has been discussed. Solutions to these exercises can be found at the very end of this document.

Setting up R

The RStudio IDE

Go ahead and open RStudio which is an IDE (integrated development environment) for R. In general, RStudio will make your life considerably easier than working in base R would. One advantage is to have well-arranged layout.

The blue box is where you open your scripts and write commands to be executed.

The results of these commands are printed in the console (yellow box). You execute commands by selecting the lines of code you want to execute and then press CTRL+ENTER. I personally want to run code using just one hand so I changed this CTRL+R under Tools -> Modify Keyboard Shortcuts -> Source -> Run current line/selection.

The red box lists all the objects currently in your work space. Objects can be many things: vectors, matrices, data sets, functions.

The green box has several tabs of which you will mostly need three: a file browsers, a plot window and the help tab. In the file browser, you can navigate the folders on your computer and set the working directory by clicking on “More” -> Set as working directory (doing this by command is probably better: setwd(“FOLDER/SUBFOLDER/”)). You can always get help for commands you don’t know or when you want to check out all available options and their defaults by typing ?COMMAND.

Installing and loading packages

Base R only has - you guessed it - pretty basic capabilities. Oftentimes, you will need more specialized tools for your problems. Now, you can either write them yourself but chances are, someone already did it for you by writing a package and publishing it on CRAN (the main repository for curated R packages) or GitHub.

In general, you will need to install packages only once, for example when you (re-)installed R. You can install packages either by typing install.packages(“PACKAGENAME”) into the console or your script. Or you do it “manually” by clicking on Tools -> Install Packages -> Type in the package name and click install. Make sure to have “Install dependencies” checked because many packages rely on other packages to run. RStudio will automatically install them as well when the box is checked.

You will need to load packages every time you want to use them in a session. Thus, it is good practice to start your scripts by loading all necessary packages using library() or require().

Some basics

R is a object-oriented programming language which basically means that everything you do can be stored in an object and accessed later on.

Loading a data set? Store in an object!

Changing the data set? Store it in another object!

Running a regression? Store in an object!

Making a graph? Store it in yet another object!

Writing a function? Save it in a new script file! Just kidding! Store it in an object!

This is an exciting aspect of R, especially when you previously worked extensively in Stata or SPSS where you essentially have one active data set to work on. In R, you will generate new objects left and right - sometimes storing little piece of information for later use in some other function. Let’s look at a quick example that also teaches you something about different types of objects in R.

We can easily generate a small data set by making a couple of vectors and combining them to a tibble (which closely resembles what you will recognize as a data set):

library(tidyverse)

# generate a vector with numbers going from 1 to 5
x <- 1:5
x
## [1] 1 2 3 4 5

# a vector with 5 strings
z <- c("apple","banana","sausage","strawberry","ham")
z
## [1] "apple"      "banana"     "sausage"    "strawberry" "ham"

# another vector with 5 strings
y <- c("fruit","fruit","not fruit","fruit","not fruit")
y
## [1] "fruit"     "fruit"     "not fruit" "fruit"     "not fruit"

# store all three vectors as a tibble
data <- tibble(x,z,y)
# give the columns names (i.e., variable names)
colnames(data) <- c("num","shopping","groupID") 

# print the whole data set
data

Generating an object is as simple as assigning some content to an object (i.e., the content 1 to 5 is assigned “<-” to the object x). In order to see what is stored in the object, simply type its name. Note the c()-command which just creates a vector with the elements inside its parentheses.

So here we generate three objects and then bind them together treating them as column vectors (think of this as one variable in a data set). Then we assign each column a name (think of these as variable names). The last line simply prints the result which looks like a tiny data set.

Written under the variable names in the output of the data set, you have “int” and “chr” telling you that num contains integers (numeric data) and the other two variables characters (strings). Always check how R stores specific variables, this will save you a lot of time spend troubleshooting errors. Sometimes it may be necessary to overwrite how variables are stored:

# convert numeric to character
data$num <- as.character(data$num)
is.numeric(data$num)
## [1] FALSE
is.character(data$num)
## [1] TRUE

# convert back to numeric
data$num <- as.numeric(data$num)

#convert character to factor
data$shopping <- as.factor(data$shopping)
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  3 variables:
##  $ num     : num  1 2 3 4 5
##  $ shopping: Factor w/ 5 levels "apple","banana",..: 1 2 4 5 3
##  $ groupID : chr  "fruit" "fruit" "not fruit" "fruit" ...

Here we convert num to character, check whether it is numeric (FALSE) and check whether it is character(TRUE). It worked! And then we convert it back to numeric. Here, we reference the num variable in the data data set with the dollar sign $. There are of course other classes (e.g., as.data.frame(), as.matrix(), etc.). Sometimes you may also need as.factor() to convert something to a factor (i.e., categorical) variable. str() gives a short summary of the data set where the variable types are also given.

You can generate new variables by simply adding them to an existing data set.

data$sausage <- c("FALSE","FALSE","TRUE","FALSE","TRUE")
data

Another useful bit of information is how to access elements of data sets. Keep in mind, data sets are essentially n-by-m matrices where n is the number of rows and m the number of columns. In R, you can also access cells, rows and/or columns of a data frame by referencing them directly using the brackets [ ]. Here are different ways to access rows, columns or individual cells:

# all elements of a row
data[1,]

# the same output by saying "show me all rows except row 2 to 5"
data[-c(2:5),]

# all elements of a column
data[,1]

# all elements of a column by name
data[,"num"]

# a cell
data[1,1]

This should cover the very basics.

In addition, here is a list of commands that come in handy once in a while. They should be rather self-explanatory but feel free to check their help page for more information:

nrow(data)
## [1] 5
ncol(data)
## [1] 4
dim(data)
## [1] 5 4
unique(data$groupID)
## [1] "fruit"     "not fruit"
length(data$groupID)
## [1] 5
names(data)
## [1] "num"      "shopping" "groupID"  "sausage"
summary(data)
##       num          shopping   groupID            sausage         
##  Min.   :1   apple     :1   Length:5           Length:5          
##  1st Qu.:2   banana    :1   Class :character   Class :character  
##  Median :3   ham       :1   Mode  :character   Mode  :character  
##  Mean   :3   sausage   :1                                        
##  3rd Qu.:4   strawberry:1                                        
##  Max.   :5

Excercises

Generate a vector with numbers from 1 to 100 named ID.
Generate a vector of 100 random draws from a normal distribution with mean 0 and standard deviation of 5 (?rnorm). Name it x. Before you run the command, set the seed for the random number generator to 1234 to generate reproducible results (set.seed(1234))
Make a data set out of the vectors called myfirstdataset.
Generate a new variable for myfirstdataset which transform x as (500+60*x) and call it y.
What are the means of x and y?
What is the value of x for row 14?
Store row 99 in a vector called row99.
Convert ID to character.
Generate a new variable for myfirstdataset which takes the value 0 for x<0 and 1 for x>=0. Name it dummy.

Data handling

For this more practical part, we will use a couple of packages, some of which only serve a single purpose such as loading data. Make sure to have them installed beforehand. Here is a list of all packages used in this part of the tutorial:

readxl
tidyverse
foreign

install.packages(c("readxl","tidyverse","foreign"))

Loading Data

Most projects start by loading some data. Oftentimes, these data sets are in one of four formats: .csv, .xlsx, .sav and .dta. R has a built-in function to deal with .csv data, but we will need a package to deal with Excel data (readxl) and one package to deal with SPSS and Stata data (foreign). For beginners, it can be helpful to set the working directory directly although some argue that this is not good practice (I enjoy setting working directories!).

Let’s load the packages, set the working directory and read in some data:

library(readxl)
library(foreign)

setwd("THIS/IS/WHERE/I/STORE/MY/STUFF/")
# Note the use of / instead of \

data_csv <- read.csv2("test.csv", header=TRUE, sep=";")
# read.csv2 has three important arguments:
# file = "test.csv" ; assuming this file is in the current working directory
# header = TRUE     ; meaning that the header of the files contains the variable names
# sep = ";"         ; meaning that cells are separated by ";"

data_xlsx <- read_excel("test.xlsx", sheet=1)
# read_xlsx has two important arguments:
# path = "test.xlsx"; assuming this file is in the current working directory
# sheet = 1         ; for excel files with many sheets, you can either load them by their position (1=first, 2=second) or by name ("country")

data_sav <- read.spss("test.sav")

data_dta <- read.dta("test.dta", convert.factors=FALSE)
# read.dta has two important arguments:
# file = "test.dta"; assuming this file is in the current working directory.

foreign only reads .dta files up to version 12. Hence, make sure to use “saveold FILE, version(12)” in Stata to be able to load these files. There is also the haven package as part of the tidyverse but it transforms the variables into the labelled-class which can lead to issues for example when trying to plot them using ggplot().

Transforming data

Once we have a data set, we may still need to transform the data to better suit our needs. For most if not all our data handling needs, we will rely on the tidyverse package. To demonstrate common workflows, let’s use the build-in Star Wars data set which records attributes for each character. Load the tidyverse package, load the Star Wars data and inspect it:

library(tidyverse)

# load data, only the first 10 variables
data <- starwars[,1:10]

# inspect the data
head(data)

The head() command gives you the first 6 rows of a data set. For each variable, it gives its attribute (e.g., chr=character data, int=integer, dbl=numeric). Another common command to inspect data is str(data). I do not show this here because the output is fairly large but do try it yourself.

Piping

One of the appeals of the tidy-data approach is being able to “pipe” commands. Piping allows you string together various commands with the pipe symbol (%>%) which makes your code easily readable. For example, I can read a data set, drop some individuals, select some attributes, generate a new variable, etc. in one “go”. Although you may not understand every command used here in detail, you probably won’t have a too hard time following the logic:

data_droids <- data %>%
  filter(species=="Droid") %>%
  select(name, height, mass, homeworld) %>%
  mutate(cm.per.kg=height/mass)

str(data_droids)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5 obs. of  5 variables:
##  $ name     : chr  "C-3PO" "R2-D2" "R5-D4" "IG-88" ...
##  $ height   : int  167 96 97 200 NA
##  $ mass     : num  75 32 32 140 NA
##  $ homeworld: chr  "Tatooine" "Naboo" "Tatooine" NA ...
##  $ cm.per.kg: num  2.23 3 3.03 1.43 NA

Dummy and categorical variables

Let’s assume we are interested in how non-human species differ from humans. To do so, we will need a dummy variable differentiating between humans and non-humans. Most data transformations work with the mutate() command. When it comes to generating dummy variables or grouping together categories of categorical variables, the case_when() command is our friend.

dataV2 <- data %>%
  mutate(non_human=
           case_when(species=="Human" ~ 0,
                     species!="Human" ~ 1))

table(dataV2$non_human)
## 
##  0  1 
## 35 47

OK, this might still be somewhat overwhelming so let me just verbalize what is done here: I save a new data set dataV2 by using the original data set data and generate a new variable called non_human. This variable takes the value 0 when the species variable indicates humans, and 1 when it indicates not human (!=). And then I tabulate the new variable (yes, the default R table sucks… The janitor package has the tabyl() command which is much nicer).

In practice, you can string along any number of commands by piping and you will have access to newly generated variables along the “way”. For example, I might immediately generate another new variables which reduces the number of categories of hair color to something more human friendly:

dataV2 <- data %>%
  mutate(non_human=
           case_when(species=="Human" ~ 0,
                     species!="Human" | is.na(species) ~ 1)) %>%
  mutate(hair_color_hmn=
           case_when(hair_color=="black" ~ "black",
                     hair_color=="blond" ~ "blond",
                     hair_color=="brown" ~ "brown",
                     hair_color!="black" | hair_color!="blond" | hair_color!="brown" ~ "other"))

table(dataV2$non_human, dataV2$hair_color_hmn)
##    
##     black blond brown other
##   0     8     3    14    10
##   1     5     0     4    38

Note that the way I construct the new variables here is certainly not the most efficient way but it is very readable in that you can almost “narrate” the flow of commands and understand what is going on. Personally, I rather have longer but easier to read than short but cryptic code.

Constructing indices

You are probably as interested as I am in Jabba’s body mass index so let’s calculate it for every individual in the data set:

dataV2 <- dataV2 %>%
  mutate(BMI=mass/((height/100)^2))

dataV2 %>% 
  filter(name=="Jabba Desilijic Tiure") %>%
  select(name, BMI)


summary(dataV2$BMI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.89   21.51   24.67   32.05   26.60  443.43      28

According to human standards, Jabba would be hyper obese.

You may have noticed that the second block of code filters for Jabba and only keeps the name and BMI variable. However, after that, I summarize the data and get information for the complete data set. The second block of code does not save the filtering and variable selection in a new data set (i.e, the “dataV2 <-” is missing) hence it just prints the complete data set in the console which - due to the filtering - just consists of Jabbas information.

Group-means

Another common task, especially in cross-national or multilevel research more generally is to use aggregated information as independent variables (i.e, country-level general social trust or the percent of foreign-born across within-country regions). In the universe of Star Wars, we might be interested in which planets have a substantial human representation.

dataV2 <- dataV2 %>%
  group_by(homeworld) %>%
  mutate(perc_human=abs(mean(non_human,na.rm=TRUE)-1)) %>%
  ungroup() 

head(dataV2[,c("homeworld","perc_human")])

To get the percentage of humans per planet, we first group the data by homeworld and then make use of the dummy variable we constructed earlier. This may be a bit convoluted but what abs(mean(non_human,na.rm=TRUE)-1) does is to calculate the mean of non_human per planet which gives you the percentage of non-humans. To get the percentage of human, we just subtract 1 and take the absolute value of the result. For example, the proportion of non-humans from Tatooine is 0.2. 0.2-1=-0.8, the absolute value of which is 0.8. Note also that we specify na.rm=TRUE for the mean() function. If you don’t do that and there are missing observations in the data, then you will get an empty variable returned.

head(dataV2[,c(“homeworld”,“perc_human”)]) does nothing more than print the first six observations for the homeworld and perc_human columns.

Merging data

R is an object-oriented language meaning that you can have any number of data set (which are in effect just objects) loaded simultaneously. So in contrast to what you might be used to from Stata or SPSS, you can load your data set with country characteristics, work on it, load your household survey and work on it without ever having to close any of the data sets. This is tremendously advantageous when you want to work on data sets and then merge them together.

Merging data in R is straightforward. There are four ways to merge data sets: left_join(), right_join(), inner_join() and full_join(), but in practice you mostly need full_join and left_join (see “Combine Data Sets” on the reference sheet for a more detailed description). full_join() takes the two data sets and merges them keeping all data from both data sets. To demonstrate this, we could have gotten the percentage of humans per homeworld and saved it in a different data set (maybe because we needed it in another analyses and want to retain access to it). Then we would need to merge it to the original data set in order to work with it. Let’s do this with full_join():

data_planets <- dataV2 %>%
  group_by(homeworld) %>%
  summarize(perc_human2=abs(mean(non_human,na.rm=TRUE)-1))

data.joined <- dataV2 %>%
  full_join(data_planets, by="homeworld")

summary(data.joined[,c("perc_human", "perc_human2")])
##    perc_human      perc_human2    
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.4545   Median :0.4545  
##  Mean   :0.4023   Mean   :0.4023  
##  3rd Qu.:0.8000   3rd Qu.:0.8000  
##  Max.   :1.0000   Max.   :1.0000

Based on the summary() output, we get the very same results. Note that I used summarize() instead of mutate() here. Summarize is a great function to aggregate values. In combination with group_by(), you can think of it as aggregating by groups.

Let’s say you are faced with a common situation where you have cross-national survey data from Europe and you have another data set with income inequality data for every country on the planet. full_join() would retain the income inequality values for all countries, even those that are not in your survey data. We typically don’t want that so we would just use left_join(), assuming we do this “from the perspective of the survey data” (the left data set). left_join joins the “left”" data set with the “right” data set (i.e, the income inequality data) and only keeps observations from the left data set. In Stata terms, this is analogous to typing “drop if _merge==2" after merging data sets where 2 refers to “observation appeared in using only”. Hence, the left data set would be the using data set in Stata lingo.

ESS.merged <- ESS %>%
  left_join(income_inequality, by="country")

Let’s further assume that we actually want to model within-country regional inequality and regions are just consecutively numbered within countries. Then we simply specify two variables to merge on:

ESS_regions.merged <- ESS %>%
  left_join(income_inequality, by=c("country","region"))

Missing data

You may have noticed that there is missing information in the Star Wars data set (e.g., hair color for R2D2 and C3PO). Missing data is represented by “NA” in R. Many commands, like mean() require you to specifically tell R how you want to deal with missing data (i.e., by saying na.rm=TRUE). There are many ways to deal with missing information - the two most popular ones being to focus on complete cases (i.e., ignoring the issue) or imputing missing values.

Dropping missing information can be doing in many ways:

summary(dataV2[,1:3])
##      name               height           mass        
##  Length:87          Min.   : 66.0   Min.   :  15.00  
##  Class :character   1st Qu.:167.0   1st Qu.:  55.60  
##  Mode  :character   Median :180.0   Median :  79.00  
##                     Mean   :174.4   Mean   :  97.31  
##                     3rd Qu.:191.0   3rd Qu.:  84.50  
##                     Max.   :264.0   Max.   :1358.00  
##                     NA's   :6       NA's   :28

# drop missing data
data_nomiss <- na.omit(dataV2)

summary(data_nomiss[,1:3])
##      name               height         mass       
##  Length:29          Min.   : 88   Min.   : 20.00  
##  Class :character   1st Qu.:170   1st Qu.: 75.00  
##  Mode  :character   Median :180   Median : 79.00  
##                     Mean   :178   Mean   : 77.77  
##                     3rd Qu.:188   3rd Qu.: 83.00  
##                     Max.   :228   Max.   :136.00

Originally, we had 87 Star Wars characters. But because some variables have missing information, we end up with only 32 character with complete information. na.omit() effectively does what you may know as list-wise deletion. As soon as an observation has missing information on any variable, it is dropped. So we might not even use hair color in a future analysis, but R2D2 and C3PO will get deleted nonetheless. Oh dear.

Imputation methods take a different route by “imputing” values to fill the missing information. If you want state-of-the-art imputation routines, check out the mice package. Here, we simply do mean imputation for the two numeric variables height and mass using two different approaches.

data_nomiss <- dataV2 %>%
  mutate(height=
           replace(height, 
                   is.na(height), 
                   mean(height, na.rm=T)))

data_nomiss$mass[is.na(data_nomiss$mass)] <- mean(data_nomiss$mass, na.rm=T)

summary(data_nomiss[,c("height","mass")])
##      height           mass        
##  Min.   : 66.0   Min.   :  15.00  
##  1st Qu.:167.5   1st Qu.:  75.00  
##  Median :178.0   Median :  84.00  
##  Mean   :174.4   Mean   :  97.31  
##  3rd Qu.:190.5   3rd Qu.:  97.31  
##  Max.   :264.0   Max.   :1358.00

Let’s verbalize what is happening here: I generate a new data set data_nomiss by using the original data and mutating the height data. I replace the height values which are missing by the mean height.

The second approach is how data handling used to be done before the tidyverse entered the stage. A lot more difficult to read, IMO. The right hand side of the assignment arrow (<-) is easy enough to follow - it calculates the mean mass in the original data set. The left hand side can be verbalized as follows: “I want to work with mass variable in data_nomiss data set. More specifically, I want to change all those rows which containing missing observations”. One important note to avoid confusion: in the basics, I told you to access values using [rows, columns] (also called sub-setting) by referencing to rows and columns. Here, there is no “,” because I already tell R that I want to focus on the mass column by saying data_nomiss$mass. Everything in the brackets thus refers to the rows of the mass column. I could have also used

data_nomiss[is.na(data_nomiss$mass), "mass"] <- mean(data_nomiss$mass, na.rm=T)

to reference the missing observations in mass using sub-setting with rows and columns simultaneously. I hope you see that the old way of doing this is considerably more involved as well as hard to read and parse.

Note that missing data for character data or strings may not be represented by “NA” but may be written as an empty string “”.

Character data

When working with social media data, you will inadvertently have to deal with character data in the form of strings.The tidyverse contains the stringr package to help you with this daunting task. Fortunately, functions dealing with string data follow a convenient naming convention: every function starts with str_ . Try typing ?str_ into the RStudio console and a selector should open automatically which let’s you scroll through all the various string-related functions available (permitted you have the tidyverse package loaded). Let’s demonstrate some commands by converting the names to lower-case words, generate a variable which counts the number of characters in each name and splits the name into first and last name:

dataV2 <- dataV2 %>%
  mutate(name.low=
           str_to_lower(name),
         n.name=
           str_length(name))

# check to see what happend
dataV2[1:5,c("name","name.low")]


# summary statistics for the number of characters in names
summary(dataV2$n.name)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     7.5    10.0    10.3    13.0    21.0

# which characters have the fewest characters. Ha!
dataV2$name[dataV2$n.name==min(dataV2$n.name)]
## [1] "Rey" "BB8"

# which characters have the most characters. Ha again!
dataV2$name[dataV2$n.name==max(dataV2$n.name)]
## [1] "Jabba Desilijic Tiure" "Wicket Systri Warrick"

# splitting the name in first and last name
names <- str_split_fixed(dataV2$name, " ", n=3)
names[1:10,]
##       [,1]      [,2]          [,3]  
##  [1,] "Luke"    "Skywalker"   ""    
##  [2,] "C-3PO"   ""            ""    
##  [3,] "R2-D2"   ""            ""    
##  [4,] "Darth"   "Vader"       ""    
##  [5,] "Leia"    "Organa"      ""    
##  [6,] "Owen"    "Lars"        ""    
##  [7,] "Beru"    "Whitesun"    "lars"
##  [8,] "R5-D4"   ""            ""    
##  [9,] "Biggs"   "Darklighter" ""    
## [10,] "Obi-Wan" "Kenobi"      ""

Well, our idea of having first and last name as separate variables appears to simplistic as many characters also have a middle name. We do have the first names in one column but the second column contains the last name for some (who have an empty third column) whereas it contains the middle name for others. Let’s fix this by overwriting the middle name with the last name for those who have a middle name:

# Convert matrix to tibble
names <- as_tibble(names)
# when you convert a matrix to a data frame, R will give the name the columns V1 to ncol(dataframe)

# replace middle name column with last name when middle names are present
names$V2[names$V3!=""] <- names$V3[names$V3!=""]

# assign variable names
colnames(names) <- c("firstname","lastname","bla")

# combine the original data set with the names. Note: this only works because we did not change the order of the observations in the data set. Had we changed that or the number of observations differ between the two data sets, then you need to merge the data sets using some common ID variable
dataV2 <- cbind(dataV2,names[,1:2])

# inspect data
head(dataV2[,c("name","firstname","lastname")], n=7)

That was rather tedious. If only there were some way to make use of the regularly occurring patterns in strings…

Right, there is something called regular expressions (regex) which enable you to specify any pattern to look for in strings. As a note of caution, regex can be difficult to wrap one’s head around. At least that was my experience. Let’s gently introduce some regex ideas with a not so contrived example:

Assume for a moment, you work at a large research institute, you want to write a research grant and you are looking for collaborators. Unfortunately, only researchers with a PhD are allowed to submit grants. Now, you would like to write an email to all your coworkers holding a PhD which you would have to manually extract from a list of all your institute’s employees. Blarg, too much work. But wait a minute, people holding a PhD have a “Dr.” in front of their names. A regular pattern you can look for! So you “CTRL+F” and put in “Dr.” and viola, it highlights all coworkers with a PhD.

You can do the same in R - but automatically:

coworkers <- as_tibble(c("Prof. Dr. Beate Meier","Prof. Dr. Dr. Manuela Richter","Dr. Andre Giant", "Mandy Brückner","Regine Schmidt","Alfred Mäh","Dr. Heinz Erhardt"))
colnames(coworkers) <- "Name"

coworkers$PhD <- str_detect(coworkers$Name, pattern="Dr. ")
coworkers

Easy enough.

No let’s assume I want to automate my emails as much as possible. Since I am on a first name basis with all my coworkers, simply pasting “Dear NAME,” together sounds too formal (“Dear Prof. Dr. Dr. Manuela Richter”). So I want to delete all that title stuff. And there are two title patterns: “Prof.” and “Dr.” with various frequencies. Can you think of a more abstract way to describe this pattern? What it basically is is some characters (either 2 or 4) followed by a “.”.

Regular expressions have several so-called character classes. For example:

[:digit:] = Numbers (1,2,3,4)
[:lower:] = lower case characters (a,b,c,d)
[:upper:] = upper case characters (A,B,C,D)
[:alpha:] = alphabetic characters (A,a,B,b)
[:alnum:] = alphabetic and numeric characters (1,2,A,a)
[:punct:] = punctuation (.,;,!)
\< = word beginning
\> = word ending
see this link for a more extensive overview)

Hence, Prof. or Dr. could be represented by “[:alpha:].”, right? Wrong! But it’s not your fault. There are a number of metacharacters that must be “escaped” by “\”. Well, it turns out, “\” is also one of those metacharacters so you also need to escape it.

“.”, for instance, represents “any character”. What would happen if we use “[:alpha:].”? Think about it for a minute.

coworkers$PhD_new <- str_detect(coworkers$Name, pattern="[:alpha:].")
coworkers

Hm, it didn’t differentiate at all. That is because “[:alpha:].” literally means “find me any alphabetic character followed by any character”. And that obviously applies to any name. Escaping it properly should work, right? Yes and no. Let’s try it:

coworkers$PhD_new <- str_detect(coworkers$Name, pattern="[:alpha:]\\.")
coworkers

It did what we wanted it to do but it we can improve upon it. Right now, it looks for “any (one) character followed by a .”. So in effect, it matches on either “f.” or “r.”. Sometimes it is better to be a little more specific and give R less room to interpret stuff. Think about a more realistic example where you don’t only have names but sentences. With this approach, it will match any and every properly constructed sentence ending in “f.” or “r.”. That’s not good. So we need some quantifiers:

*=0 or more
*?=0 or more ungreedy
+=1 or more
?=0 or 1
??=0 or 1, ungreedy
+?=1 or more, ungreedy
{3}=exactly three times
{3,}=three or more times
{3,5}=three to five times

Usually, it matches as often as possible. If you use “ungreedy”, it will match the lowest amount defined.

Back to our example. A more specific way to match for academic titles could be to use “[:alpha:]{2,4}\\.”. Here, we are telling R to look for “between 2 and 4 alphabetic characters followed by a .”. Let’s see if it works:

coworkers$PhD_new <- str_detect(coworkers$Name, pattern="[:alpha:]{2,4}\\. ")
coworkers

Seems to have worked. Now we want to delete the titles from the names to write a more personal email:

email.header <- coworkers %>% 
  filter(PhD_new==TRUE) %>%
  mutate(informal=
           str_replace_all(Name,pattern="[:alpha:]{2,4}\\. ", replacement=""))

email.header

Note that I am using str_replace_all() rather than str_replace() because some of my coworkers have multiple titles and I want to get rid of all of them.

Now it is time to write our email. The paste() function is another amazing tool that let’s you paste different content together as a string. The next chunk of code reads in my email text, generates a firstname variable, removes the trailing space and generates a new variable called email.text by pasting my coworkers first name to the email text:

email.body <- c(",\nRemember that thing we talked about? Yes, that thing. Are still interested in applying for a grant to study it? \nSee you in the cafeteria! I heard to have Knusperschnitzel on the menu today!\n\nBest, Christoph")

email.header <- email.header %>%
  mutate(firstname=
           str_extract(Name, pattern="[:alpha:]+ "),
         firstname=
           str_replace(firstname, " ","")) %>%
  mutate(email.text=
           paste("Dear ",firstname,email.body, sep=""))

# Print one email using cat() to properly display it
cat(unlist(email.header[1,"email.text"]))
## Dear Beate,
## Remember that thing we talked about? Yes, that thing. Are still interested in applying for a grant to study it? 
## See you in the cafeteria! I heard to have Knusperschnitzel on the menu today!
## 
## Best, Christoph

I could have also used paste0() since I don’t want the chunks to be separated by anything but paste() is more versatile so you should see it in use at least once. Also note that we used a different method to get the firstname variable, namely using str_extract() and a regular expression as opposed to splitting the string at " " separating first and last name. The “\n” represent a newline.

I hope this introduced you gently to the idea of regular expressions and how to use them to your advantage. As with everything in life, practice makes perfect so don’t expect to master regular expressions following just this short tutorial!

Excercises

Part I: From data to analysis

Download wave 8 of the European Social Survey in Stata format, load and inspect it.
We ultimately want to estimate a regression model predicting of ethnic income differences across European societies. Hence, we need information on income, a majority-minority indicator, education, age and gender. Find the corresponding variables (here is the documentation), check how they are stored and change any variables stored in a format you think is inappropriate (Hint: we want to treat income and age as continuous variables).
Clean the data, generate a majority-minority as well as female dummy and transform education into a factor.
Generate country-level variables: percent foreign-born population and the percent agreeing that most people can be trusted (values 7-10). Merge it back to the ESS data.
Run a linear multilevel model with individuals clustered in countries where income is regressed on the variables of interest (lmer() function of the lme4 package). Let the slope of immigrant (i.e., the ethnic differences) vary across countries. Store the results in an object called ESS.model
Expand the model by including a cross-level interaction between the avg_trust and immigrant

Part II: Regular expressions/character data

Consider the following: Learning R has been tremendous fun for you! There are so many problems in your life that R can solve. One of your biggest problems to date has been to manage your grocery shoping. You have your long shopping list, you go shopping, you get some items but not others. Now you want R to automatically delete the items you have managed to buy from your shopping list (what a crazy innovative idea! and I am wasting it on this tutorial?! I should found a start-up…) using the information on your receipt.

Here is your shopping list for this week:

1x Eggs
3x Milk
1x Bread
20x Beer
2x M&Ms

Store the shopping list in an R object. Create a data frame where the item is stored in one column and the quantifier in another column.

Here is a scan of your receipt (just imagine it) detailing what you bought:

1x Freiland Eggs
2x Frankenland Milk
2x Sappel Bread
19x Huppendorfer Beer (drank one on your way home, shame on you)
2x M&Ms

Also store this in an object and make a data frame out of it.
Now update the shopping list by “subtracting” from it what you bought.
Open and drink one of the remaining 19 beers!

Plotting data

Another great advantage of working with R is to be able to generate high quality graphs with ggplot2. Building those graphs is highly intuitive once you understand the basic logic. To demonstrate this, let me read your mind and reenact a thought process you probably had when making a graph in some other program in the past:

Your thought process	how it translates to ggplot2 logic	and what it looks like
I have this data set and would like to plot the relationship between x and y	ggplot(data=dataset, aes(x=x, y=y))
I think this relationship is best represented as a scatter plot	ggplot(data=dataset, aes(x=x, y=y)) + geom_point()
Oh, I should color the points by z	ggplot(data=dataset, aes(x=x, y=y)) + geom_point(aes(color=z))
Right, I need to label the axis	ggplot(data=dataset, aes(x=x, y=y)) + geom_point(aes(color=z)) + labs(x=“This is the x-axis”, y=“Is this the y-axis?”, color=“The values of z”)
Hm, would be interesting to make this plot for every group w?!	ggplot(data=dataset, aes(x=x, y=y)) + geom_point(aes(color=z)) + labs(x=“This is the x-axis”, y=“Is this the y-axis?”, color=“The values of z”) + facet_wrap(~w)
Ugh, I don’t want all these lines and borders. Give me something simple!	ggplot(data=dataset, aes(x=x, y=y)) + geom_point(aes(color=z)) + labs(x=“This is the x-axis”, y=“Is this the y-axis?”, color=“The values of z”) + facet_wrap(~w) + theme_minimal()
Wait, did I already write 800 lines of Stata code?	ggsave(last_plot())

And so on. ggplots are essentially build sequentially, layer by layer. Of course you don’t need to literally build it step by step, the second to last line of the table also works without all the lines coming before it.

(Almost) every aspect you can think of in a graph can be changed in ggplot, it is only a matter understanding which aspect is controlled by which ggplot layer.

Let’s use our Star Wars data set again and generate some plots. First, what is the relationship between height and mass and does it differ between humans and non-humans?

data_plot <- starwars %>%
  mutate(non_human=
           case_when(species=="Human" ~ "Human",
                     species!="Human" | is.na(species) ~ "Non-Human"))

ggplot(data=data_plot, aes(x=height, y=mass)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~as.factor(non_human)) + 
  labs(title="The relationship between height and mass", subtitle = "in the Star Wars universe", x="Height (in cm)", y="Mass (in kg)") + 
  theme_minimal()

Man, Jabba ruining our beautiful graph :(

We can also look at distribution of eye color by hair color, just to have a contrived example demonstrating bar plots:

data_plot <- data_plot %>%
  mutate(hair_color_hmn=
           case_when(hair_color=="black" ~ "black",
                     hair_color=="blond" ~ "blond",
                     hair_color=="brown" ~ "brown",
                     hair_color!="black" | hair_color!="blond" | hair_color!="brown" | is.na(hair_color) ~ "other"),
         eye_color_hmn=
           case_when(eye_color=="blue" ~ "blue",
                     eye_color=="brown" ~ "brown",
                     eye_color!="blue" | eye_color!="brown" ~ "other"))

# stacked bar plots
ggplot(data_plot, aes(x = hair_color_hmn, fill=eye_color_hmn)) + 
      geom_bar(aes(fill=eye_color_hmn), position = 'fill', color="black") + 
      labs(x="Hair color", y="Percent", fill="Eye color", title="Hair color by eye color", subtitle="in the Star Wars universe") +
      theme_minimal() +
      theme(axis.text.x=element_text(angle=90)) +
      scale_y_continuous(labels=scales::percent) + 
      scale_fill_manual(values=c("steelblue1","brown","white"))


# grouped bar plots
ggplot(data_plot, aes(x = hair_color_hmn, fill=eye_color_hmn)) + 
      geom_bar(aes(fill=eye_color_hmn,y = (..count..)/sum(..count..)), position = 'dodge', color="black") + 
      labs(x="Hair color", y="Percent", fill="Eye color", title="Hair color by eye color", subtitle="in the Star Wars universe") +
      theme_minimal() +
      theme(axis.text.x=element_text(angle=90)) +
      scale_y_continuous(labels=scales::percent) + 
      scale_fill_viridis_d()

Notice the use of scale_y_continous() to control the appearance of y-axis aspects for continuous variables. It is probably not helpful to discuss every detail in these graphs but the following table provides an overview over common things to tinker with:

Name	What it changes/influences/controls
Main aspects
fill=	controls how bars, points, etc. are filled. Often by specifying another variable (i.e., bars are filled by gender or education)
color=	controls how the borders bars, points, etc are colored. Often by specifying another variable
alpha=	controls the opaqueness of bars, points, etc. 1 is solid, smaller values make plotted things progressively more see-through.
geom_X()	where X represents the overall appearance of the plot. Common examples include geom_bar(), geom_density(), geom_boxplot(), geom_point() and geom_line()
Scales
scale_+_discrete()	where + represents x or y. Enables you to control breaks, limits and labels of categorical variables
scale_+_continuous()	where + represents x or y. Enables you to control breaks, limits and labels of continuous variables
scale_~_discrete()	where ~ represents fill, color, alpha. Let’s you control aspects related to fill, color or alpha when variables used to set these aspects are discrete
scale_~_continuous()	where ~ represents fill, color, alpha. Let’s you control aspects related to fill, color or alpha when variables used to set these aspects are continuous
scale_~_manual()	where ~ represents fill, color, alpha. Gives you even more control allowing you to specify custom mappings of data levels to aesthetic values. E.g., define the colors used for filling with values=c(“blue”,“green”,“red”) rather than relying on predefined color schemes and mappings
Labels
labs()	Used to label aspects of the graph. Important arguments are x, y, title and subtitle. If present, also legend, fill, color
Themes
theme_X()	gives your plot a different pre-constructed theme. Common examples include theme_bw(), theme_grey(), theme_minimal() and theme_classic(). If you plan to change other aspects controlled by theme(), make sure to position theme_X() before theme().
theme()	Gives you direct control of axis text, tick marks, legend positions, etc. Check out this and this website for more inspiration.

This table is by no means exhaustive but will be sufficient for most of your needs to control a graph’s appearance.

ggplot2 provides a tremendous variety of plot types. Instead of showing you example after example, this website has tons of examples grouped by topics:

One final point: Frequently, you will want to plot several graphs and arrange them in some way. The gridExtra package has the arrangeGrob() function to help you do exactly that:

library(gridExtra)

# scatter plot
p1 <- ggplot(data=data_plot, aes(x=height, y=mass)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~as.factor(non_human)) + 
  labs(title="The relationship between height and mass", subtitle = "in the Star Wars universe", x="Height (in cm)", y="Mass (in kg)") + 
  theme_minimal()

# stacked bar plots
p2 <- ggplot(data_plot, aes(x = hair_color_hmn, fill=eye_color_hmn)) + 
      geom_bar(aes(fill=eye_color_hmn), position = 'fill', color="black") + 
      labs(x="Hair color", y="Percent", fill="Eye color", title="Hair color by eye color", subtitle="in the Star Wars universe") +
      theme_minimal() +
      theme(axis.text.x=element_text(angle=90)) +
      scale_y_continuous(labels=scales::percent) + 
      scale_fill_manual(values=c("steelblue1","brown","white"))

# grouped bar plots
p3 <- ggplot(data_plot, aes(x = hair_color_hmn, fill=eye_color_hmn)) + 
      geom_bar(aes(fill=eye_color_hmn,y = (..count..)/sum(..count..)), position = 'dodge', color="black") + 
      labs(x="Hair color", y="Percent", fill="Eye color", title="Hair color by eye color", subtitle="in the Star Wars universe") +
      theme_minimal() +
      theme(axis.text.x=element_text(angle=90)) +
      scale_y_continuous(labels=scales::percent) + 
      scale_fill_viridis_d()

# to show plot in the plot window
grid.arrange(p1,p2,p3, layout_matrix = rbind(c(1,1),
                                             c(2,3)))

# how to save this plot with high quality:
#ggsave("Figure.png", plot=last_plot(), dpi=300, width=10, height=14, units="in" )

Reproducible research

Projects

Until know, I avoided talking about workflows to get you to focus on thinking about doing your regular data wrangling and analysing tasks in R. But from this point onward, you should establish a routine of working with “projects” and “Rmarkdown”.

Projects are a great way to provide minimal structure to your projects by keeping all your data files and R scripts (and whatever else may be useful to you) in a single folder. In addition, there is no need to set working directories within projects.

To use projects, simple click on File -> New Project and follow the instructions. Typically, you want to select “New Directory”, “New Project”, give a project name and select a folder to store it in. The next time you work on your project, open it by either double-clicking on the .Rproj file in your project folder or by clicking on File -> Recent Projects and selecting the appropriate entry.

RMarkdown

While the idea of projects is to help you organize your files in the background, RMarkdown helps you - at the very least because there are other uses for RMarkdown - report what you did with your data in order to arrive at the results you presented in a poster, a paper or a presentation.

Well, you might say, I am already doing that, it is called the data and methods section in academic papers. True, but when somebody wants to replicate what you did - maybe because they have a similar research question or you introduced a new way to measure something or they suspect you made an error - then simply going through the code is easier than parsing what you wrote in a paper and hoping you mentioned all your steps and didn’t forget a tiny but important detail.

Well, you might say, but that is what online appendices or repositories are for. Also true but social sciences are a diverse field with some people using this software and others the other one - yes, the truly awful one. By using RMarkdown, you are reducing the prerequisite of your peers to be able to use your software of choice on an expert level - at least to some degree. That is because RMarkdown enables you to distribute a description of your workflow in common formats such as PDF or Microsoft Word. Still, there may be some R code chunks in there which may be obtuse for non-experts in R but that’s where RMarkdowns main advantage comes in: being able to combine code, output and running text in a legible way.

Enough convincing, when taking one of my practical courses, you will have no choice but to use it anyways. LOL or whatever you young people use currently.

A First Document

You can open a new document by clicking on File -> New File -> R Markdown. The next window allows you to specify a title, an author and an output format (Don’t worry, you can change any and all of it in a heartbeat later). Doing so will generate a skeletal RMarkdown file with some stuff already written in it. Ignore all of that for now and focus on the following picture:

Source

First things first, to actually generate your output file, you need to press the “Knit” button highlighted by the red circle. Pressing the button will convert the RMarkdown file to your output of choice. Don’t worry, this looks slightly different to what happens when you execute “regular” R code.

What this figure more importantly shows is that RMarkdown files have a couple of important components:

one and only one YAML header
code chunks
formatted text
inline code

The YAML header is where you define attributes of your file such as title, author, date, font and figure options, custom style sheets but also crucially the output format. Hence, you can change all of that but replacing it with something else. For example, the YAML header for this document reads like this and mostly defines the attributes of the resulting html-document:

title: "R Tutorial for Social Sciences"
output:
  html_document:
    df_print: paged
    toc: yes
    toc_float: yes
editor_options:
  chunk_output_type: console

There are a ton of options and it is typically easier to google what you want to achieve rather than me providing you with a list of all possible options.

While you will only have one YAML header, the other three components will occur more frequently in your document.

Code chunks are essentially boxes where your R code is executed. The code chunks need to be specified exactly as shown in the figure above. The “smallest” code chuck looks like this:

```{r}
some code looking code
```

This can be a pain to type over and over again. Luckily, there is a shortcut: STRG+ALT+i (or CMD+OPTION+i on MAC).

Typically, your code chunks will include more information. You can name your chunks like this:

```{r this_is_my_name}
 some code-looking code
```

And you can add options to each chunk which define how it behaves like this:

```{r this_is_my_name, doIputoptionshere=TRUE}
 some code-looking code
```

Again, there are many different options (see here for a general cheat sheet) but let’s look at two important ones - echo and eval. For example, sometimes you don’t want the code to be shown in your output document. Let’s say you are just showing a figure and it is not really important how you made it. Then setting echo=FALSE will prevent the code to be shown. Conversely, sometimes you may want to not show the output but only the code. In this case, eval=FALSE is your friend.

Formatted text is quite self-explanatory but you essentially just write your text, like you would in any other word processor. Instead of having buttons which enable you to format the text or insert tables, you will need to use special syntax. For example, the write italics or bold, you simple write *italics* or **bold**. Again, consult the cheat sheet for the plethora of formatting options.

Inline code is inserted using ` r some code-looking code `. Inline code may look redundant given that code chunks provide similar capabilities but don’t be fooled. While code chunks present code and the resulting R output, inline code only presents the result - which is 98-times more useful.

That’s funny! Why? Because I don’t even know how many times more useful it is because the number shown to you is based on a random draw. Ok, to see what I did, here is the sentence again, but this time I will show you what I wrote: While code chunks present code and the resulting R output, inline code only presents the result - which is `r floor(rnorm(1,100,5))`-times more useful.

floor(rnorm(1,100,5)) draws one random number from a normal distribution with mean 100 and standard deviation 5 and rounds it down to the nearest integer. So every time I generate this document, there will be a different number.

But when should this be useful to your work? In every single project where you describe your results! Just think about writing a Methods and Results section for any paper. How often did you need to go back and change a number because something in your data changed (e.g., maybe you are analysing a panel and the new wave just came in increasing the number of cases you can analyse) or because reviewers wanted you to add another variable and now all your coefficients need to be updated in the running text of your results section. Well, with inline code you essentially do not need to worry about this anymore. You write your sentence once and when the data changes, you simply compile the document again with the new numbers showing up automatically. I hope you see how tremendoulsy powerful! And you can even compile this to a word document and integrate it into your full paper or report using the themes and customizations you prefer.

Troubleshooting

Yes, you will like run into trouble because instead of having a single source of errors - your R code - you now also have to deal with RMarkdown errors and the error messages are often not helpful to non-experts. Luckily, the problem will often still be your R code. One way of finding errors more quickly is to actually name your R chunks to see which chunks was the last one to run properly and work your way from there trying to figure out potential errors.

Ressources

Here is a reference sheet for common data transformation tasks and here is one for plotting data. Those will come in very handy in your adventures.

You may also want to check out this book which also covers the basics and a lot more of the tidy approach.

Excersise solutions

Some basics

# 1. Generate a vector with numbers from 1 to 100 named ID.
ID <- 1:100

# 2. Generate a vector of 100 random draws from a normal distribution with mean 0 and standard deviation of 5 (**?rnorm**). Name it x. Before you run the command, set the seed for the random number generator to 1234 to generate reproducible results (**set.seed(1234)**)
set.seed(1234)
x <- rnorm(100,0,5)

# 3. Make a data set out of the vectors called myfirstdataset.
myfirstdataset <- tibble(ID,x)
str(myfirstdataset)

#4. Generate a new variable for myfirstdataset which transform x as (500+60*x) and call it y.
myfirstdataset$y <- (500+60*x)
                     
#5. What are the means of x and y?
summary(myfirstdataset[,c("x","y")])

#6. What is the value of x for row 14?
myfirstdataset[14,"x"]

#7. Store row 99 in a vector called row99.
row99 <- myfirstdataset[99,]
row99

#8. Convert ID to character.
myfirstdataset$ID <- as.character(myfirstdataset$ID)
is.character(myfirstdataset$ID)

#9. Generate a new variable for myfirstdataset which takes the value 0 for x<0 and 1 for x>=0. Name it dummy.
myfirstdataset$dummy[myfirstdataset$x<0] <- 0 
myfirstdataset$dummy[myfirstdataset$x>=0] <- 1

max(myfirstdataset$x[myfirstdataset$dummy==0])
min(myfirstdataset$x[myfirstdataset$dummy==1])

Transforming data

### PART 1:
#1.Download wave 8 of the European Social Survey in Stata format, load and inspect it.
library(foreign)
library(tidyverse)

ess <- read.dta("ESS8e02_1.dta", convert.factors = FALSE)
str(ess)

#2. We ultimately want to estimate a regression model predicting of ethnic income differences across European societies. Hence, we need information on income, a majority-minority indicator, education, age and gender. Find the corresponding variables, check how they are stored and change any variables stored in a format you think is inappropriate (Hint: we want to treat income and age as contineous variables). 

# income: hinctnta
# majority-minority indicator: brncntr
# education: eisced
# age: agea
# gender: gndr

summary(ess[, c("hinctnta","brncntr","eisced","agea","gndr")])
str(ess[, c("hinctnta","brncntr","eisced","agea","gndr")])
# missing values for income, education, age and gender present
# education has a maximum of 55 which is worth inspecting
# all variables are measured as integers (numeric) except country of birth
# Alternatively, we could have just checked how income and age are stored:
is.numeric(ess$hinctnta)
is.numeric(ess$agea)

#3. Clean the data, generate a majority-minority as well as female dummy and transform education into a factor
ess.clean <- ess %>%
  mutate(female=gndr-1,
         isced=case_when(
           eisced==1 ~ "I",
           eisced==2 ~ "II",
           eisced %in% c(3,4) ~ "III",
           eisced==5 ~ "IV",
           eisced==6 ~ "V",
           eisced==7 ~ "VI"),
         immigrant=case_when(
           brncntr==1 ~ 0,
           brncntr==2 ~ 1
         ))

#4. Generate country-level variables: percent foreign-born population and the percent agreeing that most people can be trusted (values 7-10). Merge it back to the ESS data.

ess.aggr <- ess.clean %>% 
  mutate(high_trust=case_when(
    ppltrst<7 ~ 0,
    ppltrst>=7 ~ 1)) %>%
  group_by(cntry) %>%
  summarise(perc_foreign=mean(immigrant, na.rm=T),
            avg_trust=mean(high_trust, na.rm=T)) %>%
  select(cntry, perc_foreign, avg_trust)

summary(ess.aggr)

ess.joined <- ess.clean %>%
  left_join(ess.aggr, by="cntry")

#5. Run a linear multilevel model with individuals clustered in countrys where income is regressed on the variables of interest (**lmer()** function of the *lme4* package). Let the slope of immigrant (i.e., the ethnic differences) vary across countries. Store the results in an object called ESS.model
library(lme4)

ESS.model <- lmer(hinctnta ~ immigrant+perc_foreign+avg_trust+isced+female+agea+(1+immigrant|cntry), data=ess.joined)
summary(ESS.model)

#6. Expand the model by including a cross-level interaction between the avg_trust and immigrant

ESS.model.inter <- lmer(hinctnta ~ I(immigrant*avg_trust)+immigrant+perc_foreign+avg_trust+isced+female+agea+(1+immigrant|cntry), data=ess.joined)
summary(ESS.model.inter)


### PART 2:
#Here is your shopping list for this week:
#* 1x Eggs
#* 3x Milk
#* 1x Bread
#* 20x Beer
#* 2x M&Ms

library(tidyverse)
#1. Store the shopping list in an R object. Create a data frame where the item is stored in one column and the quantifier in another column.
shopping <- c("1x Eggs", "3x Milk", "1x Bread", "20x Beer", "2x M&Ms")

#there are easier ways to do this but to show you some more regex examples, I will do this the hard way:
shop_dat <- as_tibble(shopping) %>% 
  mutate(item=str_extract(value, pattern="[:upper:]{1}([:lower:]{3,4}|&{1}[:alpha:]{2})")) %>%
  mutate(quantity_want=as.numeric(str_extract(value, pattern = "[:digit:]{1,2}")))
#one easy way would be to simply split the string at the whitespace

#Here is a scan of your receipt (just imagine it) detailing what you bought:
#* 1 Freiland (Non-organic) Eggs
#* 2 Frankenland Milk
#* 2 Sappel Bread
#* 19 Huppendorfer Beer (drank one on your way home, shame on you)
#* 2 M&Ms

#2. Also store this in an object and make a data frame out of it.
receipt <- c("1 Freiland Eggs", "2 Frankenland Milk", "2 Sappel Bread", "19 Huppendorfer Beer", "2 M&Ms")

receipt_dat <- as_tibble(receipt) %>%
  mutate(item_long=str_extract(value, pattern="([:alpha:]+ [:alpha:]+)|([:upper:]{1}&{1}[:alpha:]{2})")) %>%
  mutate(quantity_bought=as.numeric(str_extract(value, pattern = "[:digit:]{1,2}")))
#now, the issue is that brand names are still part of the items, we need to get rid of them somehow
receipt_dat <- receipt_dat %>%
  mutate(item=str_match(item_long, shop_dat$item))

#3. Now update the shopping list by "subtracting" from it what you bought.
shop_dat_new <- shop_dat %>% left_join(receipt_dat, by="item") %>%
  mutate(quantity_want=quantity_want-quantity_bought) %>%
  filter(quantity_want>0) %>%
  select(-value.y, -item_long, -quantity_bought)

#4. Open and drink one of the remaining 19 beers!
me <- c("beer")
print("Ahhhhhhhh")