2 Packages and the Tidyverse

R’s open-source culture has encouraged a rich ecosystem of custom functions designed by scientists and researchers in the R userbase. These come in the form of ‘packages’, which are suites of several related functions. For example, there are packages for conducting statistical tests, producing data visualizations, generating publication-ready tables, and all manner of other tasks.

2.1 Loading Packages

Let’s try this out with one of the better known R packages–‘tidyverse’. This is actually a collection of several packages with a variety of interrelated functions for ‘tidying’, visualizing, and analyzing data. We will focus on what we need from ‘tidyverse’, but, if you’re curious, you can read more here: https://www.tidyverse.org/

If you’re on a lab computer, this package may already be installed. Let’s check by running the following command:

library(tidyverse)

If you receive an error when you run this, you likely do not have the package installed on your system. This is also probably the case if you are on your personal device and only recently acquired R.

If you got an error, run the following command:

install.packages("tidyverse")

With a few exceptions, you will always install new packages in this fashion: install.packages(“package_name”)

After it’s done installing, go back and run the library(tidyverse) command again. Note that you always need to do this for an added package. Whether you’ve had it for a while or just installed it, you need to load any outside package into your current session by placing its name in the library() function.

library(tidyverse)

2.2 Bringing in our Data

Let’s try bringing in a data frame to play with a few tidyverse functions. We’ll use the load() function to bring in a subset of the General Social Survey, which contains a few variables from the 2022 survey wave. Run the following command and select the file “our_gss.rda”

load(file.choose())

The file.choose() function will open up a file-explorer window that allows you to manually select an R data file to load in. We’ll talk about some other ways to import data files using R syntax next time.

Go ahead and take a look at the data frame. Each GSS survey wave has about 600-700 variables in total, so I’ve plucked several and done a little pre-processing to get us a subset to work with. All the variables here have pretty straightforward names, but I’ll note that realrinc is a clear outlier there. This is short for ‘Real respondent’s income’ and reflects the respondent’s income reported in exact dollar amounts. I’ll put a summary here so you can take a look if you’re not following along with your own script.

summary(our_gss)

      year            id            age           race          sex      
 Min.   :2022   1      :   1   Min.   :18.00   white:2514   female:1897  
 1st Qu.:2022   2      :   1   1st Qu.:34.00   other: 412   male  :1627  
 Median :2022   3      :   1   Median :48.00   black: 565   NA's  :  20  
 Mean   :2022   4      :   1   Mean   :49.18   NA's :  53                
 3rd Qu.:2022   5      :   1   3rd Qu.:64.00                             
 Max.   :2022   6      :   1   Max.   :89.00                             
                (Other):3538   NA's   :208                               
    realrinc                                      partyid   
 Min.   :   204.5   independent (neither, no response):835  
 1st Qu.:  8691.2   strong democrat                   :595  
 Median : 18405.0   not very strong democrat          :451  
 Mean   : 27835.3   strong republican                 :431  
 3rd Qu.: 33742.5   independent, close to democrat    :400  
 Max.   :141848.3   (Other)                           :797  
 NA's   :1554       NA's                              : 35  
           happy               marital                            attend    
 not too happy: 799   divorced     : 608   never                     :1149  
 pretty happy :1942   married      :1468   about once or twice a year: 464  
 very happy   : 779   never married:1095   every week                : 441  
 NA's         :  24   separated    : 103   less than once a year     : 416  
                      widowed      : 255   several times a year      : 346  
                      NA's         :  15   (Other)                   : 693  
                                           NA's                      :  35  
    cappun    
 favor :2013  
 oppose:1327  
 NA's  : 204

2.3 Data Wrangling with Tidyverse

Let’s use this subset to explore some tidyverse functionality. One of the packages included in the tidyverse is dplyr, which includes several functions for efficiently manipulating data frames in preparation for analyses. We will encounter a number of these throughout our time with R, but I want to briefly introduce a few key dplyr functions and operations that we will dig into more next time.

2.3.1 select()

It happens quite often that we have a data frame containing far more variables than we need for a given analysis. The select() function allows us to quickly subset data frames according to the variable columns we specify.

This function takes a data frame as its first input, and all following inputs are the variable columns that you want to keep

sex_inc <- select(our_gss, id, sex, realrinc)

You should now have an object that contains all 3,544 observations, but includes only the 3 columns that we specified with select().

summary(sex_inc)

       id           sex          realrinc       
 1      :   1   female:1897   Min.   :   204.5  
 2      :   1   male  :1627   1st Qu.:  8691.2  
 3      :   1   NA's  :  20   Median : 18405.0  
 4      :   1                 Mean   : 27835.3  
 5      :   1                 3rd Qu.: 33742.5  
 6      :   1                 Max.   :141848.3  
 (Other):3538                 NA's   :1554

2.3.2 filter()

filter() functions similarly except that, instead of sub-setting by specific variables, it allows you to subset by specific values. So, let’s take the sex_inc object we just created above. We now have this subset of three variables—id, sex, and income—but let’s imagine we want to answer a question that’s specific to women.

In order to do that, we need to filter the data to include only observations where the value of the variable sex is ‘female’.

fem_inc <- filter(sex_inc, sex=="female")

Note that the fem_inc object still has 3 variables, but there are now roughly half the observations, suggesting that we have successfully filtered out the male observations.

summary(fem_inc)

       id           sex          realrinc       
 1      :   1   female:1897   Min.   :   204.5  
 3      :   1   male  :   0   1st Qu.:  7668.8  
 4      :   1                 Median : 15337.5  
 7      :   1                 Mean   : 22702.1  
 9      :   1                 3rd Qu.: 27607.5  
 10     :   1                 Max.   :141848.3  
 (Other):1891                 NA's   :883

2.3.3 summarize()

As the name suggests, summarize() allows us to quickly summarize information across variables. It will give us a new data frame that reflects the summaries that we ask for, which can be very useful for quickly generating descriptive statistics. We will use this to get the mean income value for our data frame.

mean_inc <- summarize(our_gss, "mean_inc"=mean(realrinc, na.rm=TRUE))

Note

You probably noticed the na.rm = TRUE input that I supplied for the above function. This is short for ‘remove NAs’, which we need to do when a variable has any NA values. If we don’t, R will screw up, because it does not know to disregard NA values when calculating a column mean unless we tell it to.

This gives us a new data frame that we called mean_inc. It should have 1 row and 1 column, and it just gives us the average income of a person in our GSS subset—about $28,000/year.

mean_inc

  mean_inc
1 27835.33

Now, this is not really all that impressive when we are asking for a broad summary like this. In fact, if all we wanted was to see the average income, we could get that more easily, e.g.

mean(our_gss$realrinc, na.rm = TRUE)

[1] 27835.33

The true power of summarize() comes from chaining it together with other tidyverse functions. However, in order to do that, we will need to learn about one more new R operation. I’ll show you that in a moment, but let’s take a look at one more helpful tidyverse function.

2.3.4 group_by()

Often when we’re using a function like summarize(), we want to get summaries for all kinds of different subgroups within our data set. For example, we may want the mean for each value of sex or partyid, rather than for all people in the data frame. We can do this with group_by.

This function may seem a little unusual when used in isolation, because it does not seem to do much on the surface.

our_gss <- group_by(our_gss, partyid)

When you run that function, you will not generate any new objects, and you will not notice anything different about the data frame.

What it does is overlay a grouping structure on the data frame, which will in turn affect how other tidyverse functions operate.

Compare the output of summarize() run on this grouped version of our data frame with the use of summarize() above.

mean_inc <- summarize(
  our_gss,
  "mean_inc" = mean(realrinc, na.rm = TRUE)
  )

mean_inc

# A tibble: 9 × 2
  partyid                            mean_inc
  <fct>                                 <dbl>
1 strong democrat                      30677.
2 independent (neither, no response)   21570.
3 not very strong republican           29101.
4 not very strong democrat             31743.
5 independent, close to democrat       27916.
6 other party                          23891.
7 independent, close to republican     26825.
8 strong republican                    32376.
9 <NA>                                 16922.

We ran the same summarize() command as before, but now it reflects the grouping structure that we imposed.

2.4 The Pipe

This one might be a little unintuitive, so don’t worry if it doesn’t immediately click. We will continue to get plenty of practice with it over the next couple of sessions.

The pipe operator looks like this: |>. What it does is take whatever is to the left of the symbol and ‘pipe’ it into the function on the right-hand side. That probably sounds a little strange, so let’s see some examples.

We’ll refer back to our summarize() command from above.

mean_inc <- summarize(our_gss, "mean_inc"=mean(realrinc, na.rm=TRUE))

This is equivalent to…

mean_inc <- our_gss |>
  summarize("mean_age" = mean(realrinc, na.rm=TRUE))

Notice that, in the first command, the first input that we give summarize() is the data frame that we want it to work with.

In the command featuring the pipe operator, we supply the data frame and then pipe it into summarize(). The real magic comes from chaining multiple pipes together. This will likely take a little practice to get used to, but it can become a very powerful tool in our R arsenal.

2.4.1 Putting It All Together

Let’s illustrate with an example. I’ll let you know what I want to do in plain English, and then I will execute that desire with multiple piped commands.

Ultimately, I want to see the mean income, but I want to see the mean broken down by sex and partyid.

So, I want to take a selection of variables from our_gss. I want these variables to be grouped by sex and partyid. Finally, I want to see a summary of the mean according to this variable grouping.

sexpol_means <- our_gss |>
  select(id, sex, realrinc, partyid) |>
  group_by(sex, partyid) |>
  summarize("mean_inc" = mean(realrinc, na.rm=TRUE)) |>
  drop_na(sex, partyid)

`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.

Note

We can use drop_na() to do as the function’s name suggests. When we learned about group_by() above, you may have noticed that a mean was reported for an NA category within the partyid variable. Any time you notice this and want your summaries to exclude these NA categories, just include that variable as an input to drop_na().

sexpol_means

# A tibble: 16 × 3
# Groups:   sex [2]
   sex    partyid                            mean_inc
   <fct>  <fct>                                 <dbl>
 1 female strong democrat                      30111.
 2 female independent (neither, no response)   17923.
 3 female not very strong republican           22448.
 4 female not very strong democrat             24300.
 5 female independent, close to democrat       19961.
 6 female other party                          26595.
 7 female independent, close to republican     20095.
 8 female strong republican                    21584.
 9 male   strong democrat                      31600.
10 male   independent (neither, no response)   25474.
11 male   not very strong republican           34544.
12 male   not very strong democrat             41233.
13 male   independent, close to democrat       36947.
14 male   other party                          22450.
15 male   independent, close to republican     31393.
16 male   strong republican                    40210.

So, using dplyr, we can quickly subset and manipulate data frames in just a few lines of relatively straightforward code. Here we have all the means for each value of sex and partyid, which would have been a tedious task had we calculated them all manually.

We will see plenty more on the tidyverse, so don’t fret if you don’t feel completely confident with these yet. It takes practice getting used to Rs peculiarities. We will keep building with these in the next unit and hopefully accumulate some muscle memory.