library(tidyverse)
2 Packages and the Tidyverse
R’s open-source culture has encouraged a rich ecosystem of custom functions designed by scientists and researchers in the R userbase. These come in the form of ‘packages’, which are suites of several related functions. For example, there are packages for conducting statistical tests, producing data visualizations, generating publication-ready tables, and all manner of other tasks.
2.1 Loading Packages
Let’s try this out with one of the better known R packages–‘tidyverse’. This is actually a collection of several packages with a variety of interrelated functions for ‘tidying’, visualizing, and analyzing data. We will focus on what we need from ‘tidyverse’, but, if you’re curious, you can read more here: https://www.tidyverse.org/
If you’re on a lab computer, this package may already be installed. Let’s check by running the following command:
If you receive an error when you run this, you likely do not have the package installed on your system. This is also probably the case if you are on your personal device and only recently acquired R.
If you got an error, run the following command:
install.packages("tidyverse")
With a few exceptions, you will always install new packages in this fashion: install.packages(“package_name”)
After it’s done installing, go back and run the library(tidyverse) command again. Note that you always need to do this for an added package. Whether you’ve had it for a while or just installed it, you need to load any outside package into your current session by placing its name in the library() function.
library(tidyverse)
2.2 Bringing in our Data
Let’s try bringing in a data frame to play with a few tidyverse functions. We’ll use the load()
function to bring in a subset of the General Social Survey, which contains a few variables from the 2022 survey wave. Run the following command and select the file “our_gss.rda”
load(file.choose())
The file.choose()
function will open up a file-explorer window that allows you to manually select an R data file to load in. We’ll talk about some other ways to import data files using R syntax next time.
Go ahead and take a look at the data frame. Each GSS survey wave has about 600-700 variables in total, so I’ve plucked several and done a little pre-processing to get us a subset to work with. All the variables here have pretty straightforward names, but I’ll note that realrinc
is a clear outlier there. This is short for ‘Real respondent’s income’ and reflects the respondent’s income reported in exact dollar amounts. I’ll put a summary here so you can take a look if you’re not following along with your own script.
summary(our_gss)
year id age race sex
Min. :2022 1 : 1 Min. :18.00 white:2514 female:1897
1st Qu.:2022 2 : 1 1st Qu.:34.00 other: 412 male :1627
Median :2022 3 : 1 Median :48.00 black: 565 NA's : 20
Mean :2022 4 : 1 Mean :49.18 NA's : 53
3rd Qu.:2022 5 : 1 3rd Qu.:64.00
Max. :2022 6 : 1 Max. :89.00
(Other):3538 NA's :208
realrinc partyid
Min. : 204.5 independent (neither, no response):835
1st Qu.: 8691.2 strong democrat :595
Median : 18405.0 not very strong democrat :451
Mean : 27835.3 strong republican :431
3rd Qu.: 33742.5 independent, close to democrat :400
Max. :141848.3 (Other) :797
NA's :1554 NA's : 35
happy marital attend
not too happy: 799 divorced : 608 never :1149
pretty happy :1942 married :1468 about once or twice a year: 464
very happy : 779 never married:1095 every week : 441
NA's : 24 separated : 103 less than once a year : 416
widowed : 255 several times a year : 346
NA's : 15 (Other) : 693
NA's : 35
cappun
favor :2013
oppose:1327
NA's : 204
2.3 Data Wrangling with Tidyverse
Let’s use this subset to explore some tidyverse functionality. One of the packages included in the tidyverse is dplyr
, which includes several functions for efficiently manipulating data frames in preparation for analyses. We will encounter a number of these throughout our time with R, but I want to briefly introduce a few key dplyr
functions and operations that we will dig into more next time.
2.3.1 select()
It happens quite often that we have a data frame containing far more variables than we need for a given analysis. The select()
function allows us to quickly subset data frames according to the variable columns we specify.
This function takes a data frame as its first input, and all following inputs are the variable columns that you want to keep
<- select(our_gss, id, sex, realrinc) sex_inc
You should now have an object that contains all 3,544 observations, but includes only the 3 columns that we specified with select()
.
summary(sex_inc)
id sex realrinc
1 : 1 female:1897 Min. : 204.5
2 : 1 male :1627 1st Qu.: 8691.2
3 : 1 NA's : 20 Median : 18405.0
4 : 1 Mean : 27835.3
5 : 1 3rd Qu.: 33742.5
6 : 1 Max. :141848.3
(Other):3538 NA's :1554
2.3.2 filter()
filter()
functions similarly except that, instead of sub-setting by specific variables, it allows you to subset by specific values. So, let’s take the sex_inc
object we just created above. We now have this subset of three variables—id, sex, and income—but let’s imagine we want to answer a question that’s specific to women.
In order to do that, we need to filter the data to include only observations where the value of the variable sex
is ‘female’.
<- filter(sex_inc, sex=="female") fem_inc
Note that the fem_inc
object still has 3 variables, but there are now roughly half the observations, suggesting that we have successfully filtered out the male observations.
summary(fem_inc)
id sex realrinc
1 : 1 female:1897 Min. : 204.5
3 : 1 male : 0 1st Qu.: 7668.8
4 : 1 Median : 15337.5
7 : 1 Mean : 22702.1
9 : 1 3rd Qu.: 27607.5
10 : 1 Max. :141848.3
(Other):1891 NA's :883
2.3.3 summarize()
As the name suggests, summarize()
allows us to quickly summarize information across variables. It will give us a new data frame that reflects the summaries that we ask for, which can be very useful for quickly generating descriptive statistics. We will use this to get the mean income value for our data frame.
<- summarize(our_gss, "mean_inc"=mean(realrinc, na.rm=TRUE)) mean_inc
You probably noticed the na.rm = TRUE
input that I supplied for the above function. This is short for ‘remove NAs’, which we need to do when a variable has any NA values. If we don’t, R will screw up, because it does not know to disregard NA values when calculating a column mean unless we tell it to.
This gives us a new data frame that we called mean_inc
. It should have 1 row and 1 column, and it just gives us the average income of a person in our GSS subset—about $28,000/year.
mean_inc
mean_inc
1 27835.33
Now, this is not really all that impressive when we are asking for a broad summary like this. In fact, if all we wanted was to see the average income, we could get that more easily, e.g.
mean(our_gss$realrinc, na.rm = TRUE)
[1] 27835.33
The true power of summarize()
comes from chaining it together with other tidyverse functions. However, in order to do that, we will need to learn about one more new R operation. I’ll show you that in a moment, but let’s take a look at one more helpful tidyverse function.
2.3.4 group_by()
Often when we’re using a function like summarize()
, we want to get summaries for all kinds of different subgroups within our data set. For example, we may want the mean for each value of sex
or partyid
, rather than for all people in the data frame. We can do this with group_by
.
This function may seem a little unusual when used in isolation, because it does not seem to do much on the surface.
<- group_by(our_gss, partyid) our_gss
When you run that function, you will not generate any new objects, and you will not notice anything different about the data frame.
What it does is overlay a grouping structure on the data frame, which will in turn affect how other tidyverse functions operate.
Compare the output of summarize()
run on this grouped version of our data frame with the use of summarize()
above.
<- summarize(
mean_inc
our_gss,"mean_inc" = mean(realrinc, na.rm = TRUE)
)
mean_inc
# A tibble: 9 × 2
partyid mean_inc
<fct> <dbl>
1 strong democrat 30677.
2 independent (neither, no response) 21570.
3 not very strong republican 29101.
4 not very strong democrat 31743.
5 independent, close to democrat 27916.
6 other party 23891.
7 independent, close to republican 26825.
8 strong republican 32376.
9 <NA> 16922.
We ran the same summarize()
command as before, but now it reflects the grouping structure that we imposed.
2.4 The Pipe
This one might be a little unintuitive, so don’t worry if it doesn’t immediately click. We will continue to get plenty of practice with it over the next couple of sessions.
The pipe operator looks like this: |>
. What it does is take whatever is to the left of the symbol and ‘pipe’ it into the function on the right-hand side. That probably sounds a little strange, so let’s see some examples.
We’ll refer back to our summarize()
command from above.
<- summarize(our_gss, "mean_inc"=mean(realrinc, na.rm=TRUE)) mean_inc
This is equivalent to…
<- our_gss |>
mean_inc summarize("mean_age" = mean(realrinc, na.rm=TRUE))
Notice that, in the first command, the first input that we give summarize()
is the data frame that we want it to work with.
In the command featuring the pipe operator, we supply the data frame and then pipe it into summarize()
. The real magic comes from chaining multiple pipes together. This will likely take a little practice to get used to, but it can become a very powerful tool in our R arsenal.
2.4.1 Putting It All Together
Let’s illustrate with an example. I’ll let you know what I want to do in plain English, and then I will execute that desire with multiple piped commands.
Ultimately, I want to see the mean income, but I want to see the mean broken down by sex
and partyid.
So, I want to take a selection of variables from our_gss
. I want these variables to be grouped by sex
and partyid
. Finally, I want to see a summary of the mean according to this variable grouping.
<- our_gss |>
sexpol_means select(id, sex, realrinc, partyid) |>
group_by(sex, partyid) |>
summarize("mean_inc" = mean(realrinc, na.rm=TRUE)) |>
drop_na(sex, partyid)
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
We can use drop_na()
to do as the function’s name suggests. When we learned about group_by()
above, you may have noticed that a mean was reported for an NA
category within the partyid
variable. Any time you notice this and want your summaries to exclude these NA categories, just include that variable as an input to drop_na()
.
sexpol_means
# A tibble: 16 × 3
# Groups: sex [2]
sex partyid mean_inc
<fct> <fct> <dbl>
1 female strong democrat 30111.
2 female independent (neither, no response) 17923.
3 female not very strong republican 22448.
4 female not very strong democrat 24300.
5 female independent, close to democrat 19961.
6 female other party 26595.
7 female independent, close to republican 20095.
8 female strong republican 21584.
9 male strong democrat 31600.
10 male independent (neither, no response) 25474.
11 male not very strong republican 34544.
12 male not very strong democrat 41233.
13 male independent, close to democrat 36947.
14 male other party 22450.
15 male independent, close to republican 31393.
16 male strong republican 40210.
So, using dplyr
, we can quickly subset and manipulate data frames in just a few lines of relatively straightforward code. Here we have all the means for each value of sex
and partyid
, which would have been a tedious task had we calculated them all manually.
We will see plenty more on the tidyverse, so don’t fret if you don’t feel completely confident with these yet. It takes practice getting used to Rs peculiarities. We will keep building with these in the next unit and hopefully accumulate some muscle memory.