library(tidyverse)
load("our_gss.rda")
3 Recoding Variables
Today’s venture concerns univariate analysis, i.e. the quantitative description of a single variable. Before we do that, however, we need to familiarize ourselves with some data-cleaning procedures.
3.1 Background
Recoding a variable involves manipulating the underlying structure of our variable such that we can use it for analysis. We did a little recoding during the last unit when we converted character vectors into factor variables. This allowed us to align R data types with the appropriate level of measurement.
There are also occasions when we need a variable to be translated from one level of measurement to another. For example, we may want to convert a ratio variable for “number of years of education” into an ordinal variable reflecting categories like “less than high school”, “high school diploma”, “Associates degree”, and so on.
We may also want to collapse the categories of ordinal variables for some analyses. Consider a variable with a Likert-scale agreement rating, where you responses like “strongly agree,” “moderately agree,” “slightly agree,” and so forth. You may decide to collapse these categories into categories of “Agree” and “Disagree”.
We will get some practice doing this sort of thing, which is an essential component of responsible analysis. Additionally, our next unit on bivariate analysis will require us to work with categorical variables in particular, so we need to be capable of converting any numeric variables.
3.2 Converting Numeric to Categorical
We will start by recoding age
—a ratio variable—into an ordinal variable reflecting age groupings. The same strategies we use here will work for any numeric variable.
3.2.1 Setting up our workspace
As usual, let’s make sure we load in tidyverse
along with our GSS data.
Let’s double check the structure of our data frame.
str(our_gss)
'data.frame': 3544 obs. of 11 variables:
$ year : num 2022 2022 2022 2022 2022 ...
$ id : Factor w/ 4149 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 72 80 57 23 62 27 20 47 31 72 ...
$ race : Factor w/ 3 levels "white","other",..: 1 1 1 1 1 1 2 1 1 NA ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 1 2 2 1 2 1 1 ...
$ realrinc: num 40900 NA 18405 2250 NA ...
$ partyid : Factor w/ 8 levels "strong democrat",..: 1 2 3 1 2 4 5 1 5 1 ...
$ happy : Ord.factor w/ 3 levels "not too happy"<..: 1 1 1 1 2 2 2 2 2 2 ...
$ marital : Factor w/ 5 levels "divorced","married",..: 1 2 1 3 3 3 3 2 3 3 ...
$ attend : Factor w/ 9 levels "never","less than once a year",..: 3 3 1 5 3 1 2 2 1 8 ...
$ cappun : Factor w/ 2 levels "favor","oppose": 2 1 1 2 2 2 2 2 1 NA ...
You might notice the ‘int’ category, which is short for ‘integer’. This is a subtype of numeric data in R. Variables that are exclusively whole numbers are often recorded in this way, but we can work with them in R just like we can other sorts of numbers
3.2.2 The ‘age’ variable
We can take a look at all the values of age (along with the # of respondents in each age category) using the count()
function.
count(our_gss, age)
However, I’m not going to display those results here, as it will be a particularly long table of values (with 70+ different ages). It’s fine to run it—it won’t crash R or anything—but it will it clutter up this page. Let’s take this as a good opportunity to take advantage of the fact that we are using one of the better funded and well-organized surveys in all of social sciences. As such, there’s extensive documentation about all of the variables measured for the GSS. Go ahead and take a look at the age variable via the GSS Data Explorer, which allows us to search for unique variables and view their response values, the specific question(s) that was asked on the survey, and several other variable characteristics.
The responses range from 18 - 89 (in addition to a few categories for non-response). However, note that there’s something unique about value 89. It’s not just 89 years of age, but 89 & older. This isn’t a real issue for our purposes, but take this as encouragement to interface with the codebook of any publicly available data you use. There’s some imprecision at the upper end of this variable, and that might not be obvious without referencing the codebook.
For the purposes of this exercise, let’s go ahead and turn age into a simple categorical variable with 3 levels—older, middle age, and younger. I’m going to choose the range somewhat arbitrarily for now. We can use univariate analysis to inform our decision about how to break up a numeric variable, so we will revisit this idea again later on.
- Younger = 18 - 35
- Middle Age = 36 - 49
- Older = 50 and up
At the end of the day, what we need to do is 1.) create a new variable column 2.) populate that column with ordinal labels that correspond with each respondent’s numeric age interval.
3.2.3 New columns with mutate()
First, let’s consider the mutate()
function. This function takes a data frame and appends a new variable column. This new variable is the result of some calculation applied to an existing variable column.
Let’s look at an application fo mutate()
to get a feel for it. Now, this wouldn’t be the best idea for a couple reasons, but, as an illustration, let’s say we wanted to convert our yearly income values to an hourly wage (assuming 40 hrs/week).
mutate()
takes a data frame as its input, and then we provide the name of our new variable column(s) along with the calculation for this new variable. Below, I use mutate()
to create a new column called hr_wage
. Then, I tell R that the hr_wage
variable should be calculated by taking each person’s income value and diving that by 52 weeks in a year, and then 40 hours in a week.
# Without the pipe operator
<- mutate(
our_gss
our_gss,hr_wage = (realrinc/52)/40
)
# With the pipe operator
<- our_gss |>
our_gss mutate(
hr_wage = (realrinc/52)/40
)
Take a look at your new data frame object. I’ll show a summary here to verify that we got our new column.
summary(our_gss)
year id age race sex
Min. :2022 1 : 1 Min. :18.00 white:2514 female:1897
1st Qu.:2022 2 : 1 1st Qu.:34.00 other: 412 male :1627
Median :2022 3 : 1 Median :48.00 black: 565 NA's : 20
Mean :2022 4 : 1 Mean :49.18 NA's : 53
3rd Qu.:2022 5 : 1 3rd Qu.:64.00
Max. :2022 6 : 1 Max. :89.00
(Other):3538 NA's :208
realrinc partyid
Min. : 204.5 independent (neither, no response):835
1st Qu.: 8691.2 strong democrat :595
Median : 18405.0 not very strong democrat :451
Mean : 27835.3 strong republican :431
3rd Qu.: 33742.5 independent, close to democrat :400
Max. :141848.3 (Other) :797
NA's :1554 NA's : 35
happy marital attend
not too happy: 799 divorced : 608 never :1149
pretty happy :1942 married :1468 about once or twice a year: 464
very happy : 779 never married:1095 every week : 441
NA's : 24 separated : 103 less than once a year : 416
widowed : 255 several times a year : 346
NA's : 15 (Other) : 693
NA's : 35
cappun hr_wage
favor :2013 Min. : 0.09832
oppose:1327 1st Qu.: 4.17849
NA's : 204 Median : 8.84856
Mean :13.38237
3rd Qu.:16.22236
Max. :68.19631
NA's :1554
So, we can use mutate()
to add a new column that contains our recoded variable. We just need a way to specify a calculation that takes into account the specific intervals we want for our ordinal labels. For this, we need one more function.
3.2.4 Custom intervals with cut()
We can use the cut()
function to specify the intervals we want for our age groupings, and then we will combine it with mutate()
to generate our recoded age variable. Specifically, cut()
takes our intervals and turns each of them into a level in a new factor variable.
But first, I want to give a little context on interval notation in mathematics. It will help us all be a little more precise when we talk about ranges, and it will also help us understand an input we need to provide for cut()
.
3.2.4.1 An aside on intervals
In mathematic terms, an interval is the set of all numbers in between two specified end points. In formal notation, these are indicated with the two endpoints placed in brackets, e.g [1,5] as the interval of 1 through 5.
There are some technical terms to describe whether or not we want to include the endpoints when we are talking about a particular interval.
Closed interval: This is when both end points are included in the interval. Closedness is indicated with square brackets, so, the closed interval of 1 through 5 would be written just like I have above—[1,5]. This indicates any number \(x\) where \(1 \leq x \leq 5\)
Open interval: This is when neither endpoint is included in the interval. In interval notation, openness is indicated with parentheses rather than square brackets, so the open interval of 1 through 5 would be written as (1,5). This interval would indicate any number \(x\) where \(1 < x < 5\)
Left-open interval: This is when the left-hand endpoint is not included, but the right-hand endpoint is. This would be written as (1,5] in interval notation, and that interval would indicate any number \(x\) where \(1 < x \leq 5\)
Right-open interval: When the right-hand endpoint isn’t included but the left endpoint is, you have a right-open interval. This would be written as [1,5) in interval notation and would indicate any number \(x\) where \(1 \leq x < 5\).
We will be working with the right-open interval format, and we can specify that in cut()
.
Now, let’s return to our task at hand.
3.2.4.2 Inputs for cut()
We’ll go ahead and work with cut directly in mutate()
, as its going to be the calculation that we are providing for the new column we generate with mutate()
.
As a reminder, here are the intervals we need:
- Younger = 18 - 35
- Middle Age = 36 - 49
- Older = 50 and up
The following code may look a little chaotic at first glance, but it’s really the same sort of thing that we did with mutate()
above. It’s just that cut()
is a little bit more involved of a calculation than the simple division we did in our example.
Just like above, we are applying mutate()
to our_gss
, and we are giving the name of our new variable column (age_ord
, in this case) as the first input. Then, for the calculation, we give the cut()
function and the arguments it needs.
cut() and its inputs
The variable for which we want to specify intervals (
age
)breaks: This is where we indicate the intervals. The first number we give is the low end of our lowest age group (18). The second number is the low end of our middle age group (36). The third number, as you probably guessed, is the low end of our highest age group. The last value should reflect our upper limit. In this case, I use the value
Inf
, which is short for ‘infinity’. This essentially tells R that the last category can include any values higher than the previous number we entered.include.lowest: Putting
TRUE
here tells R that, in each interval, we want the lowest value to be included. If we don’t do this, then our ‘younger’ age grouping would be 19 - 35 rather than 18 - 35. In other words, setting this toTRUE
indicates a left-closed interval, andFALSE
indicates a left-open interval.right: This is the input for specifying whether we want this to be a right-closed interval, and it takes a logical value (
TRUE
orFALSE
). We want a right-open interval, so we will set this toFALSE
.labels: R will actually default to formal interval notation for the names of each level of our new factor variable, so it would be [18,36), [36,50), [50, Inf). However, we can provide a character vector to specify custom labels for these new factor levels. In the event that you have an ordinal variable, make sure that you always specify these labels in ascending order. In this case, that would be Younger < Middle Age < Older.
ordered_result: This takes a logical value and, as the name suggests, indicates whether we want the factor to be ordered or not. In our case, there is a clear progression in terms of age, so we need to set this to
TRUE
.
<- our_gss |>
our_gss mutate(
age_ord = cut(
age,breaks = c(18, 36, 50, Inf),
include.lowest = TRUE,
right = FALSE,
labels = c("Younger", "Middle Age", "Older"),
ordered_result = TRUE
) )
Go ahead and take a look at our_gss
, and we should now have an additional variable column. I’ll use the str()
function here so we can confirm that our factor was ordered and added to the data frame.
str(our_gss)
'data.frame': 3544 obs. of 13 variables:
$ year : num 2022 2022 2022 2022 2022 ...
$ id : Factor w/ 4149 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 72 80 57 23 62 27 20 47 31 72 ...
$ race : Factor w/ 3 levels "white","other",..: 1 1 1 1 1 1 2 1 1 NA ...
$ sex : Factor w/ 2 levels "female","male": 1 2 1 1 2 2 1 2 1 1 ...
$ realrinc: num 40900 NA 18405 2250 NA ...
$ partyid : Factor w/ 8 levels "strong democrat",..: 1 2 3 1 2 4 5 1 5 1 ...
$ happy : Ord.factor w/ 3 levels "not too happy"<..: 1 1 1 1 2 2 2 2 2 2 ...
$ marital : Factor w/ 5 levels "divorced","married",..: 1 2 1 3 3 3 3 2 3 3 ...
$ attend : Factor w/ 9 levels "never","less than once a year",..: 3 3 1 5 3 1 2 2 1 8 ...
$ cappun : Factor w/ 2 levels "favor","oppose": 2 1 1 2 2 2 2 2 1 NA ...
$ hr_wage : num 19.66 NA 8.85 1.08 NA ...
$ age_ord : Ord.factor w/ 3 levels "Younger"<"Middle Age"<..: 3 3 3 1 3 1 1 2 1 3 ...
3.3 Restructuring Categorical Variables
Now, let’s go ahead and tackle the other recoding situation I mentioned up at the top—collapsing the levels of categorical variables. We’ll work with partyid
here, but the logic of this process will apply to any categorical variable.
3.3.1 Collapsing categories
I encourage you to take a look at partyid
in the GSS Data Explorer, like we did for ‘age’..
Because partyid
is a categorical variable, it has far fewer unique values than a ratio variable like age
, so we can go ahead and take a look at all of its levels with the count()
function.
Observe that count(our_gss, partyid)
basically provides the same information as summary(our_gss$partyid)
when the variable is a factor. The only real difference is that count()
organizes the info into a tidy data frame. In other words, you can use either function to take a look at factor variable like this.
count(our_gss, partyid)
partyid n
1 strong democrat 595
2 independent (neither, no response) 835
3 not very strong republican 361
4 not very strong democrat 451
5 independent, close to democrat 400
6 other party 106
7 independent, close to republican 330
8 strong republican 431
9 <NA> 35
Let’s go ahead and collapse these into categories of “Democrat”, “Independent”, “Other Party”, and “Republican”. We’ll use mutate()
to make a new variable column called partyid_recoded
. For the calculation of our new variable, we can use the fct_collapse()
function. This function allows us to first give the name of a new factor level, and then we can give a character vector of the names of levels that want to be collapses into a single category.
So, to collapse our 2 Democrat levels into a single factor level, we would give the following input for fct_collapse()
:
"Democrat" = c("strong democrat", "not very strong democrat")
And then repeat for each new factor level.
<- our_gss |>
our_gss mutate(
partyid_recoded=fct_collapse(partyid,
"Democrat" = c("strong democrat", "not very strong democrat"),
"Republican" = c("strong republican","not very strong republican"),
"Independent" = c("independent, close to democrat", "independent (neither, no response)", "independent, close to republican"),
"Other Party" = c("other party")
))
Now, let’s take a look at our new variable.
count(our_gss, partyid_recoded)
partyid_recoded n
1 Democrat 1046
2 Independent 1565
3 Republican 792
4 Other Party 106
5 <NA> 35
3.3.2 Excluding levels
You may also want to work with an existing categorical variable, but only focus on certain values. In this case, we can create a recoded variable and simply code the levels we are uninterested in as NA
.
We can use the fct_recode()
function for this. For the first input, we will give it the factor-variable column that we want to recode. Then, we will let it know that we want to set the levels of “Other Party” and “Independent” to NULL
. This will convert them to NAs and allow us to easily exclude them from our analyses. We will use fct_recode()
within mutate()
so we can create another recoded version of partyid_recoded
. We’ll call this one dem_rep
to distinguish it from the original partyid
and our first recoded version.
<- our_gss |>
our_gss mutate(dem_rep = fct_recode(
partyid_recoded, NULL = "Other Party",
NULL = "Independent"))
Let’s check to see that it worked.
count(our_gss, dem_rep)
dem_rep n
1 Democrat 1046
2 Republican 792
3 <NA> 1706
Now that we have gotten all this pre-processing stuff out of the way, let’s go ahead and dig into some univariate analysis.