3  Recoding Variables

Today’s venture concerns univariate analysis, i.e. the quantitative description of a single variable. Before we do that, however, we need to familiarize ourselves with some data-cleaning procedures.

3.1 Background

Recoding a variable involves manipulating the underlying structure of our variable such that we can use it for analysis. We did a little recoding during the last unit when we converted character vectors into factor variables. This allowed us to align R data types with the appropriate level of measurement.

There are also occasions when we need a variable to be translated from one level of measurement to another. For example, we may want to convert a ratio variable for “number of years of education” into an ordinal variable reflecting categories like “less than high school”, “high school diploma”, “Associates degree”, and so on.

We may also want to collapse the categories of ordinal variables for some analyses. Consider a variable with a Likert-scale agreement rating, where you responses like “strongly agree,” “moderately agree,” “slightly agree,” and so forth. You may decide to collapse these categories into categories of “Agree” and “Disagree”.

We will get some practice doing this sort of thing, which is an essential component of responsible analysis. Additionally, our next unit on bivariate analysis will require us to work with categorical variables in particular, so we need to be capable of converting any numeric variables.

3.2 Converting Numeric to Categorical

We will start by recoding age—a ratio variable—into an ordinal variable reflecting age groupings. The same strategies we use here will work for any numeric variable.

3.2.1 Setting up our workspace

As usual, let’s make sure we load in tidyverse along with our GSS data.

library(tidyverse)
load("our_gss.rda")

Let’s double check the structure of our data frame.

str(our_gss)
'data.frame':   3544 obs. of  11 variables:
 $ year    : num  2022 2022 2022 2022 2022 ...
 $ id      : Factor w/ 4149 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ age     : num  72 80 57 23 62 27 20 47 31 72 ...
 $ race    : Factor w/ 3 levels "white","other",..: 1 1 1 1 1 1 2 1 1 NA ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 1 2 2 1 2 1 1 ...
 $ realrinc: num  40900 NA 18405 2250 NA ...
 $ partyid : Factor w/ 8 levels "strong democrat",..: 1 2 3 1 2 4 5 1 5 1 ...
 $ happy   : Ord.factor w/ 3 levels "not too happy"<..: 1 1 1 1 2 2 2 2 2 2 ...
 $ marital : Factor w/ 5 levels "divorced","married",..: 1 2 1 3 3 3 3 2 3 3 ...
 $ attend  : Factor w/ 9 levels "never","less than once a year",..: 3 3 1 5 3 1 2 2 1 8 ...
 $ cappun  : Factor w/ 2 levels "favor","oppose": 2 1 1 2 2 2 2 2 1 NA ...
Note

You might notice the ‘int’ category, which is short for ‘integer’. This is a subtype of numeric data in R. Variables that are exclusively whole numbers are often recorded in this way, but we can work with them in R just like we can other sorts of numbers

3.2.2 The ‘age’ variable

We can take a look at all the values of age (along with the # of respondents in each age category) using the count() function.

count(our_gss, age)

However, I’m not going to display those results here, as it will be a particularly long table of values (with 70+ different ages). It’s fine to run it—it won’t crash R or anything—but it will it clutter up this page. Let’s take this as a good opportunity to take advantage of the fact that we are using one of the better funded and well-organized surveys in all of social sciences. As such, there’s extensive documentation about all of the variables measured for the GSS. Go ahead and take a look at the age variable via the GSS Data Explorer, which allows us to search for unique variables and view their response values, the specific question(s) that was asked on the survey, and several other variable characteristics.

The responses range from 18 - 89 (in addition to a few categories for non-response). However, note that there’s something unique about value 89. It’s not just 89 years of age, but 89 & older. This isn’t a real issue for our purposes, but take this as encouragement to interface with the codebook of any publicly available data you use. There’s some imprecision at the upper end of this variable, and that might not be obvious without referencing the codebook.

For the purposes of this exercise, let’s go ahead and turn age into a simple categorical variable with 3 levels—older, middle age, and younger. I’m going to choose the range somewhat arbitrarily for now. We can use univariate analysis to inform our decision about how to break up a numeric variable, so we will revisit this idea again later on.

  • Younger = 18 - 35
  • Middle Age = 36 - 49
  • Older = 50 and up

At the end of the day, what we need to do is 1.) create a new variable column 2.) populate that column with ordinal labels that correspond with each respondent’s numeric age interval.

3.2.3 New columns with mutate()

First, let’s consider the mutate() function. This function takes a data frame and appends a new variable column. This new variable is the result of some calculation applied to an existing variable column.

Let’s look at an application fo mutate() to get a feel for it. Now, this wouldn’t be the best idea for a couple reasons, but, as an illustration, let’s say we wanted to convert our yearly income values to an hourly wage (assuming 40 hrs/week).

mutate() takes a data frame as its input, and then we provide the name of our new variable column(s) along with the calculation for this new variable. Below, I use mutate() to create a new column called hr_wage. Then, I tell R that the hr_wage variable should be calculated by taking each person’s income value and diving that by 52 weeks in a year, and then 40 hours in a week.

# Without the pipe operator
our_gss <- mutate(
  our_gss,
  hr_wage = (realrinc/52)/40
  )
# With the pipe operator
our_gss <- our_gss |>
  mutate(
    hr_wage = (realrinc/52)/40
  )

Take a look at your new data frame object. I’ll show a summary here to verify that we got our new column.

summary(our_gss)
      year            id            age           race          sex      
 Min.   :2022   1      :   1   Min.   :18.00   white:2514   female:1897  
 1st Qu.:2022   2      :   1   1st Qu.:34.00   other: 412   male  :1627  
 Median :2022   3      :   1   Median :48.00   black: 565   NA's  :  20  
 Mean   :2022   4      :   1   Mean   :49.18   NA's :  53                
 3rd Qu.:2022   5      :   1   3rd Qu.:64.00                             
 Max.   :2022   6      :   1   Max.   :89.00                             
                (Other):3538   NA's   :208                               
    realrinc                                      partyid   
 Min.   :   204.5   independent (neither, no response):835  
 1st Qu.:  8691.2   strong democrat                   :595  
 Median : 18405.0   not very strong democrat          :451  
 Mean   : 27835.3   strong republican                 :431  
 3rd Qu.: 33742.5   independent, close to democrat    :400  
 Max.   :141848.3   (Other)                           :797  
 NA's   :1554       NA's                              : 35  
           happy               marital                            attend    
 not too happy: 799   divorced     : 608   never                     :1149  
 pretty happy :1942   married      :1468   about once or twice a year: 464  
 very happy   : 779   never married:1095   every week                : 441  
 NA's         :  24   separated    : 103   less than once a year     : 416  
                      widowed      : 255   several times a year      : 346  
                      NA's         :  15   (Other)                   : 693  
                                           NA's                      :  35  
    cappun        hr_wage        
 favor :2013   Min.   : 0.09832  
 oppose:1327   1st Qu.: 4.17849  
 NA's  : 204   Median : 8.84856  
               Mean   :13.38237  
               3rd Qu.:16.22236  
               Max.   :68.19631  
               NA's   :1554      

So, we can use mutate() to add a new column that contains our recoded variable. We just need a way to specify a calculation that takes into account the specific intervals we want for our ordinal labels. For this, we need one more function.

3.2.4 Custom intervals with cut()

We can use the cut() function to specify the intervals we want for our age groupings, and then we will combine it with mutate() to generate our recoded age variable. Specifically, cut() takes our intervals and turns each of them into a level in a new factor variable.

But first, I want to give a little context on interval notation in mathematics. It will help us all be a little more precise when we talk about ranges, and it will also help us understand an input we need to provide for cut().

3.2.4.1 An aside on intervals

In mathematic terms, an interval is the set of all numbers in between two specified end points. In formal notation, these are indicated with the two endpoints placed in brackets, e.g [1,5] as the interval of 1 through 5.

There are some technical terms to describe whether or not we want to include the endpoints when we are talking about a particular interval.

  • Closed interval: This is when both end points are included in the interval. Closedness is indicated with square brackets, so, the closed interval of 1 through 5 would be written just like I have above—[1,5]. This indicates any number \(x\) where \(1 \leq x \leq 5\)

  • Open interval: This is when neither endpoint is included in the interval. In interval notation, openness is indicated with parentheses rather than square brackets, so the open interval of 1 through 5 would be written as (1,5). This interval would indicate any number \(x\) where \(1 < x < 5\)

  • Left-open interval: This is when the left-hand endpoint is not included, but the right-hand endpoint is. This would be written as (1,5] in interval notation, and that interval would indicate any number \(x\) where \(1 < x \leq 5\)

  • Right-open interval: When the right-hand endpoint isn’t included but the left endpoint is, you have a right-open interval. This would be written as [1,5) in interval notation and would indicate any number \(x\) where \(1 \leq x < 5\).

We will be working with the right-open interval format, and we can specify that in cut().

Now, let’s return to our task at hand.

3.2.4.2 Inputs for cut()

We’ll go ahead and work with cut directly in mutate(), as its going to be the calculation that we are providing for the new column we generate with mutate().

As a reminder, here are the intervals we need:

  • Younger = 18 - 35
  • Middle Age = 36 - 49
  • Older = 50 and up

The following code may look a little chaotic at first glance, but it’s really the same sort of thing that we did with mutate() above. It’s just that cut() is a little bit more involved of a calculation than the simple division we did in our example.

Just like above, we are applying mutate() to our_gss, and we are giving the name of our new variable column (age_ord, in this case) as the first input. Then, for the calculation, we give the cut() function and the arguments it needs.

cut() and its inputs

  • The variable for which we want to specify intervals (age)

  • breaks: This is where we indicate the intervals. The first number we give is the low end of our lowest age group (18). The second number is the low end of our middle age group (36). The third number, as you probably guessed, is the low end of our highest age group. The last value should reflect our upper limit. In this case, I use the value Inf, which is short for ‘infinity’. This essentially tells R that the last category can include any values higher than the previous number we entered.

  • include.lowest: Putting TRUE here tells R that, in each interval, we want the lowest value to be included. If we don’t do this, then our ‘younger’ age grouping would be 19 - 35 rather than 18 - 35. In other words, setting this to TRUE indicates a left-closed interval, and FALSE indicates a left-open interval.

  • right: This is the input for specifying whether we want this to be a right-closed interval, and it takes a logical value (TRUE or FALSE). We want a right-open interval, so we will set this to FALSE.

  • labels: R will actually default to formal interval notation for the names of each level of our new factor variable, so it would be [18,36), [36,50), [50, Inf). However, we can provide a character vector to specify custom labels for these new factor levels. In the event that you have an ordinal variable, make sure that you always specify these labels in ascending order. In this case, that would be Younger < Middle Age < Older.

  • ordered_result: This takes a logical value and, as the name suggests, indicates whether we want the factor to be ordered or not. In our case, there is a clear progression in terms of age, so we need to set this to TRUE.

our_gss <- our_gss |>
  mutate(
    age_ord = cut(
      age,
      breaks = c(18, 36, 50, Inf),
      include.lowest = TRUE,
      right = FALSE,
      labels = c("Younger", "Middle Age", "Older"),
      ordered_result = TRUE
    )
  )

Go ahead and take a look at our_gss, and we should now have an additional variable column. I’ll use the str() function here so we can confirm that our factor was ordered and added to the data frame.

str(our_gss)
'data.frame':   3544 obs. of  13 variables:
 $ year    : num  2022 2022 2022 2022 2022 ...
 $ id      : Factor w/ 4149 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ age     : num  72 80 57 23 62 27 20 47 31 72 ...
 $ race    : Factor w/ 3 levels "white","other",..: 1 1 1 1 1 1 2 1 1 NA ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 1 2 2 1 2 1 1 ...
 $ realrinc: num  40900 NA 18405 2250 NA ...
 $ partyid : Factor w/ 8 levels "strong democrat",..: 1 2 3 1 2 4 5 1 5 1 ...
 $ happy   : Ord.factor w/ 3 levels "not too happy"<..: 1 1 1 1 2 2 2 2 2 2 ...
 $ marital : Factor w/ 5 levels "divorced","married",..: 1 2 1 3 3 3 3 2 3 3 ...
 $ attend  : Factor w/ 9 levels "never","less than once a year",..: 3 3 1 5 3 1 2 2 1 8 ...
 $ cappun  : Factor w/ 2 levels "favor","oppose": 2 1 1 2 2 2 2 2 1 NA ...
 $ hr_wage : num  19.66 NA 8.85 1.08 NA ...
 $ age_ord : Ord.factor w/ 3 levels "Younger"<"Middle Age"<..: 3 3 3 1 3 1 1 2 1 3 ...

3.3 Restructuring Categorical Variables

Now, let’s go ahead and tackle the other recoding situation I mentioned up at the top—collapsing the levels of categorical variables. We’ll work with partyid here, but the logic of this process will apply to any categorical variable.

3.3.1 Collapsing categories

I encourage you to take a look at partyid in the GSS Data Explorer, like we did for ‘age’..

Because partyid is a categorical variable, it has far fewer unique values than a ratio variable like age, so we can go ahead and take a look at all of its levels with the count() function.

Note

Observe that count(our_gss, partyid) basically provides the same information as summary(our_gss$partyid) when the variable is a factor. The only real difference is that count() organizes the info into a tidy data frame. In other words, you can use either function to take a look at factor variable like this.

count(our_gss, partyid)
                             partyid   n
1                    strong democrat 595
2 independent (neither, no response) 835
3         not very strong republican 361
4           not very strong democrat 451
5     independent, close to democrat 400
6                        other party 106
7   independent, close to republican 330
8                  strong republican 431
9                               <NA>  35

Let’s go ahead and collapse these into categories of “Democrat”, “Independent”, “Other Party”, and “Republican”. We’ll use mutate() to make a new variable column called partyid_recoded. For the calculation of our new variable, we can use the fct_collapse() function. This function allows us to first give the name of a new factor level, and then we can give a character vector of the names of levels that want to be collapses into a single category.

So, to collapse our 2 Democrat levels into a single factor level, we would give the following input for fct_collapse():

"Democrat" = c("strong democrat", "not very strong democrat")

And then repeat for each new factor level.

our_gss <- our_gss |>
  mutate(
    partyid_recoded=fct_collapse(partyid, 
"Democrat" = c("strong democrat", "not very strong democrat"),
"Republican" = c("strong republican","not very strong republican"),
"Independent" = c("independent, close to democrat", "independent (neither, no response)", "independent, close to republican"),
"Other Party" = c("other party")
))

Now, let’s take a look at our new variable.

count(our_gss, partyid_recoded)
  partyid_recoded    n
1        Democrat 1046
2     Independent 1565
3      Republican  792
4     Other Party  106
5            <NA>   35

3.3.2 Excluding levels

You may also want to work with an existing categorical variable, but only focus on certain values. In this case, we can create a recoded variable and simply code the levels we are uninterested in as NA.

We can use the fct_recode() function for this. For the first input, we will give it the factor-variable column that we want to recode. Then, we will let it know that we want to set the levels of “Other Party” and “Independent” to NULL. This will convert them to NAs and allow us to easily exclude them from our analyses. We will use fct_recode() within mutate() so we can create another recoded version of partyid_recoded. We’ll call this one dem_rep to distinguish it from the original partyid and our first recoded version.

our_gss <- our_gss |>
  mutate(dem_rep = fct_recode(
    partyid_recoded, 
    NULL = "Other Party", 
    NULL = "Independent"))

Let’s check to see that it worked.

count(our_gss, dem_rep)
     dem_rep    n
1   Democrat 1046
2 Republican  792
3       <NA> 1706

Now that we have gotten all this pre-processing stuff out of the way, let’s go ahead and dig into some univariate analysis.