8 Transforming Your Data

How to use dplyr

Author

Mara Estreich

Abstract

This chapter introduces students to dplyr, a core package within the tidyverse designed for efficient data manipulation and transformation. Students learn essential data cleaning techniques including filtering observations, recoding variables, and creating new categorical and binary variables. Through practical examples using survey data, the chapter demonstrates how to prepare datasets for analysis by grouping response categories and creating dummy variables for regression modeling. By the end of this chapter, students will understand that clean, well-organized data is foundational for accurate analysis and will be equipped with core dplyr functions to transform raw survey data into analysis-ready datasets.

Keywords

data manipulation, dplyr, recoding variables, creating new variables, dummy coding

📖 dplyr resources

library(haven)
library(naniar)
library(dplyr)
library(knitr)

alldata <- read_sav("data/10.31.2024.oralhealth.sav", user_na = FALSE)

## replace with NA
alldata <- alldata %>%
  replace_with_na_all(condition = ~.x %in% c(-99, -50))

8.1 Introduction to dplyr

dplyr is part of the tidyverse and provides intuitive functions for manipulating data. Think of it as a toolkit for cleaning and preparing your data for analysis.

Why use dplyr?

Makes data cleaning code readable and logical
Uses the pipe operator (%>%) to chain operations together
Essential for preparing data before creating visualizations or running analyses

Remember: You cannot create meaningful visualizations or run valid statistical analyses without clean data!

8.2 Example 1: Recoding Education into Categories

When working with survey data, we often need to collapse detailed response categories into broader groups. This is useful for:

Simplifying analysis
Creating groups with adequate sample sizes
Making results easier to interpret

8.2.1 Original Education Variable

Our survey asked about education level with 7 response options:

Less than high school
High school diploma or GED
Some college, but no degree
Associate’s or technical degree
Bachelor’s degree
Master’s degree
Doctoral or professional degree

8.2.2 Recoding into Three Categories

Let’s group these into three meaningful categories:

# Create a new 3-category education variable
alldata <- alldata %>%
  mutate(
    education_3cat = factor(
      case_when(
        Education_Level %in% c(1, 2, 3, 4) ~ "Less than Bachelor's",
        Education_Level == 5 ~ "Bachelor's Degree",
        Education_Level %in% c(6, 7) ~ "Graduate Degree"
      ),
      levels = c("Less than Bachelor's", "Bachelor's Degree", "Graduate Degree")
    )
  )

# Check the recoding with a cross-tabulation
table(alldata$Education_Level, alldata$education_3cat, useNA = "ifany")

   
    Less than Bachelor's Bachelor's Degree Graduate Degree
  1                   11                 0               0
  2                   17                 0               0
  3                    8                 0               0

Warning

This page is under construction. It will include more information soon!