library(haven)
library(naniar)
library(dplyr)
library(knitr)8 Transforming Your Data
How to use dplyr
This chapter introduces students to dplyr, a core package within the tidyverse designed for efficient data manipulation and transformation. Students learn essential data cleaning techniques including filtering observations, recoding variables, and creating new categorical and binary variables. Through practical examples using survey data, the chapter demonstrates how to prepare datasets for analysis by grouping response categories and creating dummy variables for regression modeling. By the end of this chapter, students will understand that clean, well-organized data is foundational for accurate analysis and will be equipped with core dplyr functions to transform raw survey data into analysis-ready datasets.
data manipulation, dplyr, recoding variables, creating new variables, dummy coding
dplyr resources
alldata <- read_sav("data/10.31.2024.oralhealth.sav", user_na = FALSE)## replace with NA
alldata <- alldata %>%
replace_with_na_all(condition = ~.x %in% c(-99, -50))8.1 Introduction to dplyr
dplyr is part of the tidyverse and provides intuitive functions for manipulating data. Think of it as a toolkit for cleaning and preparing your data for analysis.
Why use dplyr?
Makes data cleaning code readable and logical
Uses the pipe operator (
%>%) to chain operations togetherEssential for preparing data before creating visualizations or running analyses
Remember: You cannot create meaningful visualizations or run valid statistical analyses without clean data!
8.2 Example 1: Recoding Education into Categories
When working with survey data, we often need to collapse detailed response categories into broader groups. This is useful for:
Simplifying analysis
Creating groups with adequate sample sizes
Making results easier to interpret
8.2.1 Original Education Variable
Our survey asked about education level with 7 response options:
Less than high school
High school diploma or GED
Some college, but no degree
Associate’s or technical degree
Bachelor’s degree
Master’s degree
Doctoral or professional degree
8.2.2 Recoding into Three Categories
Let’s group these into three meaningful categories:
# Create a new 3-category education variable
alldata <- alldata %>%
mutate(
education_3cat = factor(
case_when(
Education_Level %in% c(1, 2, 3, 4) ~ "Less than Bachelor's",
Education_Level == 5 ~ "Bachelor's Degree",
Education_Level %in% c(6, 7) ~ "Graduate Degree"
),
levels = c("Less than Bachelor's", "Bachelor's Degree", "Graduate Degree")
)
)
# Check the recoding with a cross-tabulation
table(alldata$Education_Level, alldata$education_3cat, useNA = "ifany")
Less than Bachelor's Bachelor's Degree Graduate Degree
1 11 0 0
2 17 0 0
3 8 0 0
This page is under construction. It will include more information soon!