📘 Unit 2.5: String, Date, and Factor Handling

Introduction

In the real world, data rarely arrives perfectly structured. Much of a data scientist’s work involves cleaning, transforming, and preparing data for analysis. Three data types that require special attention are:

  • Strings: names, addresses, comments, misspelled categories.
  • Dates and Times: inconsistent formats, time zones, duration calculations.
  • Factors (categorical variables): unordered levels, infrequent categories, confusing labels.

In this unit, you will learn to master handling these three data types using tidyverse packages: stringr for strings, lubridate for dates, and forcats for factors. These tools will enable you to transform messy data into clear, consistent information ready for analysis.


1. String Manipulation with stringr

The stringr package provides a coherent, intuitive, and efficient interface for working with text strings in R. All its functions start with str_, making them easy to discover and use.

1.1. Basic Functions

library(stringr)
library(dplyr)

# Sample data
names <- c("Ana GarcĂ­a", "Carlos Ruiz", "MarĂ­a LĂłpez", "JUAN PEREZ")

# Detect patterns
str_detect(names, "a")          # TRUE if contains "a" (case-sensitive)
str_detect(names, regex("a", ignore_case = TRUE))  # Case-insensitive

# Count occurrences
str_count(names, "a")           # Number of "a"s in each string

# Locate position
str_locate(names, "a")          # Position of first "a"
str_locate_all(names, "a")      # Positions of all "a"s

# Extract substrings
str_extract(names, "[A-Z]+")    # Extracts first uppercase sequence
str_extract_all(names, "[A-Z]+") # Extracts all uppercase sequences

# Replace text
str_replace(names, "a", "X")    # Replaces first "a" with "X"
str_replace_all(names, "a", "X") # Replaces all "a"s with "X"

# Split strings
str_split(names, " ", simplify = TRUE) # Splits by space

1.2. Patterns with Regular Expressions (Regex)

Regular expressions are patterns that describe sets of strings. stringr fully supports them.

emails <- c("ana@gmail.com", "carlos@outlook.es", "invalido", "maria@empresa.org")

# Validate emails (basic)
str_detect(emails, ".+@.+\\..+")

# Extract domain
str_extract(emails, "@(.+)$") %>% str_replace("@", "")

# Clean text: only letters and spaces
dirty_texts <- c("Hola123!", "¿Qué tal?", "Precio: $50")
str_replace_all(dirty_texts, "[^A-Za-zĂĂ‰ĂĂ“ĂšĂĄĂ©Ă­ĂłĂșñÑ ]", "")

1.3. Practical Cases in Data Frames

# Sample dataset
customers <- tibble(
  full_name = c("GarcĂ­a, Ana", "Ruiz, Carlos", "LĂłpez, MarĂ­a"),
  email = c("ana@gmail.com", "carlos@outlook.es", "maria@empresa.org"),
  phone = c("(555) 123-4567", "555-987-6543", "555 321 7890")
)

# Separate first and last name
customers <- customers %>%
  mutate(
    last_name = str_extract(full_name, "^[^,]+"),
    first_name = str_extract(full_name, "[^,]+$") %>% str_trim(),
    email_domain = str_extract(email, "@(.+)$") %>% str_replace("@", ""),
    clean_phone = str_replace_all(phone, "[^0-9]", "")
  )

customers

2. Date and Time Handling with lubridate

lubridate simplifies working with dates and times in R. It provides intuitive functions to parse, manipulate, and format dates.

2.1. Date Parsing

library(lubridate)

# Functions by order: y=year, m=month, d=day
text_dates <- c("2023-12-01", "01/12/2023", "Dec 1, 2023", "20231201")

ymd(text_dates[1])   # 2023-12-01
dmy(text_dates[2])   # 2023-12-01
mdy(text_dates[3])   # 2023-12-01
ymd(text_dates[4])   # 2023-12-01

# Flexible parsing
parse_date_time(text_dates, orders = c("ymd", "dmy", "mdy", "Ymd"))

2.2. Components and Operations

today <- ymd("2024-06-15")

# Extract components
year(today)    # 2024
month(today)   # 6
day(today)     # 15
wday(today, label = TRUE)  # "Sat"

# Modify components
today %>% 
  update(year = 2025, month = 12)  # 2025-12-15

# Add/subtract time
today + days(10)        # 2024-06-25
today + months(1)       # 2024-07-15
today + years(1)        # 2025-06-15

# Differences
difftime(ymd("2025-01-01"), today, units = "days")

2.3. Intervals, Durations, and Periods

start <- ymd_hms("2024-06-01 08:00:00")
end <- ymd_hms("2024-06-01 17:30:00")

# Duration (physical time)
duration <- end - start
as.duration(duration)  # 34200s (~9.5 hours)

# Period (calendar time)
period <- months(1)
start + period        # 2024-07-01 08:00:00

# Intervals
interval <- interval(start, end)
int_start(interval)
int_end(interval)
int_length(interval)  # in seconds

# Is a date within an interval?
ymd("2024-06-02") %within% interval  # FALSE

2.4. Time Zones

# Create with time zone
ny_time <- ymd_hms("2024-06-15 12:00:00", tz = "America/New_York")
london_time <- with_tz(ny_time, "Europe/London")  # 17:00:00

# Convert time zone (changes the time)
ny_converted <- force_tz(ny_time, "Europe/London")  # still 12:00, but in London

2.5. Practical Case: Sales Analysis by Time

sales <- tibble(
  sale_date = c("2024-01-15", "2024-02-20", "2024-03-10", "2024-04-05"),
  sale_time = c("14:30:00", "09:15:00", "16:45:00", "11:20:00"),
  amount = c(150.50, 200.00, 75.25, 300.75)
)

clean_sales <- sales %>%
  mutate(
    datetime = ymd_hms(paste(sale_date, sale_time)),
    year = year(datetime),
    month = month(datetime, label = TRUE),
    weekday = wday(datetime, label = TRUE),
    hour = hour(datetime)
  ) %>%
  select(-sale_date, -sale_time)

clean_sales

3. Factor Handling with forcats

Factors are categorical variables. forcats provides tools to intuitively reorder, recode, and manipulate factor levels.

3.1. Reordering Levels

library(forcats)

# Sample data
countries <- c("México", "Argentina", "Brasil", "Chile", "Argentina", "México", "Brasil")

# Convert to factor
countries_f <- factor(countries)

# Reorder by frequency
fct_infreq(countries_f)  # Brasil, Argentina, México, Chile (ordered by frequency)

# Reorder manually
fct_relevel(countries_f, "Brasil", "Argentina", "México", "Chile")

# Reorder by another variable (e.g., GDP)
gdp <- c(México = 1.5, Argentina = 0.6, Brasil = 2.1, Chile = 0.3)
fct_reorder(countries_f, gdp[countries], .fun = identity)

3.2. Recoding and Collapsing Levels

# Recode manually
regions <- fct_recode(countries_f,
  "North America" = "México",
  "South America" = "Argentina",
  "South America" = "Brasil",
  "South America" = "Chile"
)

# Collapse infrequent levels
set.seed(123)
categories <- sample(c("A", "B", "C", "D", "E", "F"), 100, replace = TRUE)
categories_f <- factor(categories)

# Keep only top 3 most frequent, rest as "Other"
fct_lump_n(categories_f, n = 3)

# Collapse by minimum proportion
fct_lump_prop(categories_f, prop = 0.15)

# Collapse by minimum number of observations
fct_lump_min(categories_f, min = 15)

3.3. Other Useful Functions

# Reverse levels
fct_rev(fct_infreq(countries_f))

# Expand levels (useful for plots)
fct_expand(countries_f, "PerĂș", "Colombia")

# Remove unused levels
countries_sub <- countries_f[countries_f != "Chile"]
countries_sub  # Chile still in levels
fct_drop(countries_sub)  # Chile removed from levels

# Anonymize levels
fct_anon(countries_f)  # Replaces levels with .1, .2, .3...

3.4. Practical Case: Data Preparation for Models

# Survey dataset
survey <- tibble(
  satisfaction = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"),
  frequency = c(5, 15, 25, 40, 15)
)

# Convert to factor with logical order
survey <- survey %>%
  mutate(
    satisfaction_f = factor(satisfaction, 
                           levels = c("Very Dissatisfied", "Dissatisfied", 
                                     "Neutral", "Satisfied", "Very Satisfied"))
  )

# For models, sometimes we want to order by frequency
survey %>%
  mutate(
    satisfaction_freq = fct_infreq(satisfaction_f)
  )

# Or collapse extreme categories
survey %>%
  mutate(
    sat_collapsed = fct_collapse(satisfaction_f,
      "Dissatisfied" = c("Very Dissatisfied", "Dissatisfied"),
      "Satisfied" = c("Satisfied", "Very Satisfied"),
      "Neutral" = "Neutral"
    )
  )

4. Integrated Project: Complete Dataset Cleaning

We will apply everything learned to a real dataset: customer satisfaction surveys.

library(tidyverse)
library(lubridate)
library(forcats)

# Simulate a messy dataset
set.seed(123)
dirty_data <- tibble(
  id = 1:100,
  name = paste(sample(c("Juan", "MarĂ­a", "Carlos", "Ana", "Luis"), 100, replace = TRUE), 
               sample(c("Gómez", "Pérez", "López", "Ruiz", "Díaz"), 100, replace = TRUE)),
  email = paste0(tolower(sample(letters, 100, replace = TRUE)), 
                 sample(100:999, 100, replace = TRUE), "@", 
                 sample(c("gmail.com", "hotmail.com", "empresa.org", "univ.edu"), 100, replace = TRUE)),
  survey_date = sample(seq(ymd("2023-01-01"), ymd("2024-06-15"), by = "day"), 100, replace = TRUE),
  survey_time = sprintf("%02d:%02d", sample(9:18, 100, replace = TRUE), sample(0:59, 100, replace = TRUE)),
  satisfaction = sample(c("Very Dissatisfied", "dissatisfied", "NEUTRAL", "Satisfied ", "Very Satisfied!"), 100, replace = TRUE),
  comments = c(
    rep("Excellent service, very fast", 20),
    rep("Slow and unfriendly", 15),
    rep("Good, but can be improved", 25),
    rep("Horrible, never again", 10),
    rep("Very good, I'll be back", 30)
  )
)

# Complete cleaning
clean_data <- dirty_data %>%
  # Clean satisfaction
  mutate(
    satisfaction = str_to_title(str_trim(str_replace_all(satisfaction, "[^A-Za-z ]", ""))),
    satisfaction_f = factor(satisfaction, 
                           levels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"))
  ) %>%
  # Extract email domain
  mutate(
    domain = str_extract(email, "@(.+)$") %>% str_replace("@", "")
  ) %>%
  # Combine date and time
  mutate(
    datetime = ymd_hm(paste(survey_date, survey_time)),
    month = month(datetime, label = TRUE),
    weekday = wday(datetime, label = TRUE)
  ) %>%
  # Clean comments and create sentiment variable
  mutate(
    comments = str_to_sentence(str_trim(comments)),
    sentiment = case_when(
      str_detect(comments, regex("excellent|very good|I'll be back", ignore_case = TRUE)) ~ "Positive",
      str_detect(comments, regex("slow|unfriendly|horrible|never", ignore_case = TRUE)) ~ "Negative",
      TRUE ~ "Neutral"
    ),
    sentiment_f = fct_relevel(factor(sentiment), "Negative", "Neutral", "Positive")
  ) %>%
  # Select final columns
  select(id, name, email, domain, datetime, month, weekday, satisfaction_f, sentiment_f, comments)

# View result
glimpse(clean_data)
head(clean_data)

# Exploratory analysis
clean_data %>%
  count(satisfaction_f) %>%
  mutate(pct = n / sum(n))

clean_data %>%
  count(month, satisfaction_f) %>%
  ggplot(aes(x = month, y = n, fill = satisfaction_f)) +
  geom_col(position = "dodge") +
  labs(title = "Satisfaction by Month", x = "Month", y = "Number of Surveys") +
  theme_minimal()

Key Commands Summary

Type Function Description
Strings str_detect() Detects if pattern matches
str_replace_all() Replaces all matches
str_extract() Extracts first match
str_split() Splits string into parts
Dates ymd(), dmy(), mdy() Parses dates by order
year(), month(), day() Extracts date components
ymd() + days(10) Adds days, months, years to date
interval(), %within% Creates and evaluates time intervals
Factors fct_relevel() Reorders levels manually
fct_infreq() Orders levels by frequency
fct_lump_n() Collapses least frequent levels
fct_collapse() Groups specific levels into a new one

Practical Exercises

  1. Name Cleaning: Given a vector of names with inconsistent formats (UPPERCASE, lowercase, with titles like "Dr."), normalize them to "First Last" with initial capitalization.

  2. Domain Extraction: From a list of emails, extract the domain and count how many there are per domain. Then, collapse all domains with fewer than 5 occurrences into "Other".

  3. Date Conversion: You have dates in "DD-MM-YYYY" and "MM/DD/YYYY" formats mixed together. Convert them all to R’s standard Date format.

  4. Category Reordering: In a sales dataset by product, reorder the products in the bar chart according to total sales amount (from highest to lowest).

  5. Mini Project: Load a real dataset (e.g., from Kaggle) containing at least one text column, one date column, and one categorical column. Apply all techniques from this unit to clean it and prepare it for analysis.


Additional Resources


✅ With this unit, you master the essential tools to clean and transform the most challenging data! You are ready to tackle real-world datasets with confidence.

Course Info

Course: R-zero-to-hero

Language: EN

Lesson: Module09