📘 Unit 2.3: Cleaning and Transformation with tidyr

Unit Objective

By the end of this unit, the student will be able to:

  • Identify and correct non-“tidy” data structures.
  • Transform data between wide and long formats using pivot_longer() and pivot_wider().
  • Split and combine columns with separate(), unite(), and extract().
  • Handle missing values in a controlled and efficient manner.
  • Reorder, group, and collapse factor levels to improve analysis quality.

1. Introduction to tidyr and the Concept of “Tidy Data”

The tidyr package is a fundamental part of the tidyverse ecosystem and is specifically designed to clean and restructure data so that it adheres to the principles of tidy data (rectangular and orderly), as defined by Hadley Wickham:

Tidy data follows three rules:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each value is a cell.

Many real-world datasets do not comply with these rules. For example:

  • A single column contains multiple variables (e.g., “first_last_name”).
  • Variables are in columns when they should be in rows (wide vs. long format).
  • There are implicit or combined values within a single cell.

tidyr provides intuitive and consistent functions to solve these problems.


2. Transformation between Formats: Long ↔ Wide

2.1. pivot_longer(): From Wide to Long

Converts columns into rows. Useful when multiple columns represent measurements of the same variable under different conditions.

library(tidyr)
library(dplyr)

# Example: survey dataset with columns per year
encuesta_ancho <- tibble(
  pais = c("España", "México", "Argentina"),
  `2020` = c(75, 70, 78),
  `2021` = c(77, 72, 80),
  `2022` = c(79, 74, 81)
)

# Convert to long format
encuesta_largo <- encuesta_ancho %>%
  pivot_longer(
    cols = c(`2020`, `2021`, `2022`),  # columns to pivot
    names_to = "año",                  # name of the new column for names
    values_to = "esperanza_vida"       # name of the new column for values
  )

# Result:
# pais      año   esperanza_vida
# España    2020  75
# España    2021  77
# ...

🔹 Key Parameters:

  • cols: columns to transform (you can use starts_with(), ends_with(), matches(), or ranges like año_2020:año_2022).
  • names_to: name of the column that will contain the former column names.
  • values_to: name of the column that will contain the values.

2.2. pivot_wider(): From Long to Wide

Converts rows into columns. Useful for summaries, cross-tabulations, or visualizations.

# Starting from the previous long dataset
encuesta_largo %>%
  pivot_wider(
    names_from = año,           # column whose values will become column names
    values_from = esperanza_vida # column whose values will fill the new columns
  )

# Result:
# pais      `2020` `2021` `2022`
# España    75     77     79
# México    70     72     74
# ...

🔹 Key Parameters:

  • names_from: column that defines the new column names.
  • values_from: column that provides the values.
  • values_fill: value to fill empty cells (default is NA).
  • names_prefix, names_sep: to customize resulting names.

3. Splitting and Combining Columns

3.1. separate(): Split One Column into Multiple

Splits a column into several using a separator (e.g., hyphen, space, semicolon).

datos <- tibble(
  nombre_completo = c("Ana García", "Luis Pérez", "María López")
)

datos_separado <- datos %>%
  separate(
    col = nombre_completo,
    into = c("nombre", "apellido"),
    sep = " "  # separator: space
  )

# Result:
# nombre   apellido
# Ana      García
# Luis     Pérez
# María    López

🔹 Key Parameters:

  • col: column to split.
  • into: vector of names for the new columns.
  • sep: separator (can be a string, numeric position, or regex).
  • remove: whether to remove the original column (TRUE by default).
  • convert: attempts to automatically convert to numeric or logical type (FALSE by default).

3.2. unite(): Combine Multiple Columns into One

Combines several columns into one, with an optional separator.

datos_separado %>%
  unite(
    col = nombre_completo,
    nombre, apellido,
    sep = " - "
  )

# Result:
# nombre_completo
# Ana - García
# Luis - Pérez
# ...

🔹 Key Parameters:

  • col: name of the new combined column.
  • ...: columns to unite (you can use :, c(), or individual names).
  • sep: separator between values.
  • remove: whether to remove the original columns.

3.3. extract(): Split Using Regular Expressions

Similar to separate(), but uses regex to extract groups defined by parentheses.

datos_fecha <- tibble(
  fecha_texto = c("2023-05-15", "2024-12-01", "2022-07-30")
)

datos_fecha %>%
  extract(
    fecha_texto,
    into = c("año", "mes", "día"),
    regex = "(\\d{4})-(\\d{2})-(\\d{2})"
  )

# Result:
# año    mes   día
# 2023   05    15
# 2024   12    01
# ...

🔹 Key Parameters:

  • regex: pattern with capture groups ( ).
  • remove, convert: same as in separate().

4. Handling Missing Values (NA)

4.1. drop_na(): Remove Rows with Missing Values

Removes rows containing NA in the specified columns.

datos_con_na <- tibble(
  x = c(1, 2, NA, 4),
  y = c("a", NA, "c", "d"),
  z = c(10, 20, 30, NA)
)

# Remove rows with NA in any column
datos_con_na %>% drop_na()

# Remove rows with NA only in column 'x'
datos_con_na %>% drop_na(x)

# Remove rows with NA in columns 'x' or 'y'
datos_con_na %>% drop_na(x, y)

4.2. replace_na(): Replace Missing Values

Replaces NA with a specific value, column by column.

datos_con_na %>%
  replace_na(
    list(
      x = 0,
      y = "desconocido",
      z = mean(datos_con_na$z, na.rm = TRUE)
    )
  )

🔹 Important: You must pass a list with replacements per column.


4.3. fill(): Fill Forward or Backward

Useful for data where a value applies to several consecutive rows (e.g., group headers).

ventas <- tibble(
  region = c("Norte", NA, NA, "Sur", NA, NA),
  ventas = c(100, 150, 200, 300, 400, 500)
)

ventas %>%
  fill(region, .direction = "down")

# Result:
# region   ventas
# Norte    100
# Norte    150
# Norte    200
# Sur      300
# Sur      400
# Sur      500

🔹 Parameters:

  • .direction: "down" (default), "up", "downup", "updown".

5. Factor Manipulation with forcats (integrated into tidyverse)

Although technically belonging to forcats, its use is essential for cleaning categorical data.

5.1. fct_reorder(): Reorder Levels by Another Variable

Orders the levels of a factor according to a function applied to another variable (e.g., mean, sum).

library(ggplot2)  # for mpg

mpg %>%
  mutate(class = fct_reorder(class, hwy, .fun = median)) %>%
  ggplot(aes(x = class, y = hwy)) +
  geom_boxplot() +
  coord_flip()

5.2. fct_infreq() and fct_inorder()

  • fct_infreq(): orders by descending frequency.
  • fct_inorder(): orders by order of appearance.
ejemplo <- tibble(
  categoria = c("B", "A", "C", "A", "B", "A")
)

ejemplo %>%
  mutate(
    por_frecuencia = fct_infreq(categoria),
    por_orden = fct_inorder(categoria)
  )

5.3. fct_collapse() and fct_lump(): Group Infrequent Levels

# Group manually
ejemplo %>%
  mutate(categoria_agrupada = fct_collapse(categoria,
    Grupo1 = c("A", "B"),
    Otros = "C"
  ))

# Group automatically: levels with frequency < n or % become "Other"
ejemplo %>%
  mutate(categoria_lump = fct_lump(categoria, n = 2))  # keeps the 2 most frequent

6. Integrated Practical Case

Dataset: Satisfaction survey data by region and quarter (wide format, with combined columns and missing values).

encuesta_sucia <- tibble(
  "Region-Trimestre" = c("Norte-Q1", "Sur-Q2", "Este-Q1", "Oeste-Q3"),
  Satisfaccion_2023 = c(85, NA, 90, 78),
  Satisfaccion_2024 = c(88, 82, NA, 80),
  Comentarios = c("Muy bueno", "Mejorable", "Excelente", NA)
)

# Step 1: Split combined column
encuesta_limpia <- encuesta_sucia %>%
  separate(`Region-Trimestre`, into = c("region", "trimestre"), sep = "-", remove = TRUE)

# Step 2: Convert to long format
encuesta_limpia <- encuesta_limpia %>%
  pivot_longer(
    cols = starts_with("Satisfaccion"),
    names_to = "año",
    values_to = "puntuacion",
    names_prefix = "Satisfaccion_"
  )

# Step 3: Handle missing values
encuesta_limpia <- encuesta_limpia %>%
  replace_na(list(puntuacion = mean(encuesta_limpia$puntuacion, na.rm = TRUE)))

# Step 4: Reorder regions by average score
encuesta_limpia <- encuesta_limpia %>%
  mutate(region = fct_reorder(region, puntuacion, .fun = mean))

# Step 5: Visualize
library(ggplot2)
encuesta_limpia %>%
  ggplot(aes(x = region, y = puntuacion, fill = año)) +
  geom_col(position = "dodge") +
  labs(title = "Satisfaction by Region and Year") +
  theme_minimal()

7. Proposed Exercises

  1. Exercise 1: Take a dataset with a “first_last_name” column and split it into two columns. Then, join them with an underscore.
  2. Exercise 2: Convert a monthly sales dataset (columns Jan, Feb, Mar...) to long format. Then, convert it back to wide.
  3. Exercise 3: Use fill() to complete a “category” column that only has a value in the first row of each group.
  4. Exercise 4: Use fct_lump() to group all categories with fewer than 5 observations into “Others”.
  5. Exercise 5 (Mini Project): Download a dataset from Kaggle in a non-tidy format. Clean it completely using tidyr and generate a plot with ggplot2.

8. Best Practices and Tips

  • Always check the structure before and after transforming: glimpse(), count(), summary().
  • Use descriptive names in names_to and values_to.
  • Combine tidyr with dplyr: filter before pivoting, summarize after.
  • Use convert = TRUE in separate() if you expect numbers.
  • Save intermediate versions if the transformation is complex.
  • Document your transformations with comments or in an R Markdown.

9. Additional Resources


✅ With this unit, the student masters the essential tools to prepare any dataset for analysis, visualization, or modeling, following modern standards of the R ecosystem.

Course Info

Course: R-zero-to-hero

Language: EN

Lesson: Module07