📘 Unit 2.2: Data Manipulation with dplyr

1. Introduction to dplyr

dplyr is one of the most popular and powerful packages in the tidyverse ecosystem, specifically designed for efficient and readable data frame manipulation. It was created by Hadley Wickham and the RStudio (now Posit) team with the goal of simplifying the most common operations in data analysis through an intuitive and consistent syntax.

Key features of dplyr:

  • Intuitive verbs: Each function has a name that clearly describes its action (filter, select, mutate, etc.).
  • Consistency: All functions follow the same pattern: the first argument is always the data frame, followed by columns or conditions.
  • Speed: Written in C++, it is significantly faster than equivalent operations in base R.
  • Pipe integration: Works perfectly with the %>% operator to chain operations in a readable way.
  • Compatibility: Works with data frames, tibbles, SQL databases, and other backends via dbplyr.

Prerequisite: Make sure you have loaded the tidyverse package or at least dplyr.

library(tidyverse)
# or
library(dplyr)

2. The Five Main Verbs of dplyr

2.1 filter() — Select rows by condition

The filter() function allows you to select rows from a data frame that meet one or more logical conditions.

Basic syntax:

filter(.data, ..., .preserve = FALSE)

Examples:

# We'll use the mpg dataset from the ggplot2 package
data(mpg, package = "ggplot2")

# Cars with more than 30 mpg in the city
filter(mpg, cty > 30)

# Compact class cars with 4 cylinders
filter(mpg, class == "compact", cyl == 4)

# Cars with 4 or 8 cylinders
filter(mpg, cyl %in% c(4, 8))

# Cars with automatic transmission (starting with "auto")
filter(mpg, grepl("^auto", trans))

# Negation: cars that are NOT suv
filter(mpg, class != "suv")

Useful logical operators:

  • ==, !=, >, <, >=, <=
  • & (and), | (or), ! (not)
  • %in% (belongs to a vector)
  • is.na(), !is.na() (for missing values)
  • between(x, left, right) — shortcut for x >= left & x <= right
  • near(x, y) — for comparing floating-point numbers with tolerance

2.2 select() — Select columns by name

select() allows you to choose specific columns from a data frame, useful for reducing dimensionality or reordering variables.

Basic syntax:

select(.data, ...)

Examples:

# Select specific columns
select(mpg, manufacturer, model, cty, hwy)

# Select a range of columns
select(mpg, cty:hwy)  # all columns between cty and hwy

# Exclude columns
select(mpg, -year, -cyl)  # all except year and cyl

# Select columns matching a pattern
select(mpg, starts_with("c"))    # starting with "c"
select(mpg, ends_with("y"))      # ending with "y"
select(mpg, contains("man"))     # containing "man"
select(mpg, matches("^[mh]"))    # starting with m or h (regex)

# Rename and select at the same time
select(mpg, brand = manufacturer, model = model, city = cty)

# Reorder columns
select(mpg, model, manufacturer, everything())  # model and manufacturer first

Selection helper functions:

  • starts_with(), ends_with(), contains(), matches()
  • num_range("x", 1:5) — selects x1, x2, ..., x5
  • everything() — all remaining columns
  • last_col() — last column

2.3 mutate() — Create or modify columns

mutate() adds new columns or modifies existing ones, preserving the originals (unless overwritten).

Basic syntax:

mutate(.data, ..., .keep = "all", .before = NULL, .after = NULL)

Examples:

# Create a new column: average mpg
mpg <- mutate(mpg, avg_mpg = (cty + hwy) / 2)

# Modify an existing column
mpg <- mutate(mpg, cty = cty * 1.609)  # convert to km/l

# Create multiple columns
mpg <- mutate(mpg,
              ratio_hwy_cty = hwy / cty,
              efficiency = case_when(
                avg_mpg > 25 ~ "High",
                avg_mpg > 20 ~ "Medium",
                TRUE ~ "Low"
              )
)

# Use window functions
mpg <- mutate(mpg,
              row_id = row_number(),
              cty_rank = min_rank(cty),
              cty_dense = dense_rank(cty),
              cty_cumsum = cumsum(cty)
)

# Remove original columns after mutating (R 1.0+)
mpg <- mutate(mpg, avg_mpg = (cty + hwy)/2, .keep = "used")  # keeps only cty, hwy, avg_mpg

Useful functions within mutate:

  • Arithmetic operations: +, -, *, /, ^
  • Logical functions: if_else(), case_when()
  • Text functions: str_to_upper(), str_c()
  • Date functions: year(), month(), ymd()
  • Window functions: lag(), lead(), cumsum(), row_number(), min_rank()

2.4 arrange() — Sort rows

arrange() reorders the rows of a data frame based on one or more columns.

Basic syntax:

arrange(.data, ..., .by_group = FALSE)

Examples:

# Sort by one column (ascending by default)
arrange(mpg, cty)

# Descending order
arrange(mpg, desc(cty))

# Sort by multiple columns
arrange(mpg, class, desc(hwy))  # first by class, then by hwy descending

# Sort considering NA at the end
arrange(mpg, cty, .na_last = TRUE)

# Use expressions
arrange(mpg, manufacturer == "audi")  # audi at the end (FALSE=0, TRUE=1)

2.5 summarise() — Summarize data (with group_by())

summarise() collapses multiple values into a single summary. It is especially useful with group_by() for calculations by group.

Basic syntax:

summarise(.data, ..., .groups = NULL)

Examples:

# Global summary
summarise(mpg,
          avg_cty = mean(cty),
          max_hwy = max(hwy),
          n = n()
)

# Summary by group
mpg %>%
  group_by(manufacturer) %>%
  summarise(
    avg_cty = mean(cty),
    total_cars = n(),
    best_hwy = max(hwy),
    .groups = "drop"  # important: releases grouping
  )

# Multiple levels of grouping
mpg %>%
  group_by(manufacturer, class) %>%
  summarise(
    avg_mpg = mean((cty + hwy)/2),
    n = n(),
    .groups = "drop"
  )

# Common summary functions:
# - mean(), median(), sd(), var(), min(), max()
# - sum(), n(), n_distinct()
# - first(), last(), nth()
# - quantile(), IQR()

⚠️ Important: Always use .groups = "drop" at the end of summarise() to avoid warnings and unexpected behavior in recent versions of dplyr.


3. Helper and Advanced Functions

3.1 distinct() — Unique rows

Removes duplicate rows.

# All columns
distinct(mpg)

# Only certain columns
distinct(mpg, manufacturer, model)

# Keep first occurrence
distinct(mpg, manufacturer, .keep_all = TRUE)  # keeps entire row

3.2 slice() — Select rows by position

slice(mpg, 1:10)           # first 10 rows
slice(mpg, n())            # last row
slice(mpg, n() - 5:n())    # last 6 rows
slice_sample(mpg, n = 5)   # random sample of 5 rows
slice_min(mpg, cty, n = 3) # 3 rows with lowest cty
slice_max(mpg, hwy, prop = 0.1) # top 10% with highest hwy

3.3 rename() and relocate() — Rename and relocate columns

# Rename
rename(mpg, brand = manufacturer, model = model)

# Relocate (move columns)
relocate(mpg, model, manufacturer, .before = cty)
relocate(mpg, everything(), .after = last_col())  # move everything to the end

4. Joins: Combining Data Frames

dplyr offers functions to combine data frames based on key columns.

Types of joins:

  • inner_join(): only rows with matches in both tables.
  • left_join(): all rows from the left table, plus matches from the right.
  • right_join(): all rows from the right table, plus matches from the left.
  • full_join(): all rows from both tables.
  • semi_join(): rows from the left that have a match in the right (does not bring columns from the right).
  • anti_join(): rows from the left that do NOT have a match in the right.

Syntax:

inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)

Example:

# Create manufacturer table with countries
manufacturers <- tibble(
  manufacturer = c("audi", "chevrolet", "ford", "honda"),
  country = c("Germany", "USA", "USA", "Japan")
)

# Join with mpg
mpg_enriched <- left_join(mpg, manufacturers, by = "manufacturer")

# If key column names are different:
# left_join(mpg, manufacturers, by = c("manufacturer" = "brand"))

5. The Power of the Pipe %>%

The pipe operator (%>%) allows you to chain operations in a readable way, passing the result of one function as the first argument to the next.

Without pipe:

summarise(
  group_by(
    filter(mpg, cty > 20),
    manufacturer
  ),
  avg_hwy = mean(hwy)
)

With pipe:

mpg %>%
  filter(cty > 20) %>%
  group_by(manufacturer) %>%
  summarise(avg_hwy = mean(hwy), .groups = "drop")

Advantages:

  • Left-to-right, top-to-bottom reading.
  • Fewer nested parentheses.
  • Easy to debug (comment out intermediate lines).
  • Compatible with functions from other packages.

💡 RStudio Shortcut: Ctrl + Shift + M (Windows/Linux) or Cmd + Shift + M (Mac) to insert %>%.


6. Comprehensive Practical Case

Objective: Analyze the mpg dataset to find manufacturers with the best average efficiency, excluding low-displacement models.

library(tidyverse)
data(mpg, package = "ggplot2")

results <- mpg %>%
  # Filter: only cars with 4 or more cylinders
  filter(cyl >= 4) %>%

  # Create average efficiency column
  mutate(efficiency = (cty + hwy) / 2) %>%

  # Group by manufacturer
  group_by(manufacturer) %>%

  # Summarize metrics
  summarise(
    avg_efficiency = mean(efficiency),
    total_models = n(),
    max_hwy = max(hwy),
    .groups = "drop"
  ) %>%

  # Sort by efficiency descending
  arrange(desc(avg_efficiency)) %>%

  # Select relevant columns
  select(manufacturer, avg_efficiency, total_models)

# Display results
print(results)

7. Best Practices and Tips

  1. Use pipes to chain operations: Improves readability.
  2. Name your columns well: Use snake_case for consistency.
  3. Avoid overwriting the original dataset: Use descriptive names or work with copies.
  4. Use glimpse() or View() to inspect: Before and after complex manipulations.
  5. Group with purpose: Always release grouping with .groups = "drop" in summarise().
  6. Comment your code: Explain why you do each step, not just what you do.
  7. Test with subsets: Before running on the full dataset, test with head() or slice_sample().
  8. Handle NA explicitly: Use na.rm = TRUE in functions like mean(), or filter with !is.na().

8. Proposed Exercises

  1. From the starwars dataset (included in dplyr), filter characters with known mass and height > 170, select name, height, mass, and species, and sort them by mass descending.
  2. Create a new column in mpg indicating whether the car is “efficient” (avg_mpg > 25) or not, using mutate() and if_else().
  3. Calculate the average hwy by car class (class) and number of cylinders (cyl). Use group_by() and summarise().
  4. Join the band_members dataset with band_instruments (both in dplyr) using left_join(). What happens if you use inner_join()?
  5. Use slice_max() to get the 3 cars with the highest cty for each manufacturer.

9. Additional Resources


You have completed Unit 2.2! You can now fluently manipulate, transform, and summarize any data frame using dplyr. In the next unit, you will learn to clean and restructure data with tidyr.

Course Info

Course: R-zero-to-hero

Language: EN

Lesson: Module06