dplyr is one of the most popular and powerful packages in the tidyverse ecosystem, specifically designed for efficient and readable data frame manipulation. It was created by Hadley Wickham and the RStudio (now Posit) team with the goal of simplifying the most common operations in data analysis through an intuitive and consistent syntax.
%>% operator to chain operations in a readable way.dbplyr.✅ Prerequisite: Make sure you have loaded the
tidyversepackage or at leastdplyr.library(tidyverse) # or library(dplyr)
filter() — Select rows by conditionThe filter() function allows you to select rows from a data frame that meet one or more logical conditions.
Basic syntax:
filter(.data, ..., .preserve = FALSE)
Examples:
# We'll use the mpg dataset from the ggplot2 package
data(mpg, package = "ggplot2")
# Cars with more than 30 mpg in the city
filter(mpg, cty > 30)
# Compact class cars with 4 cylinders
filter(mpg, class == "compact", cyl == 4)
# Cars with 4 or 8 cylinders
filter(mpg, cyl %in% c(4, 8))
# Cars with automatic transmission (starting with "auto")
filter(mpg, grepl("^auto", trans))
# Negation: cars that are NOT suv
filter(mpg, class != "suv")
Useful logical operators:
==, !=, >, <, >=, <=& (and), | (or), ! (not)%in% (belongs to a vector)is.na(), !is.na() (for missing values)between(x, left, right) — shortcut for x >= left & x <= rightnear(x, y) — for comparing floating-point numbers with toleranceselect() — Select columns by nameselect() allows you to choose specific columns from a data frame, useful for reducing dimensionality or reordering variables.
Basic syntax:
select(.data, ...)
Examples:
# Select specific columns
select(mpg, manufacturer, model, cty, hwy)
# Select a range of columns
select(mpg, cty:hwy) # all columns between cty and hwy
# Exclude columns
select(mpg, -year, -cyl) # all except year and cyl
# Select columns matching a pattern
select(mpg, starts_with("c")) # starting with "c"
select(mpg, ends_with("y")) # ending with "y"
select(mpg, contains("man")) # containing "man"
select(mpg, matches("^[mh]")) # starting with m or h (regex)
# Rename and select at the same time
select(mpg, brand = manufacturer, model = model, city = cty)
# Reorder columns
select(mpg, model, manufacturer, everything()) # model and manufacturer first
Selection helper functions:
starts_with(), ends_with(), contains(), matches()num_range("x", 1:5) — selects x1, x2, ..., x5everything() — all remaining columnslast_col() — last columnmutate() — Create or modify columnsmutate() adds new columns or modifies existing ones, preserving the originals (unless overwritten).
Basic syntax:
mutate(.data, ..., .keep = "all", .before = NULL, .after = NULL)
Examples:
# Create a new column: average mpg
mpg <- mutate(mpg, avg_mpg = (cty + hwy) / 2)
# Modify an existing column
mpg <- mutate(mpg, cty = cty * 1.609) # convert to km/l
# Create multiple columns
mpg <- mutate(mpg,
ratio_hwy_cty = hwy / cty,
efficiency = case_when(
avg_mpg > 25 ~ "High",
avg_mpg > 20 ~ "Medium",
TRUE ~ "Low"
)
)
# Use window functions
mpg <- mutate(mpg,
row_id = row_number(),
cty_rank = min_rank(cty),
cty_dense = dense_rank(cty),
cty_cumsum = cumsum(cty)
)
# Remove original columns after mutating (R 1.0+)
mpg <- mutate(mpg, avg_mpg = (cty + hwy)/2, .keep = "used") # keeps only cty, hwy, avg_mpg
Useful functions within mutate:
+, -, *, /, ^if_else(), case_when()str_to_upper(), str_c()year(), month(), ymd()lag(), lead(), cumsum(), row_number(), min_rank()arrange() — Sort rowsarrange() reorders the rows of a data frame based on one or more columns.
Basic syntax:
arrange(.data, ..., .by_group = FALSE)
Examples:
# Sort by one column (ascending by default)
arrange(mpg, cty)
# Descending order
arrange(mpg, desc(cty))
# Sort by multiple columns
arrange(mpg, class, desc(hwy)) # first by class, then by hwy descending
# Sort considering NA at the end
arrange(mpg, cty, .na_last = TRUE)
# Use expressions
arrange(mpg, manufacturer == "audi") # audi at the end (FALSE=0, TRUE=1)
summarise() — Summarize data (with group_by())summarise() collapses multiple values into a single summary. It is especially useful with group_by() for calculations by group.
Basic syntax:
summarise(.data, ..., .groups = NULL)
Examples:
# Global summary
summarise(mpg,
avg_cty = mean(cty),
max_hwy = max(hwy),
n = n()
)
# Summary by group
mpg %>%
group_by(manufacturer) %>%
summarise(
avg_cty = mean(cty),
total_cars = n(),
best_hwy = max(hwy),
.groups = "drop" # important: releases grouping
)
# Multiple levels of grouping
mpg %>%
group_by(manufacturer, class) %>%
summarise(
avg_mpg = mean((cty + hwy)/2),
n = n(),
.groups = "drop"
)
# Common summary functions:
# - mean(), median(), sd(), var(), min(), max()
# - sum(), n(), n_distinct()
# - first(), last(), nth()
# - quantile(), IQR()
⚠️ Important: Always use
.groups = "drop"at the end ofsummarise()to avoid warnings and unexpected behavior in recent versions of dplyr.
distinct() — Unique rowsRemoves duplicate rows.
# All columns
distinct(mpg)
# Only certain columns
distinct(mpg, manufacturer, model)
# Keep first occurrence
distinct(mpg, manufacturer, .keep_all = TRUE) # keeps entire row
slice() — Select rows by positionslice(mpg, 1:10) # first 10 rows
slice(mpg, n()) # last row
slice(mpg, n() - 5:n()) # last 6 rows
slice_sample(mpg, n = 5) # random sample of 5 rows
slice_min(mpg, cty, n = 3) # 3 rows with lowest cty
slice_max(mpg, hwy, prop = 0.1) # top 10% with highest hwy
rename() and relocate() — Rename and relocate columns# Rename
rename(mpg, brand = manufacturer, model = model)
# Relocate (move columns)
relocate(mpg, model, manufacturer, .before = cty)
relocate(mpg, everything(), .after = last_col()) # move everything to the end
dplyr offers functions to combine data frames based on key columns.
inner_join(): only rows with matches in both tables.left_join(): all rows from the left table, plus matches from the right.right_join(): all rows from the right table, plus matches from the left.full_join(): all rows from both tables.semi_join(): rows from the left that have a match in the right (does not bring columns from the right).anti_join(): rows from the left that do NOT have a match in the right.Syntax:
inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
Example:
# Create manufacturer table with countries
manufacturers <- tibble(
manufacturer = c("audi", "chevrolet", "ford", "honda"),
country = c("Germany", "USA", "USA", "Japan")
)
# Join with mpg
mpg_enriched <- left_join(mpg, manufacturers, by = "manufacturer")
# If key column names are different:
# left_join(mpg, manufacturers, by = c("manufacturer" = "brand"))
%>%The pipe operator (%>%) allows you to chain operations in a readable way, passing the result of one function as the first argument to the next.
Without pipe:
summarise(
group_by(
filter(mpg, cty > 20),
manufacturer
),
avg_hwy = mean(hwy)
)
With pipe:
mpg %>%
filter(cty > 20) %>%
group_by(manufacturer) %>%
summarise(avg_hwy = mean(hwy), .groups = "drop")
Advantages:
💡 RStudio Shortcut:
Ctrl + Shift + M(Windows/Linux) orCmd + Shift + M(Mac) to insert%>%.
Objective: Analyze the mpg dataset to find manufacturers with the best average efficiency, excluding low-displacement models.
library(tidyverse)
data(mpg, package = "ggplot2")
results <- mpg %>%
# Filter: only cars with 4 or more cylinders
filter(cyl >= 4) %>%
# Create average efficiency column
mutate(efficiency = (cty + hwy) / 2) %>%
# Group by manufacturer
group_by(manufacturer) %>%
# Summarize metrics
summarise(
avg_efficiency = mean(efficiency),
total_models = n(),
max_hwy = max(hwy),
.groups = "drop"
) %>%
# Sort by efficiency descending
arrange(desc(avg_efficiency)) %>%
# Select relevant columns
select(manufacturer, avg_efficiency, total_models)
# Display results
print(results)
snake_case for consistency.glimpse() or View() to inspect: Before and after complex manipulations..groups = "drop" in summarise().head() or slice_sample().na.rm = TRUE in functions like mean(), or filter with !is.na().starwars dataset (included in dplyr), filter characters with known mass and height > 170, select name, height, mass, and species, and sort them by mass descending.mpg indicating whether the car is “efficient” (avg_mpg > 25) or not, using mutate() and if_else().hwy by car class (class) and number of cylinders (cyl). Use group_by() and summarise().band_members dataset with band_instruments (both in dplyr) using left_join(). What happens if you use inner_join()?slice_max() to get the 3 cars with the highest cty for each manufacturer.✅ You have completed Unit 2.2! You can now fluently manipulate, transform, and summarize any data frame using dplyr. In the next unit, you will learn to clean and restructure data with tidyr.