Estimated Duration: 1 week (4โ6 hours)
Objective: Master modern techniques for importing, managing, and exporting data in multiple formats using thetidyverseecosystem and complementary tools, ensuring reproducibility and efficiency.
In real-world data science practice, up to 80% of time is spent preparing and managing data. The ability to read and write data reliably, quickly, and reproducibly is a fundamental skill.
R offers multiple ways to import data, but in this course we prioritize tidyverse tools due to their:
readrThe readr package, part of the tidyverse, replaces base functions like read.csv() with faster and more predictable versions.
library(readr)
# Standard CSV (comma as separator)
data <- read_csv("path/to/file.csv")
# TSV (tab as separator)
data <- read_tsv("path/to/file.tsv")
# Custom delimiter (semicolon, pipe, etc.)
data <- read_delim("path/to/file.txt", delim = ";")
# Read directly from URL
data <- read_csv("https://example.com/data.csv ")
readr:col_types."", "NA").data <- read_csv("file.csv",
col_types = cols(
name = col_character(),
age = col_integer(),
salary = col_double(),
date = col_date(format = "%d/%m/%Y")
)
)
# View parsing warnings
problems(data)
# Read ignoring problems (not recommended for production)
data <- read_csv("file.csv", problems = NULL)
Files with .xls and .xlsx extensions are common in business environments. We use the readxl package.
library(readxl)
# Read sheet by name
data <- read_excel("file.xlsx", sheet = "Sheet1")
# Read sheet by index
data <- read_excel("file.xlsx", sheet = 1)
# Specify cell range
data <- read_excel("file.xlsx", range = "A1:D100")
# Skip initial rows (misplaced headers)
data <- read_excel("file.xlsx", skip = 3)
# Specific columns
data <- read_excel("file.xlsx", col_names = c("A", "C", "D"))
โ Advantages:
We use the haven package, also part of the tidyverse.
library(haven)
# SPSS (.sav)
data <- read_sav("file.sav")
# Stata (.dta)
data <- read_dta("file.dta")
# SAS (.sas7bdat, .xpt)
data <- read_sas("file.sas7bdat")
data_xpt <- read_xpt("file.xpt")
๐ Note: haven preserves variable and value labels, stored as attributes. To use them as factors:
data$variable <- as_factor(data$variable)
For large datasets or production environments, it's common to connect directly to databases using SQL.
DBI + RSQLite (for local databases)library(DBI)
# Connect to SQLite database
con <- dbConnect(RSQLite::SQLite(), "my_database.db")
# List tables
dbListTables(con)
# Read entire table
data <- dbReadTable(con, "table_name")
# Execute SQL query
data <- dbGetQuery(con, "SELECT * FROM table WHERE age > 30")
# Close connection
dbDisconnect(con)
odbc (for SQL Server, PostgreSQL, MySQL, etc.)library(odbc)
# Connect (requires installed driver)
con <- dbConnect(odbc(),
driver = "SQL Server",
server = "my-server",
database = "my_db",
uid = "user",
pwd = "password"
)
# Read data
data <- dbGetQuery(con, "SELECT TOP 100 * FROM sales")
dbDisconnect(con)
๐ก Tip: Use dbplyr to write queries with dplyr syntax that are automatically translated to SQL.
library(dbplyr)
remote_table <- tbl(con, "sales")
result <- remote_table %>%
filter(sales > 1000) %>%
group_by(region) %>%
summarise(total = sum(sales)) %>%
collect() # Brings data into R
httr and jsonlitelibrary(httr)
library(jsonlite)
# Make GET request
response <- GET("https://api.example.com/data ")
# Check status
response$status_code
# Convert body to JSON
json_data <- content(response, "text")
data <- fromJSON(json_data, simplifyDataFrame = TRUE)
# Or directly:
data <- fromJSON("https://api.example.com/data ")
rvestlibrary(rvest)
# Read HTML page
page <- read_html("https://example.com/table ")
# Extract HTML table
table <- page %>%
html_node("table") %>%
html_table()
# Extract text by CSS selectors
titles <- page %>%
html_nodes(".title") %>%
html_text()
โ ๏ธ Important: Respect robots.txt and site terms of use. Do not perform aggressive scraping.
readr# Export to CSV
write_csv(data, "clean_data.csv")
# Export to TSV
write_tsv(data, "clean_data.tsv")
# With compression
write_csv(data, "clean_data.csv.gz")
Ideal for saving R objects exactly as they are (structure, classes, attributes).
# Save
saveRDS(data, "processed_data.rds")
# Load
data <- readRDS("processed_data.rds")
โ Advantages:
With the writexl package (lightweight, no dependencies):
library(writexl)
write_xlsx(data, "report.xlsx")
# Multiple sheets
write_xlsx(list(Sheet1 = data1, Sheet2 = data2), "report.xlsx")
With haven:
# To SPSS
write_sav(data, "data.sav")
# To Stata
write_dta(data, "data.dta")
here::here() for Relative PathsAvoid absolute paths that break when sharing the project.
library(here)
# Path relative to project directory
data <- read_csv(here("data", "raw", "sales.csv"))
# Save
write_csv(clean_data, here("data", "clean", "sales_clean.csv"))
โ Recommended Project Structure:
my_project/
โโโ data/
โ โโโ raw/
โ โโโ clean/
โโโ scripts/
โโโ output/
โโโ reports/
โโโ my_project.Rproj
Creating an .Rproj file allows RStudio to automatically set the working directory when opening the project, eliminating the need for setwd().
readr, readxl, haven instead of base functions.col_types when you know the data structure (avoids surprises)..rds to speed up future loads.here::here().setwd() โ breaks reproducibility.read_csv("file.csv", locale = locale(encoding = "latin1"))
read.csv() after saving it as CSV from Excel โ introduces formatting errors.Objective: Create a reproducible script that:
read_csv(), specifying column types.dplyr (filter, rename, mutate)..rds and .csv.here::here().# process_data.R
library(tidyverse)
library(here)
# 1. Import
url <- "https://raw.githubusercontent.com/data-example/sales/master/sales_2023.csv "
sales <- read_csv(url,
col_types = cols(
id = col_integer(),
date = col_date(format = "%Y-%m-%d"),
product = col_factor(),
units = col_integer(),
price = col_double()
)
)
# 2. Clean
clean_sales <- sales %>%
filter(!is.na(price), units > 0) %>%
mutate(revenue = units * price) %>%
rename(Product = product, Units = units, Unit_Price = price)
# 3. Export
write_rds(clean_sales, here("data", "clean", "sales_2023_clean.rds"))
write_csv(clean_sales, here("data", "clean", "sales_2023_clean.csv"))
# Success message
cat("โ
Data processed and saved to:", here("data", "clean"), "\n")
โ Upon completing this unit, you will be able to import and export data to and from multiple sources with confidence, speed, and reproducibility, ready to integrate them into analysis and modeling workflows.