Data wrangling refers to all the steps that need to be taken before the actual analysis can be conducted. As such, it is often considered a nuisance that however usually takes a considerable share of the overall effort and code in a research project.
Data wrangling comprises, e.g., filtering out bad data, selecting a subset of available data, reshaping data into a more useful format, or computing derivative measures and summaries. In the following, we will look at some of the most frequently needed operations and how to apply them using the tidyverse.
This tutorial only gives a glimpse at the possibilities of data wrangling in R. For a much more comprehensive introduction see the great and free R for Data Science online book.
In the following, we will again make use of the gapminder dataset as well as the tidyverse:
library(gapminder)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
Filtering rows
Oftentimes, we’re only interested in a subset of the observations contained in our data. Say, for example, we only want data on Germany since the 1990s. We can use the filter function to achieve this:
filter(gapminder, country =="Germany", year >=1990)
# A tibble: 4 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Germany Europe 1992 76.1 80597764 26505.
2 Germany Europe 1997 77.3 82011073 27789.
3 Germany Europe 2002 78.7 82350671 30036.
4 Germany Europe 2007 79.4 82400996 32170.
Selecting and renaming variables
Similarly, we’re often only interested in a subset of the variables. We can use the select command to select only some variables and also to rename them:
Often, we want to group our data according to one variable (e.g., the continent) and then compute a summary for each of the groups based on another variable (e.g., the mean life expectancy). We can do this by first creating a grouped version of our data using group_by and then using summarize to compute a summary for each group:
# A tibble: 5 × 2
continent lifeExp_mean
<fct> <dbl>
1 Africa 48.9
2 Americas 64.7
3 Asia 60.1
4 Europe 71.9
5 Oceania 74.3
Note that this drops all other variables except for the grouping variable and the summary variable.
Data pipelines
Typically, data wrangling implies combining multiple of these steps. R allows us to build data pipelines with the pipe operator |> which pipes the output of the operation to the left into the first argument of the operation to the right. Instead of writing filter(data, year > 1990) we could for example write data |> filter(year > 1990).
filter(gapminder, year >1990)
# A tibble: 568 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1992 41.7 16317921 649.
2 Afghanistan Asia 1997 41.8 22227415 635.
3 Afghanistan Asia 2002 42.1 25268405 727.
4 Afghanistan Asia 2007 43.8 31889923 975.
5 Albania Europe 1992 71.6 3326498 2497.
6 Albania Europe 1997 73.0 3428038 3193.
7 Albania Europe 2002 75.7 3508512 4604.
8 Albania Europe 2007 76.4 3600523 5937.
9 Algeria Africa 1992 67.7 26298373 5023.
10 Algeria Africa 1997 69.2 29072015 4797.
# ℹ 558 more rows
On its own, this is not very useful. However, this syntax allows us to chain multiple operations from left to right or from top to bottom, which is easier to read than nested functions. consider for example the following:
select(filter(gapminder, year >1990, country =="Germany"), year, lifeExp)
Reshape the gapminder data into a country-by-year matrix, i.e., a data frame where each row is a country, each column is a year and each cell contains the GDP for that country-year pair.