R basics III

Data wrangling

Data wrangling refers to all the steps that need to be taken before the actual analysis can be conducted. As such, it is often considered a nuisance that however usually takes a considerable share of the overall effort and code in a research project.

Data wrangling comprises, e.g., filtering out bad data, selecting a subset of available data, reshaping data into a more useful format, or computing derivative measures and summaries. In the following, we will look at some of the most frequently needed operations and how to apply them using the tidyverse.

This tutorial only gives a glimpse at the possibilities of data wrangling in R. For a much more comprehensive introduction see the great and free R for Data Science online book.

In the following, we will again make use of the gapminder dataset as well as the tidyverse:

library(gapminder)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

gapminder

# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Filtering rows

Oftentimes, we’re only interested in a subset of the observations contained in our data. Say, for example, we only want data on Germany since the 1990s. We can use the filter function to achieve this:

filter(gapminder, country == "Germany", year >= 1990)

# A tibble: 4 × 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Germany Europe     1992    76.1 80597764    26505.
2 Germany Europe     1997    77.3 82011073    27789.
3 Germany Europe     2002    78.7 82350671    30036.
4 Germany Europe     2007    79.4 82400996    32170.

Selecting and renaming variables

Similarly, we’re often only interested in a subset of the variables. We can use the select command to select only some variables and also to rename them:

select(gapminder, country, year, life_expectancy=lifeExp)

# A tibble: 1,704 × 3
   country      year life_expectancy
   <fct>       <int>           <dbl>
 1 Afghanistan  1952            28.8
 2 Afghanistan  1957            30.3
 3 Afghanistan  1962            32.0
 4 Afghanistan  1967            34.0
 5 Afghanistan  1972            36.1
 6 Afghanistan  1977            38.4
 7 Afghanistan  1982            39.9
 8 Afghanistan  1987            40.8
 9 Afghanistan  1992            41.7
10 Afghanistan  1997            41.8
# ℹ 1,694 more rows

Group summaries

Often, we want to group our data according to one variable (e.g., the continent) and then compute a summary for each of the groups based on another variable (e.g., the mean life expectancy). We can do this by first creating a grouped version of our data using group_by and then using summarize to compute a summary for each group:

data_grouped <- group_by(gapminder, continent)
summarize(data_grouped, lifeExp_mean = mean(lifeExp))

# A tibble: 5 × 2
  continent lifeExp_mean
  <fct>            <dbl>
1 Africa            48.9
2 Americas          64.7
3 Asia              60.1
4 Europe            71.9
5 Oceania           74.3

Note that this drops all other variables except for the grouping variable and the summary variable.

Data pipelines

Typically, data wrangling implies combining multiple of these steps. R allows us to build data pipelines with the pipe operator |> which pipes the output of the operation to the left into the first argument of the operation to the right. Instead of writing filter(data, year > 1990) we could for example write data |> filter(year > 1990).

filter(gapminder, year > 1990)

# A tibble: 568 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1992    41.7 16317921      649.
 2 Afghanistan Asia       1997    41.8 22227415      635.
 3 Afghanistan Asia       2002    42.1 25268405      727.
 4 Afghanistan Asia       2007    43.8 31889923      975.
 5 Albania     Europe     1992    71.6  3326498     2497.
 6 Albania     Europe     1997    73.0  3428038     3193.
 7 Albania     Europe     2002    75.7  3508512     4604.
 8 Albania     Europe     2007    76.4  3600523     5937.
 9 Algeria     Africa     1992    67.7 26298373     5023.
10 Algeria     Africa     1997    69.2 29072015     4797.
# ℹ 558 more rows

On its own, this is not very useful. However, this syntax allows us to chain multiple operations from left to right or from top to bottom, which is easier to read than nested functions. consider for example the following:

select(filter(gapminder, year > 1990, country == "Germany"), year, lifeExp)

# A tibble: 4 × 2
   year lifeExp
  <int>   <dbl>
1  1992    76.1
2  1997    77.3
3  2002    78.7
4  2007    79.4

# is equivalent to

gapminder |>
  filter(year > 1990, country == "Germany") |>
  select(year, lifeExp)

# A tibble: 4 × 2
   year lifeExp
  <int>   <dbl>
1  1992    76.1
2  1997    77.3
3  2002    78.7
4  2007    79.4

Exercises

Build a data pipeline that computes the mean life expectancy and the mean per-capita GDP in Europe for each 5-year period since the 1980s.

Solution

gapminder |>
  filter(year >= 1980) |>
  group_by(country, year) |>
  summary(
    lifeExp_mean=mean(lifeExp),
    gdpPercap_mean=mean(gdpPercap))

Reshape the gapminder data into a country-by-year matrix, i.e., a data frame where each row is a country, each column is a year and each cell contains the GDP for that country-year pair.

Solution

gapminder |>
  select(country, year, gdpPercap) |>
  pivot_wider(names_from=year, values_from=gdpPercap)

How many entries in gapminder are from after 1960 and come from OECD countries?

Here’s a list of all countries in the OECD: “Australia”, “Austria”, “Belgium”, “Canada”, “Chile”, “Colombia”, “Costa Rica”, “Czech Republic”, “Denmark”, “Estonia”, “Finland”, “France”, “Germany”, “Greece”, “Hungary”, “Iceland”, “Ireland”, “Israel”, “Italy”, “Japan”, “Korea”, “Latvia”, “Lithuania”, “Luxembourg”, “Mexico”, “Netherlands”, “New Zealand”, “Norway”, “Poland”, “Portugal”, “Slovak Republic”, “Slovenia”, “Spain”, “Sweden”, “Switzerland”, “Turkey”, “United Kingdom”, “United States”

Solution

oecd <- c("Australia", "Austria", "Belgium", "Canada", "Chile", "Colombia", "Costa Rica", "Czech Republic", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Iceland", "Ireland", "Israel", "Italy", "Japan", "Korea", "Latvia", "Lithuania", "Luxembourg", "Mexico", "Netherlands", "New Zealand", "Norway", "Poland", "Portugal", "Slovak Republic", "Slovenia", "Spain", "Sweden", "Switzerland", "Turkey", "United Kingdom", "United States")

gapminder |> filter(year > 1960, country %in% oecd) |> nrow()