The first step for any R analysis is getting data into an R session. While it sounds like this should be easy, data loading can be quite a hassle due to an unending variety of data formats, which are often only weakly specified. Due to this, data loading can be quite frustrating, especially when one is just getting started.
Nevertheless, there is a variety of packages which can help us with this step. Let’s try two of the most common formats:
Writing to and loading data from an excel spreadsheet
Sometimes we want to send our data to a colleague, who might only be familiar with excel. So as a first step, we’re going to write the gapminder data to an excel spreadsheet. Let’s have a look at our data first:
library(gapminder)gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
We can see that there’s 1704 rows (observations) and 6 columns (variables). Let’s write it to an excel file:
In our working directory, there should now be the gapminder.xlsx file, containing a sheet with our data frame. There’s a variety of formatting options, but usually you should regard excel as a mere transport format and not as the final styled output.
Having created an excel file, we can also go the other way. Let’s read our data back into an R data.frame. For reading excel files, I would usually recommend the readxl package, which is part of the extended tidyverse:
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
Writing to and loading data from a csv file
One of the most wide-spread formats is .csv (for comma-separated values). It is a very simple and lightweight but also somewhat restrictive and underspecified format (e.g. regarding delimiter, headers, data types, sentinel values, etc.).
While there’s a built-in function to read from csv, we’re again going to rely on the tidyverse here, which contains the readr package, providing this functionality.
library(readr)
We’ll follow the same procedure as before, first writing the gapminder data to a csv file and then reading it back again:
write_csv(gapminder, file ="data/gapminder.csv")
There should now be a gapminder.csv file in your working directory, which you can also view with RStudio (click it in the file viewer).
Let’s read it back in again:
df2 <-read_csv("data/gapminder.csv")
Rows: 1704 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
When loading a csv file with read_csv, we get some information about the automatic detection of parsing specifications (e.g. the detected delimiter and column types).
Exercises
Identify the data format of the file data.csv and find a function/package to load your data into R.
Identify the correct specification and arguments for loading your data and load your data into a data.frame.