R Basics II

Loading data

The first step for any R analysis is getting data into an R session. While it sounds like this should be easy, data loading can be quite a hassle due to an unending variety of data formats, which are often only weakly specified. Due to this, data loading can be quite frustrating, especially when one is just getting started.

Nevertheless, there is a variety of packages which can help us with this step. Let’s try two of the most common formats:

Writing to and loading data from an excel spreadsheet

Sometimes we want to send our data to a colleague, who might only be familiar with excel. So as a first step, we’re going to write the gapminder data to an excel spreadsheet. Let’s have a look at our data first:

library(gapminder)
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

We can see that there’s 1704 rows (observations) and 6 columns (variables). Let’s write it to an excel file:

library(openxlsx)

write.xlsx(gapminder,
           file = "data/gapminder.xlsx",
           colNames=TRUE)

In our working directory, there should now be the gapminder.xlsx file, containing a sheet with our data frame. There’s a variety of formatting options, but usually you should regard excel as a mere transport format and not as the final styled output.

Having created an excel file, we can also go the other way. Let’s read our data back into an R data.frame. For reading excel files, I would usually recommend the readxl package, which is part of the extended tidyverse:

library(readxl)

df <- read_excel("data/gapminder.xlsx")
df
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Writing to and loading data from a csv file

One of the most wide-spread formats is .csv (for comma-separated values). It is a very simple and lightweight but also somewhat restrictive and underspecified format (e.g. regarding delimiter, headers, data types, sentinel values, etc.).

While there’s a built-in function to read from csv, we’re again going to rely on the tidyverse here, which contains the readr package, providing this functionality.

library(readr)

We’ll follow the same procedure as before, first writing the gapminder data to a csv file and then reading it back again:

write_csv(gapminder, file = "data/gapminder.csv")

There should now be a gapminder.csv file in your working directory, which you can also view with RStudio (click it in the file viewer).

Let’s read it back in again:

df2 <- read_csv("data/gapminder.csv")
Rows: 1704 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

When loading a csv file with read_csv, we get some information about the automatic detection of parsing specifications (e.g. the detected delimiter and column types).

Exercises

  1. Identify the data format of the file data.csv and find a function/package to load your data into R.
  2. Identify the correct specification and arguments for loading your data and load your data into a data.frame.
Solution
library(readr)
data <- read_csv2("data/data.csv")