11 Data Exploration with R

This chapter is among the “stubs” I’m slowly working on. There isn’t much in this one, but still enough that I figured it was worth making it public.

11.1 Common Exploration Commands

https://www.r-bloggers.com/2018/11/explore-your-dataset-in-r/

Nearly the opposite of SPSS, R is a “quiet” language that only talks back to you when you explicitly ask it to. This isn’t always desirable when you’re exploring data since there may be aspects of the exploration you would have forgotten or not thought of doing if you hadn’t seen the output first. At least R makes up a bit for it with some pretty pithy commands—and even a few lengthy outputs—that we’ll cover here.

We’ll use some pre-existing data packaged with R for our examples here. Which data are available can be easily found with:

data(infert)
head(infert)

  education age parity induced case spontaneous stratum pooled.stratum
1    0-5yrs  26      6       1    1           2       1              3
2    0-5yrs  42      1       1    1           0       2              1
3    0-5yrs  39      6       2    1           0       3              4
4    0-5yrs  34      4       2    1           0       4              2
5   6-11yrs  35      3       1    1           1       5             32
6   6-11yrs  36      4       2    1           1       6             36

summary(infert)

   education        age            parity         induced      
 0-5yrs : 12   Min.   :21.00   Min.   :1.000   Min.   :0.0000  
 6-11yrs:120   1st Qu.:28.00   1st Qu.:1.000   1st Qu.:0.0000  
 12+ yrs:116   Median :31.00   Median :2.000   Median :0.0000  
               Mean   :31.50   Mean   :2.093   Mean   :0.5726  
               3rd Qu.:35.25   3rd Qu.:3.000   3rd Qu.:1.0000  
               Max.   :44.00   Max.   :6.000   Max.   :2.0000  
      case         spontaneous        stratum      pooled.stratum 
 Min.   :0.0000   Min.   :0.0000   Min.   : 1.00   Min.   : 1.00  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:21.00   1st Qu.:19.00  
 Median :0.0000   Median :0.0000   Median :42.00   Median :36.00  
 Mean   :0.3347   Mean   :0.5766   Mean   :41.87   Mean   :33.58  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:62.25   3rd Qu.:48.25  
 Max.   :1.0000   Max.   :2.0000   Max.   :83.00   Max.   :63.00

dplyr::glimpse(infert)

Rows: 248
Columns: 8
$ education      <fct> 0-5yrs, 0-5yrs, 0-5yrs, 0-5yrs, 6-11yrs, 6-11yrs, 6-11y…
$ age            <dbl> 26, 42, 39, 34, 35, 36, 23, 32, 21, 28, 29, 37, 31, 29,…
$ parity         <dbl> 6, 1, 6, 4, 3, 4, 1, 2, 1, 2, 2, 4, 1, 3, 2, 2, 5, 1, 3…
$ induced        <dbl> 1, 1, 2, 2, 1, 2, 0, 0, 0, 0, 1, 2, 1, 2, 1, 2, 2, 0, 2…
$ case           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ spontaneous    <dbl> 2, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1…
$ stratum        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ pooled.stratum <dbl> 3, 1, 4, 2, 32, 36, 6, 22, 5, 19, 20, 37, 9, 29, 21, 18…

# skimr::skim(infert)
# DataExplorer::create_report(infert)

The data() command can be used both to load data into R and—with the parentheses left blank—list out whatever data are currently available. Please note that data() will list dat sets that you loaded in addition to the ones that came pre-installed with R (or any packages you’ve invoked).

This Statistics Globe page provides several R commands that are handy for data exploration. More methods for visualizing the data while you explore it are given this Towards Data Science page.

11.2 Using SQL and tidyverse

This R Views newsletter post by Vachharajani presents a nice overview using SQL and the tidyverse “ecosystem” of packages for R.

SQL is a venerable programming language used to manage and manipulate data—especially very large sets of data. R uses RAM to hold and manipulate data, and so can flounder with very large sets of data; using SQL can thus help. There are several ways to use SQL and R together, however the most common are either to first prepare the data in SQL before exporting it (or parts of it) into R or working from within R to make queries to the SQL-prepared data from within R.

tidyverse is a set of packages designed to make common tasks—especially the manipulation and presentation of data—both more flexible and intuitive. Intuition is a relative thing, and as much as the tidyverse grammar and even vocabulary do make sense, they take some learning to understand—learning that is really in addition to learning core R syntax and grammar.