Descriptive statistics in R, part II

Moving on from frequencies and tables, which were covered in part I, let’s now focus on other ways to summarize our data (e.g., mean, standard deviation). There are a lot of ways to divide a topic like descriptive statistics, and R can further complicate this seemingly simple task. It’s been said R does a great job of making complex procedures simple, but it also has the tendency to make simple tasks complex. Because there are many methods of calculating descriptive statistics in R, and because some of these methods are more applicable in certain situations over others, there is a lot of material covered in this post. To keep the post length manageable, part II will cover some ways to calculate descriptive statistics in base R and some common functions in popular packages. In part III, we’ll focus on some other common approaches for obtaining descriptive statistics, and I’ll finish by discussing my thoughts on the packages between parts II and III overall.

For this post and the next, I will be using the gapminder dataset in the gapminder package. The dataframe contains data on life expectancy, GDP per capita, and population by country, continent, and year. It also contains 1704 observations and 6 variables. The data come from Gapminder, which is an independent Swedish foundation with no political, religious, or economic affiliation.

library(gapminder)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Again, if you’re familiar with SPSS, you can get means, SDs, and pretty much all of your general descriptive statistics through either the FREQUENCIES, (CON)DESCRIPTIVE, or EXAMINE commands. There are some functions in R that are pretty similar to the output SPSS displays, but there are also a few cool functions that are exclusive to R.

Options in base R

R’s built-in function for univariate statistics is summary.

summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

This output is pretty sparse. It lacks percentages for factors (i.e., country and continent), frequencies and percentages for numeric variables (i.e., lifeExp and gdpPercap), number of nonmissing values, and so on. However, it is very compact. This compactness allows summary to work with a wide range of object whereas other functions (e.g., describe described below) only work with dataframes, vectors, matrices, or formulas. The numbers labeled “1st Qu.” and “3rd Qu.” are the first and third quartiles, or the 25th and 75th percentiles, respectively.

If the dataset contains any variable labels, those will be ignored. It also doesn’t show the overall count, but you could manually compute this by summing up the counts in the categorical variables. Then again, what’s the point of using a powerful program like R if you just end up manually computing things?! summary also doesn’t display standard deviation, and it’s not tidy.

If you need to summarize data by group, which often needs to be done, you can do this with base R as well. However, the output isn’t nicely formatted, which can make it difficult to compare statistics across groups, especially if there are a large number of categories (like there are in country variable). Since there are so many categories in the below example, I wrapped the command in the head function. This allows me to conserve space by only displaying the first six cases of the output.

head(by(gapminder, gapminder$country, summary))
## $Afghanistan
##         country      continent       year         lifeExp     
##  Afghanistan:12   Africa  : 0   Min.   :1952   Min.   :28.80  
##  Albania    : 0   Americas: 0   1st Qu.:1966   1st Qu.:33.51  
##  Algeria    : 0   Asia    :12   Median :1980   Median :39.15  
##  Angola     : 0   Europe  : 0   Mean   :1980   Mean   :37.48  
##  Argentina  : 0   Oceania : 0   3rd Qu.:1993   3rd Qu.:41.70  
##  Australia  : 0                 Max.   :2007   Max.   :43.83  
##  (Other)    : 0                                               
##       pop             gdpPercap    
##  Min.   : 8425333   Min.   :635.3  
##  1st Qu.:11220245   1st Qu.:736.7  
##  Median :13473708   Median :803.5  
##  Mean   :15823715   Mean   :802.7  
##  3rd Qu.:17795294   3rd Qu.:852.6  
##  Max.   :31889923   Max.   :978.0  
##                                    
## 
## $Albania
##         country      continent       year         lifeExp     
##  Albania    :12   Africa  : 0   Min.   :1952   Min.   :55.23  
##  Afghanistan: 0   Americas: 0   1st Qu.:1966   1st Qu.:65.87  
##  Algeria    : 0   Asia    : 0   Median :1980   Median :69.67  
##  Angola     : 0   Europe  :12   Mean   :1980   Mean   :68.43  
##  Argentina  : 0   Oceania : 0   3rd Qu.:1993   3rd Qu.:72.24  
##  Australia  : 0                 Max.   :2007   Max.   :76.42  
##  (Other)    : 0                                               
##       pop            gdpPercap   
##  Min.   :1282697   Min.   :1601  
##  1st Qu.:1920079   1st Qu.:2451  
##  Median :2644572   Median :3253  
##  Mean   :2580249   Mean   :3255  
##  3rd Qu.:3351883   3rd Qu.:3658  
##  Max.   :3600523   Max.   :5937  
##                                  
## 
## $Algeria
##         country      continent       year         lifeExp     
##  Algeria    :12   Africa  :12   Min.   :1952   Min.   :43.08  
##  Afghanistan: 0   Americas: 0   1st Qu.:1966   1st Qu.:50.63  
##  Albania    : 0   Asia    : 0   Median :1980   Median :59.69  
##  Angola     : 0   Europe  : 0   Mean   :1980   Mean   :59.03  
##  Argentina  : 0   Oceania : 0   3rd Qu.:1993   3rd Qu.:68.10  
##  Australia  : 0                 Max.   :2007   Max.   :72.30  
##  (Other)    : 0                                               
##       pop             gdpPercap   
##  Min.   : 9279525   Min.   :2449  
##  1st Qu.:12320611   1st Qu.:3189  
##  Median :18593278   Median :4854  
##  Mean   :19875406   Mean   :4426  
##  3rd Qu.:26991784   3rd Qu.:5386  
##  Max.   :33333216   Max.   :6223  
##                                   
## 
## $Angola
##         country      continent       year         lifeExp     
##  Angola     :12   Africa  :12   Min.   :1952   Min.   :30.02  
##  Afghanistan: 0   Americas: 0   1st Qu.:1966   1st Qu.:35.49  
##  Albania    : 0   Asia    : 0   Median :1980   Median :39.69  
##  Algeria    : 0   Europe  : 0   Mean   :1980   Mean   :37.88  
##  Argentina  : 0   Oceania : 0   3rd Qu.:1993   3rd Qu.:40.73  
##  Australia  : 0                 Max.   :2007   Max.   :42.73  
##  (Other)    : 0                                               
##       pop             gdpPercap   
##  Min.   : 4232095   Min.   :2277  
##  1st Qu.: 5142106   1st Qu.:2725  
##  Median : 6589530   Median :3265  
##  Mean   : 7309390   Mean   :3607  
##  3rd Qu.: 9020747   3rd Qu.:4401  
##  Max.   :12420476   Max.   :5523  
##                                   
## 
## $Argentina
##         country      continent       year         lifeExp     
##  Argentina  :12   Africa  : 0   Min.   :1952   Min.   :62.48  
##  Afghanistan: 0   Americas:12   1st Qu.:1966   1st Qu.:65.51  
##  Albania    : 0   Asia    : 0   Median :1980   Median :69.21  
##  Algeria    : 0   Europe  : 0   Mean   :1980   Mean   :69.06  
##  Angola     : 0   Oceania : 0   3rd Qu.:1993   3rd Qu.:72.22  
##  Australia  : 0                 Max.   :2007   Max.   :75.32  
##  (Other)    : 0                                               
##       pop             gdpPercap    
##  Min.   :17876956   Min.   : 5911  
##  1st Qu.:22521614   1st Qu.: 7823  
##  Median :28162601   Median : 9069  
##  Mean   :28602240   Mean   : 8956  
##  3rd Qu.:34520076   3rd Qu.: 9602  
##  Max.   :40301927   Max.   :12779  
##                                    
## 
## $Australia
##         country      continent       year         lifeExp     
##  Australia  :12   Africa  : 0   Min.   :1952   Min.   :69.12  
##  Afghanistan: 0   Americas: 0   1st Qu.:1966   1st Qu.:71.06  
##  Albania    : 0   Asia    : 0   Median :1980   Median :74.11  
##  Algeria    : 0   Europe  : 0   Mean   :1980   Mean   :74.66  
##  Angola     : 0   Oceania :12   3rd Qu.:1993   3rd Qu.:77.88  
##  Argentina  : 0                 Max.   :2007   Max.   :81.23  
##  (Other)    : 0                                               
##       pop             gdpPercap    
##  Min.   : 8691212   Min.   :10040  
##  1st Qu.:11602940   1st Qu.:13949  
##  Median :14629150   Median :18906  
##  Mean   :14649312   Mean   :19981  
##  3rd Qu.:17752794   3rd Qu.:24318  
##  Max.   :20434176   Max.   :34435  
## 

These functions are certainly useful, but there are other more informative and flexible options for summarizing our data outside of base R.

describe, from the Hmisc package

The Hmisc package offers a wide selection of functions that have a more comprehensive output than the base R functions. One of these is the describe function. It’s similar to the summary described above, but one nice feature of the describe function is that it provides frequencies on nonfactors as well as factors, as long as they do not have too many values.

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(gapminder)
## gapminder 
## 
##  6  Variables      1704  Observations
## ---------------------------------------------------------------------------
## country 
##        n  missing distinct 
##     1704        0      142 
## 
## lowest : Afghanistan        Albania            Algeria            Angola             Argentina         
## highest: Vietnam            West Bank and Gaza Yemen, Rep.        Zambia             Zimbabwe          
## ---------------------------------------------------------------------------
## continent 
##        n  missing distinct 
##     1704        0        5 
##                                                        
## Value        Africa Americas     Asia   Europe  Oceania
## Frequency       624      300      396      360       24
## Proportion    0.366    0.176    0.232    0.211    0.014
## ---------------------------------------------------------------------------
## year 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1704        0       12    0.993     1980    19.87     1952     1957 
##      .25      .50      .75      .90      .95 
##     1966     1980     1993     2002     2007 
##                                                                       
## Value       1952  1957  1962  1967  1972  1977  1982  1987  1992  1997
## Frequency    142   142   142   142   142   142   142   142   142   142
## Proportion 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083
##                       
## Value       2002  2007
## Frequency    142   142
## Proportion 0.083 0.083
## ---------------------------------------------------------------------------
## lifeExp 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1704        0     1626        1    59.47    14.82    38.49    41.51 
##      .25      .50      .75      .90      .95 
##    48.20    60.71    70.85    75.10    77.44 
## 
## lowest : 23.599 28.801 30.000 30.015 30.331, highest: 81.701 81.757 82.000 82.208 82.603
## ---------------------------------------------------------------------------
## pop 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1704        0     1704        1 29601212 46384459   475459   946367 
##      .25      .50      .75      .90      .95 
##  2793664  7023596 19585222 54801370 89822054 
## 
## lowest :      60011      61325      63149      65345      70787
## highest: 1110396331 1164970000 1230075000 1280400000 1318683096
## ---------------------------------------------------------------------------
## gdpPercap 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1704        0     1704        1     7215     8573    548.0    687.7 
##      .25      .50      .75      .90      .95 
##   1202.1   3531.8   9325.5  19449.1  26608.3 
## 
## lowest :    241.1659    277.5519    298.8462    299.8503    312.1884
## highest:  80894.8833  95458.1118 108382.3529 109347.8670 113523.1329
## ---------------------------------------------------------------------------

Unlike SPSS, the describe function does not provide percentages that include missing values. You can change that by setting the exclude.missing argument to FALSE. describe will automatically provide a table of frequencies whenever a variable has no more than 20 unique values. Beyond that, it will report the five largest and five smallest values, just like SPSS’s EXPLORE procedure. Also, for some items (not pictured in this example), the describe function will display the variable prompts themselves (e.g., the survey question participants responded to) in the output. This can be accomplished by adding variable labels (e.g., through the label function in the Hmisc package).

SPSS users may find the output user-friendly, but it isn’t tidy and can be hard to work with at times. There is also no summary by group function, but you could get around this by wrapping describe in the by() function mentioned earlier. One downside of this approach, though, is that the output becomes lengthy and comparing statistics between groups becomes challenging.

describe and describeBy, from the psych package

Not to be confused with describe from the Hmisc package, the describe function from the psych package has a strikingly different output!

library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(gapminder)
##            vars    n        mean           sd     median     trimmed
## country*      1 1704       71.50        41.00      71.50       71.50
## continent*    2 1704        2.33         1.21       2.00        2.27
## year          3 1704     1979.50        17.27    1979.50     1979.50
## lifeExp       4 1704       59.47        12.92      60.71       59.92
## pop           5 1704 29601212.32 106157896.74 7023595.50 11399459.45
## gdpPercap     6 1704     7215.33      9857.45    3531.85     5221.44
##                   mad      min          max      range  skew kurtosis
## country*        52.63     1.00        142.0        141  0.00    -1.20
## continent*       1.48     1.00          5.0          4  0.25    -1.34
## year            22.24  1952.00       2007.0         55  0.00    -1.22
## lifeExp         16.10    23.60         82.6         59 -0.25    -1.13
## pop        7841473.62 60011.00 1318683096.0 1318623085  8.33    77.62
## gdpPercap     4007.61   241.17     113523.1     113282  3.84    27.40
##                    se
## country*         0.99
## continent*       0.03
## year             0.42
## lifeExp          0.31
## pop        2571683.45
## gdpPercap      238.80

This output contains all six of the variables in the dataset; however, the categorical variables (i.e., country and continent) display summary stats that you’d typically associate with numeric fields. This is because the function’s default is to recode categories as numbers. When the function recodes a categorical variable into a numeric variable, it denotes this by adding an * at the end of the variable name. Thus, statistics for variables marked with * should be interpreted cautiously (if at all).

If you’re dealing with categorical data that you do not want converted into numerical data, psych::describe may not be the best function to use. However, for genuinely numeric data though, it provides most of the key statistics and a few bonus stats (e.g., skew, kurtosis, SE). However, it does not indicate where data are missing. The output is fairly tidy except for it using rownames to represent the variable the statistics are summarizing, but this is a pretty common feature across packages.

Another nice feature about the psych package is that it has a specific summary by group function: describeBy. If we wanted the statistics for each continent in the gapminder dataset, we would simply use the following code:

describeBy(gapminder, gapminder$continent)
## 
##  Descriptive statistics by group 
## group: Africa
##            vars   n       mean          sd     median    trimmed
## country*      1 624      70.50       40.78      75.50      70.00
## continent*    2 624       1.00        0.00       1.00       1.00
## year          3 624    1979.50       17.27    1979.50    1979.50
## lifeExp       4 624      48.87        9.15      47.79      48.30
## pop           5 624 9916003.14 15490923.32 4579311.00 6443586.30
## gdpPercap     6 624    2193.75     2827.93    1192.14    1557.04
##                   mad      min          max        range skew kurtosis
## country*        53.37     3.00       142.00       139.00 0.07    -1.25
## continent*       0.00     1.00         1.00         0.00  NaN      NaN
## year            22.24  1952.00      2007.00        55.00 0.00    -1.22
## lifeExp          8.58    23.60        76.44        52.84 0.56     0.13
## pop        5540860.93 60011.00 135031164.00 134971153.00 3.55    17.02
## gdpPercap      775.32   241.17     21951.21     21710.05 3.53    15.78
##                   se
## country*        1.63
## continent*      0.00
## year            0.69
## lifeExp         0.37
## pop        620133.24
## gdpPercap     113.21
## -------------------------------------------------------- 
## group: Americas
##            vars   n        mean          sd     median     trimmed
## country*      1 300       65.04       42.30      54.00       63.05
## continent*    2 300        2.00        0.00       2.00        2.00
## year          3 300     1979.50       17.29    1979.50     1979.50
## lifeExp       4 300       64.66        9.35      67.05       65.50
## pop           5 300 24504795.00 50979430.20 6227510.00 10642390.82
## gdpPercap     6 300     7136.11     6396.76    5465.51     5823.60
##                   mad       min          max        range  skew kurtosis
## country*        48.93      5.00       137.00       132.00  0.38    -1.23
## continent*       0.00      2.00         2.00         0.00   NaN      NaN
## year            22.24   1952.00      2007.00        55.00  0.00    -1.23
## lifeExp          8.47     37.58        80.65        43.07 -0.73    -0.21
## pop        5972602.95 662850.00 301139947.00 300477097.00  3.37    11.43
## gdpPercap     3269.33   1201.64     42951.65     41750.02  2.83     9.42
##                    se
## country*         2.44
## continent*       0.00
## year             1.00
## lifeExp          0.54
## pop        2943298.78
## gdpPercap      369.32
## -------------------------------------------------------- 
## group: Asia
##            vars   n        mean           sd      median     trimmed
## country*      1 396       79.48        38.18       73.00       81.18
## continent*    2 396        3.00         0.00        3.00        3.00
## year          3 396     1979.50        17.28     1979.50     1979.50
## lifeExp       4 396       60.06        11.86       61.79       60.62
## pop           5 396 77038721.97 206885204.62 14530830.50 25678311.84
## gdpPercap     6 396     7902.15     14045.37     2646.79     4915.85
##                    mad      min          max        range  skew kurtosis
## country*         35.58      1.0        140.0        139.0 -0.29    -0.58
## continent*        0.00      3.0          3.0          0.0   NaN      NaN
## year             22.24   1952.0       2007.0         55.0  0.00    -1.23
## lifeExp          13.00     28.8         82.6         53.8 -0.40    -0.67
## pop        18326690.66 120447.0 1318683096.0 1318562649.0  4.11    16.93
## gdpPercap      2820.83    331.0     113523.1     113192.1  4.42    25.56
##                     se
## country*          1.92
## continent*        0.00
## year              0.87
## lifeExp           0.60
## pop        10396372.70
## gdpPercap       705.81
## -------------------------------------------------------- 
## group: Europe
##            vars   n        mean          sd     median     trimmed
## country*      1 360       71.33       41.60      64.00       72.12
## continent*    2 360        4.00        0.00       4.00        4.00
## year          3 360     1979.50       17.28    1979.50     1979.50
## lifeExp       4 360       71.90        5.43      72.24       72.39
## pop           5 360 17169764.73 20519437.65 8551125.00 13215590.75
## gdpPercap     6 360    14469.48     9355.21   12081.75    13532.97
##                   mad       min         max       range  skew kurtosis
## country*        58.56      2.00      134.00      132.00 -0.09    -1.37
## continent*       0.00      4.00        4.00        0.00   NaN      NaN
## year            22.24   1952.00     2007.00       55.00  0.00    -1.23
## lifeExp          4.43     43.59       81.76       38.17 -1.25     3.29
## pop        6626994.42 147962.00 82400996.00 82253034.00  1.57     1.30
## gdpPercap     8846.05    973.53    49357.19    48383.66  0.85     0.14
##                    se
## country*         2.19
## continent*       0.00
## year             0.91
## lifeExp          0.29
## pop        1081469.32
## gdpPercap      493.06
## -------------------------------------------------------- 
## group: Oceania
##            vars  n       mean         sd     median    trimmed        mad
## country*      1 24      49.00      43.92      49.00      49.00      63.75
## continent*    2 24       5.00       0.00       5.00       5.00       0.00
## year          3 24    1979.50      17.63    1979.50    1979.50      22.24
## lifeExp       4 24      74.33       3.80      73.66      74.19       4.00
## pop           5 24 8874672.33 6506342.47 6403491.50 8439348.35 5996394.97
## gdpPercap     6 24   18621.61    6358.98   17983.30   18059.97    6459.10
##                   min         max       range skew kurtosis         se
## country*         6.00       92.00       86.00 0.00    -2.08       8.97
## continent*       5.00        5.00        0.00  NaN      NaN       0.00
## year          1952.00     2007.00       55.00 0.00    -1.36       3.60
## lifeExp         69.12       81.23       12.11 0.37    -1.33       0.77
## pop        1994794.00 20434176.00 18439382.00 0.43    -1.48 1328101.59
## gdpPercap    10039.60    34435.37    24395.77 0.71    -0.21    1298.02

All the essential statistics are there! This output more condensed than output from the by() function, but it’s still difficult to compare the same statistic across groups. It’s also not tidy. However, the mat argument let’s you produce a matrix from the above output.

describeBy(gapminder, gapminder$continent, mat = TRUE)
##             item   group1 vars   n         mean           sd       median
## country*1      1   Africa    1 624 7.050000e+01 4.077980e+01 7.550000e+01
## country*2      2 Americas    1 300 6.504000e+01 4.230181e+01 5.400000e+01
## country*3      3     Asia    1 396 7.948485e+01 3.818366e+01 7.300000e+01
## country*4      4   Europe    1 360 7.133333e+01 4.159843e+01 6.400000e+01
## country*5      5  Oceania    1  24 4.900000e+01 4.392484e+01 4.900000e+01
## continent*1    6   Africa    2 624 1.000000e+00 0.000000e+00 1.000000e+00
## continent*2    7 Americas    2 300 2.000000e+00 0.000000e+00 2.000000e+00
## continent*3    8     Asia    2 396 3.000000e+00 0.000000e+00 3.000000e+00
## continent*4    9   Europe    2 360 4.000000e+00 0.000000e+00 4.000000e+00
## continent*5   10  Oceania    2  24 5.000000e+00 0.000000e+00 5.000000e+00
## year1         11   Africa    3 624 1.979500e+03 1.727411e+01 1.979500e+03
## year2         12 Americas    3 300 1.979500e+03 1.728910e+01 1.979500e+03
## year3         13     Asia    3 396 1.979500e+03 1.728210e+01 1.979500e+03
## year4         14   Europe    3 360 1.979500e+03 1.728429e+01 1.979500e+03
## year5         15  Oceania    3  24 1.979500e+03 1.763149e+01 1.979500e+03
## lifeExp1      16   Africa    4 624 4.886533e+01 9.150210e+00 4.779200e+01
## lifeExp2      17 Americas    4 300 6.465874e+01 9.345088e+00 6.704800e+01
## lifeExp3      18     Asia    4 396 6.006490e+01 1.186453e+01 6.179150e+01
## lifeExp4      19   Europe    4 360 7.190369e+01 5.433178e+00 7.224100e+01
## lifeExp5      20  Oceania    4  24 7.432621e+01 3.795611e+00 7.366500e+01
## pop1          21   Africa    5 624 9.916003e+06 1.549092e+07 4.579311e+06
## pop2          22 Americas    5 300 2.450479e+07 5.097943e+07 6.227510e+06
## pop3          23     Asia    5 396 7.703872e+07 2.068852e+08 1.453083e+07
## pop4          24   Europe    5 360 1.716976e+07 2.051944e+07 8.551125e+06
## pop5          25  Oceania    5  24 8.874672e+06 6.506342e+06 6.403492e+06
## gdpPercap1    26   Africa    6 624 2.193755e+03 2.827930e+03 1.192138e+03
## gdpPercap2    27 Americas    6 300 7.136110e+03 6.396764e+03 5.465510e+03
## gdpPercap3    28     Asia    6 396 7.902150e+03 1.404537e+04 2.646787e+03
## gdpPercap4    29   Europe    6 360 1.446948e+04 9.355213e+03 1.208175e+04
## gdpPercap5    30  Oceania    6  24 1.862161e+04 6.358983e+03 1.798330e+04
##                  trimmed          mad          min          max
## country*1   7.000400e+01 5.337360e+01       3.0000 1.420000e+02
## country*2   6.305000e+01 4.892580e+01       5.0000 1.370000e+02
## country*3   8.117925e+01 3.558240e+01       1.0000 1.400000e+02
## country*4   7.212500e+01 5.856270e+01       2.0000 1.340000e+02
## country*5   4.900000e+01 6.375180e+01       6.0000 9.200000e+01
## continent*1 1.000000e+00 0.000000e+00       1.0000 1.000000e+00
## continent*2 2.000000e+00 0.000000e+00       2.0000 2.000000e+00
## continent*3 3.000000e+00 0.000000e+00       3.0000 3.000000e+00
## continent*4 4.000000e+00 0.000000e+00       4.0000 4.000000e+00
## continent*5 5.000000e+00 0.000000e+00       5.0000 5.000000e+00
## year1       1.979500e+03 2.223900e+01    1952.0000 2.007000e+03
## year2       1.979500e+03 2.223900e+01    1952.0000 2.007000e+03
## year3       1.979500e+03 2.223900e+01    1952.0000 2.007000e+03
## year4       1.979500e+03 2.223900e+01    1952.0000 2.007000e+03
## year5       1.979500e+03 2.223900e+01    1952.0000 2.007000e+03
## lifeExp1    4.829597e+01 8.580547e+00      23.5990 7.644200e+01
## lifeExp2    6.550320e+01 8.467129e+00      37.5790 8.065300e+01
## lifeExp3    6.062415e+01 1.300240e+01      28.8010 8.260300e+01
## lifeExp4    7.238535e+01 4.432974e+00      43.5850 8.175700e+01
## lifeExp5    7.418570e+01 4.003020e+00      69.1200 8.123500e+01
## pop1        6.443586e+06 5.540861e+06   60011.0000 1.350312e+08
## pop2        1.064239e+07 5.972603e+06  662850.0000 3.011399e+08
## pop3        2.567831e+07 1.832669e+07  120447.0000 1.318683e+09
## pop4        1.321559e+07 6.626994e+06  147962.0000 8.240100e+07
## pop5        8.439348e+06 5.996395e+06 1994794.0000 2.043418e+07
## gdpPercap1  1.557042e+03 7.753226e+02     241.1659 2.195121e+04
## gdpPercap2  5.823602e+03 3.269332e+03    1201.6372 4.295165e+04
## gdpPercap3  4.915853e+03 2.820834e+03     331.0000 1.135231e+05
## gdpPercap4  1.353297e+04 8.846051e+03     973.5332 4.935719e+04
## gdpPercap5  1.805997e+04 6.459103e+03   10039.5956 3.443537e+04
##                    range        skew   kurtosis           se
## country*1   1.390000e+02  0.07251641 -1.2542325 1.632498e+00
## country*2   1.320000e+02  0.38193193 -1.2266294 2.442296e+00
## country*3   1.390000e+02 -0.28791448 -0.5830326 1.918801e+00
## country*4   1.320000e+02 -0.08696936 -1.3673609 2.192430e+00
## country*5   8.600000e+01  0.00000000 -2.0815972 8.966120e+00
## continent*1 0.000000e+00         NaN        NaN 0.000000e+00
## continent*2 0.000000e+00         NaN        NaN 0.000000e+00
## continent*3 0.000000e+00         NaN        NaN 0.000000e+00
## continent*4 0.000000e+00         NaN        NaN 0.000000e+00
## continent*5 0.000000e+00         NaN        NaN 0.000000e+00
## year1       5.500000e+01  0.00000000 -1.2224941 6.915178e-01
## year2       5.500000e+01  0.00000000 -1.2286515 9.981868e-01
## year3       5.500000e+01  0.00000000 -1.2257780 8.684581e-01
## year4       5.500000e+01  0.00000000 -1.2266762 9.109618e-01
## year5       5.500000e+01  0.00000000 -1.3622888 3.599014e+00
## lifeExp1    5.284300e+01  0.56316642  0.1335922 3.663016e-01
## lifeExp2    4.307400e+01 -0.73494971 -0.2072983 5.395389e-01
## lifeExp3    5.380200e+01 -0.40106859 -0.6664275 5.962151e-01
## lifeExp4    3.817200e+01 -1.24610371  3.2857630 2.863536e-01
## lifeExp5    1.211500e+01  0.36792145 -1.3277465 7.747759e-01
## pop1        1.349712e+08  3.55098227 17.0211253 6.201332e+05
## pop2        3.004771e+08  3.37106426 11.4323045 2.943299e+06
## pop3        1.318563e+09  4.10911316 16.9262005 1.039637e+07
## pop4        8.225303e+07  1.57401173  1.2976452 1.081469e+06
## pop5        1.843938e+07  0.42709154 -1.4849722 1.328102e+06
## gdpPercap1  2.171005e+04  3.52959852 15.7820804 1.132078e+02
## gdpPercap2  4.175002e+04  2.83121560  9.4240088 3.693173e+02
## gdpPercap3  1.131921e+05  4.42266933 25.5582081 7.058066e+02
## gdpPercap4  4.838366e+04  0.85400706  0.1360489 4.930630e+02
## gdpPercap5  2.439577e+04  0.70779453 -0.2063427 1.298022e+03

It’s not pretty, but it does have the added benefit of producing a potentially useful dataframe that you could tidy up.

descr and dfsummary, from the summarytools package

Stating with descr, the first thing to note is that it (like psych::describe) only works with numerical data. However, unlike psych::describe’s asterisk, we’re altered by an actual warning message displayed at the top of the output showing which columns have been ignored. There’s also a record count and a nice variety of summary statistics. The output also displays some information on missing data (e.g., Pct.Valid is the proportion of data that is not missing).

library(summarytools)
## 
## Attaching package: 'summarytools'
## The following objects are masked from 'package:Hmisc':
## 
##     label, label<-
descr(gapminder)
## Non-numerical variable(s) ignored: country, continent
## Descriptive Statistics     
## Data Frame: gapminder     
## N: 1704   
## 
##                        year   lifeExp             pop   gdpPercap
## ----------------- --------- --------- --------------- -----------
##              Mean   1979.50     59.47     29601212.32     7215.33
##           Std.Dev     17.27     12.92    106157896.74     9857.45
##               Min   1952.00     23.60        60011.00      241.17
##                Q1   1964.50     48.19      2792776.00     1201.92
##            Median   1979.50     60.71      7023595.50     3531.85
##                Q3   1994.50     70.85     19593660.50     9325.86
##               Max   2007.00     82.60   1318683096.00   113523.13
##               MAD     22.24     16.10      7841473.62     4007.61
##               IQR     27.50     22.65     16791557.75     8123.40
##                CV    114.65      4.60            0.28        0.73
##          Skewness      0.00     -0.25            8.33        3.84
##       SE.Skewness      0.06      0.06            0.06        0.06
##          Kurtosis     -1.22     -1.13           77.62       27.40
##           N.Valid   1704.00   1704.00         1704.00     1704.00
##         Pct.Valid    100.00    100.00          100.00      100.00

There’s a transpose argument for those who prefer to arrange their variables by row and summary statistics as columns.

descr(gapminder, transpose = TRUE)
## Non-numerical variable(s) ignored: country, continent
## Descriptive Statistics     
## Data Frame: gapminder     
## N: 1704   
## 
##                          Mean        Std.Dev        Min           Q1       Median            Q3
## --------------- ------------- -------------- ---------- ------------ ------------ -------------
##            year       1979.50          17.27    1952.00      1964.50      1979.50       1994.50
##         lifeExp         59.47          12.92      23.60        48.19        60.71         70.85
##             pop   29601212.32   106157896.74   60011.00   2792776.00   7023595.50   19593660.50
##       gdpPercap       7215.33        9857.45     241.17      1201.92      3531.85       9325.86
## 
## Table: Table continues below
## 
##  
## 
##                             Max          MAD           IQR       CV   Skewness   SE.Skewness
## --------------- --------------- ------------ ------------- -------- ---------- -------------
##            year         2007.00        22.24         27.50   114.65       0.00          0.06
##         lifeExp           82.60        16.10         22.65     4.60      -0.25          0.06
##             pop   1318683096.00   7841473.62   16791557.75     0.28       8.33          0.06
##       gdpPercap       113523.13      4007.61       8123.40     0.73       3.84          0.06
## 
## Table: Table continues below
## 
##  
## 
##                   Kurtosis   N.Valid   Pct.Valid
## --------------- ---------- --------- -----------
##            year      -1.22   1704.00      100.00
##         lifeExp      -1.13   1704.00      100.00
##             pop      77.62   1704.00      100.00
##       gdpPercap      27.40   1704.00      100.00

summarytools also includes a more comprehensive summarizing function called dfSummary that is intended to summarize a whole dataframe.

dfSummary(gapminder)
## Data Frame Summary   
## gapminder     
## N: 1704   
## ------------------------------------------------------------------------------------------------------------------------------------------
## No   Variable     Stats / Values                            Freqs (% of Valid)   Text Graph                             Valid    Missing  
## ---- ------------ ----------------------------------------- -------------------- -------------------------------------- -------- ---------
## 1    country      1. Afghanistan                              12 ( 0.7%)         IIIIIIIIIIIIIIII                       1704     0        
##      [factor]     2. Albania                                  12 ( 0.7%)                                                (100%)   (0%)     
##                   3. Algeria                                  12 ( 0.7%)                                                                  
##                   4. Angola                                   12 ( 0.7%)                                                                  
##                   5. Argentina                                12 ( 0.7%)                                                                  
##                   6. Australia                                12 ( 0.7%)                                                                  
##                   7. Austria                                  12 ( 0.7%)                                                                  
##                   8. Bahrain                                  12 ( 0.7%)                                                                  
##                   9. Bangladesh                               12 ( 0.7%)                                                                  
##                   10. Belgium                                 12 ( 0.7%)                                                                  
##                   [ 132 others ]                            1584 (92.4%)                                                                  
## 
## 2    continent    1. Africa                                 624 (36.6%)          IIIIIIIIIIIIIIII                       1704     0        
##      [factor]     2. Americas                               300 (17.6%)          IIIIIII                                (100%)   (0%)     
##                   3. Asia                                   396 (23.2%)          IIIIIIIIII                                               
##                   4. Europe                                 360 (21.1%)          IIIIIIIII                                                
##                   5. Oceania                                 24 ( 1.4%)                                                                   
## 
## 3    year         mean (sd) : 1979.5 (17.27)                12 distinct val.     :                 :                    1704     0        
##      [integer]    min < med < max :                                              :                 :                    (100%)   (0%)     
##                   1952 < 1979.5 < 2007                                           : . . . . . . . . :                                      
##                   IQR (CV) : 27.5 (0.01)                                         : : : : : : : : : :                                      
##                                                                                  : : : : : : : : : :                                      
## 
## 4    lifeExp      mean (sd) : 59.47 (12.92)                 1626 distinct val.                 . :                      1704     0        
##      [numeric]    min < med < max :                                                    . .     : :                      (100%)   (0%)     
##                   23.6 < 60.71 < 82.6                                                  : : : : : :                                        
##                   IQR (CV) : 22.65 (0.22)                                            : : : : : : : .                                      
##                                                                                    . : : : : : : : :                                      
## 
## 5    pop          mean (sd) : 29601212.32 (106157896.74)    1704 distinct val.   :                                      1704     0        
##      [integer]    min < med < max :                                              :                                      (100%)   (0%)     
##                   60011 < 7023595.5 < 1318683096                                 :                                                        
##                   IQR (CV) : 16791557.75 (3.59)                                  :                                                        
##                                                                                  :                                                        
## 
## 6    gdpPercap    mean (sd) : 7215.33 (9857.45)             1704 distinct val.   :                                      1704     0        
##      [numeric]    min < med < max :                                              :                                      (100%)   (0%)     
##                   241.17 < 3531.85 < 113523.13                                   :                                                        
##                   IQR (CV) : 8123.4 (1.37)                                       :                                                        
##                                                                                  : : .                                                    
## ------------------------------------------------------------------------------------------------------------------------------------------

The output from dfSummary is much more detailed than many of the functions we’ve looked at so far, and it can deal with both categorical and numeric variables. It displays commonly used summary statistics, information on missing data, sample sizes, and variable type. It also uses a textgraph to show rough distributions (but these pale in comparison to the skimr package described below). However, one limitation is that there is no special grouping feature.

skim, from the skimr package

The skim function summarizes the record and variable count. It is fantastic for handling both categorical and numeric data, and displays accurate summary stats for each. For readability, it separates variables into different sections dependent on data type, which in turn makes for quick interpretation given that different summary stats are relevant for different data types. It also clearly reports missing data, and has all the most common summary stats. However, it doesn’t display some of the bonus content that psych::describe does such as SE and range.

library(skimr)
skim(gapminder)
## Skim summary statistics
##  n obs: 1704 
##  n variables: 6 
## 
## Variable type: factor 
##   variable missing complete    n n_unique
##  continent       0     1704 1704        5
##    country       0     1704 1704      142
##                              top_counts ordered
##  Afr: 624, Asi: 396, Eur: 360, Ame: 300   FALSE
##      Afg: 12, Alb: 12, Alg: 12, Ang: 12   FALSE
## 
## Variable type: integer 
##  variable missing complete    n    mean       sd    p0        p25     p50
##       pop       0     1704 1704 3e+07    1.1e+08 60011 2793664    7e+06  
##      year       0     1704 1704  1979.5 17.27     1952    1965.75  1979.5
##       p75       p100     hist
##  2e+07       1.3e+09 ▇▁▁▁▁▁▁▁
##   1993.25 2007       ▇▃▇▃▃▇▃▇
## 
## Variable type: numeric 
##   variable missing complete    n    mean      sd     p0     p25     p50
##  gdpPercap       0     1704 1704 7215.33 9857.45 241.17 1202.06 3531.85
##    lifeExp       0     1704 1704   59.47   12.92  23.6    48.2    60.71
##      p75      p100     hist
##  9325.46 113523.13 ▇▁▁▁▁▁▁▁
##    70.85     82.6  ▁▂▅▅▅▅▇▃

skim’s output also works very well with group_by from the dplyr package, which means you can easily produce summaries by group:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:skimr':
## 
##     contains, ends_with, everything, matches, num_range, one_of,
##     starts_with
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
group_by(gapminder, continent) %>% 
  skim()
## Skim summary statistics
##  n obs: 1704 
##  n variables: 6 
##  group variables: continent 
## 
## Variable type: factor 
##  continent variable missing complete   n n_unique
##     Africa  country       0      624 624       52
##   Americas  country       0      300 300       25
##       Asia  country       0      396 396       33
##     Europe  country       0      360 360       30
##    Oceania  country       0       24  24        2
##                          top_counts ordered
##  Alg: 12, Ang: 12, Ben: 12, Bot: 12   FALSE
##  Arg: 12, Bol: 12, Bra: 12, Can: 12   FALSE
##  Afg: 12, Bah: 12, Ban: 12, Cam: 12   FALSE
##  Alb: 12, Aus: 12, Bel: 12, Bos: 12   FALSE
##    Aus: 12, New: 12, Afg: 0, Alb: 0   FALSE
## 
## Variable type: integer 
##  continent variable missing complete   n          mean            sd
##     Africa      pop       0      624 624 9916003.14          1.5e+07
##     Africa     year       0      624 624    1979.5          17.27   
##   Americas      pop       0      300 300       2.5e+07       5.1e+07
##   Americas     year       0      300 300    1979.5          17.29   
##       Asia      pop       0      396 396       7.7e+07       2.1e+08
##       Asia     year       0      396 396    1979.5          17.28   
##     Europe      pop       0      360 360       1.7e+07       2.1e+07
##     Europe     year       0      360 360    1979.5          17.28   
##    Oceania      pop       0       24  24 8874672.33    6506342.47   
##    Oceania     year       0       24  24    1979.5          17.63   
##      p0        p25           p50        p75        p100     hist
##   60011 1342075    4579311          1.1e+07     1.4e+08 ▇▁▁▁▁▁▁▁
##    1952    1965.75    1979.5     1993.25     2007       ▇▃▇▃▃▇▃▇
##  662850   3e+06    6227510          1.8e+07 3e+08       ▇▁▁▁▁▁▁▁
##    1952    1965.75    1979.5     1993.25     2007       ▇▃▇▃▃▇▃▇
##  120447 3844393          1.5e+07    4.6e+07     1.3e+09 ▇▁▁▁▁▁▁▁
##    1952    1965.75    1979.5     1993.25     2007       ▇▃▇▃▃▇▃▇
##  147962 4331500    8551125          2.2e+07     8.2e+07 ▇▁▁▁▁▁▁▁
##    1952    1965.75    1979.5     1993.25     2007       ▇▃▇▃▃▇▃▇
##   2e+06 3199212.5  6403491.5        1.4e+07 2e+07       ▇▁▁▁▁▁▁▂
##    1952    1965.75    1979.5     1993.25     2007       ▇▃▇▃▃▇▃▇
## 
## Variable type: numeric 
##  continent  variable missing complete   n     mean       sd       p0
##     Africa gdpPercap       0      624 624  2193.75  2827.93   241.17
##     Africa   lifeExp       0      624 624    48.87     9.15    23.6 
##   Americas gdpPercap       0      300 300  7136.11  6396.76  1201.64
##   Americas   lifeExp       0      300 300    64.66     9.35    37.58
##       Asia gdpPercap       0      396 396  7902.15 14045.37   331   
##       Asia   lifeExp       0      396 396    60.06    11.86    28.8 
##     Europe gdpPercap       0      360 360 14469.48  9355.21   973.53
##     Europe   lifeExp       0      360 360    71.9      5.43    43.59
##    Oceania gdpPercap       0       24  24 18621.61  6358.98 10039.6 
##    Oceania   lifeExp       0       24  24    74.33     3.8     69.12
##       p25      p50      p75      p100     hist
##    761.25  1192.14  2377.42  21951.21 ▇▁▁▁▁▁▁▁
##     42.37    47.79    54.41     76.44 ▁▂▆▇▆▃▁▁
##   3427.78  5465.51  7830.21  42951.65 ▇▃▁▁▁▁▁▁
##     58.41    67.05    71.7      80.65 ▁▂▂▃▅▇▇▃
##   1056.99  2646.79  8549.26 113523.13 ▇▁▁▁▁▁▁▁
##     51.43    61.79    69.51     82.6  ▁▃▅▅▇▇▇▂
##   7213.09 12081.75 20461.39  49357.19 ▆▇▅▃▂▂▁▁
##     69.57    72.24    75.45     81.76 ▁▁▁▁▂▇▇▃
##  14141.86 17983.3  22214.12  34435.37 ▇▅▇▃▃▁▁▁
##     71.2     73.66    77.55     81.23 ▅▇▂▃▂▂▂▃

Because the output is arranged by the grouping variable (i.e., continent in the current example), sometimes it can be difficult to compare summary statistics across groups. However, this method is probably one of the easier/nicer ways to view these results at a glimpse. However, with a little more manipulation, it’s actually pretty easy to directly compare statistics across groups. Despite the above skim output not looking tidy, there is a tidy dataframe-like representation of each combination of variable and statistic happening behind the scenes. For example:

data <- group_by(gapminder, continent) %>% 
          skim()
head(data)
## # A tibble: 6 x 7
## # Groups:   continent [1]
##   continent variable type   stat       level   value formatted
##   <fct>     <chr>    <chr>  <chr>      <chr>   <dbl> <chr>    
## 1 Africa    country  factor missing    .all       0. 0        
## 2 Africa    country  factor complete   .all     624. 624      
## 3 Africa    country  factor n          .all     624. 624      
## 4 Africa    country  factor n_unique   .all      52. 52       
## 5 Africa    country  factor top_counts Algeria   12. Alg: 12  
## 6 Africa    country  factor top_counts Angola    12. Ang: 12

It’s not exactly human-reader friendly, but after some deciphering, you might be able to see it produces a long table that contains one row for every combination of group, variable, and summary statistic. This means you can use standard methods of dataframe manipulation to process your summary statistics.

For instance, using the tidyverse, we can examine the mean and SD of the lifeExp variable and compare the results between each continent in the gapminder dataset.

data %>% 
  filter(variable == "lifeExp", stat %in% c("mean", "sd")) %>% 
  ggplot(aes(x = continent, y = value)) +
  facet_grid(stat ~ ., scales = "free") +
  geom_col()

Overall, skimr is a simple to use summary function that can be used with pipes and displays. You can modify the default summary statistics, as can the default formatting. Support for data frames and vectors is included, and you can implement your own skim methods for specific object types. So, I encourage you to check out the skimr documentation for further details on these topics.

Summary

That about wraps it up for part II. In this post, I covered five packages for summarizing your data in R.

  1. summary and by in base R
  2. describe in the Hmisc package
  3. describe in the psych package
  4. descr and dfSummary in the summarytools package
  5. skim in the skimr package

I tend to use skimr as my go-to package for descriptive statistics, mostly because I can seamlessly integrate it with my tidy workflow. I used psych::describe as my main data summary function for a very long time because it provided me with descriptive statistics that many other functions neglect. However, since skimr allows you to modify the default summary statistics it displays, I find myself using it more and more. That being said, there are many times I’ll use base R’s summary function because of its simplicity and flexibility with other functions.

However, these five packages are just the tip of R’s descriptive statistic iceberg. Stay tuned for more descriptive statistics in R part III, coming soon!

Avatar
Jeremy R. Winget
Graduate Research Assistant & Lecturer

Related

comments powered by Disqus