R offers many tools to analyse the univariate or bivariate distribution of series. This includes table and prop.table on base R, group_by/summarise and count in dplyr. However, these functions are somehow frustrating as some very common tasks, like:

  • adding a total,
  • computing relative frequencies or percentage instead of counts,
  • merging high values of integer series in a >=N category,
  • computing cumulative distributions,
  • computing descriptive statistics.

are tedious. Moreover, to our knowledge, R offers weak support for numerical series for which the numerical value is not known at the individual level, but only the fact that this value belongs to a certain range. descstat is intended to provide user-friendly tools to perform these kind of operations. More specifically, descstat provides:

  • a bin class which makes computations on series that contain bins easy,
  • a freq_table function to construct a frequency tables, and some methods to print, plot and compute some descriptive statistics,
  • a cont_table to compute contingency tables from two series.

These function are writen in the tidyverse style, which means that the pipe operator can be used and that the series can be selected without quotes.

library("descstat")
library("ggplot2")
library("dplyr")

Bins

From continuous numerical values to bins

bin is a new class intended to deal easily with series that indicates in which range of numerical values an observation belongs to. It is basically a factor or a character, with values having a very strict format, like [10,20), (20,30], [50,Inf), ie :

  • an oppening bracket that should be either ( or [ for respectively a bin open or closed on the left,
  • a first numerical value (the lower bound of the bin),
  • a comma,
  • a second numerical value (the upper bound of the bin),
  • a closing bracket that should be either ) or ] for respectively a bin open or closed on the right.

This format is returned by the base::cut function which default method is relevant for numeric series and has a breaks argument (a numeric containing the breaks) and a right argument (if TRUE the bin is closed on the right).

z <- c(1, 5, 10, 12, 4, 9, 8)
bin1 <- cut(z, breaks = c(1, 8, 12), right = FALSE)
bin2 <- cut(z, breaks = c(1, 8, 12), right = TRUE)
bin3 <- cut(z, breaks = c(1, 8, 12, Inf), right = FALSE)
tibble(z, bin1, bin2, bin3)
## # A tibble: 7 x 4
##       z bin1   bin2   bin3    
##   <dbl> <fct>  <fct>  <fct>   
## 1     1 [1,8)  <NA>   [1,8)   
## 2     5 [1,8)  (1,8]  [1,8)   
## 3    10 [8,12) (8,12] [8,12)  
## 4    12 <NA>   (8,12] [12,Inf)
## 5     4 [1,8)  (1,8]  [1,8)   
## # … with 2 more rows

Note that :

  • the value of 8 is included in the [8,12) bin when right = FALSE and in the (1,8] bin when right = TRUE,
  • the value of 1 results in a NA when right = TRUE as the lowest value of breaks is 1,
  • the value of 12 results in a NA when right = FALSE as 12 is the highest value of breaks.

cut returns a factor, but sometimes series containing bins can be provided as a character. In this case, there is a problem with the ordering of the distinct values, as illustrated below:

bin3chr <- as.character(bin3)
bin3chr
## [1] "[1,8)"    "[1,8)"    "[8,12)"   "[12,Inf)"
## [5] "[1,8)"    "[8,12)"   "[8,12)"
factor(bin3chr)
## [1] [1,8)    [1,8)    [8,12)   [12,Inf) [1,8)   
## [6] [8,12)   [8,12)  
## Levels: [1,8) [12,Inf) [8,12)
sort(unique(bin3chr))
## [1] "[1,8)"    "[12,Inf)" "[8,12)"

The problem with bins stored in characters is that the distinct values don't appear in the correct order while sorted (ie [12,Inf) before [8,12)).

The bin class

To create bin objects, an extract method is provided for characters and factors which create a tibble containing the different elements of the unique bins:

bin3chr  %>% extract
## # A tibble: 3 x 5
##   bin      left  first  last right
##   <chr>    <chr> <int> <dbl> <chr>
## 1 [1,8)    [         1     8 )    
## 2 [8,12)   [         8    12 )    
## 3 [12,Inf) [        12   Inf )

The relevance of the values to represent bins is then checked and the as_bin function returns an object of class bin which is a factor with the levels in the relevant order.

bin3chr %>% as_bin
## [1] [1,8)    [1,8)    [8,12)   [12,Inf) [1,8)   
## [6] [8,12)   [8,12)  
## Levels: [1,8) [8,12) [12,Inf)

In case of incorrect values in the series, NA's are returned.

bin4 <- c("[1,8)", "[1, 8)", "[8,12", "[12,inf)", "[1,8)",
          "[8,12)", "[8,12)")
bin4 %>% as_bin
## [1] [1,8)  <NA>   <NA>   <NA>   [1,8)  [8,12) [8,12)
## Levels: [1,8) [8,12)

From bins to numerical values

The real strength of the bin class is to allow calculus on the underlying numerical values. To perform this task, an as_numeric function is provided which returns a numeric series. The basic use of this function returns simply the lower bound:

bin3 %>% as_numeric
## [1]  1  1  8 12  1  8  8

but a pos argument can be used to return any value in the range of the bin, by defining the relative position, ie:

  • pos = 0 (the default) for the lower bound,
  • pos = 1 for the upper bound,
  • pos = 0.5 for the center of the bin.
bin3 %>% as_numeric(pos = 1)
## [1]  8  8 12 16  8 12 12
bin3 %>% as_numeric(pos = 0.5)
## [1]  4.5  4.5 10.0 14.0  4.5 10.0 10.0

While computing some statistics on bins, it is custumory to consider that, for all the individuals belonging to a specific bin, the value is just the center (ie all the individual values are equal to 10 for individuals belonging to the [8,12) bin), or the values are uniformaly distributed in the bin. This value is returned by as_numeric with pos = 0.5, but there is a specific problem for the last bin if, as it is the case here, the last bin is open to infinity on the right. In this case, the default behaviour of as_numeric is to set the width of the last bin to the width of the just before last one. As the width ot the [8,12) bin is 4, the width of the last one is also set to 4, which mean an upper bound of 16 and a center value of 14. This behaviour can be changed either by setting the wlast argument to a different value, which is interpreted as a multiple of the width of the just before last bin. For example, wlast = 4 means that the width of the last bin is set to 4 times the one of the before last bin, which means 16, and the resulting bin is [12,28) with a center value of 20.

bin3 %>% as_numeric(pos = 0.5, wlast = 4)
## [1]  4.5  4.5 10.0 20.0  4.5 10.0 10.0

The same result can be obtained by directly setting the center of the last bin using the xlast argument:

bin3 %>% as_numeric(pos = 0.5, xlast = 20)

Finally, a specific center for the first bin can also be set using the xfirst argument:

bin3 %>% as_numeric(pos = 0.5, xlast = 20, xfirst = 6)
## [1]  6  6 10 20  6 10 10

Frequency tables for discrete series

Frequency tables summarise the univariate distribution of a categorical or an integer series. In tidyverse, this task can be performed using the dplyr::count function. For example, the rgp data set, which is an extract of the French cencus, contains the number of children in households:

rgp %>% count(children)
## # A tibble: 8 x 2
##   children     n
## *    <int> <int>
## 1        0   317
## 2        1   272
## 3        2   260
## 4        3   112
## 5        4    31
## # … with 3 more rows

Computing a frequency table

The descstat::freq_table function performs the same task and returns by default exactly the same tibble as the one returned by dplyr::count:

rgp %>% freq_table(children)

but it has several further arguments which can improve the result:

  • f is a character containing one or several letters and indicates what kind of frequencies should be computed:

    • n for the count or absolute frequencies (the default),
    • f for the (relative) frequencies,
    • p for the percentage (ie \(f\times 100\)),
    • N, F and P for the cumulative values of n, f and p.
  • total a boolean, if TRUE, a total is returned,
  • max, suitable for an integer series only, is an integer which is the maximum value presented in the frequency table, eg max = 3 creates a last line which is >=3.

The following command use all the possible letters.

rgp %>% freq_table(children, "nfpNFP")
## # A tibble: 8 x 7
##   children     n     f     p     N     F     P
## *    <int> <int> <dbl> <dbl> <int> <dbl> <dbl>
## 1        0   317 0.317  31.7   317 0.317  31.7
## 2        1   272 0.272  27.2   589 0.589  58.9
## 3        2   260 0.26   26     849 0.849  84.9
## 4        3   112 0.112  11.2   961 0.961  96.1
## 5        4    31 0.031   3.1   992 0.992  99.2
## # … with 3 more rows

As there are few occurrences of families with more than 3 children, we set max = 3 and add a total by setting total to TRUE.

rgp %>% freq_table(children, max = 3, total = TRUE)
## # A tibble: 5 x 2
##   children     n
##   <fct>    <int>
## 1 0          317
## 2 1          272
## 3 2          260
## 4 >=3        151
## 5 Total     1000

Printing a frequency table

Note that in the printed table, the children series is now a character as the last two values are >=3 and Total. Actually freq_table returns an object of class freq_table which inherits from the tbl_df class. A look at the structure of the object:

rgp %>% freq_table(children, max = 3, total = TRUE) %>% str
## freq_table [5 × 2] (S3: freq_table/tbl_df/tbl/data.frame)
##  $ children: num [1:5] 0 1 2 3.34 NA
##  $ n       : int [1:5] 317 272 260 151 1000

indicates that we still have a tibble, with a numeric children series for which the last two values equal to 3.34 (which is the average of the values greater or equal to 3) and NA (for the total).

A pre_print function is provided with a method for freq_table objects. It turns the children series in a character, with 3.34 and NA replaced by >=3 and Total. This pre_print method is included in the format method for freq_table objects, which is then passed to the tbl_df method:

descstat:::format.freq_table
## function (x, ..., n = NULL, width = NULL, n_extra = NULL) 
## {
##     x <- pre_print(x)
##     class(x) <- setdiff(class(x), "freq_table")
##     format(x, ..., n = n, width = width, n_extra = n_extra)
## }
## <bytecode: 0x561e8e906ca8>
## <environment: namespace:descstat>

The pre_print function should be used explicitly while using knitr::kable, as this function doesn't use any format method:

rgp %>% freq_table(children, max = 3, total = TRUE) %>%
    pre_print %>% knitr::kable()
children n
0 317
1 272
2 260
>=3 151
Total 1000

Ploting a frequency table

The most natural way to plot a frequency table with ggplot is to use geom_col (or equivalently geom_bar with stat = 'identity').

cld <- rgp %>% freq_table(children, f = "nf", max = 3)
cld %>% pre_print %>% ggplot(aes(children, f)) +
    geom_col(fill = "white", color = "black")

Note the use of the pre_print method which turns the 3.34 numerical value in >=3.

To get more enhanced graphics, a pre_plot method is provided, with a plot argument equal to stacked or cumulative. With plot = "stacked":

cld %>% pre_print %>% pre_plot("f", plot = "stacked")
## # A tibble: 4 x 3
##   children     f  ypos
## * <fct>    <dbl> <dbl>
## 1 >=3      0.151 0.924
## 2 2        0.26  0.719
## 3 1        0.272 0.453
## 4 0        0.317 0.158

pre_plot returns an ypos series which indicates the coordinates where to write labels (here the values of the series).

bnp <- cld %>% pre_print %>% pre_plot("f", plot = "stacked") %>%
    ggplot(aes(x = 2, y = f, fill = children)) +
    geom_col() +
    geom_text(aes(y = ypos, label = children)) +
    scale_x_continuous(label = NULL) +
    scale_fill_brewer(palette = "Set3") +
    guides(fill = FALSE)
bnp

using polar coordinates, we get a pie chart:

bnp + coord_polar(theta = "y") + theme_void()

changing the range of the x values, we get a hole in the pie chart which result in the so called donut chart:

bnp + scale_x_continuous(limits = c(1, 2.5)) +
    coord_polar(theta = "y") + theme_void()

The other possible value for the plot argument of pre_plot is cumulative:

cld <- rgp %>% freq_table(children, "F", max = 5, total = TRUE)
cld %>% pre_plot(plot = "cumulative") %>% print(n = 5)
## # A tibble: 12 x 5
##   pos       x  xend     y   yend
## * <chr> <dbl> <dbl> <dbl>  <dbl>
## 1 hor       0    NA 0.317  0.317
## 2 vert     NA    NA 0.317 NA    
## 3 hor       1     0 0.589  0.589
## 4 vert      0     0 0.589  0.317
## 5 hor       2     1 0.849  0.849
## # … with 7 more rows

this returns four series which have the names of the aesthetics that are mandatory for geom_segment; x, xend, y and yend. It also returns a pos series with which one can draw differently the horizontal (hor) and the vertical (vert) segments, using for example the linetype aesthetic.

cld %>% pre_plot(plot = "cumulative") %>% ggplot() +
    geom_segment(aes(x = x, xend = xend, y = y, yend = yend,
                     linetype = pos)) +
    guides(linetype = FALSE) +
    labs(x = "number of children", y = "cumulative frequency")

Frequency tables for bin series

As an example, consider the wages data set that contains two series called wage and size which respectively indicate bins for wage and firm size.

wages %>% print(n = 3)
## # A tibble: 1,000 x 6
##   sector           age hours sex    wage    size    
##   <fct>          <int> <int> <fct>  <fct>   <fct>   
## 1 industry          37  1712 male   [14,16) [1,10)  
## 2 administration    57   598 female [4,6)   [50,100)
## 3 business          30  1820 male   [40,50) [20,50) 
## # … with 997 more rows

Creating bins tables

Applying descstat::freq_table for the size series, we get:

wages %>% freq_table(size) %>% print(n = Inf)
## # A tibble: 6 x 3
##   size          x     n
## * <chr>     <dbl> <int>
## 1 [1,10)      5.5   207
## 2 [10,20)    15      90
## 3 [20,50)    35     124
## 4 [50,100)   75     113
## 5 [100,250) 175     124
## 6 [250,Inf) 325     342

A breaks argument is provided, which is a numerical vector that can be used to reduce the number of bins:

wages %>% freq_table(size, breaks = c(20, 250))
## # A tibble: 3 x 3
##   size          x     n
## * <chr>     <dbl> <int>
## 1 [1,20)     10.5   297
## 2 [20,250)  135     361
## 3 [250,Inf) 365     342

The breaks argument should only contain values that are bounds of the initial set of bins. If the breaks argument is of length 1, all the bins containing values greater than the one specified are merged:

wages %>% freq_table(size, breaks = 50)
## # A tibble: 4 x 3
##   size         x     n
## * <chr>    <dbl> <int>
## 1 [1,10)     5.5   207
## 2 [10,20)   15      90
## 3 [20,50)   35     124
## 4 [50,Inf)  65     579

Internally, the breaks argument is passed to the cut methods for bin objects.

A frequency table with bins can also be created from a continuous numerical series, as price in the padova data set:

padova %>% pull(price) %>% range
## [1]  35 950
padova %>% freq_table(price, breaks = c(250, 500, 750))
## # A tibble: 4 x 3
##   price         x     n
## * <chr>     <dbl> <int>
## 1 [0,250)     125   646
## 2 [250,500)   375   333
## 3 [500,750)   625    49
## 4 [750,Inf)   875    14
padova %>% freq_table(price, breaks = c(30, 250, 500, 750, 1000))
## # A tibble: 4 x 3
##   price           x     n
## * <chr>       <dbl> <int>
## 1 [30,250)      140   646
## 2 [250,500)     375   333
## 3 [500,750)     625    49
## 4 [750,1e+03)   875    14

Note that in this case, the first (last) values of breaks can be either:

  • outside the range of the series; in this case the lower bound of the first bin and the upper bound of the last bin are these values,
  • inside the range of the series; in this case the lower bound of the first bin is 0 and the upper bound of the last bin is Inf.

The f argument may contain further letters than for the discrete series case:

  • d for density, which is the relative frequency divided by the bin's width,
  • m for the mass frequencies (the mass is the product of the absolute frequency and the value of the variable),
  • and M for cumulated mass frequencies.
wages %>% freq_table(size, "dmM", breaks = c(20, 100, 250))
## # A tibble: 4 x 5
##   size          x        d      m      M
## * <chr>     <dbl>    <dbl>  <dbl>  <dbl>
## 1 [1,20)     10.5 0.0156   0.0208 0.0208
## 2 [20,100)   60   0.00296  0.0947 0.115 
## 3 [100,250) 175   0.000827 0.144  0.260 
## 4 [250,Inf) 325   0.00228  0.740  1

Other values of the variable can be included in the table using the vals argument, which is a character including some of the following letters:

  • l for the lower bound of the bin,
  • u for the upper bound of the bin,
  • a for the width of the bins.
wages %>% freq_table(size, "p", vals = "xlua", breaks = c(20, 100, 250), wlast = 2)
## # A tibble: 4 x 6
##   size          x     p     l     u     a
## * <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 [1,20)     10.5  29.7     1    20    19
## 2 [20,100)   60    23.7    20   100    80
## 3 [100,250) 175    12.4   100   250   150
## 4 [250,Inf) 400    34.2   250   550   300

Ploting bins tables

Relevant plots for numerical continuous series are very different from those suitable for discrete or categorial series. ggplot provides three geoms to plot the distribution of a numerical series: geom_histogram, geom_density and geom_freqpoly. These three geoms use the bin stat, which means that they consider a raw vector of numerical values, create bins, count the number of observations in each bin and then plot the result:

padova %>% ggplot(aes(price)) +
    geom_histogram(aes(y = ..density..), color = "black", fill = "white") +
    geom_freqpoly(aes(y = ..density..), color = "red") +
    geom_density(color = "blue")

These geoms can be used when individual numerical data are available, but not when only bins are observed. A pre_plot method is provided for freq_table objects, with the plot argument either equal to histogram (the default) or freqpoly. The resulting table contains columns called x and y and can be ploted using geom_polygon for an histogram and geom_line for a frequency polygon.

ftwage <- wages %>% freq_table(wage, "d", breaks = c(10, 20, 30, 40, 50))
ftwage %>% pre_plot(plot = "histogram") %>%
    ggplot(aes(x, y)) + geom_polygon(fill = "white", color = "black")

ftwage %>% pre_plot(plot = "freqpoly") %>%
    ggplot(aes(x, y)) + geom_line()

Another popular plot for continuous numerical series is the Lorenz curve, which indicates the relation between the cumulative distributions of the frequencies and the masses of the series. The data necessary to draw this curve are obtained using plot = "lorenz". Not that in this case, the table should contain F and M.

lzc <- wages %>% freq_table(wage, "MF", breaks = c(10, 20, 30, 40, 50)) %>%
    pre_plot(plot = "lorenz")
lzc
## # A tibble: 24 x 4
##   cls      pts       F      M
## * <chr>    <lgl> <dbl>  <dbl>
## 1 [0.2,10) FALSE 0     0     
## 2 [0.2,10) TRUE  0     0     
## 3 [0.2,10) TRUE  0.184 0.0391
## 4 [0.2,10) FALSE 0.184 0     
## 5 [10,20)  FALSE 0.184 0     
## # … with 19 more rows

Each line in the resulting tibble indicate the coordinates (F for x and M for y) of the points that are necessary to plot the polygons under the Lorenz curve. pts is a logical which indicates on which lines of the tibble there are points that are part of the Lorenz curve:

lzc %>% ggplot(aes(F, M)) +
    geom_polygon(fill = "lightyellow", color = "black") +
    geom_point(data = filter(lzc, pts)) +
    geom_line(data = tibble(F = c(0, 1), M = c(0, 1)), color = "blue") +
    geom_line(data = tibble(F = c(0, 1, 1), M = c(0, 0, 1)), color = "red")

The basic plot is obtained with a call to geom_polygon and geom_point and we added two calls to geom_line to draw the two extreme cases of a Lorenz curve:

  • the blue curve for a completely equal distribution,
  • the red curve for a completely unequal distribution.

Dealing with frequency tables

freq_table is not only designed for data sets containing individual observations but also for frequency tables, like the income data set:

income
## # A tibble: 25 x 3
##   inc_class number tot_inc
##   <chr>      <dbl>   <dbl>
## 1 [0,10)      8.76    36.3
## 2 [10,12)     2.11    23.2
## 3 [12,15)     3.40    46.2
## 4 [15,20)     6.00   104. 
## 5 [20,30)     6.98   172. 
## # … with 20 more rows

which presents the distribution of income in France with 25 bins; inc_class contains range of yearly income (in thousands of euros) number is the number of households (in millions) and tot_inc is the mass of income in the bin (in thousands of million). To use freq_table to read this frequency table, a further argument freq is required and should indicate which column of the tibble contains the frequencies:

income %>% freq_table(inc_class, freq = number)
## # A tibble: 25 x 3
##   inc_class     x     n
## * <chr>     <dbl> <dbl>
## 1 [0,10)      5    8.76
## 2 [10,12)    11    2.11
## 3 [12,15)    13.5  3.40
## 4 [15,20)    17.5  6.00
## 5 [20,30)    25    6.98
## # … with 20 more rows

If one column of the tibble contains the mass of the variables (the tot_inc series in the income tibble), it can be indicated using the mass argument:

income %>% freq_table(inc_class, freq = number, mass = tot_inc)
## # A tibble: 25 x 3
##   inc_class     x     n
## * <chr>     <dbl> <dbl>
## 1 [0,10)     4.15  8.76
## 2 [10,12)   11.0   2.11
## 3 [12,15)   13.6   3.40
## 4 [15,20)   17.4   6.00
## 5 [20,30)   24.6   6.98
## # … with 20 more rows

the center of the bins are then calculated by dividing the mass by the frequency, which result, by definition, to the exact mean value for every bin.

Univariate descriptive statistics

Descriptive statistics can easily be computed applying functions to freq_table objects. The problem is that only a few statistical functions of the base and the stats packages are generic. For these, methods where written for freq_table objects, for the other ones, we had to create new functions. The following table indicates the R functions and the corresponding descstat functions.

R descstat
mean mean
median median
quantile quantile
var variance
sd stdev
mad madev
modval
medial
gini
skewness
kurtosis

To compute the central values statistics of the distribution of wages we use:

z <- wages %>% freq_table(wage)
z %>% mean
## [1] 24.12
z %>% median
## [1] 21.94
z %>% modval
## # A tibble: 1 x 3
##   wage        x     n
##   <chr>   <dbl> <int>
## 1 [24,26)    25    69

median returns a value computed using a linear interpolation. modval returns the mode, which is a one line tibble containing the bin, the center of the bin and the modal value.

For the dispersion statistics:

z %>% stdev
## [1] 14.68
z %>% variance
## [1] 215.4
z %>% madev
## [1] 11.57

For the quantiles, the argument y can be used to compute the quantiles using the values of the variable (y = "value", the default) or the masses (y = "mass"):

z %>% quantile(probs = c(0.25, 0.5, 0.75))
## [1] 13.45 21.94 32.31
z %>% quantile(y = "mass", probs = c(0.25, 0.5, 0.75))
## [1] 13.45 21.94 32.31

The quantile of level 0.5 is the median in the first case, the medial in the second case:

z %>% median
## [1] 21.94
z %>% medial
## [1] 21.94

gini computes the Gini coefficient of the series:

z %>% gini
## [1] 0.3403

skewness and kurtosis compute Fisher's shape statistics:

z %>% skewness
## [1] 1.898
z %>% kurtosis
## [1] 1.48

All these functions also work for counts. Of course, for categorical series, all these functions are irrelevant except the one which computes the mode:

wages %>% freq_table(sector) %>% modval
## # A tibble: 1 x 2
##   sector       n
##   <chr>    <int>
## 1 services   362

Contingency tables

With dplyr, a contingency table can be computed using count with two categorical variables. Let's first reduce the number of classes of size and wage in the wages table.

wages2 <- wages %>%
    mutate(size = cut(size, c(20, 50, 100)),
           wage = cut(wage, c(10, 30, 50)))
wages2 %>% count(size, wage)
## # A tibble: 16 x 3
##   size    wage         n
##   <fct>   <fct>    <int>
## 1 [1,20)  [0.2,10)    74
## 2 [1,20)  [10,30)    169
## 3 [1,20)  [30,50)     41
## 4 [1,20)  [50,Inf)    13
## 5 [20,50) [0.2,10)    17
## # … with 11 more rows

To get a "wide" table, with the values of one of the two variables being the columns, we can use tidyr::pivot_wider:

wages2 %>% count(size, wage) %>%
    tidyr::pivot_wider(values_from = n, names_from = size)
## # A tibble: 4 x 5
##   wage     `[1,20)` `[20,50)` `[50,100)` `[100,Inf)`
##   <fct>       <int>     <int>      <int>       <int>
## 1 [0.2,10)       74        17         30          63
## 2 [10,30)       169        75         55         236
## 3 [30,50)        41        26         21         103
## 4 [50,Inf)       13         6          7          64

Creating contigency tables using cont_table

The same contingency table can be obtained using descstat::cont_table:

wages2 %>% cont_table(wage, size)
## # A tibble: 4 x 5
##   `wage|size` `[1,20)` `[20,50)` `[50,100)` `[100,Inf)`
##   <fct>          <int>     <int>      <int>       <int>
## 1 [0.2,10)          74        17         30          63
## 2 [10,30)          169        75         55         236
## 3 [30,50)           41        26         21         103
## 4 [50,Inf)          13         6          7          64

The result is a cont_table object, which is a tibble in "long" format, as the result of the dplyr::count function. The printing of the table in "wide" format is performed by the pre_print method, which is included in the format method, but should be used explicitely while using knitr::kable.

wages2 %>% cont_table(wage, size) %>%
    pre_print %>% knitr::kable()
wage|size [1,20) [20,50) [50,100) [100,Inf)
[0.2,10) 74 17 30 63
[10,30) 169 75 55 236
[30,50) 41 26 21 103
[50,Inf) 13 6 7 64

The total argument can be set to TRUE to get a row and a column of totals for the two series and, to save place, the row_name can be set to FALSE so that the first column, which contains the modalities of the first series is unnamed.

wages2 %>% cont_table(wage, size, total = TRUE) %>% print(row_name = FALSE)
## # A tibble: 5 x 6
##   ` `   `[1,20)` `[20,50)` `[50,100)` `[100,Inf)` Total
##   <chr>    <int>     <int>      <int>       <int> <int>
## 1 1           74        17         30          63   184
## 2 2          169        75         55         236   535
## 3 3           41        26         21         103   191
## 4 4           13         6          7          64    90
## 5 Total      297       124        113         466  1000

A weights argument is used to mimic the population. For example, the employment table contains a series of weights called weights:

employment %>% cont_table(age, sex, weights = weights, total = TRUE)
## # A tibble: 6 x 4
##   `age|sex`   male female  Total
##   <chr>      <dbl>  <dbl>  <dbl>
## 1 1         12676. 14050. 26726.
## 2 2          8482. 12563. 21045.
## 3 3          9254. 12355. 21610.
## 4 4         10090. 10456. 20546.
## 5 5         15246. 21409. 36654.
## # … with 1 more row

as for freq_table, central values for the first and the last class can be set using arguments xfirst#, xlast# and wlast#, where # is equal to 1 or 2 for the first and the second variable indicated in the cont_table function.

Plotting a contingency table

A contingency table can be ploted using geom_point, with the size of the points being proportional to the count of the cells. The pre_plot method replaces classes by values.

wages2 %>% cont_table(size, wage) %>% pre_plot %>%
    ggplot() + geom_point(aes(size, wage, size = n))

Computing the distributions from a contingency table

\(n_{ij}\) is the count of the cell corresponding to the \(i\)th modality of the first variable and the \(j\)th modality of the second one.

  • the joint distribution is obtained by dividing \(n_{ij}\) by the sample size,
  • the marginal distribution is obtained by summing the counts column-wise \(n_{i.}=\sum_{j}n_{ij}\) for the first variable and row-wise \(n_{.j} = \sum_{i}n_{ij}\) for the second one,
  • the conditional distribution is obtained by dividing the joint by the marginal frequencies.

The joint, marginal and conditional functions return these three distributions. The last two require an argument y which is one of the two series of the cont_table object.

wht <- wages2 %>% cont_table(size, wage)
wht %>% joint
## # A tibble: 4 x 5
##   `size|wage` `[0.2,10)` `[10,30)` `[30,50)` `[50,Inf)`
##   <fct>            <dbl>     <dbl>     <dbl>      <dbl>
## 1 [1,20)           0.074     0.169     0.041      0.013
## 2 [20,50)          0.017     0.075     0.026      0.006
## 3 [50,100)         0.03      0.055     0.021      0.007
## 4 [100,Inf)        0.063     0.236     0.103      0.064
wht %>% marginal(size)
## # A tibble: 4 x 3
##   size          x     f
## * <chr>     <dbl> <dbl>
## 1 [1,20)     10.5 0.297
## 2 [20,50)    35   0.124
## 3 [50,100)   75   0.113
## 4 [100,Inf) 125   0.466
wht %>% conditional(size)
## # A tibble: 4 x 5
##   `size|wage` `[0.2,10)` `[10,30)` `[30,50)` `[50,Inf)`
##   <fct>            <dbl>     <dbl>     <dbl>      <dbl>
## 1 [1,20)          0.402      0.316     0.215     0.144 
## 2 [20,50)         0.0924     0.140     0.136     0.0667
## 3 [50,100)        0.163      0.103     0.110     0.0778
## 4 [100,Inf)       0.342      0.441     0.539     0.711

Note that marginal, as it returns an univariate distribution, is coerced to a freq_table object.

Computing descriptive statistics

Descriptive statistics can be computed using any of the three distributions. Using the joint distribution, we get a tibble containing two columns for the two series.

wht %>% joint %>% mean
## # A tibble: 1 x 2
##    size  wage
##   <dbl> <dbl>
## 1  74.2  24.7
wht %>% joint %>% stdev
## # A tibble: 1 x 2
##    size  wage
##   <dbl> <dbl>
## 1  51.0  15.5
wht %>% joint %>% variance
## # A tibble: 1 x 2
##    size  wage
##   <dbl> <dbl>
## 1 2598.  239.
wht %>% joint %>% modval
## # A tibble: 2 x 4
##   series value       x     f
##   <chr>  <chr>   <dbl> <dbl>
## 1 size   [1,20)   10.5 0.297
## 2 wage   [10,30)  20   0.535

The same (univariate) statistics can be obtained using the marginal distribution:

wht %>% marginal(size) %>% mean
## [1] 74.18

or even more simply considering the univariate distribution computed by freq_table:

wages2 %>% freq_table(size) %>% mean
## [1] 74.18

The mean, stdev and variance methods are actually only usefull when applied to a conditional distribution; in this case, considering for example the conditional distribution of the first variable, there are as many values returned that the number of modalities of the second (conditioning) variable.

wht %>% conditional(wage) %>% mean
## # A tibble: 4 x 2
##   size       mean
##   <chr>     <dbl>
## 1 [1,20)     20.8
## 2 [20,50)    24.1
## 3 [50,100)   22.2
## 4 [100,Inf)  27.9
wht %>% conditional(wage) %>% variance
## # A tibble: 4 x 2
##   size      variance
##   <chr>        <dbl>
## 1 [1,20)        180.
## 2 [20,50)       175.
## 3 [50,100)      227.
## 4 [100,Inf)     276.

Regression curve

The total variance of \(X\) can be writen as the sum of:

  • the explained variance, ie the variance of the conditional means,
  • the residual variance, ie the mean of the conditional variances.

\[ s_{x}^2 = \sum_j f_{.j} s^2_{x_j} + \sum_j f_{.j} (\bar{x}_j - \bar{\bar{x}}) ^ 2 \]

The decomposition of the variance can be computed by joining tables containing the conditional moments and the marginal distribution of the conditioning variable and then applying the formula:

cm <- wht %>% conditional(wage) %>% mean# %>% rename(mean = wage)
cv <- wht %>% conditional(wage) %>% variance# %>% rename(variance = wage)
md <- wht %>% marginal(size)
md %>% left_join(cm) %>% left_join(cv) %>%
    summarise(om = sum(f * mean),
              ev = sum(f * (mean - om) ^ 2),
              rv = sum(f * variance),
              tv = ev + rv) ->  ra

Or more simply using the anova method for cont_table objects:

wht_wage <- wht %>% anova("wage")
wht_wage
## # A tibble: 4 x 5
##   size          x     f  mean variance
## * <chr>     <dbl> <dbl> <dbl>    <dbl>
## 1 [1,20)     10.5 0.297  20.8     180.
## 2 [20,50)    35   0.124  24.1     175.
## 3 [50,100)   75   0.113  22.2     227.
## 4 [100,Inf) 125   0.466  27.9     276.

which has a summary method which computes the different elements of the decomposition:

wht_wage %>% summary
## # A tibble: 1 x 4
##   inter intra total  ratio
##   <dbl> <dbl> <dbl>  <dbl>
## 1  10.0  229.  239. 0.0419

and especially the correlation ratio, which is obtained by dividing the explained variance by the total variance.

The regression curve of wage on size can be plotted using wht_wage, together with error bars:

wht_wage %>% ggplot(aes(x, mean)) + geom_point() +
    geom_line(lty = "dotted") +
    geom_errorbar(aes(ymin = mean - sqrt(variance), ymax = mean + sqrt(variance))) +
    labs(x = "size", y = "wage")    

Linear regression

For the joint distribution, two other functions are provided, covariance and correlation (which are the equivalent of the non-generic stats::cov and stats::cor functions) to compute the covariance and the linear coefficient of correlation.

wht %>% joint %>% covariance
## [1] 67.81
wht %>% joint %>% correlation
## [1] 0.08599

The regression line can be computed using the regline function:

rl <- regline(wage ~ size, wht)
rl
## [1] -43.1337   0.9141

which returns the intercept and the slope of the regression of wage on size. We can then draw the points and the regression line:

wht %>% pre_plot %>% ggplot() + geom_point(aes(size, wage, size = n)) +
    geom_abline(intercept = rl[1], slope = rl[2])