Usage of eiPartialID

2019-04-10

The R package eiPartialID implements the approach for generating bounds for 2x2 ecological inference proposed in Jiang et al. 2019 (“Ecological Regression with Partial Identification”). Details for using this package for replicating the results in the paper are provided in the README.md file in the paper’s repo. This vignette provides an overview of using the approach for additional datasets. The two main user-facing functions are generateBounds() and evaluateBounds(), the usage of which we illustrate below. The wrapper bounds() calls generateBounds() and evaluateBounds(), providing illustrative usage to generate the CI_0.5 bounds. Additional options are described in the function documention (accessible, for example, via ?functionName).

Generate bounds: generateBounds()

By means of an example, we use the ‘census’ dataset appearing in the ‘eco’ library.

library("eco")
data("census")
inputDataSet <- census
x <- inputDataSet$X t <- inputDataSet$Y
n <- inputDataSet$N trueBetaB <- inputDataSet$W1

The vectors x and t contain the proportions (marginals) of each of the variables across each of the geographic units (in this case, counties), and the vector n contains the total number of residents in each county. The vector trueBetaB contains the true conditional (in this case, the Black literacy rate) for each county. We can call generateBounds(), with printSummary=TRUE, to print out an overview of the proposed bounds:

outputList <- generateBounds(x, t, n, trueBetaB=trueBetaB, useXRangeOffset=TRUE, returnAdditionalStats=FALSE, printSummary=TRUE)

# True B: 0.674809
# Duncan-Davis bounds: [0.535618, 0.974010]
# [l,u]=[min(X_i),max(X_i)]: [0.050810, 0.939290]
# CI_0=[Bl_hat, Bu_hat]: [0.606101, 0.810082]
# CI_1: [0.572566, 0.842403]
# Width-ratio: |CI_0|/|DD|: 0.465295

This information is also saved in the outputList object.

Evaluate bounds: evaluateBounds()

Using the outputList object generated by generateBounds(), we can then use evaluateBounds() to apply the selection heuristic proposed in Jiang et al. 2019 and to generate the bounds across confidence levels. Continuing the example above,

summaryOutputList <- evaluateBounds(outputList)

# $x$ & Nominal coverage (\Phi(x)) & True B in CI_x & Width-ratio: |Proposed width|/|DD| & Reverted to DD & Proposed Lower & Proposed Upper \\
# 0.00 & 0.5000 & TRUE & 0.4653 & FALSE & 0.6061 & 0.8101\\
# 0.25 & 0.5987 & TRUE & 0.5028 & FALSE & 0.5977 & 0.8182\\
# 0.50 & 0.6915 & TRUE & 0.5404 & FALSE & 0.5893 & 0.8262\\
# 0.75 & 0.7734 & TRUE & 0.5780 & FALSE & 0.5809 & 0.8343\\
# 1.00 & 0.8413 & TRUE & 0.6155 & FALSE & 0.5726 & 0.8424\\
# 1.25 & 0.8944 & TRUE & 0.6531 & FALSE & 0.5642 & 0.8505\\
# 1.50 & 0.9332 & TRUE & 0.6906 & FALSE & 0.5558 & 0.8586\\
# 1.75 & 0.9599 & TRUE & 0.7282 & FALSE & 0.5474 & 0.8666\\
# 2.00 & 0.9772 & TRUE & 0.7657 & FALSE & 0.5390 & 0.8747\\

In this case, the bounds never revert to the deterministic DD bounds and the true district-level value is always captured.

The output object summaryOutputList contains the information printed to standard out in the table above. For example, CI_0.5 (0.5893336 0.8262426) corresponds to c(summaryOutputList$CI_x_lower[3], summaryOutputList$CI_x_upper[3]).

Generate and evaluate bounds: bounds()

The function bounds() calls generateBounds() and then evaluateBounds(), saving the CI_0.5 bounds in the returned list. Continuing the example above,

library("eco")
data("census")
inputDataSet <- census
x <- inputDataSet$X t <- inputDataSet$Y
n <- inputDataSet$N trueBetaB <- inputDataSet$W1
outputList <- bounds(x, t, n, trueBetaB=trueBetaB)
print(outputList)
# $CI_0.5_lower # [1] 0.5893336 # #$CI_0.5_upper
# [1] 0.8262426
#
# $CI_0.5_isSelected # [1] TRUE # #$CI_0.5_widthRatio
# [1] 0.5404046
#
# $CI_0.5_nominalCoverage # [1] 0.6914625 # #$CI_0.5_truthCaptured
# [1] TRUE

Here, we see that the CI_0.5 (0.5893336 0.8262426) bounds corresponds to c(outputList$CI_0.5_lower, outputList$CI_0.5_upper), the bounds were not rejected by the heuristic (outputList$CI_0.5_isSelected), the width-ratio with the deterministic bounds is 0.5404046 (outputList$CI_0.5_widthRatio), the nominal coverage probability is 0.6914625 (outputList$CI_0.5_nominalCoverage), and the true district level value was captured within the bounds (outputList$CI_0.5_truthCaptured).