---
title: "Censored Data Analysis"
author: "Dennis Helsel and Paul Julian"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Censored Data Analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
# see https://r-pkgs.org/vignettes.html
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width=6.5,fig.height=4,fig.align='center'
)
```
## Start RStudio
***Caveat:** recently there has been debate on the use and practices surrounding setting a working directory and workflow critiques (see [link](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/)). For purposes of this vignette/training, we will use `setwd()` for simplicity. But honestly, do what works and is comfortable for you. Go forth and code!*
### Set the working directory:
Set the working directory to one in which you will save any worksheets or output. In RStudio you can easily do this using RStudio's pull-down menu: `Session > Set working directory > Choose directory`
or
```{r,eval=FALSE}
setwd(".../NADA2")
```
if you are using `setwd()`, for reproducibility it is recommend to use the code above.
### Load Packages
Load the packages needed. Install 15 packages: You can either do this manually via the packages tab by checking the boxes next to the package names or via console/R-script using `install.packages(...)` and `library(...)`.
*Here is a trick to use a list of packages combined with a function that will check if the package is installed, if not install it and then load it.*
```{r,warning=F,message=F,eval=F,results="hide"}
check.packages <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
pkg <- c("bestglm","car","cenGAM","EnvStats","fitdistrplus","Kendall",
"mgcv","multcomp","NADA","nlme","perm","rms","survminer",
"vegan","NADA2","bestglm")
check.packages(pkg)
```
*Loading libraries using console/R-script*
```{r setup,warning=F,message=F,results="hide"}
# Load Package
library(cenGAM)
library(EnvStats)
library(fitdistrplus)
library(Kendall)
library(mgcv)
library(multcomp)
library(NADA)
library(perm)
library(rms)
library(survminer)
library(vegan)
library(NADA2)
library(bestglm)
library(car)
library(nlme)
library(rms)
```
## Loading R Format Databases From Packages
Some packages have datasets contained within them. When you load the packages you have access to those datasets. You can type data() to list all datasets available within the R base packages and in the packages you've loaded. Once you know the names of the datasets you want to load, use the data (datasetname) command to load the dataset.
```{r load data example}
data(Golden); # From NADA package
head(Golden,5L)
```
## Loading External Datasets of Various Formats
For purposes of this vignette, we won't be loading external dataset, but here is a description of how to load data of several formats from external files.
### Read in an R-format (`.rda`) file.
In the Environment tab, click the open folder icon. Go to the directory where the data are located and choose the file name.
**console/R-script**
```{r,eval=F}
dat<-load(".../path/to/data/data.rda")
```
### Read in an excel format (`.xlsx`/`.xls`) file.
In the environment tab, click on the Import Data button. Choose the "From Excel" option. Go to the folder where the data file is located and choose the file name. If there are variable names stored as column names, make sure the box next to First Row as Names is checked, and click Import.
**console/R-script**
There are several R-packages that will read `xlsx` or `xls` files into R. Beware some packages have recod limitations.
```{r,eval=F}
library(readxl)
dat <- read_excel(".../path/to/data/data.xlsx",sheet=4)
# or
library(openxlsx)
dat <- read.xlsx(".../path/to/data/data.xlsx",sheet=4)
```
### Read in an csv format (`.csv`) file.
In the environment tab, click on the Import Data button. Choose the "From Text (base)..." option. Go to the folder containing the file and choose the file name. Make sure the Heading button YES is selected if the first row in the dataset are the variable names (text). Change the na.strings entry to whatever in the dataset represents a missing value (often a blank in Excel). Click the Import button.
**console/R-script**
```{r,eval=F}
dat <- read.csv(".../path/to/data/data.csv")
```
### Read in an txt format (`.txt`) file.
In the environment tab, click on the Import Data button. Choose the "From Text (base)..." option. Go to the folder containing the data file and choose the file name. Make sure the Heading button YES is selected if the first row in the dataset are the variable names (text). If necessary, change the na.strings entry to whatever in the dataset represents a missing value (often a blank in Excel). Click the Import button.
**console/R-script**
```{r,eval=F}
dat <- read.table(".../path/to/data/data.txt")
```
For other data with specific delimiters (i.e. tab delimited) run `?read.table` or `?read.delim` for more info.
## Plotting Censored Data
### Boxplots
*Data: Zinc dataset*
```{r bxplot1,fig.width=6.5,fig.height=4,fig.align='center'}
data(CuZn); # Data from the NADA package
cboxplot(CuZn$Zn,CuZn$ZnCen,CuZn$Zone,minmax = TRUE,Xlab="Zone",Ylab="Zn")
```
```{r bxplot2,fig.width=6.5,fig.height=4,fig.align='center'}
cboxplot(CuZn$Zn,CuZn$ZnCen,CuZn$Zone,LOG=TRUE,Xlab="Zone",Ylab="Zn")
```
Note that without the minmax option, outlier observations such as the one in the Alluvial Fan data, are shown individually.
```{r bxplot3,fig.width=6.5,fig.height=4,fig.align='center'}
cboxplot(CuZn$Zn,CuZn$ZnCen,CuZn$Zone,LOG=TRUE,show = TRUE, minmax = TRUE,
Xlab="Zone",Ylab="Zn")
```
The show=TRUE option models the portion of each group's data below the highest detection limit (the lines in gray) using ROS.
### Scatterplots
*Data: TCE concentrations in ground water*
```{r scatplot1,fig.width=6.5,fig.height=4,fig.align='center'}
data(TCEReg); # Data from the NADA package
cenxyplot(TCEReg$PopDensity, 1-TCEReg$PopAbv1, TCEReg$TCEConc, TCEReg$TCECen)
```
```{r scatplot2,fig.width=6.5,fig.height=4,fig.align='center'}
cenxyplot(TCEReg$PopDensity, 1-TCEReg$PopAbv1, TCEReg$TCEConc, TCEReg$TCECen,
xlab="Population Denisty",ylab="TCE Concentration, in ug/L")
```
```{r scatplot3,fig.width=6.5,fig.height=4,fig.align='center'}
cenxyplot(TCEReg$PopDensity, 1-TCEReg$PopAbv1, TCEReg$TCEConc, TCEReg$TCECen,
xlab="Population Denisty",ylab="TCE Concentration, in ug/L",
main = "Your Title Here", log ="y")
```
### Cumulative distribution functions (CDFs)
*Data: Zinc, Pyrene*
```{r cdf1,fig.width=6.5,fig.height=4,fig.align='center'}
# Data already loaded
cen_ecdf(CuZn$Zn, CuZn$ZnCen)
```
```{r cdf2,fig.width=6.5,fig.height=4,fig.align='center'}
cen_ecdf(CuZn$Zn, CuZn$ZnCen,CuZn$Zone,
Ylab="Zinc Concentration, in ug/L")
```
#### Compare CDFs
```{r cdf comp1,fig.width=6.5,fig.height=4,fig.align='center'}
data(ShePyrene); # From the NADA package
cenCompareCdfs(ShePyrene$Pyrene,ShePyrene$PyreneCen)
```
```{r cdf comp2,fig.width=6.5,fig.height=4,fig.align='center'}
cenCompareCdfs(ShePyrene$Pyrene,ShePyrene$PyreneCen,dist3 = "weibull")
```
### Probability (Q-Q) Plots: Pyrene data
*Data: Pyrene*
```{r QQ1,fig.width=6.5,fig.height=4,fig.align='center'}
cenQQ(ShePyrene$Pyrene,ShePyrene$PyreneCen)
```
#### Compare QQs
```{r QQ2,fig.width=7,fig.height=6,fig.align='center'}
cenCompareQQ(ShePyrene$Pyrene,ShePyrene$PyreneCen,Yname="Pyrene",cex=0.75)
```
## Estimate Descriptive Statistics
### Exploring the data
In R, the summary command is used to briefly describe the characteristics of the data. In
the NADA for R package, the censummary command fulfills the same role for censored
data:
```{r sum}
censummary(ShePyrene$Pyrene,ShePyrene$PyreneCen)
```
There are 11 nondetects located at 8 different detection limits. The probabilities of being less than or equal to the detection limit value is (1-pexceed), one minus the exceedance probability. So the limit at a concentration of 28 is at the (1-0.964), or the 3.6th percentile of the data. And (1-0.179) or 82.1% of the observations are below the highest detection limit of 174.
I’ll demonstrate how to compute MLE, K-M and ROS statistics using both the NADA and EnvStats packages.
### Maximum Likelihood Estimation (MLE)
The cenmle command in the NADA package assumes by default that data follow a *lognormal* distribution. Other distributions may be specified as options. We will use the *lognormal* because it was the best-fitting distribution, as seen in the Plotting Data exercise. The results have been stored as an object `(Pyr.mle.nada`, below) and by typing the object name you get the output.
```{r}
Pyr.mle.nada <- with(ShePyrene,
cenmle(Pyrene,PyreneCen))
Pyr.mle.nada
```
The EnvStats package provides different commands for each distribution chosen. As with the plots, *“lnorm”* indicates a *lognormal* distribution, “norm” a normal distribution, and “gamma” a gamma distribution. These come after the *“e”* in the command name. The *“Alt”* in the command tells `EnvStats` to report back the *lognormal* results not in log units, but transformed back into original units. The output is much more detailed than in the `NADA` package. In this example, options for computing two-sided confidence intervals of the mean are specified, which we’ll discuss in the next section of the vignette.
```{r}
Pyr.mle <- with(ShePyrene,
elnormAltCensored(Pyrene, PyreneCen,
ci=TRUE, ci.method ="bootstrap",
n.bootstraps = 5000))
print(Pyr.mle)
```
Using the print statement after storing the output in an object (`Pyr.mle` was used here) produces the table type output shown above. Without the print statement, just typing the object name, the output is generic and not ready to be pasted into a results document.
### Kaplan-Meier
The cenfit function in the `NADA` package has a slightly incorrect detail in its computation of the mean. Here it is, but remember that this issue generally makes the computed mean slightly too high.
```{r}
Pyr.km.nada <- with(ShePyrene,
cenfit(Pyrene, PyreneCen))
Pyr.km.nada
```
You should use the `EnvStats` command `enparCensored` instead for Kaplan-Meier, until
this issue in the `NADA` package is corrected. The `EnvStats` command uses “npar” for nonparametric to produce the Kaplan-Meier estimates.
```{r}
Pyr.km <- with(ShePyrene,
enparCensored(Pyrene, PyreneCen,
ci=TRUE, ci.method ="bootstrap",
n.bootstraps = 5000))
print(Pyr.km)
```
Note that as with all bootstrap estimates the confidence intervals above will differ slightly from your results.
### Regression on Order Statistics (ROS)
The `cenros` command in the `NADA` package constructs ROS models. The default model fits the data to a *lognormal* distribution. A Q-Q plot is drawn by the plot command using the ROS model. The cenros function will not take data with embedded NA values – manually delete them first or use the `elnormAltCensored` function as in the next section.
```{r, fig.width=6.5,fig.height=4,fig.align='center',fig.cap="Lognormal probability of pyrene data"}
Pyr.ROS.nada <- with(ShePyrene,
cenros(Pyrene, PyreneCen))
mean(Pyr.ROS.nada)
sd(Pyr.ROS.nada)
quantile(Pyr.ROS.nada)
plot(Pyr.ROS.nada)
```
The `EnvStats` command is again `elnormAltCensored`, but here with the “rROS” option to compute ROS. In that case the *lognormal* assumption is only for the nondetect data. It also produces confidence intervals for the ROS mean by bootstrapping, making it very useful.
```{r}
Pyr.ROS <- with(ShePyrene,
elnormAltCensored(Pyrene, PyreneCen, method="rROS",
ci=TRUE, ci.method ="bootstrap",
n.bootstraps = 5000))
print(Pyr.ROS)
```
### All at once
Descriptive stats for all three methods, again for the default lognormal distribution, can quickly be produced using the censtats command of the NADA package: Unfortunately this `NADA` package command also cannot currently incorporate `NA` values, so remove them prior to running the command.
```{r}
with(ShePyrene,censtats(Pyrene, PyreneCen))
```
K-M and ROS use the high outlier data value to estimate the mean. MLE uses the lognormal model, whose value at that percentile is lower and therefore the MLE estimate of the mean for this dataset is lower. And again, the K-M mean computed in this `NADA` package function is slightly biased high.
## Interval Estimates
Several of the commands to obtain confidence intervals are identical to what we was done in the Estimating Descriptive Statistics section Prediction. and tolerance intervals are new.
### Confidene Intervals
#### Kaplan-Meier
A confidence interval around the KM mean is computed using the enparCensored command. Since K-M is a nonparametric method, the bootstrap method for computing a CI is recommended, as it too requires no assumed distribution. Note that the default CI method is a t-interval, which requires that the distribution of possible estimates of the mean is a normal distribution in order for this confidence interval to be valid. When the
sample size is around 70+ this may be a reasonable assumption. For this example it is not. Bootstrap intervals work fine with large and smaller data, say 20 observations and above. First the bootstrap:
```{r}
## from above
print(Pyr.km)
```
Then the default normal assumption (basically, a t-interval using the K-M estimates of mean and standard deviation):
```{r}
Pyr.km2 <- with(ShePyrene,enparCensored(Pyrene,PyreneCen, ci=TRUE))
print(Pyr.km2)
```
This t-interval (Normal Approximation) LCL goes down considerably lower (66.5) than the BCa bootstrap interval (98.3) because the t-interval must be symmetric, and the upper end is approx. 100 ug/L above the mean, so the LCL must be 100 below the mean. The data don't warrant that low of an interval as they are asymmetric, and the bootstrap LCL picks up on that information.
#### MLE
Computing the mean of an cenmle object also gives its confidence interval:
```{r}
pymle <- with(ShePyrene,cenmle(Pyrene, PyreneCen,conf.int=0.95))
mean(pymle)
```
These assume the default *lognormal* distribution. Change the `conf.int=` value to get an interval with something other than the default 0.95 confidence coefficient. To get the more typical normal distribution interval, use the `dist="gaussian"` option.
```{r}
pymlenorm <- with(ShePyrene,cenmle(Pyrene, PyreneCen, dist="gaussian"))
mean(pymlenorm)
```
A better method for computing confidence intervals and bounds for skewed data would be bootstrapping. This is the option we used in the Descriptive Statistics exercise above. For the *lognormal* MLE method:
```{r}
pyr.lnorm <- with(ShePyrene,
elnormAltCensored(Pyrene, PyreneCen,
ci=TRUE, ci.method ="bootstrap",
n.bootstraps = 5000))
print(pyr.lnorm)
```
#### ROS
The `cenros` command in `NADA` does not compute confidence intervals for the mean. Use the `EnvStats` command `elnormAltCensored` as done previously in the Descriptive Statistics exercise to bootstrap a confidence interval for the ROS method.
```{r}
# from above
print(Pyr.ROS)
```
Generally, I recommend using a bootstrap estimate when there is sufficient data, which there are here, as theoretical methods such as Cox are strongly dependent on the *lognormal* shape that often does not fit exactly. Remember, ROS assumes a distribution but only for the censored observations.
### Prediction Intervals
Intervals for computing the range of probable values for new observations when the data distribution has not changed can be quickly performed using MLE for three assumed distributions using the cenPredInt command:
```{r}
with(ShePyrene,cenPredInt(Pyrene, PyreneCen))
```
The default intervals here are for 1 new observation. That can be changed with the `newobs = `option. See NADA2 package. You can ignore the warnings about NAs in the dataset, they are deleted prior to computing the intervals, just as you would by hand if necessary.
The same function can be used to compute PIs using ROS, here for 2 new observations, which will make them wider than the intervals for 1 new observation above:
```{r}
with(ShePyrene,cenPredInt(Pyrene, PyreneCen,newobs =2, method = "rROS"))
```
The normal distribution is this example is not a good fit, as shown by the negative value of the lower 95% prediction intervals when assuming a normal distribution.
### Tolerance Intervals
Intervals for computing an upper bound on the true X% percentile, to state that we are 95% confident that no more than (1-X%) of data will exceed it, are computed using MLE by:
(Here for the 90th percentile – no more than 10% exceedances).
To compute a tolerance interval for three distributions, plus a graph showing BIC stats to determine which is best (lowest BIC is best), use the `cenTolInt` function in the `NADA2` package:
```{r, fig.width=6.5,fig.height=4,fig.align='center'}
with(ShePyrene,cenTolInt(Pyrene, PyreneCen, cover=0.9))
```
What’s inside this function? If you would like info on the commands this function uses, its below. If that’s not your thing, just use the function! Here’s how you would get the lognormal tolerance interval:
```{r}
example <- with(ShePyrene,
eqlnormCensored (Pyrene, PyreneCen, p=0.9,
ci=TRUE, ci.type ="upper"))
print(example)
```
Here’s how you would compute a gamma tolerance interval by first taking cube roots, then using those in a censored normal routine to get a tolerance interval on a percentile, then retransforming back to the original data scale by cubing the result:
```{r}
dat.gamma <- ShePyrene$Pyrene^(1/3)
obj.gamma <- eqnormCensored(dat.gamma, ShePyrene$PyreneCen, p=0.9,
ci=TRUE, ci.type ="upper")
pct.gamma <- obj.gamma$quantiles^3 # the 90th percentile in orig units
pct.gamma
ti.gamma <- (obj.gamma$interval$limits[2])^3 # the upper tol limit in orig units
ti.gamma
```
This agrees with the output of the `cenTolInt` command used above, where the results for a gamma distribution are printed.
## Matched Pair Tests and Comparing Data to Standards
### Compare Data to a Standard Using a Matched Pair Test
**Example 1:** Use the `cen_paired` function to determine if arsenic concentrations in
groundwater exceed the drinking water standard of 10 ug/L standard for the Example1 dataset.
```{r, fig.width=6.5,fig.height=5,fig.align='center'}
data(Example1) # From NADA2 package
head(Example1,5L)
with(Example1,cen_paired(Arsenic, NDisTRUE, 10, alt = "greater"))
```
The mean arsenic concentration does not exceed 10 ug/L.
### Test for Differences in Two Paired Columns of Data
**Example 2:** Test whether atrazine concentrations were the same in June versus September groundwaters on the same dates in a variety of wells (rows – paired data). Test both for differences in the mean as well as differences in the cdfs and the medians.
```{r, fig.width=6.5,fig.height=5,fig.align='center'}
data(Atra); # From NADA package
head(Atra,5L)
with(Atra,cen_paired(June, JuneCen, Sept, SeptCen))
```
The p-value is well above 0.05. Do not reject that the mean difference in concentration for the two months could be 0.
```{r}
# test for the median difference = 0 using the sign test.
with(Atra,cen_signtest(June, JuneCen, Sept, SeptCen))
```
Because it is important to correct for the numbers of tied values within a pair, the p-value of 0.089 results in the conclusion to not reject that the median difference in concentration between the two months could be 0.
```{r}
# test for a difference in the cdfs of the two months using the signed-rank
with(Atra,cen_signedranktest(June, JuneCen, Sept, SeptCen))
```
The signed-rank test has more power to see differences than did the sign test. It also is comparing the cdfs, the entire set of percentiles, between the two months. It finds a difference because the upper end of the distribution is quite a bit higher in the Sept data.
### Comparing Data to Standards Using an Upper Confidence Limit
Using the Example 1 data, compute the UCL95 for censored data.
**Step 1.** Sample size. There are 21 observations. Since it is on the borderline for deciding whether to use a distributional or nonparametric method, both will be demonstrated here.
**Step 2. Distributional Method**
Draw the boxplot for "censored data" (data with nondetects).
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(Example1,
cboxplot(Arsenic, NDisTRUE, Ylab="Arsenic Conc", show = TRUE))
```
Note that the highest detection limit is drawn as the horizontal dashed line at 4 ug/L. Everything below that includes values estimated using a lognormal ROS. Three "outliers" (not 'bad data') lie above the estimated whisker, showing that the data are skewed.
Decide which of three distributions best fits the data using the `cenCompareCdfs` command. Choose the distribution with the smallest BIC.
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(Example1,
cenCompareCdfs (Arsenic, NDisTRUE, Yname = "Arsenic concentration in ug/L"))
```
The gamma distribution has the smallest BIC.
Note that the curve representing the normal distribution dips below zero (x=0) at about the 10th percentile. A distribution of concentrations with 10% negative numbers is not realistic, which results in a higher BIC statistic.
Use the best-fit distribution (gamma) from 2b to compute the UCL95.
```{r}
egam <- with(Example1,
egammaAltCensored(Arsenic, NDisTRUE,
ci=TRUE, ci.type = "upper",
ci.method = "normal.approx"))
print(egam)
```
Use the print statement to get the “table format” for the output from this EnvStats function. The UCL95 equals 2.57 assuming a gamma distribution. Because this is lower than the 10 ug/L standard, the null hypothesis of non-compliance is rejected, and the site from which these data came is found to be in compliance.
**Step 3. Nonparametric Method**
**Example 1:**
There are multiple detection limits for this arsenic data. Compute the Kaplan-Meier estimate of the mean and percentile bootstrap UCL95, the latter because of the high percent of nondetects (66.67%) in the data.
```{r}
arsenic.out <- with(Example1,
enparCensored(Arsenic, NDisTRUE,
ci=TRUE, ci.method="bootstrap", ci.type="upper",
n.bootstraps=5000))
print(arsenic.out)
```
The percentile bootstrap estimate of the UCL95 will be near to 2.53, with the output varying slightly each time a bootstrap estimate is run. This is essentially the same estimate as that for the gamma distribution, with the identical result – the site is found to be in compliance.
**Example 2:** Computation of a UCL95 for data with both detected and non-detected values, DL unknown.
Data: Methyl Isobutyl Ketone (MIBK) in air above a medium-sized US city.
```{r}
data(Example2)
```
*A: Computation of the mean and UCL95*
The MIBK concentrations are given as reported in column 1 -- no detection limit was provided. Nondetects were designated only as ND. The lowest detected value in the data equals 0.1229. Assuming all ND values are lower than this, all NDs were changed to <0.1229 as shown in the MIBK and MIBKcen columns.
This results in only one reporting limit in the data, so the Kaplan-Meier estimate will be biased a bit high. Instead, use the robust ROS method with bootstrapping:
```{r}
mibk.ucl95 <- with(Example2,
elnormAltCensored (MIBK, MIBKcen, method = "rROS",
ci=TRUE, ci.method = "bootstrap",
ci.type = "upper", n.bootstraps = 5000))
print(mibk.ucl95)
```
The percentile bootstrap UCL95 based on the robust ROS mean equals 0.290 (the Kaplan-Meier estimate with the slight bias would have equaled 0.293). Remember that your bootstrap result will slightly differ from the one here. To decrease differences between runs, increase the number of bootstraps, say to 10,000.
*B: What if the detection limit had been known?*
If a reporting limit of 0.029 had been provided by the laboratory, the data would be as given in the MIBK2 and MIBK2cen columns. Using the same procedure gives slightly lower results for both mean and UCL95:
```{r}
mibk2.out <- with(Example2,
elnormAltCensored (MIBK2, MIBK2cen, method = "rROS",
ci=TRUE, ci.method = "bootstrap",
ci.type = "upper", n.bootstraps = 5000))
print(mibk2.out)
```
The percentile bootstrap UCL95 based on the robust ROS mean will be near to 0.290 with this known detection limit. It is always better to use the laboratory documented limit, but not having one should not stop the user from computing estimates using the lowest detected observation as the limit.
**Example 3:** Computation of the expected percent of observations exceeding a health advisory when all data are NDs. More details of this method are found in Chapter 8 of Statistics for Censored Environmental Data Using Minitab and R (Helsel, 2012).
```{r}
data(Example3)
```
All detection limits used are below the 10 ppb drinking water MCL for arsenic. Therefore we know that 0 out of 14 current observations exceed the MCL of 10 ppb. What is the range of percent of observations in the aquifer that might exceed the MCL (with 95% probability)? Use the binomial test command, entering the number of observations in the dataset that exceed the MCL (0) and the number of total observations (14). The `alternative =”less”` option states that this is a one-sided confidence interval – we are looking only for possible exceedances, nothing on the low end.
```{r}
binom.test(0,14,alternative="less")
```
Most of what is returned concerns a test for whether the proportion of observations above the MCL differs from 50%, but this test is of no interest here. What is of interest is the confidence interval on the proportion of observations in the population that could be above the MCL, based on the 14 samples observed. The UCL95 of the proportion equals 0.192. Therefore we can say with 95% probability that there are no more than 19.2% of concentrations in the aquifer exceeding the MCL – we expect that there are fewer because the MCL of 10 is considerably above the highest detection limit of 4 ppb, and this interval is actually the probability of exceeding 4 ppb. Taking this conservative approach that the probability of values falling above 4 ppb is the same probability of falling above 10 ppb, the expected percent of samples at this location above the MCL of 10 ppb is no more than 19.2%. This range could be tightened by taking more samples, of course. For other questions that can be answered when all values are nondetects, see Chapter 8 in Helsel (2012).
## Two-Group Tests
The MLE version of a "t-test" for censored data uses an MLE regression with one X variable, a 0/1 group indicator. Assuming a normal distribution:
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(CuZn,cen2means(Zn,ZnCen,Zone,LOG=FALSE))
```
No difference between group means can be seen. But do the residuals follow a normal distribution, as required? The corresponding Q-Q plot of regression residuals and corresponding Shapiro-Francia test show that the data do not follow a normal distribution.
MLE also builds a model of the two groups after estimating their mean and standard deviations. When assuming a normal distribution with data close to zero it is easy for the model to project data down below zero. Besides being unrealistic, this can lead to a false separation between the groups and p-values that are too low. To avoid this, either assume a lognormal distribution (the default, or use LOG=TRUE) or use a permutation test instead.
Use the cenperm2 function to perform a two-group permutation test. This avoids an assumption of a normal distribution while testing for differences in means:
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(CuZn,cenperm2(Zn,ZnCen,Zone))
```
The two groups do not have significantly different means. These are p-values that you can believe, as they do not depend on the normal assumption and do not project data values below zero.
Using cen2means to perform the test using the default lognormal distribution, the difference in geometric means (not arithmetic means) is performed below.
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(CuZn,cen2means(Zn,ZnCen,Zone))
```
No significant difference between geometric means. And the corresponding Q-Q plot of regression residuals show that the data do not follow a lognormal distribution either (one large outlier), but it is the better fit of the two distributions. A permutation test using cenperm2 after computing the logarithms would be a better test for differences in geometric means.
"A nonparametric test between the two groups will test differences between their cdfs (often considered as a test for difference in medians) instead of testing the difference between means. The Peto-Peto test is run using the cen1way function. It reports the Kaplan-Meier medians in each of the groups:
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(CuZn,cen1way(Zn,ZnCen,Zone))
```
The two group medians (10 vs 17) are found different at p = 0.0228, without assuming normality or substituting anything for the nondetects censored at multiple (in this case, two) reporting limits. The cdfs for the two groups are also shown, drawn using the cen_ecdf function.
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(CuZn,cen_ecdf (Zn,ZnCen,Zone,
Ylab="Zinc Concentration, in ug/L"))
```
Above approximately the 40th percentile or so, the Basin Trough (dashed line) data are higher than (to the right of) the Alluvial Fan data. Boxplots might show the group differences better than do the cdfs, especially for non-statisticians.
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(CuZn,cboxplot(Zn,ZnCen,Zone,
Ylab="Zinc Concentration, in ug/L"))
```
Now for the two simpler tests, where all values below the highest DL must be re-censored.
Contingency tables are a test to determine whether the proportions of data in categories are the same in two or more groups. With nondetects the cutoff level is again the highest reporting limit. Use the `TCE2` dataset with the highest of 4 DLs at 5 ug/L. The test determines whether the proportions of data above versus below the cutoff are the same in each group. It differs from a rank-sum in that all data above the limit are simply in the same ‘Above’ group rather than ranked individually. This loses some information as compared to the rank-sum test. Density is a text variable (a factor) representing the groups, while Below5Cens has a 1 for data below 5 ug/L and a 0 for data at or above 5. First we combine them into a matrix using the ftable command, then compute a ‘cross-tabulation’ with the `xtabs(...)` command that the `chisq.test(...)` command expects. Finally the chisquare test is computed. Its null hypothesis is that there is no difference in the percent of data above the cutoff of 5 ug/L in the two groups.
```{r}
data(TCE2)
with(TCE2,ftable(Density~Below5Cens))
tab=with(TCE2,xtabs(~Below5Cens+Density))
chisq.test(tab)
```
The contingency table finds a difference in the proportions. It is less powerful than the rank-sum test (which will also find a difference) when there are data that can be ranked above the highest reporting limit.
The nonparametric Wilcoxon rank-sum test can be calculated using the `wilcox.test(...)` command. First you must have or create a column that contains the concentrations for all detected values at the maxDL and above, plus a single number (I use -1) below the max DL for all values below the maxDL.
```{r}
TCE2$Below5[TCE2$Below5Cens== 1] <- -1 # all <5s are now a -1
wilcox.test(Below5~Density,TCE2)
```
The two-sided p-value is significant at 0.027. The smaller p-value than the contingency table test reflects the additional information in the individual values at and above 5 that the rank-sum tests uses. The slightly smaller p-value for the Peto-Peto test shows that it is the most appropriate test when there are multiple detection limits.
Is re-censoring at the highest DL and running the rank-sum test really better than the typical method of running a t-test on data with one-half DL subbed for nondetects?
```{r}
t.test(Half.DL~Density,TCE2)
```
No significant difference found. The arbitrariness of the substitution process, not to mention that this puts the same number in many times, and so likely decreases the standard deviation artificially, should get you to quickly use a nonparametric test instead.
## Three or more groups
The `Golden` data present lead concentrations in organs of herons after exposing them to lead. There are four dosage groups (“Dosage” or “Group” columns), zero plus three amounts of lead. The objective was to determine if feathers or another non-destructive part of the birds could be used as an environmental indicator, so it would not be necessary to sacrifice a bird in order to measure their lead concentrations. A censored boxplot of the Golden liver lead data shows that the third and fourth groups have generally higher values.
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
data(Golden)
with(Golden,cboxplot(Liver, LiverCen, Dosage,
Ylab = "Lead concentrations in liver, in ppb"))
```
The skewness of the 0.05 group and the outlier of the 0.25 group indicates that logs should end up being the better set of units to use.
### Kruskal-Wallis test
First the groups can be compared using a Kruskal-Wallis test, setting all values below the highest detection limit of 0.04 as tied. Note that there are detected observations below 0.04, so either the data had a second and lower detection limit with no nondetects below it, or more likely were reported using “insider censoring” (see *Statistics for Censored Environmental Data Using Minitab and R* to find out what that is and the problem it causes).
**Step 1** - Create a variable -- call it Below04 -- with zeros (or -1, or any value below the highest DL) for all data below the highest DL of 0.04. Be careful not just to assign all 0.04s as nondetects, as some of these could be detected 0.04s. Instead, use two steps, the first to set all values BELOW 0.04 as a 0 (or -1), and the second to set all data marked as nondetects (which will include the <0.04 values) as a 0 (or -1). The result is a variable with an indicator (-1 recommended) for all data below the highest reporting limit, and original values for all detected data at and above the highest reporting limit. The logical operators < (less-than) and == (equal to) are used here.
```{r}
Golden$Below04 <- Golden$Liver
Golden$Below04[Golden$Liver<0.04] <- -1
Golden$Below04[Golden$LiverCen==TRUE] <- -1
```
**Step 2** - run the Kruskal-Wallis test
```{r}
kruskal.test(Below04~Dosage,Golden)
```
The result shows that there is a difference (p = 0.049) between group medians using this simple nonparametric test. An ANOVA on data after substituting one-half DL will not find a difference (trust me on this).
### Peto-Peto test
The nonparametric Peto-Peto test, the multi-DL nonparametric test, is computed using the `cen1way(...)` command:
```{r}
with(Golden,cen1way(Liver,LiverCen,Dosage))
```
The cdfs show that the higher two groups appear to differ in their percentiles as compared to the lower two groups.
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
with(Golden,cen_ecdf(Liver,LiverCen,Dosage))
```
This is more easily seen by plotting the empirical cdfs in log units:
```{r,fig.width=6.5,fig.height=4,fig.align='center'}
Golden$lnLiver=log(Golden$Liver)
with(Golden,cen_ecdf(lnLiver,LiverCen,Dosage,
xlim = c(min(lnLiver), max(lnLiver)),
Ylab = "Natural Logs of Lead Concentrations in Liver"))
```
The 0.05 and 0.025 groups appear to have the higher liver lead concentrations (are further to the right) than the other two groups.
### The MLE "ANOVA"
For the parametric approach, use the `cenanova(...)` command to run a censored regression with the groups as ‘factor’ explanatory variables. By default, `cenanova` assumes the residuals follow a lognormal distribution, so use the associated Q-Q plot to see if the residuals in log units appear approximately like a normal distribution.
```{r}
with(Golden,cenanova(Liver,LiverCen,Dosage))
```
The overall test has a p-value of 0.014. Therefore the four group mean logarithms (geometric means) differ. p-values for the individual pairwise tests of differences show differences in two pairs of groups. The p-values of 0.0150 (0.05 versus 0 groups) and 0.0323 (0.05 versus 0.01 groups) show that the 0.05 group differs from the lowest two groups, but not from the 0.25 group. The residuals plot shows that log are a very good set of units to use, as their residuals are close to a normal distribution:
If instead you had wanted to test differences in the arithmetic means, use a permutation test as a normal distribution will not fit these data very well.
```{r}
with(Golden,cenpermanova(Liver,LiverCen,Dosage))
```
No significant difference in the means was found. This test did not assume a normal distribution, though it is still influenced by outliers because it evaluates means, which are influenced by outliers. The permutation test will not extrapolate data to values below zero as would MLE.
Note that the means of two groups, and so much of the data in the two groups, falls below zero when estimated by MLE assuming a normal distribution:
```{r}
with(Golden,cenanova(Liver,LiverCen,Dosage,LOG=FALSE))
```
The p-value of 0.0755 is too small because the group differences are exaggerated by pushing data down below 0. Given that the actual data cannot go below zero, the cenpermanova p-value of 0.14 is a much more realistic test result.
### Two-factor "ANOVA" by MLE
The above commands cenpermanova, cenanova and cen1way are all "one way" or one factor tests. They test for the effect of one and only one factor (group rather than continuous) explanatory variable. A two factor test is also available with the `cen2way` command in NADA2. It can test for the effect of each of the two factors and the interaction between the two factors, incorporating censored data by maximum likelihood. The function may also be used on data without nondetects, resulting in tests that are similar to but not necessarily identical with the more common least-squares analysis of variance software. A normal distribution of residuals is assumed by cen2way and so the default is to log-transform the Y variable. If you decide that other transformations are better than the log, transform (say using the square root of Y) and then set the `LOG=` option to `FALSE.` Here I use chromium concentrations in a stream and test for the effect of a trend with the year (Yr) variable and a seasonal effect with the (Season) variable. A better trend analysis procedure will be performed in the Trend Analysis section later. Note that Season is a text variable and Yr is numeric, so Yr will first be set to a factor using the as.factor command.
```{r}
# Load data
data(Gales_Creek)
Gales_Creek <- as.data.frame(Gales_Creek)
# set Year to factor by making a new variable.
Gales_Creek$Yr.f <- as.factor(Gales_Creek$Yr)
with(Gales_Creek,cen2way(TCr, CrND, Yr.f, Season))
```
The residuals plot and Shapiro-Francia test show that the log(Y) are reasonable units. Both Yr and Season are significant with small p-values. Their interaction is also significant with a p-value of 0.011. The best model, judged by its lowest AIC of 145.54, is that all three effects including the interaction, are significant. As mentioned in the next section on regression, the BIC statistic tends to not find variables significant that should be, based on simulation studies. Here BIC would toss out the interaction tern and state that only the two main effects are to be kept. BIC 'best models' tend to be too small and I generally recommend using AIC instead. As with regression, models should not be compared using AIC or R2 unless the Y scales are identical. Here we have used only log(Y) so numerical measures of quality are appropriate.
## Correlation and Regression
```{r}
data(Recon)
```
First test for high vifs by computing a standard regression equation that ignores the censoring indicator column, ignoring all results except for the vifs:
```{r}
vif(lm(AtraConc ~ Area + Applic + PctCorn + SoilGp + Temp + Precip + Dyplant + Pctl,Recon))
```
All of the variables appear uncorrelated with the others (all VIFs well below 10). Therefore the p-values obtained in regression should be reliable.
**Step 1** - Create the 8-X variable `data.frame` and run the regression with all variables. Decide which scale the Y variable should be used (no transformation, log, cube-root).
```{r}
recon.8 <- with(Recon,data.frame(Area, Applic, PctCorn, SoilGp, Temp, Precip, Dyplant, Pctl))
reg.recon.8 <- with(Recon,cencorreg(AtraConc, AtraCen, recon.8))
summary(reg.recon.8)
```
The Rescaled likelihood R is fairly high (0.82) and the AIC equals 804.4. The Q-Q plot (below) shows a fairly straight pattern of data and W = 0.966, so it would be difficult to find a better transformation of the Y variable than the log. Use log Y.
**Step 2** - Decide whether to transform the scale of each X variable.
Run the partplots procedure to see whether curvature in the Y-X relationship means that a transformation of the X variable should be taken.
```{r,fig.height=10}
layout(matrix(1:8,4,2))
with(Recon,partplots(AtraConc, AtraCen, recon.8,multiplot = F))
```
Only PctCorn showed an appreciable drop in AIC with a transformation. As a percentage it is bounded by 0 and 100 so it’s a little odd to do a transform. However the partial plot shows an increase in the percent of "filled circle" detected observations and so a general increase in atrazine with increasing PctCorn. The relationship may be nonlinear so I'll take the cube root of PctCorn.
```{r}
Recon$cbrtPctCorn <- recon.8$PctCorn^(1/3)
recon.8onecube <- cbind(recon.8[, -3], Recon$cbrtPctCorn)
reg.recon.8onecube = with(Recon,cencorreg(AtraConc, AtraCen, recon.8onecube))
```
```{r,fig.height=10}
layout(matrix(1:8,4,2))
with(Recon,partplots(AtraConc, AtraCen, recon.8onecube,multiplot = F))
```
No other variables indicate a further transformation is necessary after running partplots -- the cbrtPctCorn variable is already transformed so taking the log or cube root of the cube root would not make much sense.
**Step 3a** - Can we lower the AIC by dropping unimportant variables?
```{r}
summary(reg.recon.8onecube)
```
Both Precip and Soil Gp are very non-significant. Soil Group was a number looked up in a book at the county scale and so wasn’t really expected to provide much information (your insight as a scientist is needed when using regression). Delete either Precip or Soil Group and run the 7-variable model.
```{r}
# an alternative way to make a data.frame without using data.frame(...)
recon.7 <- Recon[,c("Area", "Applic", "cbrtPctCorn", "Temp", "Precip", "Dyplant", "Pctl")]
reg.recon.7 <- with(Recon,cencorreg(AtraConc, AtraCen, recon.7))
```
AIC has decreased so this is better than the 8-variable model. The residuals plot looks much the same -- this is expected in regression. The normality of residuals is primarily determined by the scale of the Y variable. Going to a six-variable model is an easy choice: Precip has a high p-value.
```{r}
recon.6 <- Recon[,c("Area", "Applic", "cbrtPctCorn", "Temp","Dyplant", "Pctl")]
reg.recon.6 <- with(Recon,cencorreg(AtraConc, AtraCen, recon.6))
summary(reg.recon.6)
```
AIC has decreased by 1.5. The next variable with a high p-value is Area. The order of deleting these 3 variables likely wouldn't matter and you'd get to this 5 variable model even if you dropped them in a different order.
```{r}
recon.5 <- Recon[,c("Applic", "cbrtPctCorn", "Temp","Dyplant", "Pctl")]
reg.recon.5 <- with(Recon,cencorreg(AtraConc, AtraCen, recon.5))
summary(reg.recon.5)
```
AIC has decreased by 0.7, so this is a very slightly better model. The only other insignificant variable is Applic, the application amounts of ag chemicals. These are known only on a county level so are estimated by cutting and pasting county boundaries with watershed boundaries. They also are voluntary amounts, and may not always provide accurate information to the Federal government. But probably the main evidence against the variable is its negative slope – we would expect more atrazine to wash off with more applications. So we drop this variable to see its effect.
```{r}
recon.4 <- Recon[,c( "cbrtPctCorn", "Temp","Dyplant", "Pctl")]
reg.recon.4 <- with(Recon,cencorreg(AtraConc, AtraCen, recon.4))
summary(reg.recon.4)
```
The AIC goes up but only a little. This would be the scientist’s choice to use either the 5-variable or the 4-variable model. I usually choose the larger model if all p-values are under 0.10 because AIC and similar metrics are known to choose too few variables. Here however the Applic p-value in the 5-varable model is 0.124 and so I'd drop it, choosing the 4-variable model. What is also behind my decision to drop Applic is that it is a crude measure of amount of pesticide applied (county level data cut and pasted) and cbrtPctCorn essentially measures the same thing. Use your knowledge of the data to make your decision.
**Step 3b** - Use the bestaic function to lower the AIC?
Starting with the full 8 variables, though PctCorn has been transformed to become cbrtPctCorn, run the bestaic function to see what models the computer selects:
```{r}
with(Recon,bestaic(AtraConc, AtraCen, recon.8onecube))
```
Many models are very similar in AIC, but the 'best' is the 5-variable model that we named `recon.5`, above. Second best was the four variable model we called `recon.4.` So by deleting sequentially we did get to the "best" models, but this is a lot quicker. As you see in the list there are several other models around an AIC of 798, and if it were less expensive to use the variables in one of these, it would be an excellent substitute for the mathematically lowest AIC model. For example the fourth model down uses Area instead of Applic, with all other variables the same. If `Applic` were expensive to collect, this model has an AIC only 0.9 units higher.
**Finding the best one-variable model.** To find the best 1-variable model (just to compare to the ATS equation), run the four possible 1-variable models using the variables from the 4-variable model.
```{r}
reg.recon.cbrtPctCorn <- with(Recon,cencorreg(AtraConc, AtraCen, cbrtPctCorn))
```
```{r}
reg.recon.Temp <- with(Recon,cencorreg(AtraConc, AtraCen, Temp))
```
```{r}
reg.recon.Dyplant <- with(Recon,cencorreg(AtraConc, AtraCen, Dyplant ))
```
```{r}
reg.recon.Pctl <- with(Recon,cencorreg(AtraConc, AtraCen, Pctl))
```
The Dyplant (days since planting) variable has the lowest AIC. Its residuals plot (above) shows a linear pattern pulled away from the normal theory line because of six high outliers. The negative slope is reasonable: as there are more days since planting of corn, and atrazine is only applied before planting, the more time it sits on the ground the lower the amounts available to be washed off into the stream. The outliers are probably high flows that would be accounted for by the Pctl variable if it were in the model. Here are the details for the Dyplant 1-variable model:
```{r}
summary(reg.recon.Dyplant)
```
### Plots in Multiple Regression
To plot the predicted line from a regression model, set `cencorreg`'s `pred.plot` option = `TRUE`.
```{r}
with(Recon,cencorreg(AtraConc, AtraCen, Dyplant,pred.plot=TRUE))
```
For the Dyplant only model the plot shows the relationship between Y and Dyplant, but ignores the influence of all the other variables that should be significant and in the model. It is too simple given that we know there are 4 more variables in the 'best' model that affect predictions which are not shown here. This isn't much help when a multiple regression model is the best model.
Plotting the partial plot of a multiple regression model provides a better picture of the effect of one explanatory variable on Y after 'subtracting off' the effect of all the other variables in the model. The partial plots of five pages earlier are an example. They are the best way to picture the effect of one X variable within the context of the multiple regression model. However they do not show what is actually predicted for the Y variable from the model.
A prediction plot shows predicted values for Y for the full multiple regression model using one X variable as the horizontal scale. The predictions will not look like a nice straight line or smooth curve unless there is only one variable in the model, as effects of the other explanatory variables will cause the predictions to move up or down in relation to the effect of the X variable chosen for the horizontal scale. The cencorreg function uses the first variable in the X data.frame as the one chosen for the horizontal axis on the plot. If Dyplant as the 'best' X variable is to be used for the horizontal axis, a new data frame with Dyplant as the first in the list must be created:
```{r}
recon.5.alt <- Recon[,c("Dyplant","Applic", "cbrtPctCorn", "Temp", "Pctl")]
plot.reg.recon.5 <- with(Recon,cencorreg(AtraConc, AtraCen, recon.5.alt,pred.plot = TRUE))
```
The pattern of predictions is much more jagged than the partial plot of `Dyplant`. But the predicted values also show the effects of nondetects and the other 4 variables in the model in addition to the effect of the one X variable used in the plot.
**The Nonparametric ATS line:**
Using Dyplant as the X variable,
```{r}
with(Recon,ATS(AtraConc, AtraCen, Dyplant))
```
Seeing this transformed back into the original units will look much better.
```{r}
with(Recon,ATS(AtraConc, AtraCen, Dyplant,retrans=TRUE))
```
For this dataset, the maximum likelihood and ATS slopes for Dyplant are very similar (different by 0.006). The intercepts are similar as well when you realize that a difference of 2.0 is small when concentrations go up to 100.
The plot of the relationship of atrazine to flow percentile (Pctl) shows a clear washoff effect at higher flows:
```{r}
with(Recon,ATS(AtraConc,AtraCen, Pctl,retrans=TRUE))
```
There is not yet (June 2022) a good nonparametric “multiple regression” method for censored data. There are ‘robust regression’ methods that perform nonparametric regression but I’ve never seen them applied to censored data.
## Trend Analysis for Censored Data
Load the Gales_Creek dataset:
```{r}
data(Gales_Creek)
```
### Nonparametric Methods
*ATS (no covariate or seasonal variation)*
We choose to use the original units (`LOG=FALSE`) because the data appear linear over time with one large outlier, and a nonparametric test will not be overly influenced by one outlier. Running the ATS function on concentration versus decimal time, we find strong evidence for a downtrend (p = 0.006):
```{r}
with(Gales_Creek,ATS(TCr,CrND,dectime,LOG=FALSE))
```
It isn’t easy to see on the plot, but the detection limits shown as dashed lines are higher before 2012 as opposed to after 2012. The methods of this section of the course work well with multiple detection limits in the data record.
*ATS on residuals from a smooth with a covariate*
Using the centrend function, we first smooth the chromium – streamflow relationship, and then test the residuals for trend:
```{r}
with(Gales_Creek,centrend(TCr,CrND,discharge,dectime))
```
There is no trend in chromium concentration once the effect of streamflow has been subtracted out. It appears that the evidence for a downtrend was due to a change in the flow regime over the time period. There is a strong relationship between flow and chromium concentrations.
*Seasonal Kendall test*
Perhaps there is a trend in either the dry season alone, ignoring the effects of high flows on the trend test? Perform the Seasonal Kendall test using the censeaken function and pay attention to the individual season results by plotting them using the `seaplots = TRUE` option.
```{r}
with(Gales_Creek,censeaken(dectime,TCr,CrND,Season,seaplots=TRUE))
```
There is an overall trend once the Seasonal Kendall test has removed all comparisons between values in different seasons. Also, the dry season has a p-value of 0.069. The prevailing wisdom in statistics in 2019 is to not get too rigid about an alpha of 0.05. A value of 0.069 is close to 0.05 and the trend in the dry season graph appears strong. I would report in this case that there is an overall downtrend and a downtrend in the dry season. The high flows in the wet season were preventing the non-seasonal `centrend` function from seeing the trend.
*Seasonal Kendall test with a covariate*
The centrendsea function will perform the seasonal Kendall test on the residuals from a GAM smooth of the original Y variable verses a covariate, like flow. This first computes the same covariate adjustment of the centrend function and using the residuals from the smooth, tests the "covariate-adjusted trend" using the Seasonal Kendall test. The variables to be entered are, in order, the Y variable to be tested for trend, the T/F indicator of censoring for the Y variable, the covariate, the time variable, and the season category variable.
```{r}
with(Gales_Creek,centrendsea(TCr,CrND,discharge,dectime,Season))
```
There is no trend in flow-adjusted chromium found for this site and dates. The slight decrease of the ATS line above, which includes the influence of the below detection limit observations as well as detected observations, is not significantly different from a zero slope.
### Parametric Method
*Simple Censored Regression*
Using the default log transformation of chromium because we know there is one large outlier lurking, the `cencorreg` function shows that the residuals are not a normal distribution, but the data appear quite straight except for the one high outlier. There is likely no better scale to work in – untransformed concentrations would be far worse. Without deleting the outlier (you should check it to see if there’s been an error, but you can’t because this isn’t your data!), do not delete the outlier without cause and work in the log units.
```{r}
with(Gales_Creek,cencorreg(TCr,CrND,dectime))
```
The regression p-value of 0.03 says that there is a trend. The slope of – 0.095 log units per year will be approximately a 10% decrease in chromium per year. But is this slope a good estimate, given that there appear to be a confounding effect of streamflow? So perform a censored multiple regression.
*Censored Multiple Regression*
Create a data frame of the two X variables, dectime and flow, and try again. This is a better model if flow explains a lot of the variation in concentration. If that’s the case the model’s AIC will be lower than the previous AIC of 163.48.
```{r}
timeflow <- with(Gales_Creek,data.frame(dectime, discharge))
with(Gales_Creek,cencorreg(TCr,CrND,timeflow))
```
The QQ plot looks great, and the residuals do not differ from a normal distribution. The AIC is considerably lower for the 2-variable model, so this model that accounts for flow variation should be used instead of the original model.
*Censored Multiple Regression with Seasonal Variables*
Sounds like a menu option (‘seasonal vegetables’), doesn’t it? Create the sin and cos function variables using `2*pi*dectime`, and add it to the stew. See if they add anything.
```{r}
sinT <- with(Gales_Creek, sin(2*pi*dectime))
cosT <- with(Gales_Creek, cos(2*pi*dectime))
timeflowseas <- with(Gales_Creek,data.frame(dectime, discharge))
timeflowseas <- cbind(timeflowseas,sinT,cosT)
with(Gales_Creek, cencorreg(TCr, CrND, timeflowseas))
```
The QQ plot looks good. The sin and cos model has a lower AIC (106.3 versus the 2-variable model’s 113.5) so this is the best model of the three. The slope of -0.098 per year still maps to around a 10% decrease in concentration per year.
## Logistic Regression
The ReconLogistic dataset presents atrazine concentrations at streams across the midwestern United States. There were multiple detection limits, and a health advisory of 1 ug/L. Several characteristics of the basin at the time of sampling, including streamflow, are also recorded.
```{r}
data(ReconLogistic)
head(ReconLogistic,3)
```
We will model the above/below 1 ug/L pattern using the `GT_1` variable. The primary assumption is that there is a linear relationship between the X variables and the log(odds). Start by checking VIFs for all 6 candidate variables. The `glm` command using the `family=binomial(logit)` link function produces the equation:
```{r}
glm.1 <- glm(GT_1~
APPLIC+
CORNpct+
SOILGP+
PRECIP+
DYPLANT+
FPCTL,
ReconLogistic,family=binomial(logit))
vif(glm.1)
```
There is no multicollinearity between the variables, as indicated by VIF values of less than 2. Therefore the reported p-values should be trustworthy.
```{r}
summary(glm.1)
```
To compute the overall test of whether this model is better than no model at all, the test that all slopes are zero, first compute the null model by using a 1 instead of any X variables in the right-hand side of the equation:
```{r}
glm.null <- glm(GT_1~1,ReconLogistic,family=binomial(logit))
anova(glm.null,glm.1,test="Chisq")
```
The test statistic often named G equals 312.1. Compared to a chi-squared distribution with 6 degrees of freedom, the difference in the number of X variables between the two models, this statistic has a p-value of 2 x 10^-16, and so is very significant. We conclude that there is information in this model for predicting atrazine occurrence above 1 and proceed to try and find the best model. The simplest way to have the computer tell you the best logistic regression model is to use the bestglm command. But let’s try manually first and see how we do. We need to see if any of the X variables need to be transformed.
```{r}
residualPlots(glm.1,type="deviance")
```
It is hard to judge residuals plots with logistic regression because there are separate groups of residuals for the Y = 0 and 1 data. The smooth for DYPLANT appears curved, but is not an improvement. If `log(Dyplant)` is substituted for `Dyplant` the AIC increases, so is not a better model.
Using `glm.1`, `Applic` had the highest p-value, so we’ll drop it and see if AIC goes below 259.9.
```{r}
glm.3 <- glm(GT_1~
CORNpct+
SOILGP+
PRECIP+
DYPLANT+
FPCTL,
ReconLogistic,family=binomial(logit))
summary(glm.3)
```
This 5-variable model is improved over `glm.1`. `SoilGp` remains insignificant so let’s drop that and see the effect on AIC.
```{r}
glm.4 <- glm(GT_1~
CORNpct+
PRECIP+
DYPLANT+
FPCTL,
ReconLogistic,family=binomial(logit))
summary(glm.4)
```
AIC is lower (improved). All terms are significant. We’ll settle on this as our final model. This is the model found ‘best’ using `bestglm` (note that `bestglm` requires the `data.frame` to have the 0/1 Y variable in the last column, which `Recon` does).
```{r,warning=FALSE}
bestglm(ReconLogistic,family = binomial(logit), IC = "AIC")
```
glm.4 has slopes with algebraic signs that make scientific sense, and all explanatory variables are significant at alpha = 0.05. The model can be compared to the original 6 variable model using either a partial test, or with the AIC. The partial test determines whether the two variables that were dropped add significantly to the explanatory power of the model, just as in multiple linear regression. The null hypothesis is that they do not; not rejecting the null hypothesis says to keep the simpler model.
```{r}
anova(glm.4, glm.1, test="Chisq")
```
We do not reject that the smaller model is just as good. Go with `glm.4`. Finally, if we want to get an r-squared or a Brier score, and get some plots of the relation between the log-odds and each X variable, compute the same 4-variable model using the `lrm` command from the `rms` package:
```{r}
Recon.frame = with(ReconLogistic, datadist(CORNpct, DYPLANT, FPCTL, TEMP, GT_1))
options(datadist = "Recon.frame")
lrm4 <- lrm(GT_1 ~ CORNpct + DYPLANT + FPCTL + TEMP, data = ReconLogistic)
lrm4
```
## Multivariate Methods for Censored Data
Symonds et al (2016) used microbial source tracking (MST) markers to detect fecal pollution in waters along the coast of Florida. Six MST markers are in the dataset `Markers` in interval-censored format, where (0 to MDL) indicate values below a limit of detection. Nonzero lower ends of the interval indicate either (MDL to QL) data or detected values above the QL. Also included is the US EPA total enterococci marker `Entero1A`, a general fecal pollution indicator.
1. Test whether the pattern of the six MST markers plus the `Entero1A` indicator differs among the five sites using `ANOSIM`.
2. Test whether there is a ‘trend’ (correlation) between the six MST markers versus the general fecal pollution indicator using the Mantel test for trend analysis using Kendall’s tau correlation between the matrices (multivariate nonparametric correlation).
***
*Solution*
1. Load the data and compute the ranks of the uscores; then compute the anosim test for group differences and illustrate the results with an MDS:
```{r,message = FALSE,results="hide"}
data(Markers)
head(Markers,3)
Mdat <- Markers[, -15] # remove the Site Name column
M.usc <- uscoresi(Mdat) # uscoresi drops rows with NAs (row 13 here)
M.euclid <- dist(M.usc)
Site <- Markers$Site_Name[-13] # delete the site entry for row 13 with NAs
M.anosim <- anosim(M.euclid, Site)
M.anosim
anosimPlot(M.anosim)
```
There is a difference between the five sites. To illustrate which sites appear different than others, draw an NMDS plot:
```{r,message = FALSE,results="hide"}
uMDS(M.usc, group = Site,
legend.pos = "topright",
title = "NMDS of rank(uscores) for markers + entero")
```
From the left-right axis (NMDS1) we see that all three inlets (Port of Miami, Port Everglades and Baker’s Haulover) are on the left side, while Miami Central, an ocean outfall site, is on the right side. That is the main contrast between sites. From the second axis (NMDS2) there are one or two samples within Sites that are ‘outliers’ towards the top as compared to others in that site. Some characteristic differs in those samples. Sample 11 compared to the rest of Miami_North and site 14 compared to the rest of Baker’s Haulover, for example. Second axes for NMDS aren’t always interpretable, but the first axis should be expected to show the main difference detected by the ANOSIM result.
If you’d like to draw the NMDS plot manually, perhaps to change some options from what is in the function, here is the R code that will draw the same plot as the `uMDS` function:
```{r,message = FALSE,results="hide"}
M.nmds <- metaMDS(M.euclid)
Site <- as.factor(Site)
gp.color <- as.integer(Site)
Mplot <- ordiplot(M.nmds, type="none", display = "sites",
main="NMDS of rank(uscores) for markers + entero")
points(Mplot, "sites", pch=19, col=gp.color)
text(Mplot, "sites", pos=4, cex = 0.8)
leg.col <- c(1: length(levels(Site)))
legend("topright", legend=levels(Site), bty="n", col = leg.col, text.col = leg.col, pch = 19)
```
2. Create two triangular distance matrices, one for the 6 MST markers and the second for the general fecal pollution indicator (entero1A) data. Then correlate the two matrices using the Mantel command. This is to see if there is a ‘trend’ in the MST marker pattern with increasing entero1A.
```{r}
Mdata <- Markers[-13,] # delete row with NAs
M6 <- Mdata[, -(13:15)] # only keep columns for the 6 markers
M6.usc <- uscoresi(M6)
M6.euclid <- dist(M6.usc) # matrix for the 6 MST markers
ent <- Mdata[, 13:14] # the entero1A data
ent.usc<-uscoresi(ent)
ent.euclid<-dist(ent.usc) # matrix for the entero1A data
M6.Ktau <- mantel(ent.euclid, M6.euclid, method="kendall", permutations = 9999)
M6.Ktau
```
There is a significant positive correlation between the MST marker values and the entero1A values. One way to picture the correlation is to plot their distance matrix entries against one another. The x,y pairs are those in the triangular matrices that are being correlated using Kendall’s tau in the mantel test. If there was a positive correlation, for example, larger distances (relating small to large entero1A data) would also have larger M6 distances between MST marker concentrations. This is the pattern seen in the plot.
```{r}
Site <- as.factor(Markers$Site_Name)
gp.color <- as.numeric(Site) # assigns numbers to group names in Site_Name
plot(ent.euclid, M6.euclid, pch = 19,
col = gp.color,
main = "Correlation of distance matrix of rank(uscores)")
lws <- lowess(ent.euclid, M6.euclid)
lines(lws)
legend("bottomright", legend=levels(Site), bty="n", col = 1:nlevels(Site), pch = 19)
```
The plot, in addition to being colorful, shows the general increase of M6 distances as a function of entero1A distances, as shown by the lowess smooth. As entero1A values increase, the pattern of 6 marker concentrations also increases.
*Extra Credit:*
Which MST marker(s) have the highest correlation with the Entero1A values? This can be determined with the bioenv command in the vegan package. This command performs iterative mantel tests with subsets of the marker data. The highest correlation coefficient is the set of best predictors, which may be 1 up to all 6 of the markers.
```{r}
bioenv(ent.euclid, M6.usc, method = "kendall")
```
The highest correlation model is with one marker, HF183. This can be illustrated by plotting the entero1A uscores against the HF183 uscores. HF183 was the 4th of six columns within M6.
```{r}
plot(ent.usc, M6.usc[,4], pch = 19)
```
You can plot ent.usc against the other five MST markers if you wish to see which markers appear to be related to the Entero1A concentrations. My look at the plots – BacHum is also correlated, and HPyV has a binary style correlation – as Entero1A increases the probability of a high rather than low PHyV increases -- rather than a linear relationship. This is because HPyV values occur in just two categories, (0 to 249) and (250 to 499). Had all data been censored to <500 this relationship would not have been visible. See the plot below. HpyV was the 5th of the 6 marker columns. The other three markers (all animal MST markers) show little correlation with Entero1A.
```{r}
plot(ent.usc, M6.usc[,5], ylab = "HPyv rank of uscores")
```
## References
* Helsel, D.R., 2012. Statistics for censored environmental data using Minitab and R, 2nd edition. John Wiley and Sons, New York. 344 p.
* Millard, S.P., 2013. EnvStats: An R Package for Environmental Statistics (2nd Edition). Springer, New York.
* Singh, A., Maichle, R. and S. Lee, 2006. On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. Office of Research and Development, USEPA. EPA/600/R-06/022. 123 p.
* Symonds et al., 2016. Faecal pollution along the southeastern coast of Florida and insight into the use of pepper mild mottle virus as an indicator, Journ Applied Microbio 121, p. 1469-1481.
* Thode, H.C., 2002. Testing for Normality. Marcel Dekker, New York. 479 p.
## Appendix
### A "Flowchart" for Computation of UCL / EPC for Data with Nondetects
The following steps can guide your choice of a method to compare a UCL to a legal standard or health advisory. Methods depend on the number of observations (detects and nondetects) available.
1. Are there at least 20 observations?
NO: Assume the best fitting distribution to estimate the UCL. Go to step 2.
YES: Use a bootstrap (nonparametric estimation) method. Go to step 3.
2. Distributional Methods
a) Use a boxplot (the cboxplot command) to take a first look at the data. Decide whether or not outliers are retained or not based on the sampling strategy that was used and the objectives of the study. If data were collected using a probabilistic sample, an equal-area sample, or other representative sampling, keep all observations unless the portions of the area the outliers represent are to be excluded from the estimation study. If it is unclear what area each observation represents, investigate why the outliers occur and decide accordingly. Note that outliers will strongly affect the estimate of the UCL95, so this decision is critical. If they are part of what people have been exposed to, keep them. If they are mistakes or represent an area that is not to be considered, delete them. A statistical test cannot be used to make this decision for you.
b) Decide which of three distributions best fits the data using either the cenCompareQQ or cenCompareCdfs function. Of these three, select the distribution with either the highest PPCC or lowest BIC statistic. I prefer the BIC statistic because it better measures the misfit caused by the normal distribution going negative and not matching the 0 lower limit of the data.
* If the normal distribution was selected, check the low (left) end of the plot to see when the projected values drop below zero, indicating negative concentrations are being estimated. If this percentage is more than a trace, the normal distribution is not a good fit, even if the PPCC was high. You should choose the next-highest PPCC distribution instead, or the lowest BIC statistic, instead of the normal distribution.
c) Use the best-fit distribution from 2b to compute the UCL. The three commands, one for each of the three distributions, are:
```{r,eval=F}
enormCensored(Data, Cen,ci=TRUE, ci.type="upper", ci.method="normal.approx")
elnormAltCensored(Data, Cen, ci=TRUE, ci.type="upper", ci.method="bootstrap")
egammaAltCensored(Data, Cen, ci=TRUE, ci.type="upper", ci.method="bootstrap")
```
In each of these commands, the input column of concentrations plus detection limits is shown as “Data”, and the censoring indicator column (0/1 or FALSE/TRUE) as “Cen”. Use the appropriate variable names in your dataset instead.
3. Nonparametric Methods
a) If there are multiple detection limits, use the Kaplan-Meier (KM) estimate, computing a UCL95 with a BCA or percentile bootstrap estimate. Report the BCA UCL95 estimate for up to 40% NDs and the percentile bootstrap for greater than 40% censoring (Singh et al., 2006, page 114). Use 5000 bootstrap repetitions so that the estimate is stable from one time to the next.
```{r,eval=F}
enparCensored(Data, Cen, ci=TRUE, ci.method="bootstrap", ci.type="upper", n.bootstraps=5000)
```
b) If there is only one detection limit the KM method in essence substitutes the detection limit for all NDs. It will not project values below the lowest DL as that would require a distribution to show how the values are arranged below the lowest DL. This will bias upward the estimate of the mean. I recommend you bootstrap the lognormal ROS method (elnormAltCensored command) instead.
Singh et al. (2006) state that the UCL95 is better estimated using KM than by ROS methods for censored data, and based on this overall statement, recommend that KM be used in any situation with nondetects. I believe they haven't split out the one-DL situations separately and looked at the resulting bias. They simply state that they've shown that KM is always better. Statisticians disagree.