README

Thanks for checking out the development version of CCI (Computational Conditional Independence) package in R. The CCI-package is made for conducting computational testing for conditionally independent. Conditional independence is written as Y || X | Z, meaning that the random variable Y is independent of X given Z. Computational testing of conditional independence is a machine learning based test where permutation and Monte Carlo Cross-Validation (MCCV) us used to build an empirical null distribution and estimating a test statistic.

Key Features:

In this readme we will go through all the available features of the CCI package, and provide heuristics for how to use it. The CCI package is designed to be flexible and hopefully user-friendly, allowing you to test conditional independence in various data types and scenarios. Report any bugs or issues you encounter, and feel free to contribute to the package on GitHub. ## Installation You can install the development version of CCI from GitHub:

install.packages("devtools")
devtools::install_github("khliland/CCI")
library(CCI)

1. Basic Usage

Data

The CCI-package comes with several built-in data generating functions to help you get started. These functions generate synthetic data sets that can be used to test conditional independence.

Testing Conditional Independence

Below is an example of the simplest line of code needed to test the hypothesis Y _||_ X | Z1, Z2 using the CCI package. The code generates a dataset with 400 observations, where Y is conditionally independent of X given Z1 and Z2. The main user interface function for CCI is CCI.test(), which must at least have a formula and a data frame as input to perform a test.

set.seed(1985)
data <- NormalData(400)
t <- CCI.test(formula = Y ~ X | Z1 + Z2, data = data)
summary(t)

The output tells us that you have performed a computational test of conditional independence using the random forest method (rf), testing the condition Y ~ X | Z1 + Z2. The performance metric used to build the null distribution and the test statistic is Root Mean Square Error (RMSE). The test statistic is 1.143737, and the p-value is 0.9344262, indicating that we fail to reject the null hypothesis at a significance level of 0.05. Which is correct. tail is set to “left”, meaning that the further left the test statistic is in the null distribution, the lower the p-value. The summary of the null distribution shows that the mean is 1.064956 and the standard deviation is 0.04709121.

The CCI package automatically generates a histogram over the null distribution and the corresponding test statistic, which can be useful for visualizing the results. Ypu can display the plot by,

In the previous example we tested a true null hypothesis, for completeness, we will now test a false null hypothesis, where Y is not conditionally independent of X given Z1.

summary(CCI.test(formula = Y ~ X | Z1, data = data, parametric = TRUE))

At significant level 0.05 the test rejects the null hypothesis of Y ~ X | Z1, since the p-value is less than 0.05. We also added in the argument parametric = TRUE, which means that the p-value is calculated assuming that the null distribution is Gaussian. The default way of calculating a p-value is empirically, which means that the p-value is calculated by comparing the test statistic to the null distribution. Using a parametric p-value is usually a good idea, since the empircal derived p-value is limited in how small it can be.

2. Handling Different Data Types

One of the main motivation for having a computational test of conditional independence is to be able to test conditional independence using different data types, such as continuous, binary, and categorical data. Depending on the data type of Y in the formula Y ~ X | Z1 + Z2, you can adjust the metric argument to "RMSE" (default) or "Kappa".

In the example below, both Y and X are binary variables, meaning they only take on values 0 and 1.

set.seed(1985)
data <- BinaryData(500)
cci_test <- CCI.test(formula = Y ~ X | Z1 + Z2, data = data, metric = "Kappa")
summary(cci_test)
plot(cci_test)

If you notice in the example above, the null distribution looks kinda weird. It consist of two peaks, one at around -0.5 and one at 0.5. This is in general not a good sign, and indicate that you should try to change the method. The two main method options in CCI.test() are ‘rf’ and ‘xgboost’. First we can change the method to XGBoost, like this:

set.seed(1985)
data <- BinaryData(500)
test_w_xgb <- CCI.test(formula = Y ~ X | Z1 + Z2, data = data, metric = "Kappa", method = "xgboost")
summary(test_w_xgb)
plot(test_w_xgb)

Viola, now the null distribution looks much better, which is what we want. The p-value is 0.72 which is what we expect since Y and X are conditionally independent given Z1 and Z2.

The third available method is support vector machine (svm). svm is fast, but not as robust as rf and xgboost.

set.seed(1985)
data <- BinaryData(500)
test_w_svm <- CCI.test(formula = Y ~ X | Z1 + Z2, data = data, metric = "Kappa", method = "svm")
summary(test_w_svm)
plot(test_w_svm)

When you are dealing with character variables representing categories, then you must convert these variables into factor variable. However, ‘CCI.test’ will judge which performance metric to use, “RMSE” or “Kappa”.

set.seed(1945)
data <- TrigData(500)
data$Y <-as.factor(data$Y) # Make sure Y is a factor variable
data$X <-as.factor(data$X) # Make sure X is a factor variable
factor_test <- CCI.test(formula = Y ~ X | Z2, data = data)
summary(factor_test)

set.seed(2)
test_svm_factor <- CCI.test(formula = Y ~ X | Z2, data = data, method = "svm", nperm = 250)
summary(test_svm_factor)
plot(test_svm_factor)

All three methods successfully reject the null hypothesis which is what we want. In the last example we set the argument nperm = 250, which means that the null distribution is generated using 250 Monte Carlo samples. The default nperm value is 60, which will usually be enough, but it does not hurt to increase the number of samples to get a more precise null distribution.

The last important argument in CCI.test() is the p argument, which controls the proportion of data used for training the model in each Monte Carlo Sample, default value is 0.5. In lagrge datasets, you can set p to a lower value, like 0.1, to speed up the process and increase precision.

set.seed(1984)
data <- SineGaussian(10000, d = 0.3) # d = 0.3 breaks conditional independence Y _||_ X | Z
large_N_test <- CCI.test(formula = Y ~ X | Z, data = data, parametric = T, p = 0.05)
summary(large_N_test)

There is also a subsampling argument which can be set to a value between 0 and 1. This reduces the sample size used for testing in each iteration, speeding up testing. In the example below, even with 2 million observations testing is “fast”” because we only use 0.1% of the data for training the model in each iteration.

set.seed(1984)
data <- SineGaussian(2000000, d = 0) # d = 0 means conditional independence Y _||_ X | Z
even_larger_N <- CCI.test(formula = Y ~ X | Z, data = data, parametric = T, nperm = 100, p = 0.25, subsample = 0.001)
summary(even_larger_N)

Finally we show that you can pass on arguments to the machine learning algorithm used in CCI.test(). Here is an example of using the xgboost method with custom parameters.

set.seed(1066)
dat <- UniformNoise(N = 2000)
complex_test <- CCI.test(formula = Y ~ X | Z2, 
data = dat, 
method = "xgboost", 
parametric = TRUE, 
p = 0.25,
nthread = 5, 
subsample = 0.7, 
alpha = 0.3, 
eta = 0.1, 
max_depth = 6, 
min_child_weight = 3, 
colsample_bytree = 0.8, 
gamma = 0.1, 
nrounds = 100, 
objective = "reg:pseudohubererror")
summary(complex_test)

The formula

As you might have guessed, the formula argument in CCI.test() gives the conditional independent statement to test. The condition Y || X | Z1, Z2 is written as Y ~ X | Z1 + Z2 (alternatively as Y ~ X + Z1 + Z2, both is accepted). When Y is the dependent variable it means that Y is predicted while X is permuted to see if it improves predicition performance beyond Z1 and Z2. Naturally, the condition Y || X | Z1, Z2 can also be tested by the formula X ~ Y | Z1 + Z2. So which way to choose? Generally it is best to predict the variable with the best predictive performance. CCI.test() will automatically choose the variable with the best predictive performance if you set the argument choose_direction = TRUE.

set.seed(1814)
data <- SineGaussianBiv(N = 1000, a = 1, d = 0.5)
summary(CCI.test(formula = Y ~ X | Z1 + Z2, method = 'rf', data = data, nrounds = 800, choose_direction = TRUE, parametric = T))

set.seed(1814)
data <- SineGaussianBiv(N = 1000, a = 1, d = 0)
summary(CCI.test(formula = Y ~ X | Z1 + Z2, method = 'rf', data = data, parametric = T, nrounds = 1000, choose_direction = T))

In any statistical test, it might be the case that we have insufficient power and therefor one can not rely on one single p-value. An extra level of analysis in CCI, or any other statistical test, is to create quantile-quantile (qq) plots over p-values, to see if they approximately follows a uniform distribution, here is how to do it.

set.seed(1985)
data <- NormalData(80)
CCI_obj <- CCI.test(formula = Y ~ X | Z1 + Z2, data = data, nperm = 200, parametric = T)
QQplot(CCI_obj) # P-values roughly follow the diagonal line
# Testing a false Null
CCI_obj <- CCI.test(formula = Y ~ X | Z2, data = data, nperm = 200, parametric = T)
QQplot(CCI_obj) # P-values do not follow diagonal line

Note that assessing a qq plot is not a statistical test, and each test must be judge subjectively.

These examples show the basics of computational testing of conditional independence with the CCI-package.

3. Testing CI in dagitty DAGs

Conditional independence testing is central in validating a causal model with the data. In the example below we load data from the study, we use the dagitty package to draw out a causal model extract conditional independence statements. reclassifying them to formulas, and loop through the formulas list using parallel processing.

Here’s the example: NB this chunck of code is computationally demanding and takes several minutes to execute!

library(dagitty)
library(parallel)
library(pbapply)

protein_data <- read.csv("https://raw.githubusercontent.com/ankurankan/2020-dagitty-manual/master/protocol3/protein_signal.csv")
protein_dag <- dagitty('dag {
bb="-3.736,-5.393,3.689,5.453"
Akt [pos="0.881,2.671"]
Erk [pos="2.736,0.697"]
Jnk [pos="0.891,-1.981"]
Mek [pos="2.730,-1.760"]
P38 [pos="1.404,-1.802"]
PIP2 [pos="-1.518,3.028"]
PIP3 [pos="-0.384,-0.815"]
PKA [pos="1.995,-2.989"]
PKC [pos="2.222,-4.779"]
Plcg [pos="-1.534,-2.264"]
Raf [pos="2.705,-3.304"]
Mek -> Erk
PIP2 -> PKC [pos="-3.404,-5.120"]
PIP2 -> Plcg
PIP3 -> Akt
PIP3 -> PIP2
PIP3 -> Plcg
PKA -> Akt
PKA -> Erk
PKA -> Jnk
PKA -> Mek
PKA -> P38
PKA -> Raf
PKC -> Jnk [pos="0.813,-3.535"]
PKC -> Mek [pos="3.269,-3.745"]
PKC -> P38 [pos="1.041,-3.188"]
PKC -> Raf
Plcg -> PKC
Raf -> Mek
}')
# Plot the DAG
plot(protein_dag)
# Run CCI test
conditions <- impliedConditionalIndependencies(protein_dag)
# Create formula list from conditions 

formulas <- list()
for (i in 1:length(conditions)) {
  cond <- conditions[[i]]
  lhs <- cond$X
  rhs <- cond$Y
  set <- cond$Z
  if (length(set) == 0) {
    # No conditioning set
  } else {
    formula <- as.formula(paste(lhs, "~", rhs, "|", paste(set, collapse = " + ")))
  }
  
  formulas[[i]] <- formula
}
# remove duplicated rows  in formulas
formulas <- unique(formulas)

cl <- makeCluster(detectCores() - 1)
clusterEvalQ(cl, c(library('CCI')))
clusterExport(cl,  varlist=c('formulas', 'protein_data'), envir=environment())

results <- pblapply(cl = cl, X = 1:55, function(i) {
  
p <- CCI.test(formula   = formulas[[i]],
    data      = protein_data, 
    parametric = TRUE,
    seed      = i
  )$p.value
  
  data.frame(
    test = i, 
    p_value = p)
})
p

6. Using the mlfunc and metricfunc arguments

Advance users can pass om custom machine learning wrapper functions and performance metrics to the CCI.test() function using the mlfunc and metricfunc arguments, respectively. This is intended so that you can use any machine learning algorithm or performance metric of your choice, allowing for greater flexibility and customization in your conditional independence testing. The function passed through the mlfunc argument should take the following inputs: formula, data, train_indices, test_indices and ... - formula: The formula for the model. - data: The dataset used for training and testing. - train_indices: Indices for the training data. - test_indices: Indices for the test data.

The function should return a numeric value representing the model’s performance.

Let’s take a look at two examples, the first example uses the nnet package to create a neural network wrapper function.

neuralnet_wrapper <- function(formula,
                      data,
                      train_indices,
                      test_indices,
                      ...) {
                              
    model <- nnet::nnet(formula, data = data[train_indices, ], linout = TRUE, trace = FALSE, ...)
  
    predictions <- predict(model, newdata = data[test_indices, ])
    actual <- data[test_indices, ][[all.vars(formula)[1]]]

    metric <- sqrt(mean((predictions - actual)^2))

    return(metric)
}

dat <- NonLinNormal(2000)
nnet_test <- CCI.test(formula = Y ~ X | Z1 + Z2, data = dat, tail = 'left', mlfunc = neuralnet_wrapper, nperm = 200, size = 10, decay = 0.1, maxit = 200)
summary(nnet_test)

caret_wrapper <- function(formula,
                          data,
                          train_indices,
                          test_indices,
                          caret_method,
                          caret_data_type,
                          ...) { 
  training_data <- data[train_indices, ]
  test_data <- data[test_indices, ]
  ctrl <- caret::trainControl(method = "none")
  model <- caret::train(formula,
                        data = training_data,
                        method = caret_method,
                        trControl = ctrl,
                        verbose = F,
                        trace = F,
                        ...)
  predictions <- predict(model, newdata = test_data)
  actual <- data[test_indices, ][[all.vars(formula)[1]]]
  if (caret_data_type =="continuous") {
    metric <- sqrt(mean((predictions - actual)^2))
  } else if (caret_data_type %in% c("categorical", "binary")) {
    actual <- test_data[[all.vars(formula)[1]]]
    metric <- sum(predictions == actual) / length(actual)
  } else {
    stop("Unsupported data type for wrapper")
  }
  return(metric)
}
}

7. ️ Custom Performance Metric with metricfunc

The CCI package provides default performance metrics, such as RMSE for continuous outcomes and Kappa scores for binary and categorical outcomes. However, if you have a specific performance metric in mind, you can easily define your own using the metricfunc argument.

The output should be a numeric value representing the performance metric. Here’s an example that calculates the (R^2) metric using xgboost:

Rsquare_metric  <- function(actual, predictions) {
  sst <- sum((actual - mean(actual))^2)
  ssr <- sum((actual - predictions)^2)
  metric <- 1 - (ssr / sst)
  return(metric)
}


results <- CCI.test(formula = Y ~ X | Z2, data = dat, method = "rf", metricfunc = Rsquare_metric, tail = "right", seed = 2, verbose = T)

Important: When using a custom performance metric, you should also specify the tail argument: - tail = "right": Use if higher metric values indicate better model performance. - tail = "left": Use if lower metric values indicate better model performance.

10. More Examples

Example 1 (Bagging tree)

PolyData <- function(N) {
  Z1 <- rnorm(N)
  Z2 <- rnorm(N)
  X <- numeric(N)
  Y <- numeric(N)

  for (i in 1:N) {
    X[i] <- ifelse(Z1[i]^2 + Z2[i]^2 > 2, 3,
                   ifelse(Z1[i]^2 + Z2[i] > 0.5, 2,
                          ifelse(Z1[i] + Z2[i]^2 > 0, 1, 0)))

    Y[i] <- ifelse(Z1[i]^3 + Z2[i] > 1, 3,
                   ifelse(Z1[i]^2 - Z2[i]^2 > 0, 2,
                          ifelse(Z1[i] - Z2[i]^3 > -1, 1, 0)))
  }

  return(data.frame(Z1, Z2, X, Y))
}

set.seed(100)
dat <- PolyData(931)
CCI.test(formula = X ~ Y + Z1, 
         data = dat, 
         method = "xgboost", 
         booster = "gbtree", 
         num_class = 4,
         max_depth = 6, 
         eta = 0.3, 
         subsample = 0.7,  
         colsample_bytree = 0.7)

In this example we use the bagging tree algorithm as implemented in the xgboost package.

Example 2 (Custom multi class log loss with xgboost)

Multinominal <- function(N, zeta = 1.5) {
  Z1 <- rnorm(N)
  Z2 <- rnorm(N)
  xb1 <- Z2 + zeta*Z1*Z2 + zeta*Z1
  xb2 <- Z2 - zeta*Z1
  xp1 <- 1/(1+exp(xb1) + exp(xb2))
  xp2 <- exp(xb1) /(1+exp(xb1) + exp(xb2))
  random <- runif(N,0, 1)
  X <- ifelse(random < xp1, "C", ifelse(random < xp1 + xp2,"A","B"))
  yb1 = zeta*Z1*Z2
  yb2 <- exp(Z2) +  zeta*Z1
  yp1 <- 1/(1+exp(yb1) + exp(yb2))
  yp2 <- exp(yb1) /(1+exp(yb1) + exp(yb2))
  random <- runif(N,0, 1)
  Y <- ifelse(random < yp1, "X", ifelse(random < yp1 + yp2,"Y","Z"))
  return(data.frame(Z1,Z2,X,Y))
}

  multi_class_log_loss <- function(data, model, test_indices, test_matrix) {
    eps = 0.001
    pred <- predict(model, newdata = test_matrix)
    actual <- data[test_indices,][['Y']] #Hard coded, must change if you have a different formula
    actual_matrix <- model.matrix(~ factor(actual) - 1)
    num_classes <- length(unique(actual))
    pred_matrix <- matrix(pred, ncol = num_classes, byrow = TRUE)

    pred_matrix <- pmax(pmin(pred_matrix, 1 - eps), eps)
    log_loss <- -sum(actual_matrix * log(pred_matrix)) / nrow(pred_matrix)
    return(log_loss)
  }

  data <- Multinominal(1000)
  data$Y <- as.numeric(as.factor(data$Y))-1
  data$X <- as.numeric(as.factor(data$X))-1

  inTraining <- sample(1:nrow(data), size = floor(0.8 * nrow(data)), replace = FALSE)
  train_indices  <- inTraining
  test_indices <- setdiff(1:nrow(data), inTraining)

  metric <- CCI.test(formula = Y ~ X + Z1 + Z2,
                            data = data,
                            method = "xgboost",
                            nrounds = 120,
                            num_class = 3,
                            eta = 0.1,
                            lambda = 0.5,
                            alpha = 0.5,
                            metricfunc = multi_class_log_loss,
                            tail = "left")

Example 3 (Time series data)

CCI can also be used to test whether two time series variables are conditionally independent. However, analyzing time series data with CCI requires a relatively large data set due to the complexity and dependencies inherent in time series analysis. In this example we test if two ((X) and (Y) trends which diverge in a times series are conditional independent given previous lags of X. We then to the same for lags of Y to show that this test is rejected (using significance level = 0.05).

time_series <- function(n, phi1, phi2) {
  phi <- c(phi1, phi2)
  # We generate X
  X <- arima.sim(n = n, list(ar = phi))
  Y <- numeric(n)
  
  for (t in 3:n) {
    Y[t] <- 0.01 * t +  1.2 * X[t-1] +  0.7 * X[t-2] +  0.5 * X[t-2]*X[t-1] + rnorm(1, sd = 1)
  }
  data <- data.frame(Time = 1:n, X = X, Y = Y)
}
set.seed(1993) 
data <- time_series(n = 1500, phi1 = 0.9, phi2 = -0.5)

data$X_lag1 <- c(NA, data$X[-length(data$X)])  
data$X_lag2 <- c(NA, NA, data$X[-(length(data$X)-1):-(length(data$X))])

cor(data$Y, data$X) # Showing the correlation between Y and X, indicating that they are not independent

# Inputing Time as a conditioning variables for general trends.
data <- na.omit(data)
CCI.test(formula = X ~ Y | X_lag1 + Time, data = data, p = 0.5, nperm = 200, method = "xgboost")
CCI.test(formula = Y ~ X | X_lag1 + X_lag2 + Time, p = 0.5, nperm = 200, data = data, method = "xgboost")

data$Y_lag1 <- c(NA, data$Y[-length(data$Y)])  
data$Y_lag2 <- c(NA, NA, data$Y[-(length(data$Y)-1):-(length(data$Y))])
CCI.test(formula = Y ~ X | Y_lag1 + Y_lag2 + Time, p = 0.5, nperm = 200, data = data, method = "xgboost", parametric = T)

Welcome to the CCI Package

Key Features:

1. Basic Usage

Data

Testing Conditional Independence

2. Handling Different Data Types

The formula

3. Testing CI in dagitty DAGs

6. Using the `mlfunc` and `metricfunc` arguments

7. ️ Custom Performance Metric with `metricfunc`

10. More Examples

Example 1 (Bagging tree)

Example 2 (Custom multi class log loss with xgboost)

Example 3 (Time series data)

Welcome to the CCI Package

Key Features:

1. Basic Usage

Data

Testing Conditional Independence

2. Handling Different Data Types

The formula

3. Testing CI in dagitty DAGs

6. Using the mlfunc and metricfunc arguments

7. ️ Custom Performance Metric with metricfunc

10. More Examples

Example 1 (Bagging tree)

Example 2 (Custom multi class log loss with xgboost)

Example 3 (Time series data)

6. Using the `mlfunc` and `metricfunc` arguments

7. ️ Custom Performance Metric with `metricfunc`