# Working with HydeNetwork Objects

## Introduction

Setting up Bayesian network models with HydeNet generally involves two components – specifying the network structure and specifying the (conditional) probability distribution of each node (given any parent nodes). Network structure is specified within the HydeNetwork() function, while node distributions can be set using either HydeNetwork(), setNode(), or setNodeModels(). Generally, HydeNetwork() would be used to simultaneously define the distributions for all the nodes in the network in a single function call, while setNode() and setNodeModels() are used to define the distribution of a specific node in an existing HydeNetwork object. Also, HydeNetwork() offers a relatively limited set of options in terms of the nature of the specified distributions, while the other two functions offer more flexibility.

HydeNetwork() can be called in three different ways. The first involves explicit specification by the user of the network structure (according to the formula syntax implemented in gRbase::dag()) but no specification of a training dataset or models to populate node distributions. This results in a “skeleton” HydeNetwork object. The second technique involves the same explicit specification of the network structure, but also passing a training dataset. In this case, conditional probability distributions for all the nodes in the network are estimated, using frequency tabulation, linear regression, logistic regression, or multinomial logistic regression, depending on the classes (and number of levels, for factors) of the variables in the data frame and the user-specified network structure.

The third way to invoke HydeNetwork() is to pass a “bag of models”, or more specifically a list argument containing one or more model objects as elements. In this method, the network structure is automatically built using the names of the response and explanatory variables within each of the models included in the list argument. Permissible model classes include xtabs, cpt, lm, glm, and multinom. Note that, in the HydeNet package, we have included the cpt model class. This stands for conditional probablity table, and is intended to facilitate the specification of categorical node distributions for which all parent nodes are also categorical. See help("cpt") for details and see below for examples. Note also that at this time the glm class only works with family="binomial"; defining a node’s distribution using other families is possible, however, using setNode().

In any of the above three cases, but especially in the first case (i.e., when HydeNetwork() is used with neither a training data nor a list of model objects to populate node distributions), the distributions for each node in the network can be manually specified, one-by-one. This is accomplished with either the setNode() function or the setNodeModels() function. As we discuss below, we have implemented a multitude of techniques for specifying node distributions with these functions.

We start by loading the package:

library(HydeNet)

In the above output, the required packages for HydeNet are listed. In addition, this package uses JAGS, which is stand-alone software for implementing Markov Chain Monte Carlo simulation. JAGS is called from R via a package called rjags. See help("rjags-package") for details.

[Link to top]

## Example – Pulmonary Embolism

The network we will study involves the diagnosis and treatment of pulmonary embolism, or PE (node pe). PE is a condition where the arteries carrying blood to the lungs get blocked, typically by a blood clot that dislodged from a vein in the leg. There are two commonly-used tests for diagnosing PE. One is a blood test called D-dimer (node d.dimer), and the other is pulmonary angiography (node angio). For each, the probability of positive and negative test values depends on the status of PE. In other words, the conditional distribution function for each test node can be defined using the sensitivity and specificity of each test. The D-dimer test also is affected by pregnancy (node pregnant), with higher false positive rates. Clinicians prior beliefs about the likelihood of PE are captured in a score (node wells). Since PE cannot directly be observed, the likelihood of a patient receiving treatment (node treat) depends on the test results. And the likelihood of survival through hospital discharge (node death) depends on both the status of the disease and whether or not the patient received treatment.

[Link to top]

## Creating “Skeleton” HydeNetwork Objects

A graphical representation of the PE network can be constructed based on an unpopulated HydeNetwork object (i.e., a “base” object for which node distributions have not yet been specified):

net <- HydeNetwork(~ wells
+ pe | wells
+ d.dimer | pregnant*pe
+ angio | pe
+ treat | d.dimer*angio
+ death | pe*treat)
##
## plot(net)

The HydeNetwork object we created, called net, is worth exploring:

## A Probabilistic Graphical Network
## Has data attached: No
##
## wells
## dnorm(mu = Unspecified, tau = Unspecified)
## : wells ~ 1
##
## pe | wells
## dnorm(mu = Unspecified, tau = Unspecified)
## : pe ~ wells
##
## d.dimer | pe * pregnant
## dnorm(mu = Unspecified, tau = Unspecified)
## : d.dimer ~ pe + pregnant
##
## pregnant
## dnorm(mu = Unspecified, tau = Unspecified)
## : pregnant ~ 1
##
## angio | pe
## dnorm(mu = Unspecified, tau = Unspecified)
## : angio ~ pe
##
## treat | d.dimer * angio
## dnorm(mu = Unspecified, tau = Unspecified)
## : treat ~ d.dimer + angio
##
## death | pe * treat
## dnorm(mu = Unspecified, tau = Unspecified)
## : death ~ pe + treat

Since we haven’t given HydeNetwork() information on conditional probability distributions for the nodes in the network, we have a skeleton object where each node is distributed as normal given its parent nodes. The parameters mu and tau are to this point unspecified (note: in JAGS, the mean and precision are specified for the normal distribution - the precision parameter $$\tau$$ is equal to $$1/\sigma^2$$).

[Link to top]

## Creating HydeNetwork Objects With a Training Dataset

Using HydeNetwork() with a training dataset implements the following default node-specific model classes, depending on the class of the node (in other words, the class of the variable in the training dataset), whether or not the node has parent nodes, and if so, the classes of the parent nodes:

Node Class Parents Model Type Function
factor with any number of levels None Tabulation stats::xtabs()
factor with any number of levels All of class factor Conditional Probabilility Table HydeNet::cpt()
Binary factor At least 1 numeric or integer Logistic Regression stats::glm(..., family="binomial")
factor with 3+ levels At least 1 numeric or integer Multinomial Logistic Regression nnet::multinom()
numeric or integer Ordinary Least Squares stats::lm()

The syntax for building a Bayesian network using training data is rather simple:

data(PE, package='HydeNet')
autoNet <- HydeNetwork(~ wells
+ pe | wells
+ d.dimer | pregnant*pe
+ angio | pe
+ treat | d.dimer*angio
+ death | pe*treat,
data = PE)
writeNetworkModel(autoNet, pretty=TRUE)
## model{
##    wells ~ dnorm(3.79, 0.4)
##    pe ~ dbern(ilogit(-3.9 + 0.58 * wells))
##    d.dimer ~ dnorm(210.24 + 68.38 * (pe == 1) + 29.29 * (pregnant == 2), 0)
##    pi.pregnant[1] <- 0.9; pi.pregnant[2] <- 0.1
## pregnant ~ dcat(pi.pregnant)
##    pi.angio <- cpt.angio[(pe+1), ]
##    angio ~ dcat(pi.angio)
##    treat ~ dbern(ilogit(-5.89 + 0.02 * d.dimer + 1.73 * angio))
##    pi.death <- cpt.death[(pe+1), (treat+1), ]
##    death ~ dcat(pi.death)
## }

We can see by the output that the models have all been populated, and verify that these are indeed the coefficients we obtain from the functions in the above table:

glm(treat ~ d.dimer+angio, data=PE, family="binomial")$coef ## (Intercept) d.dimer angioPositive ## -5.89315742 0.01993752 1.73353613 xtabs(~PE$pregnant) / nrow(PE)
## [1] 5
##
## $sd ## [1] 1.5 #### Multicategory Root Nodes Suppose instead that the Wells score was categorical in nature, with three values (e.g., low, medium and high). We can specify categorical distributions as follows: net <- setNode(net, wells, nodeType = "dcat", pi = vectorProbs(p = c(.3, .6, .1), wells) ) ## Validation has been ignored for parameters defined with character strings net$nodeType$wells ## [1] "dcat" net$nodeParams$wells ##$pi
## [1] "pi.wells[1] <- 0.3; pi.wells[2] <- 0.6; pi.wells[3] <- 0.1"

Note here that we have overwritten the node distribution within the object net to be categorical in nature.

The vectorProbs() function converts a probability vector into JAGS code, as seen above in the list element net$nodeParams$wells. This function will by default normalize probability vectors, so that counts can be directly fed into the model:

net <- setNode(net, wells,
nodeType = "dcat",
pi = vectorProbs(p = c(37, 162, 48), wells) )
## Validation has been ignored for parameters defined with character strings
net$nodeType$wells
## [1] "dcat"
net$nodeParams$wells
## $pi ## [1] "pi.wells[1] <- 0.149797570850202; pi.wells[2] <- 0.65587044534413; pi.wells[3] <- 0.194331983805668" We could have achieved the same by directly inserting the JAGS code into the pi parameter: net <- setNode(net, wells, nodeType = "dcat", pi = "pi.wells[1] <- 0.15; pi.wells[2] <- 0.66; pi.wells[3] <- 0.19") ## Validation has been ignored for parameters defined with character strings #### Other Univariate Distributions HydeNet supports all the statistical distributions supported by JAGS. A table of these distributions is stored in the jagsDists dataset: data(jagsDists, package='HydeNet') jagsDists[,c(1:3, 6:8)] DistName FnName FnNameR Parameters RParameter paramLimit Beta dbeta dbeta a shape1 > 0 Beta dbeta dbeta b shape2 > 0 Chi-square dchisqr dchisq k df > 0 Double exponential ddexp mu Double exponential ddexp tau > 0 Exponential dexp dexp lambda rate > 0 F df df n df1 > 0 F df df mu df2 > 0 Gamma dgamma dgamma r shape > 0 Gamma dgamma dgamma lambda rate > 0 Generalized gamma dgen.gamma r > 0 Generalized gamma dgen.gamma b > 0 Generalized gamma dgen.gamma lambda > 0 Logistic dlogis dlogis mu location Logistic dlogis dlogis tau scale > 0 Log-normal dlnorm dlnorm mu meanlog Log-normal dlnorm dlnorm tau sdlog > 0 Noncentral chi-square dnchisqr k > 0 Noncentral chi-square dnchisqr delta >= 0 Normal dnorm dnorm mu mean Normal dnorm dnorm tau sd >= 0 Pareto dpar alpha > 0 Pareto dpar alpha > c Student t dt dt mu Student t dt dt tau > 0 Student t dt dt k df > 0 Uniform dunif dunif a min < b Uniform dunif dunif b max > a Weibull dweib dweib nu shape > 0 Weibull dweib dweib lambda scale > 0 Beta Binomial dbetabin a > 0 Beta Binomial dbetabin b > 0 Beta Binomial dbetabin n > 0 Bernoulli dbern p prob 0 < p < 1 Binomial dbin dbin p prob 0 < p < 1 Binomial dbin dbin n size > 0 Categorical dcat pi pi > 0 Noncentral hypergeometric dhyper dhyper n1 > 0 Noncentral hypergeometric dhyper dhyper n2 > 0 Noncentral hypergeometric dhyper dhyper m1 0 < m1 < (n1 + n2) Noncentral hypergeometric dhyper dhyper psi Negative Binomial dnegbin p 0 < p < 1 Negative Binomial dnegbin r > 0 Poisson dpois dpois lambda lambda > 0 Deterministic determ define define So, to assign a Weibull distribution to a node XYZ, we would use the following code: net <- setNode(net, XYZ, nodeType = "dweib", shape=2, scale=5) Finally, note that there is built-in error handling when parameters are outside allowable limits: net <- setNode(net, d.dimer, nodeType = "dpois", lambda=-10) ## Error in tools::buildVignettes(dir = ".", tangle = TRUE): 1 assertions failed: ## * Please define lambda such that lambda > 0 (or use ## * validate=FALSE). [Link to top] ### Regression Equations #### Ordinary Least Squares (OLS) For OLS models, nodeType="dnorm" can be used. We use a regression equation to characterize the dependency of the node on its parents. We note again that normal distributions are specified using the mean and precision parameters, where the precision parameter is the inverse of the variance. setNode() supports the use of formula syntax to define a regression equation for a given node. This is achieved using the fromFormula() function with the nodeFormula parameter, as follows: net <- setNode(net, d.dimer, nodeType="dnorm", mean=fromFormula(), sd=sqrt(30), #sigma^2 = 30 nodeFormula = d.dimer ~ 210 + 29*pregnant + 68*pe) net$nodeType$d.dimer ## [1] "dnorm" net$nodeParams$d.dimer ##$mean
## [1] "fromFormula"
##
## $sd ## [1] 5.477226 net$nodeFormula$d.dimer ## d.dimer ~ 210 + 29 * pregnant + 68 * pe ## <environment: 0x10dfff8> Or, alternatively, one may directly specify JAGS code for the parameters as character strings. Below, we do this for mu: net <- setNode(net, d.dimer, nodeType="dnorm", mean="210 + 29*pregnant + 68*pe", sd = sqrt(30)) However, the model syntax is flexible, allowing for alternative distributions to be used if desired. For example, maybe the distribution of the residuals has heavy tails; here, the (non-standardized) Student’s t distribution could be used: net <- setNode(net, d.dimer, nodeType="dt", mean="210 + 29*pregnant + 68*pe", sd=sqrt(20), df=2) The decision of whether to give an R-style formula or JAGS code is a matter of preference. But when using R code, one needs to ensure that any functions used in the formula can be translated to JAGS code. A list of functions that can be translated between R and JAGS can be viewed by calling data(jagsFunctions, package='HydeNet') jagsFunctions jags_function r_function r_package abs abs base arccos acos base arccosh acosh base arcsin asin base arcsinh asinh base arctan atan base arctanh atanh base cos cos base cosh cosh base cloglog cloglog VGAM equals == base exp exp base icloglog ifelse ifelse base ilogit plogis base log log base logfact loggam logit qlogis base phi pnorm base pow ^ base probit probit VGAM round ceiling base sin sin base sinh sinh base sqrt sqrt base step >= 0 base tan tan base tanh tanh base trunc floor base #### Logistic Regression If the intercept and slope coefficients of a logistic regression model are known, one may define a Bernoulli-distributed node using the ilogit function in JAGS (inverse logit): equation <- "-6.3 + 0.02*d.dimer + 2.9*angio - 0.005*d.dimer*angio" net <- setNode(net, treat, nodeType="dbern", prob=paste("ilogit(", equation, ")"), validate=FALSE) [Link to top] ### Using R Model Objects Above, we showed how HydeNetwork() can be used with a list of model objects to populate both the graph and the corresponding node distributions. In a similar fashion, certain R model classes can be used to populate the distribution for individual nodes in an existing HydeNetwork object. This is achieved using the setNodeModels() function. Currently, setNodeModels() is compatible with the following model classes: xtabs, cpt, lm, glm, (family="binomial" only) and multinom. Above, we constructed a HydeNetwork object called bagNet for the PE network by passing a list of model objects. Suppose we wanted to modify one of the models and repopulate the network, e.g., by introducing an interaction term. This is achieved with the following code: bagNet$nodeType$d.dimer ## [1] "dnorm" bagNet$nodeParams$d.dimer ##$mu
## [1] " 210.24 + 68.38*(pe==1) + 29.29*(pregnant==1)"
##
## $tau ## [1] 0.0334725 bagNet$nodeFormula$d.dimer ## d.dimer ~ pe + pregnant new.DDimer.Model <- lm(d.dimer ~ pe * pregnant, data=PE) bagNet <- setNodeModels(bagNet, new.DDimer.Model) writeNetworkModel(bagNet, pretty=TRUE) ## model{ ## wells ~ dnorm( 3.79, 0.630504251834359) ## pe ~ dbern( ilogit(-3.9 + 0.58*wells)) ## d.dimer ~ dnorm( 210.03 + 69.52*(pe==1) + 31.42*(pregnant==1) + -11.51*(pe==1)*(pregnant==1), 0.0335041033258058) ## pi.pregnant[1] <- 0.9; pi.pregnant[2] <- 0.1 ## pregnant ~ dcat(pi.pregnant) ## pi.angio <- cpt.angio[(pe+1), ] ## angio ~ dcat(pi.angio) ## treat ~ dbern( ilogit(-5.89 + 0.02*d.dimer + 1.73*angio)) ## pi.death <- cpt.death[(pe+1), (treat+1), ] ## death ~ dcat(pi.death) ## } #### Warning About Limited Scope of writeJagsModel Methods Passing model objects to HydeNetwork objects, either using HydeNetwork.list() or setNodeModels(), is handled by invoking the writeJagsModel() methods. These methods accept the model object (e.g., an lm object) as input and populate a variety of list elements within the HydeNetwork object (e.g., $nodeFormula, $nodeFitter, $nodeFitterArgs, \$nodeParams, etc.). The core functionality of these methods is to use the R model object to write JAGS code implementing the probability distribution described by the model. This is a difficult feature to standardize.

Currently, only a limited set of model parameterizations are supported by the convenience functions HydeNetwork.list() and setNodeModels(). In situations where more complex model equations are to be specified for certain node(s), setNode() should be used instead of these functions as it allows more flexibility. Future versions of the package will allow for more flexibility in directly passing R model objects.

The supported parameterizations include the following:

• Main effects
• Two-way interactions
• Polynomial terms involving continuous/integer predictors

[Link to top]

#### Conditional Probability Tables (CPTs)

When a given node as well as all of its parent nodes are categorical (or binary) in nature, the conditional distribution of that node is also fully categorical. We have included two functions — cpt() and inputCPT() — which facilitate the process of populating the conditional distributions for such nodes.

Each of these two functions produce an object of class cpt, which is a k-dimensional array (with k equal to the number of parent nodes) with a specific structure: the last dimension corresponds to the child node and the array, when summed across this dimension, is equal to a (k-1)-dimensional array of ones. It therefore is an array containing conditional distributions of the child node for each combination of parent nodes.

The function cpt() will compute this array given an input dataset and a formula which represents the conditional probability structure. In the code below, the variable death is the child node and the variables pe and treat are the parent nodes.

h <- cpt(death ~ pe + treat, data=PE)

inputCPT() is similar, although instead of using an input dataset to estimate the conditional distributions, it runs through a dialogue to manually specify the conditional densities. This can be useful for small conditional probability tables, such as the conditional probability of a diagnostic test being positive given disease status:

h <- inputCPT(test ~ disease)
------------------------------------------------------------------
Enter Factor Levels for node 'test':

If this is a binary variable, enter '<yn>' as a shortcut.
When finished, enter '<z>'.
To repeat entry of the last inputted factor level, enter '<b>'.
To start over entirely, enter '<s>'
------------------------------------------------------------------
Level 1 of 'test':   ---
Level 2 of 'test':   +++
Level 3 of 'test':   <z>
------------------------------------------------------------------
Enter Factor Levels for node 'disease':

If this is a binary variable, enter '<yn>' as a shortcut.
When finished, enter '<z>'.
To repeat entry of the last inputted factor level, enter '<b>'.
To start over entirely, enter '<s>'
------------------------------------------------------------------
Level 1 of 'disease':   Healthy
Level 2 of 'disease':   Diseased
Level 3 of 'disease':   <z>
------------------------------------------------------------------
NOTE: parameter 'reduce' is set to TRUE in inputCPT().
Conditional probabilities Pr(test=--- | disease)
will be calculated as the complement of the
inputted probabilities Pr(test != --- | disease).
------------------------------------------------------------------
Enter the following conditional probabilities:
Use '<q>' to halt execution.
To go back one step and re-enter, enter '<b>'.
------------------------------------------------------------------
Pr(test=+++ | Healthy ):   0.23
Pr(test=+++ | Diseased):   0.85
print(h)
          test
disease     ---  +++
Healthy  0.77 0.23
Diseased 0.15 0.85
attr(,"model")
disease test   wt
1  Healthy  +++ 0.23
2 Diseased  +++ 0.85
3  Healthy  --- 0.77
4 Diseased  --- 0.15
attr(,"class")
[1] "cpt"   "array"

[Link to top]

## Deterministic Nodes

In many cases, the user may desire to specify nodes that are non-random in nature. For example, we might construct a network for the first roll of dice within a game of craps. In craps, if the “shooter” (the person rolling the dice) rolls a 2, 3, or 12, you immediately lose. If the “shooter” rolls a 7 or 11, you immediately win. Anything else and the “point” gets set (and then the shooter rolls again).

craps <- HydeNetwork(~ d1 + d2 + diceSum | d1*d2
+ firstRollOutcome | diceSum)

craps <- setNode(craps, d1, nodeType="dcat",
pi = vectorProbs(p = rep(1/6,6), d1),
validate = FALSE)
craps <- setNode(craps, d2, nodeType="dcat",
pi = vectorProbs(p = rep(1/6,6), d2),
validate = FALSE)

craps <- setNode(craps, diceSum, nodeType = "determ",
define = fromFormula(),
nodeFormula = diceSum ~ di1 + di2)

craps <- setNode(craps, firstRollOutcome, nodeType = "determ",
define = fromFormula(),
nodeFormula = firstRollOutcome ~
ifelse(diceSum < 4 | diceSum > 11, -1,
ifelse(diceSum == 7 | diceSum == 11, 1,0)))

## plot(craps)
writeNetworkModel(craps, pretty=TRUE)
## model{
##    pi.d1[1] <- 0.166666666666667; pi.d1[2] <- 0.166666666666667; pi.d1[3] <- 0.166666666666667; pi.d1[4] <- 0.166666666666667; pi.d1[5] <- 0.166666666666667; pi.d1[6] <- 0.166666666666667
##    d1 ~ dcat(pi.d1)
##    pi.d2[1] <- 0.166666666666667; pi.d2[2] <- 0.166666666666667; pi.d2[3] <- 0.166666666666667; pi.d2[4] <- 0.166666666666667; pi.d2[5] <- 0.166666666666667; pi.d2[6] <- 0.166666666666667
##    d2 ~ dcat(pi.d2)
##    diceSum <- di1 + di2
##    firstRollOutcome <- ifelse(diceSum < 4 | diceSum > 11, -1, ifelse(diceSum == 7 | diceSum == 11, 1, 0))
## }

The formulas follow the same rules as described above in the Regression Equations section.

[Link to top]

## Modifying the Graph Structure

Nodes and/or links may be added or removed from an existing HydeNetwork object, using an update method we have implemented for HydeNetwork objects. Syntactically, this function acts in a similar fashion to update.lm(), in that you add or subtract terms from the model equation. Suppose that a new diagnostic test for PE was invented and we wish to incorporate it into the PE network. We can achieve this by the following code:

net2 <- update(net, . ~ . + newTest | pe
+ treat | newTest
- pregnant)
## Warning in update.HydeNetwork(net, . ~ . + newTest | pe + treat | newTest - : The following nodes lost parents in the update--please redefine the node formula:
##     d.dimer: pregnant
##
##
## plot(net2)

The update() method for HydeNetwork objects processes terms in the given model equation sequentially. In the above example, the original object net did not contain a node called newTest. But there were nodes called pregnant, pe, and treat. The first term within the model equation (+ newTest | pe) specifies the addition of the node newTest as a child of node pe. The second term (+ treat | newTest) specifies the addition of a link from the now-existing node newTest into node treat. The third term (- pregnant) specifies the removal of node pregnant. Examining the network object, two important points are worth mentioning:

net2
## A Probabilistic Graphical Network
## Has data attached: No
##
## newTest | pe
## dnorm(mu = Unspecified, tau = Unspecified)
## : newTest ~ pe
##
## pe | wells
## dnorm(mu = Unspecified, tau = Unspecified)
## : pe ~ wells
##
## treat | newTest * angio * d.dimer
## dbern(prob = ilogit( -6.3 + 0.02*d.dimer + 2.9*angio - 0.005*d.dimer*angio ))
## : treat ~ d.dimer + angio
##
## angio | pe
## dnorm(mu = Unspecified, tau = Unspecified)
## : angio ~ pe
##
## d.dimer | pe
## dnorm(mean = fromFormula, sd = 5.47722557505166)
## : d.dimer ~ 210 + 29 * pregnant + 68 * pe
##
## death | pe * treat
## dnorm(mu = Unspecified, tau = Unspecified)
## : death ~ pe + treat
##
## wells
## dcat(pi = pi.wells[1] <- 0.15; pi.wells[2] <- 0.66; pi.wells[3] <- 0.19)
## : wells ~ 1

First, while the graph has changed – and now node treat has three parents – the model for node treat has not changed. The user must specify a new model (with either setNode() or setNodeModels()) to account for this new dependency if desired.

Second, a warning message indicates that a parent node (pregnant) has been removed. Since that node was involved in characterizing the distribution of its child node(s) (d.dimer), the function by default removes the distribution from all child nodes still existing in the network. The user then is required to use either setNode() or setNodeModels() to repopulate the distribution(s) for the affected node(s).

[Link to top]