Parent adjustment distance
In this article, we present an implementation for computing the parent adjustment distance between two CPDAGs as proposed by Henckel et al. (2024b). CPDAGs are causal graphs made up of directed as well as undirected edges that are learned by causal discovery algorithms. Such algorithms try to recover the graphical structure of the underlying causal model based on, typically, observational data. In the following, we will usually denote the estimated CPDAG as and the CPDAG representing the true underlying model as .
The parent adjustment distance measures the distance of from . Broadly speaking (we will define this more precisely below), this distance measures the number of node pairs and for which the validity status the parents of in differs between the two graphs. For example, if the parents of in are an adjustment set for but not for the distance increases by one. To make this more precise, let us recall the CPDAG adjustment criterion by Perković et al. (2018) which we discussed previously in this article.
Theorem 1 [Perković et al. (2018)]
Consider disjoint node sets , and in a CPDAG . Then, is a valid adjustment set relative to in if, and only if,
- all proper possibly directed paths from to begin with a directed edge,
- and
- all proper definite-status non-causal paths from to are blocked by in .
This criterion gives a precise graphical characterization of adjustment sets relative to and in a CPDAG , which enables us to define the parent adjustment distance formally. As we have discussed this criterion and its conditions in detail before, we will not go into more details here. Moreover, in the algorithm below, we build on an equivalent characterization introduced by Henckel et al. (2024b). Before that, let us first properly introduce the parent adjustment distance.
Definition 1 (Parent adjustment distance)
Consider two graphs and . Then, the parent adjustment distance corresponds to the number of node pairs and such that
- Condition 1 of Theorem 1 is violated relative to in but not in ,
- is a parent of in which implies a causal effect of zero but is a descendant of in contradicting this, or
- the parents of in are a valid adjustment set in but not a valid adjustment set relative to and in .
Directly evaluating this definition, by iterating over all node pairs and checking the relevant conditions of the adjustment criterion as discussed, e.g., in our earlier article, would yield a time algorithm for graphs with nodes and edges. As shown by Henckel et al. (2024b), this can be improved further by checking, for a given , the status of all in linear-time at once.
Efficient algorithm for the parent adjustment distance
The algorithm by Henckel et al. (2024b) for computing the parent adjustment distance in time is based on the following lemma.
Lemma 1 [Henckel et al. (2024b)]
Consider a CPDAG and sets of nodes , and . Then, satisfies the adjustment criterion if, and only if,
- every proper possibly directed walk from to begins with a directed edge out of ,
- no proper possibly directed walk from to contains a node in and
- every proper definite-status walk from to that contains a backwards-facing edge is blocked by .
Algorithmically, it is easier to check the complements of the first and the third condition, that is, to check that no proper possibly directed walk from to begins with an undirected edge and that no proper definite-status walk from to that contains a backwards-facing edge is open given .
To implement the above lemma, it is first necessary to recall the definitions of proper, possibly directed, definite-status as well as blocked and open walks. A walk is conceptually similar to a path with the difference that the same nodes and edges can be traversed multiply times. It is called
- proper if it contains at most one node from and ,
- possibly directed if all edges are either undirected or point towards on the path,
- definite-status if every non-endpoint node is a collider or a definite-status non-collider (a definite-status non-collider is a node with left neighbor and right neighbor such that , or ),
- open if every collider is in and every non-collider is not in , and
- blocked if it is not open.
We are now able to formulate rule tables for the three conditions of Lemma 1 above. As we have already given the table for the first condition (non-amenability) in the article on checking the CPDAG adjustment criterion, we focus on the other two conditions here.
For the second condition, we need to track proper possibly directed walks from that contain a node in .
EDGES --> <--, ---
SETS X, W
COLORS pass, yield
START --> [pass] AT X
OUTPUT ... [yield]
... [pass] | ---, --> [pass] | next not in X and next not in W
... [pass] | ---, --> [yield] | next not in X and next in W
... [yield] | ---, --> [yield] | next not in X
To ensure properness, we always check that the next
node is not in . Moreover, we switch to color yield
once we find a node that is in while only tracking possibly directed paths. Hence, the set of nodes of color yield
that is returned contains all nodes reachable from via walks prescribed by Condition 2 of Lemma 1.
Tracking the paths in Condition 3 is a bit more complex. We leave the init
color with the first edge. Afterwards, open paths are tracked, which may become non-causal if a backwards facing edge is traversed. Again we ensure properness by checking that next
is not in in every rule. Other than that the rules ensure that the walks we track are open and of definite-status. Finally, we return all nodes reached with color non-causal
.
EDGES --> <--, ---
SETS X, W
COLORS init, poss-causal, non-causal
START ... [init] AT X
OUTPUT ... [non-causal]
... [init] | ---, --> [poss-causal] | next not in X
... [init] | <-- [non-causal] | next not in X
--> [...] | <-- [non-causal] | next not in X and current in W
--- [poss-causal] | --- [poss-causal] | next not in X and current not in W
... [poss-causal] | --> [poss-causal] | next not in X and current not in W
--- [non-causal] | --- [non-causal] | next not in X and current not in W
<-- [non-causal] | ... [non-causal] | next not in X and current not in W
... [non-causal] | --> [non-causal] | next not in X and current not in W
CIfly-based computation of the parent adjustment distance
The general idea of the algorithm is to go through all , look up its parents in , denoted by , and then compute the following sets in linear-time:
- all such that the pair and is not amenable (violates Condition 1 of the adjustment criterion) in ,
- all such that is a proper possible descendant of ,
- all such that the pair and is not amenable in , and
- all such that is not a valid adjustment set relative to and in .
With these sets we can compote the parent adjustment distance directly as defined in Definition 1. For this, we iterate over all and check whether the parent adjustment distances increases for pair and in constant time per . Overall, this yields run-time .
The only functionality we have not discussed yet is how to compute the set of parents for each node in the CIfly graph. Again, this can be done easily in Python and R directly, based on the CIfly graph representation. Of course, you can also rely on your favorite graph library or format for this task (and translate it into CIfly format only for performing the reach
calls).
As in our R example, we defer this code to the utils.R
source file, we show it here for completeness. The code avoids loops for better performance.
# p is the number of nodes and g the CIfly graph
parents <- function(p, g) {
dirEdges <- g[["-->"]]
grouped <- split(dirEdges[, 1], dirEdges[, 2])
result <- replicate(p, integer(0), simplify = FALSE)
names(result) <- as.character(1:p)
result[names(grouped)] <- grouped
return (result)
}
Generally, in this application we put more focus than usual on practical efficiency. More precisely,
- we parse the graphs into CIfly objects outside of the loop over all ,
- we have a single function which returns not-amenable and all for which the adjustment set is not valid in to avoid computing non-amenable twice, and
- we make sure that testing membership in the sets of nodes by using the
set
data structure in Python and Boolean indicator vectors in R (there is no native set data structure in R).
We do this to achieve comparable performance with the optimized gadjid library written in Rust and accessible from Python. In particular, parsing the graphs into CIfly objects is an effective trick for better practical performance if reach
is called often with the same graph. Finally, we present the overall code for computing the parent adjustment distance.
import ciflypy as cifly
import ciflypy_examples.utils as utils
ruletables = utils.get_ruletable_path()
poss_desc_table = cifly.Ruletable(str(ruletables / "possible_descendants_cpdag.txt"))
not_amenable_table = cifly.Ruletable(str(ruletables / "not_amenable_cpdag.txt"))
forb_path_conn_table = cifly.Ruletable(str(ruletables / "forbidden_path_connected_cpdag.txt"))
non_causal_conn_table = cifly.Ruletable(str(ruletables / "non_causal_connected_cpdag.txt"))
def poss_desc(g, x):
return set(cifly.reach(g, {"X": x, "W": []}, poss_desc_table))
def not_amenable(g, x):
return set(cifly.reach(g, {"X": x}, not_amenable_table))
def forbidden(g, x, W):
return set(cifly.reach(g, {"X": x, "W": W}, forb_path_conn_table))
def non_causal_connected(g, x, W):
return set(cifly.reach(g, {"X": x, "W": W}, non_causal_conn_table))
def not_amenable_not_adjustment(g, x, W):
nam = not_amenable(g, x)
return nam, nam.union(forbidden(g, x, W)).union(non_causal_connected(g, x, W))
def parent_aid(p, g_true, g_guess):
parents = [[] for _ in range(p)]
for u, v in g_guess["-->"]:
parents[v].append(u)
g_guess_parsed = cifly.Graph(g_guess, poss_desc_table)
g_true_parsed = cifly.Graph(g_true, poss_desc_table)
mistakes = 0
for x in range(p):
pa = set(parents[x])
not_amenable_guess = not_amenable(g_guess_parsed, x)
desc_true = poss_desc(g_true_parsed, x)
not_amenable_true, not_adjustment_true = not_amenable_not_adjustment(g_true_parsed, x, pa)
for y in range(p):
if y == x:
continue
if y in pa:
if y in desc_true:
mistakes += 1
elif y in not_amenable_guess:
if y not in not_amenable_true:
mistakes += 1
else:
if y in not_adjustment_true:
mistakes += 1
return mistakes
library("ciflyr")
library("here")
source(here("R", "utils.R"))
ruletables <- getRuletablePath()
possDescendants <- parseRuletable(file.path(ruletables, "possible_descendants_cpdag.txt"))
notAmenable <- parseRuletable(file.path(ruletables, "not_amenable_cpdag.txt"))
forbiddenPath <- parseRuletable(file.path(ruletables, "forbidden_path_connected_cpdag.txt"))
nonCausal <- parseRuletable(file.path(ruletables, "non_causal_connected_cpdag.txt"))
vectorToIndicator <- function(p, l) {
b <- rep(FALSE, p)
b[l] <- TRUE
return (b)
}
reachPossDescendants <- function(p, g, x) {
return (vectorToIndicator(p, reach(g, list("X" = x, "W" = c()), possDescendants)))
}
reachNotAmenable <- function(p, g, x) {
return (vectorToIndicator(p, reach(g, list("X" = x), notAmenable)))
}
reachForbiddenPath <- function(p, g, x, W) {
return (vectorToIndicator(p, reach(g, list("X" = x, "W" = W), forbiddenPath)))
}
reachNonCausal <- function(p, g, x, W) {
return (vectorToIndicator(p, reach(g, list("X" = x, "W" = W), nonCausal)))
}
notAmenablenotAdjustment <- function(p, g, x, W) {
nam <- reachNotAmenable(p, g, x)
forb <- reachForbiddenPath(p, g, x, W)
ncau <- reachNonCausal(p, g, x, W)
nad <- rep(FALSE, p)
nad[nam] <- TRUE
nad[forb] <- TRUE
nad[ncau] <- TRUE
return (list("nam" = nam, "nad" = nad))
}
parentAid <- function(p, gTrue, gGuess) {
gTrueParsed <- parseGraph(gTrue, possDescendants)
gGuessParsed <- parseGraph(gGuess, possDescendants)
paList <- parents(p, gGuess)
mistakes <- 0
for (x in seq(p)) {
pa <- vectorToIndicator(p, paList[[x]])
namGuess <- reachNotAmenable(p, gGuessParsed, x)
desTrue <- reachPossDescendants(p, gTrueParsed, x)
namnadTrue <- notAmenablenotAdjustment(p, gTrueParsed, x, paList[[x]])
for (y in seq(p)) {
if (y == x) {
next
}
if (pa[y]) {
if (desTrue[y]) {
mistakes <- mistakes + 1
}
} else if (namGuess[y]) {
if (!namnadTrue$nam[y]) {
mistakes <- mistakes + 1
}
} else {
if (namnadTrue$nad[y]) {
mistakes <- mistakes + 1
}
}
}
}
return (mistakes)
}