Parent adjustment distance

In this article, we present an implementation for computing the parent adjustment distance between two CPDAGs as proposed by Henckel et al. (2024b). CPDAGs are causal graphs made up of directed as well as undirected edges that are learned by causal discovery algorithms. Such algorithms try to recover the graphical structure of the underlying causal model based on, typically, observational data. In the following, we will usually denote the estimated CPDAG as GguessG_{\text{guess}} and the CPDAG representing the true underlying model as GtrueG_{\text{true}}.

The parent adjustment distance measures the distance of GguessG_{\text{guess}} from GtrueG_{\text{true}}. Broadly speaking (we will define this more precisely below), this distance measures the number of node pairs xx and yy for which the validity status the parents of xx in GguessG_{\text{guess}} differs between the two graphs. For example, if the parents of xx in GguessG_{\text{guess}} are an adjustment set for GguessG_{\text{guess}} but not for GtrueG_{\text{true}} the distance increases by one. To make this more precise, let us recall the CPDAG adjustment criterion by Perković et al. (2018) which we discussed previously in this article.

Theorem 1 [Perković et al. (2018)]
Consider disjoint node sets XX, YY and WW in a CPDAG GG. Then, WW is a valid adjustment set relative to (X,Y)(X, Y) in GG if, and only if,

  1. all proper possibly directed paths from XX to YY begin with a directed edge,
  2. forbG(X,Y)W=\text{forb}_G(X, Y) \cap W = \emptyset and
  3. all proper definite-status non-causal paths from XX to YY are blocked by WW in GG.

This criterion gives a precise graphical characterization of adjustment sets WW relative to XX and YY in a CPDAG GG, which enables us to define the parent adjustment distance formally. As we have discussed this criterion and its conditions in detail before, we will not go into more details here. Moreover, in the algorithm below, we build on an equivalent characterization introduced by Henckel et al. (2024b). Before that, let us first properly introduce the parent adjustment distance.

Definition 1 (Parent adjustment distance)
Consider two graphs Gguess=(V,Eguess)G_{\text{guess}} = (V, E_{\text{guess}}) and Gtrue=(V,Etrue)G_{\text{true}} = (V, E_{\text{true}}). Then, the parent adjustment distance corresponds to the number of node pairs xx and yy such that

  1. Condition 1 of Theorem 1 is violated relative to (x,y)(x,y) in GguessG_{\text{guess}} but not in GtrueG_{\text{true}},
  2. yy is a parent of xx in GguessG_{\text{guess}} which implies a causal effect of zero but yy is a descendant of xx in GtrueG_{\text{true}} contradicting this, or
  3. the parents of xx in GguessG_{\text{guess}} are a valid adjustment set in GguessG_{\text{guess}} but not a valid adjustment set relative to xx and yy in GtrueG_{\text{true}}.

Directly evaluating this definition, by iterating over all node pairs x,yx, y and checking the relevant conditions of the adjustment criterion as discussed, e.g., in our earlier article, would yield a p2(p+m)p^2 \cdot (p + m) time algorithm for graphs with pp nodes and mm edges. As shown by Henckel et al. (2024b), this can be improved further by checking, for a given xx, the status of all yy in linear-time at once.

Efficient algorithm for the parent adjustment distance

The algorithm by Henckel et al. (2024b) for computing the parent adjustment distance in time O(p(p+m))O(p \cdot (p + m)) is based on the following lemma.

Lemma 1 [Henckel et al. (2024b)]
Consider a CPDAG GG and sets of nodes XX, YY and WW. Then, WW satisfies the adjustment criterion if, and only if,

  1. every proper possibly directed walk from XX to YY begins with a directed edge out of XX,
  2. no proper possibly directed walk from XX to YY contains a node in WW and
  3. every proper definite-status walk from XX to YY that contains a backwards-facing edge is blocked by WW.

Algorithmically, it is easier to check the complements of the first and the third condition, that is, to check that no proper possibly directed walk from XX to YY begins with an undirected edge and that no proper definite-status walk from XX to YY that contains a backwards-facing edge is open given WW.

To implement the above lemma, it is first necessary to recall the definitions of proper, possibly directed, definite-status as well as blocked and open walks. A walk is conceptually similar to a path with the difference that the same nodes and edges can be traversed multiply times. It is called

  • proper if it contains at most one node from XX and YY,
  • possibly directed if all edges are either undirected or point towards YY on the path,
  • definite-status if every non-endpoint node is a collider or a definite-status non-collider (a definite-status non-collider is a node vv with left neighbor uu and right neighbor ww such that uvu \leftarrow v, vwv \rightarrow w or uvwu - v - w),
  • open if every collider is in WW and every non-collider is not in WW, and
  • blocked if it is not open.

We are now able to formulate rule tables for the three conditions of Lemma 1 above. As we have already given the table for the first condition (non-amenability) in the article on checking the CPDAG adjustment criterion, we focus on the other two conditions here.

For the second condition, we need to track proper possibly directed walks from XX that contain a node in WW.

forbidden_path_connected_cpdag.txt
EDGES --> <--, ---
SETS X, W
COLORS pass, yield
START --> [pass] AT X
OUTPUT ... [yield]

... [pass]  | ---, --> [pass]  | next not in X and next not in W
... [pass]  | ---, --> [yield] | next not in X and next in W
... [yield] | ---, --> [yield] | next not in X

To ensure properness, we always check that the next node is not in XX. Moreover, we switch to color yield once we find a node that is in WW while only tracking possibly directed paths. Hence, the set of nodes of color yield that is returned contains all nodes reachable from XX via walks prescribed by Condition 2 of Lemma 1.

Tracking the paths in Condition 3 is a bit more complex. We leave the init color with the first edge. Afterwards, open paths are tracked, which may become non-causal if a backwards facing edge is traversed. Again we ensure properness by checking that next is not in XX in every rule. Other than that the rules ensure that the walks we track are open and of definite-status. Finally, we return all nodes reached with color non-causal.

non_causal_connected_cpdag.txt
EDGES --> <--, ---
SETS X, W
COLORS init, poss-causal, non-causal
START ... [init] AT X
OUTPUT ... [non-causal]

... [init]        | ---, --> [poss-causal] | next not in X
... [init]        | <--      [non-causal]  | next not in X
--> [...]         | <--      [non-causal]  | next not in X and current in W
--- [poss-causal] | ---      [poss-causal] | next not in X and current not in W
... [poss-causal] | -->      [poss-causal] | next not in X and current not in W
--- [non-causal]  | ---      [non-causal]  | next not in X and current not in W
<-- [non-causal]  | ...      [non-causal]  | next not in X and current not in W
... [non-causal]  | -->      [non-causal]  | next not in X and current not in W

CIfly-based computation of the parent adjustment distance

The general idea of the algorithm is to go through all xx, look up its parents in GguessG_{\text{guess}}, denoted by paGguess(x)\text{pa}_{G_{\text{guess}}}(x), and then compute the following sets in linear-time:

  • all yy such that the pair xx and yy is not amenable (violates Condition 1 of the adjustment criterion) in GguessG_{\text{guess}},
  • all yy such that yy is a proper possible descendant of xx,
  • all yy such that the pair xx and yy is not amenable in GtrueG_{\text{true}}, and
  • all yy such that paGguess(x)\text{pa}_{G_{\text{guess}}}(x) is not a valid adjustment set relative to xx and yy in GtrueG_{\text{true}}.

With these sets we can compote the parent adjustment distance directly as defined in Definition 1. For this, we iterate over all yy and check whether the parent adjustment distances increases for pair xx and yy in constant time per yy. Overall, this yields run-time O(p(p+m))O(p \cdot (p + m)).

The only functionality we have not discussed yet is how to compute the set of parents for each node in the CIfly graph. Again, this can be done easily in Python and R directly, based on the CIfly graph representation. Of course, you can also rely on your favorite graph library or format for this task (and translate it into CIfly format only for performing the reach calls).

As in our R example, we defer this code to the utils.R source file, we show it here for completeness. The code avoids loops for better performance.

# p is the number of nodes and g the CIfly graph
parents <- function(p, g) {
	dirEdges <- g[["-->"]]
	grouped <- split(dirEdges[, 1], dirEdges[, 2])
	result <- replicate(p, integer(0), simplify = FALSE)
	names(result) <- as.character(1:p)
	result[names(grouped)] <- grouped
	return (result)
}

Generally, in this application we put more focus than usual on practical efficiency. More precisely,

  • we parse the graphs into CIfly objects outside of the loop over all xx,
  • we have a single function which returns not-amenable yy and all yy for which the adjustment set is not valid in GguessG_{\text{guess}} to avoid computing non-amenable yy twice, and
  • we make sure that testing membership in the sets of yy nodes by using the set data structure in Python and Boolean indicator vectors in R (there is no native set data structure in R).

We do this to achieve comparable performance with the optimized gadjid library written in Rust and accessible from Python. In particular, parsing the graphs into CIfly objects is an effective trick for better practical performance if reach is called often with the same graph. Finally, we present the overall code for computing the parent adjustment distance.

parent_aid.py
import ciflypy as cifly
import ciflypy_examples.utils as utils

ruletables = utils.get_ruletable_path()

poss_desc_table = cifly.Ruletable(str(ruletables / "possible_descendants_cpdag.txt"))
not_amenable_table = cifly.Ruletable(str(ruletables / "not_amenable_cpdag.txt"))
forb_path_conn_table = cifly.Ruletable(str(ruletables / "forbidden_path_connected_cpdag.txt"))
non_causal_conn_table = cifly.Ruletable(str(ruletables / "non_causal_connected_cpdag.txt"))

def poss_desc(g, x):
    return set(cifly.reach(g, {"X": x, "W": []}, poss_desc_table))

def not_amenable(g, x):
    return set(cifly.reach(g, {"X": x}, not_amenable_table))

def forbidden(g, x, W):
    return set(cifly.reach(g, {"X": x, "W": W}, forb_path_conn_table))

def non_causal_connected(g, x, W):
    return set(cifly.reach(g, {"X": x, "W": W}, non_causal_conn_table))

def not_amenable_not_adjustment(g, x, W):
    nam = not_amenable(g, x)
    return nam, nam.union(forbidden(g, x, W)).union(non_causal_connected(g, x, W))

def parent_aid(p, g_true, g_guess):
    parents = [[] for _ in range(p)]
    for u, v in g_guess["-->"]:
        parents[v].append(u)
    g_guess_parsed = cifly.Graph(g_guess, poss_desc_table)
    g_true_parsed = cifly.Graph(g_true, poss_desc_table)
    mistakes = 0

    for x in range(p):
        pa = set(parents[x])
        not_amenable_guess = not_amenable(g_guess_parsed, x)
        desc_true = poss_desc(g_true_parsed, x)
        not_amenable_true, not_adjustment_true = not_amenable_not_adjustment(g_true_parsed, x, pa)

        for y in range(p):
            if y == x:
                continue
            if y in pa:
                if y in desc_true:
                    mistakes += 1
            elif y in not_amenable_guess:
                if y not in not_amenable_true:
                    mistakes += 1
            else:
                if y in not_adjustment_true:
                    mistakes += 1

    return mistakes

References