Optimal IV · CIfly

In this article, we show how to implement an algorithm for constructing a graphically optimal conditional instrumental set. Instrumental variables are a popular tool for identifying and estimating causal effects in the presence of unmeasured confounding and conditional instrumental sets generalize this notion further. As there may be many different conditional instruments in a causal graph, it is an important question which intruments are preferable over others. Henckel et al. (2024a) provide a graphical criterion for a conditonal instrumental set with a statistical efficiency guarantees they call graphical optimality. The main result in this regard is stated in the following theorem. Note that we focus here purely on the graphical aspects of this problem, for more statistical and causal background, we refer the reader to the original paper.

Theorem 1 [Henckel et al. (2024a)]
Let $x$ and $y$ be distinct nodes in an acyclic directed mixed graph (ADMG) $G$ such that $\text{de}(x)_G = \{x, y\}$ . Then, it holds for the sets $W^{\text{opt}} = \text{dis}^+_{G, \{x\}}(y) \setminus \{x, y\}$ and $Z^{\text{opt}} = \text{dis}^+_{G, \{y\}}(x) \setminus (W^{\text{opt}} \cup \{x, y\})$ that

if $Z^\text{opt} \neq \emptyset$ , then $(Z^\text{opt}, W^\text{opt})$ is a valid conditional instrumental set relative to $(x, y)$ in $G$ and
if $Z^\text{opt} \cap (\text{pa}_G(x) \cup \text{sib}_G(x)) \neq \emptyset$ , then $(Z^\text{opt}, W^\text{opt})$ is graphically optimal relative to $(x, y)$ in $G$ .

The criterion is stated for acyclic directed mixed graph, from now on abbreviated as ADMGs, which are graphs with both directed and bidirected edges that do not contain a directed cycle. $\text{pa}_G(x)$ are the parents of $x$ in $G$ , i.e., nodes $p$ with $p \rightarrow x$ , and $\text{sib}_G(x)$ are the siblings of $x$ in $G$ , i.e., the nodes with $p \leftrightarrow x$ . Further, $\text{dis}_{G, W}(x)$ contains all nodes connected to $x$ via a path of bidirected edges, with all nodes on the path not being in $W$ , and $\text{dis}_{G, W}^+(x) = (\text{dis}_{G, W}(x) \cup \text{pa}_G(\text{dis}_{G, W}(x))) \setminus W$ .

The criterion assumes that $y$ is the only descendant of $x$ , which may appear like a strong restriction. However, one may consider the latent projection over all other descendants of $x$ to satisfy this condition (for more details on the latent projection, we refer to the CIfly paper). The only consequence of this, is that no descendants of $x$ in the original graph can be part of the instrumental set. However, this is a mild assumption because if such variables could be used as part of the instrument, then the causal effect would also be identifiable using adjustment. One way to deal with it, is to first search for an adjustment set (a task that can also be solved using CIfly) and if none exists proceed with the methodology described in this article. Hence, in the following, we will always assume that the causal effect cannot be identified by adjustment.

While using the latent projection to remove all descendants of $x$ (apart from $y$ ) is methodologically reasonable under this assumption, algorithmically, computing the latent projection is more expensive than CIfly reachability algorithms. Hence, we would like to avoid this operation here. To this aim, we use the following graphical characterization of $W^\text{opt}$ and $Z^\text{opt}$ developed in the CIfly paper.

Theorem 2
Consider distinct nodes $x$ and $y$ in an ADMG $G$ and $D = \text{de}_G(x)$ such that $y \in D$ and let $A = V \setminus D$ . Then, $W^{\text{opt}}$ and $Z^{\text{opt}}$ for the graph obtained by latent projection over $D \setminus \{x, y\}$ are equivalent to the following constructions in $G$ :

$W^{\text{opt}} = \{ v \in V \; \mid \; v \text{ is on a path } p \rightarrow c_1 \leftrightarrow \cdots \leftrightarrow \cdots c_k \leftrightarrow d_1 \rightarrow \cdots \rightarrow \cdots d_l \rightarrow y \} \cap A$ , where $p, c_1, \dots, c_k \in A$ and $d_1, \dots, d_l \in (D \setminus \{ x \})$ possibly with $k=0$ or $l=0$ and
$Z^{\text{opt}} = \{ v \in V \; \mid \; v \text{ is on a path } p \rightarrow c_1 \leftrightarrow \cdots \leftrightarrow c_k \leftrightarrow x\} \setminus W^{\text{opt}}$ , where $p, c_1, \dots, c_k \in A$ possibly with $k=0$ .

Using this theorem, we proceed as follows:

Check that $y \in \text{de}_G(x)$ .
Compute $W^{\text{opt}}$ and $Z^{\text{opt}}$ as defined in Theorem 2.
Check Condition 1. and 2. of Theorem 1.

This will guarantee that the resulting sets $W^{\text{opt}}$ and $Z^{\text{opt}}$ constitute graphical optimal instrumental sets, assuming that there is no adjustment set relative to $x$ and $y$ in $G$ . Now, let us consider the three steps one-by-one.

Descendant check

Of course, the task of checking whether $y$ is a descendant of $x$ in $G$ is obvious and available in many off-the-shelf graph packages. Still, we show how it can be easily done in CIfly as well. For this, we use the following rule table.

descendants_admg.txt

EDGES --> <--, <->
SETS X
START --> AT X
OUTPUT ...

... | --> | true

Effectively, this table describes to start at the nodes in $X$ as if the last edge was $\rightarrow$ and from there only follow directed edges. All reached nodes are returned as the descendants of $X$ . We can use this in as follows to check whether $y \in \text{de}_G(x)$ .

descendants = cf.reach(g, {"X": x}, "descendants_admg.txt")
isDescendant = y in descendants

Compute $W^{\text{opt}}$ and $Z^{\text{opt}}$

We will compute both $W^{\text{opt}}$ and $Z^{\text{opt}}$ using the same underlying ruletable. Let us begin with $W^{\text{opt}}$ . Here, we want to start at $y$ and track paths that, first, follow edges $\leftarrow$ (assuming the respective nodes are in $D \setminus \{x\}$ ) until at some point we switch to bidirected edges for edges in $A$ . The path may end with final directed edge $\leftarrow$ to a node in $A$ . To do this consider the following rule table.

optimal_iv_admg.txt

EDGES  --> <--, <->
SETS   S, D, F
COLORS pass, yield
START  <-- [pass] AT S
OUTPUT ... [yield]

...         | -->              | false
<-- [pass]  | <--      [pass]  | next in D     and next not in F
<-- [pass]  | <--, <-> [yield] | next not in D and next not in F
<-> [yield] | <--, <-> [yield] | next not in D and next not in F

Here, we use the fact that $A = V \setminus D$ , hence it suffices to pass $D$ . The set $S$ forms the starting nodes and $F$ allows us to restrict which nodes are visited. This table tracks exactly the kind of paths we have described above. We ensure that it only returns nodes reached in $A$ by using the colors pass and yield. We can compute both $W^{\text{opt}}$ and $Z^{\text{opt}}$ using this table as follows. For the latter, observe that descendants of $x$ can never be connected to it via a path of the form $x \leftarrow \cdots$ due to the acyclicity of the underlying graph.

W = set(cf.reach(g, {"S": y, "D": descendants, "F": x}, "./optimal_iv_admg.txt"))
Z = set(cf.reach(g, {"S": x, "D": descendants, "F": []}, "./optimal_iv_admg.txt")) - W

Checking the optimality condition

The resulting sets $Z^{\text{opt}}$ and $W^{\text{opt}}$ are valid, provided $Z^{\text{opt}}$ is non-empty, which is trivial to check outside of CIfly. The second condition is that $Z^{\text{opt}}$ contains a parent or sibling of $x$ . While this could be checked using a CIfly algorithm, clearly it is easier to do this directly in Python or R as well. Here, we do this based on the CIfly graph representation, but you may prefer to do this using your favorite graph library or format instead.

Recall that a CIfly graph is stored as one edge list per edge type. In Python, each edge is given as a pair of node IDs. Hence, we may use the following helper functions to check whether $Z$ contains a parent or sibling of $x$ .

def contains_parent(g, x, Z):
    return any(map(lambda uv: uv[0] in Z and uv[1] == x, g["-->"]))

def contains_sibling(g, x, Z):
    return any(map(lambda uv: (uv[0] in Z and uv[1] == x) or (uv[0] == x and uv[1] in Z), g["<->"]))

To do the parent check, the lambda function tests for a pair uv from the edge list of directed edges whether the first node is in $Z$ and the second node is $x$ . The sibling check is a bit more verbose because both orderings have to be tested. Provided that testing membership in $Z$ is constant-time (as is the case here because we pass $Z$ as a Python set), these functions run in linear-time in the size of the graph as a whole edge list is searched. Of course, better run-times are possible if graph representations more suitable to this task are used. While this does not matter for the time complexity of this procedure, it can be useful to maintain such a representation in case the parent and sibling checks are needed repeatedly.

In R, the edge lists are given as $r \times 2$ matrices in case there are $r$ edges of that type. In order to check whether some node in $Z$ is a parent of $x$ we can filter the edge list matrix for such rows, quite similarly to the Python code above. However, because R has no native set data structure, we take some extra steps to ensure efficient look-up in the filtering (even for large sets $Z$ ). Namely, we first store a Boolean vector indicating membership in $Z$ . Afterwards, we filter the edges in g[["-->"]] that have a node in $Z$ as first and $x$ as second element and return TRUE or FALSE based on the length of the filtered matrix. Let us consider this for the parent check first.

containsParent <- function(g, x, Z) {
	# enable efficient lookup of membership in Z
	p <- highestNodeId(g)
	zLogical <- rep(FALSE, p)
	zLogical[Z] <- TRUE

	edgeList <- g[["-->"]]
	return (length(edgeList[zLogical[edgeList[,1]] & edgeList[,2] == x]) > 0)
}

This code uses a small helper function highestNodeId to find the highest node ID that is part of an edge in a CIfly graph. Note that the number of nodes is not stored in a CIfly graph object and, hence, has to be managed outside of it if needed.

highestNodeId <- function(g) {
	nodeIds <- unlist(g)
	if (length(nodeIds) == 0) {
		return (0)
	} else {
		return (max(nodeIds))
	}
}

Again, the siblings can be retrieved similarly, by checking for both possible orders of the endpoints in the edge list matrix.

containsSibling <- function(g, x, Z) {
	p <- highestNodeId(g)
	zLogical <- rep(FALSE, p)
	zLogical[Z] <- TRUE

	edgeList <- g[["<->"]]
	return (length(edgeList[(zLogical[edgeList[,1]] & edgeList[,2] == x) | (zLogical[edgeList[,2]] & edgeList[,1] == x)]) > 0)
}

We note that in case this functionality is used more often, a helper function

containsNeighborCifly(g, x, Z, edgeType, orderedEdge=TRUE)

may prove useful, which abstracts this code for different edge types including an argument that indicates whether the edge type is ordered.

Full implementation

Hence, we obtain the following implementation for finding graphically optimal instrumental sets $Z$ and $W$ assuming that the causal effect of $x$ on $y$ cannot be identified with adjustment.

The functions for checking parents and siblings are loaded from utils. As before, we also rely on a simple function get_ruletable_path(), respectively getRuletablePath(), that returns a path to the rule-table folder.

optimal_iv.py

import ciflypy as cf
import ciflypy_examples.utils as utils

ruletables = utils.get_ruletable_path()
descendants_table = cf.Ruletable(ruletables / "descendants_admg.txt")
optimal_iv_table = cf.Ruletable(ruletables / "optimal_iv_admg.txt")

# find Z, W according to the criterion given by Henckel et al. (2024)
# NOTE: the returned instrument is only optimal if there is no adjustment set between x and y
# we do NOT check this here and thus if this assumption is violated
# the function alone only ensures soundness, but not optimality
def optimal_instrument(g, x, y):
    descendants = cf.reach(g, {"X": x}, descendants_table)
    if y not in descendants:
        return None

    W = set(cf.reach(g, {"S": y, "D": descendants, "F": x}, optimal_iv_table))
    Z = set(cf.reach(g, {"S": x, "D": descendants, "F": []}, optimal_iv_table)) - W

    if Z and (utils.contains_parent(g, x, Z) or utils.contains_sibling(g, x, Z)):
        return (list(Z), list(W))
    else:
        return None

optimalIV.R

library("ciflyr")
library("here")

source(here("R", "utils.R"))

ruletables <- getRuletablePath()
descendantsTable <- parseRuletable(file.path(ruletables, "descendants_admg.txt"))
optimalIVTable <- parseRuletable(file.path(ruletables, "optimal_iv_admg.txt"))

# find Z, W according to the criterion given by Henckel et al. (2024)
# NOTE: the returned instrument is only optimal if there is no adjustment set between x and y
# we do NOT check this here and thus if this assumption is violated
# the function alone only ensures soundness, but not optimality
optimalInstrument <- function(g, x, y) {
	descendants <- reach(g, list("X" = x), descendantsTable)
	if (!(y %in% descendants)) {
		return (NULL)
	} 
	W <- reach(g, list("S" = y, "D" = descendants, "F" = x), optimalIVTable)
	Z <- setdiff(reach(g, list("S" = x, "D" = descendants, "F" = c()), optimalIVTable), W)

	if (length(Z) > 0 && (containsParent(g, x, Z) || containsSibling(g, x, Z))) {
		return (list("Z" = Z, "W" = W))
	} else {
		return (NULL)
	}
}

References

Leonard Henckel, Martin Buttenschoen, Marloes H. Maathuis.

Graphical tools for selecting conditional instrumental sets. Biometrika, 2024.

Optimal conditional instruments

Descendant check

Compute WoptW^{\text{opt}}Wopt and ZoptZ^{\text{opt}}Zopt

Checking the optimality condition

Full implementation

References

Compute $W^{\text{opt}}$ and $Z^{\text{opt}}$