Optimal conditional instruments
In this article, we show how to implement an algorithm for constructing a graphically optimal conditional instrumental set. Instrumental variables are a popular tool for identifying and estimating causal effects in the presence of unmeasured confounding and conditional instrumental sets generalize this notion further. As there may be many different conditional instruments in a causal graph, it is an important question which intruments are preferable over others. Henckel et al. (2024a) provide a graphical criterion for a conditonal instrumental set with a statistical efficiency guarantees they call graphical optimality. The main result in this regard is stated in the following theorem. Note that we focus here purely on the graphical aspects of this problem, for more statistical and causal background, we refer the reader to the original paper.
Theorem 1 [Henckel et al. (2024a)]
Let and be distinct nodes in an acyclic directed mixed graph (ADMG) such that . Then, it holds for the sets and that
- if , then is a valid conditional instrumental set relative to in and
- if , then is graphically optimal relative to in .
The criterion is stated for acyclic directed mixed graph, from now on abbreviated as ADMGs, which are graphs with both directed and bidirected edges that do not contain a directed cycle. are the parents of in , i.e., nodes with , and are the siblings of in , i.e., the nodes with . Further, contains all nodes connected to via a path of bidirected edges, with all nodes on the path not being in , and .
The criterion assumes that is the only descendant of , which may appear like a strong restriction. However, one may consider the latent projection over all other descendants of to satisfy this condition (for more details on the latent projection, we refer to the CIfly paper). The only consequence of this, is that no descendants of in the original graph can be part of the instrumental set. However, this is a mild assumption because if such variables could be used as part of the instrument, then the causal effect would also be identifiable using adjustment. One way to deal with it, is to first search for an adjustment set (a task that can also be solved using CIfly) and if none exists proceed with the methodology described in this article. Hence, in the following, we will always assume that the causal effect cannot be identified by adjustment.
While using the latent projection to remove all descendants of (apart from ) is methodologically reasonable under this assumption, algorithmically, computing the latent projection is more expensive than CIfly reachability algorithms. Hence, we would like to avoid this operation here. To this aim, we use the following graphical characterization of and developed in the CIfly paper.
Theorem 2
Consider distinct nodes and in an ADMG and such that and let . Then, and for the graph obtained by latent projection over are equivalent to the following constructions in :
- , where and possibly with or and
- , where possibly with .
Using this theorem, we proceed as follows:
- Check that .
- Compute and as defined in Theorem 2.
- Check Condition 1. and 2. of Theorem 1.
This will guarantee that the resulting sets and constitute graphical optimal instrumental sets, assuming that there is no adjustment set relative to and in . Now, let us consider the three steps one-by-one.
Descendant check
Of course, the task of checking whether is a descendant of in is obvious and available in many off-the-shelf graph packages. Still, we show how it can be easily done in CIfly as well. For this, we use the following rule table.
EDGES --> <--, <->
SETS X
START --> AT X
OUTPUT ...
... | --> | true
Effectively, this table describes to start at the nodes in as if the last edge was and from there only follow directed edges. All reached nodes are returned as the descendants of . We can use this in as follows to check whether .
descendants = cf.reach(g, {"X": x}, "descendants_admg.txt")
isDescendant = y in descendants
descendants <- reach(g, list("X" = x), "descendants_admg.txt")
isDescendant <- y %in% descendants
Compute and
We will compute both and using the same underlying ruletable. Let us begin with . Here, we want to start at and track paths that, first, follow edges (assuming the respective nodes are in ) until at some point we switch to bidirected edges for edges in . The path may end with final directed edge to a node in . To do this consider the following rule table.
EDGES --> <--, <->
SETS S, D, F
COLORS pass, yield
START <-- [pass] AT S
OUTPUT ... [yield]
... | --> | false
<-- [pass] | <-- [pass] | next in D and next not in F
<-- [pass] | <--, <-> [yield] | next not in D and next not in F
<-> [yield] | <--, <-> [yield] | next not in D and next not in F
Here, we use the fact that , hence it suffices to pass . The set forms the starting nodes and allows us to restrict which nodes are visited. This table tracks exactly the kind of paths we have described above. We ensure that it only returns nodes reached in by using the colors pass
and yield
. We can compute both and using this table as follows. For the latter, observe that descendants of can never be connected to it via a path of the form due to the acyclicity of the underlying graph.
W = set(cf.reach(g, {"S": y, "D": descendants, "F": x}, "./optimal_iv_admg.txt"))
Z = set(cf.reach(g, {"S": x, "D": descendants, "F": []}, "./optimal_iv_admg.txt")) - W
W <- reach(g, list("S" = y, "D" = descendants, "F" = x), "./optimal_iv_admg.txt")
Z <- setdiff(reach(g, list("S" = x, "D" = des, "F" = c()), "./optimal_iv_admg.txt"), W)
Checking the optimality condition
The resulting sets and are valid, provided is non-empty, which is trivial to check outside of CIfly. The second condition is that contains a parent or sibling of . While this could be checked using a CIfly algorithm, clearly it is easier to do this directly in Python or R as well. Here, we do this based on the CIfly graph representation, but you may prefer to do this using your favorite graph library or format instead.
Recall that a CIfly graph is stored as one edge list per edge type. In Python, each edge is given as a pair of node IDs. Hence, we may use the following helper functions to check whether contains a parent or sibling of .
def contains_parent(g, x, Z):
return any(map(lambda uv: uv[0] in Z and uv[1] == x, g["-->"]))
def contains_sibling(g, x, Z):
return any(map(lambda uv: (uv[0] in Z and uv[1] == x) or (uv[0] == x and uv[1] in Z), g["<->"]))
To do the parent check, the lambda function tests for a pair uv
from the edge list of directed edges whether the first node is in and the second node is . The sibling check is a bit more verbose because both orderings have to be tested. Provided that testing membership in is constant-time (as is the case here because we pass as a Python set), these functions run in linear-time in the size of the graph as a whole edge list is searched. Of course, better run-times are possible if graph representations more suitable to this task are used. While this does not matter for the time complexity of this procedure, it can be useful to maintain such a representation in case the parent and sibling checks are needed repeatedly.
In R, the edge lists are given as matrices in case there are edges of that type. In order to check whether some node in is a parent of we can filter the edge list matrix for such rows, quite similarly to the Python code above. However, because R has no native set data structure, we take some extra steps to ensure efficient look-up in the filtering (even for large sets ). Namely, we first store a Boolean vector indicating membership in . Afterwards, we filter the edges in g[["-->"]]
that have a node in as first and as second element and return TRUE
or FALSE
based on the length of the filtered matrix. Let us consider this for the parent check first.
containsParent <- function(g, x, Z) {
# enable efficient lookup of membership in Z
p <- highestNodeId(g)
zLogical <- rep(FALSE, p)
zLogical[Z] <- TRUE
edgeList <- g[["-->"]]
return (length(edgeList[zLogical[edgeList[,1]] & edgeList[,2] == x]) > 0)
}
This code uses a small helper function highestNodeId
to find the highest node ID that is part of an edge in a CIfly graph. Note that the number of nodes is not stored in a CIfly graph object and, hence, has to be managed outside of it if needed.
highestNodeId <- function(g) {
nodeIds <- unlist(g)
if (length(nodeIds) == 0) {
return (0)
} else {
return (max(nodeIds))
}
}
Again, the siblings can be retrieved similarly, by checking for both possible orders of the endpoints in the edge list matrix.
containsSibling <- function(g, x, Z) {
p <- highestNodeId(g)
zLogical <- rep(FALSE, p)
zLogical[Z] <- TRUE
edgeList <- g[["<->"]]
return (length(edgeList[(zLogical[edgeList[,1]] & edgeList[,2] == x) | (zLogical[edgeList[,2]] & edgeList[,1] == x)]) > 0)
}
We note that in case this functionality is used more often, a helper function
containsNeighborCifly(g, x, Z, edgeType, orderedEdge=TRUE)
may prove useful, which abstracts this code for different edge types including an argument that indicates whether the edge type is ordered.
Full implementation
Hence, we obtain the following implementation for finding graphically optimal instrumental sets and assuming that the causal effect of on cannot be identified with adjustment.
The functions for checking parents and siblings are loaded from utils
. As before, we also rely on a simple function get_ruletable_path()
, respectively getRuletablePath()
, that returns a path to the rule-table folder.
import ciflypy as cf
import ciflypy_examples.utils as utils
ruletables = utils.get_ruletable_path()
descendants_table = cf.Ruletable(ruletables / "descendants_admg.txt")
optimal_iv_table = cf.Ruletable(ruletables / "optimal_iv_admg.txt")
# find Z, W according to the criterion given by Henckel et al. (2024)
# NOTE: the returned instrument is only optimal if there is no adjustment set between x and y
# we do NOT check this here and thus if this assumption is violated
# the function alone only ensures soundness, but not optimality
def optimal_instrument(g, x, y):
descendants = cf.reach(g, {"X": x}, descendants_table)
if y not in descendants:
return None
W = set(cf.reach(g, {"S": y, "D": descendants, "F": x}, optimal_iv_table))
Z = set(cf.reach(g, {"S": x, "D": descendants, "F": []}, optimal_iv_table)) - W
if Z and (utils.contains_parent(g, x, Z) or utils.contains_sibling(g, x, Z)):
return (list(Z), list(W))
else:
return None
library("ciflyr")
library("here")
source(here("R", "utils.R"))
ruletables <- getRuletablePath()
descendantsTable <- parseRuletable(file.path(ruletables, "descendants_admg.txt"))
optimalIVTable <- parseRuletable(file.path(ruletables, "optimal_iv_admg.txt"))
# find Z, W according to the criterion given by Henckel et al. (2024)
# NOTE: the returned instrument is only optimal if there is no adjustment set between x and y
# we do NOT check this here and thus if this assumption is violated
# the function alone only ensures soundness, but not optimality
optimalInstrument <- function(g, x, y) {
descendants <- reach(g, list("X" = x), descendantsTable)
if (!(y %in% descendants)) {
return (NULL)
}
W <- reach(g, list("S" = y, "D" = descendants, "F" = x), optimalIVTable)
Z <- setdiff(reach(g, list("S" = x, "D" = descendants, "F" = c()), optimalIVTable), W)
if (length(Z) > 0 && (containsParent(g, x, Z) || containsSibling(g, x, Z))) {
return (list("Z" = Z, "W" = W))
} else {
return (NULL)
}
}