Theoretical foundations

In this article, we explore the main ideas underlying the CIfly framework. This will lay the groundwork for introducing the precise CIfly syntax afterwards. For a more detailed exposition, we refer to the paper which thoroughly develops these theoretical foundations. The central principle of CIfly is to solve tasks from graphical causal inference with reachability algorithms. The following illustration gives a overview of what we call a causal-to-reach reduction.

We assume that the causal input consists of a graph G=(V,(E1,,Ek))G=(V, (E_1, \dots, E_k)) and a sequence of node sets (L1,,L)(L_1, \dots, L_{\ell}). Note that there are multiple edge sets each containing the edges of a specific type. E.g., E1E_1 may contain directed edges and E2E_2 undirected edges. The output of the causal task is a set of vertices RVR \subseteq V.

Of course, many tasks in causal inference do not directly fit into this framework, however, often they can be solved by multiple causal primitives that do. The most basic example of a CIfly primitive is to find all nodes d-connected to a set of vertices XX given a conditioning set ZZ in a DAG GG. Here, a set RR of d-connected nodes is returned, which can be subsequently used to test d-separation of XX and YY by checking wether RYR \cap Y is empty.

The causal primitive is solved by constructing a reachability input, consisting of a directed graph Gdir=(Vdir,Edir)G_{\text{dir}} = (V_{\text{dir}}, E_{\text{dir}}) and a node set SVdirS \subseteq V_{\text{dir}}, then finding all nodes reachable from SS in GdirG_{\text{dir}} and mapping this set TT back to the causal domain to the result RR. Note that the graph GdirG_{\text{dir}} has a different node and edge set compared to original graph GG.

The CIfly framework offers a concise way of specifying the mappings I\mathcal{I} and O\mathcal{O} between the causal and reachability task. Based on this, it can automatically perform the reduction shown in the figure above. Crucially, the mappings are defined in what is called a rule table. We define rule tables in such a way that they can express quite general mappings I\mathcal{I} and O\mathcal{O} while still guaranteeing an overall linear run-time in the size of the causal input. We call the reductions expressed in this way CIfly reductions.

To explain CIfly reductions, let us first introduce the structure of the reachability input graph GdirG_{\text{dir}}. We define Vdir=V×NE×CV_{\text{dir}} = V \times N_{\mathcal{E}} \times C where NEN_{\mathcal{E}} is the set of neighbor-types and CC is a finite set of colors. A node sVdirs \in V_{\text{dir}} is also called a state. The neighbor-types correspond to the different types of neighbors a node might have. E.g., in a causal graph with both directed and undirected edges, a node vv can have three types of neighbors: parents uvu \rightarrow v, children uvu \leftarrow v and undirected neighbors uvu - v. The colors can be used to further distinguish between states, more on that later.

The directed edges in GdirG_{\text{dir}} follow from a set of rules that given in a rule table. Consider two states (v1,n1,c1)(v_1, n_1, c_1) and (v2,n2,c2)(v_2, n_2, c_2) in VdirV_{\text{dir}}. A rule might say: if n1=  n_1 = \; \rightarrow and n2=  n_2 = \; \leftarrow, and moreover v2L2v_2 \in L_2, then there should be an edge (v1,n1,c1)(v2,n2,c2)(v_1, n_1, c_1) \rightarrow (v_2, n_2, c_2) in GdirG_{\text{dir}}. To connect this to the original graph GG, this edge will only be present if v2v_2 is an neighbor of type n2n_2 of v1v_1 in GG.

This construction may appear quite complicated at first, but essentially this connects walks in the graph GG to paths in GdirG_{\text{dir}} where information about the type of the last traversed edge is stored in the states. By choosing the rules and using colors for added complexity we can control what walks in GG correspond to directed paths in GdirG_{\text{dir}}. Through this we reduce complex causal walk condition in GG to algorithmically simple reachability problems in GdirG_{\text{dir}}. We now illustrate this on the example of d-connectivity.

Definition (d-separation)
Let XX and ZZ be two disjoint node sets in a DAG G=(V,E)G = (V, E). Then, YV(XZ)Y \subseteq V \setminus (X \cup Z) is d-connected to XX in GG if there exists xXx \in X, yYy \in Y and a walk ww from xx to yy such that

  1. for any collider vv on ww, it holds that vZv \in Z and
  2. for any non-collider vv on ww, it holds that v∉Zv \not\in Z.

Conversely, if XX and YY are not d-connected, they are called d-separated.

Note that this definition is slightly different from the more common one that, instead of walks (which may visit nodes multiple times), talk about paths between xx and yy. However, this has the drawback that the condition for colliders is more complicated. The definition above can be translated directly into the following CIfly rule table that returns all nodes d-connected to XX given ZZ.

EDGES --> <--
SETS X, Z
START <-- AT X
OUTPUT ...

--> | <-- | current in Z
... | ... | current not in Z

The precise syntax of rule tables is discussed in the next article. As intuitive explanation of the rules at the bottom, observe that the first line concerns colliders for which it is checked that current node is in set ZZ, exactly as in the definition of d-connectivity. The condition in the second line is only evaluated in case the first one was skipped. Hence, the second line is only reached for non-colliders. Here, ... matches any edge, thus this is a catch-all case for which it is checked that the current node is not in set ZZ.

In terms of states of the reachability input graph, the rule

--> | <-- | current in Z

effectively prescribes that there is a transition (current,,)(\text{current}, \rightarrow, \cdot) to (next,,)(\text{next}, \leftarrow, \cdot) if currentZ\text{current} \in Z and there is an edge currentnext\text{current} \leftarrow \text{next} in GG (this is checked implicitly and does not have to be stated in the rule). Note that we just use \cdot as color because colors are not needed for this example. Let us consider the constructed reachability input for a concrete input graph.

In the DAG on the left, the nodes v1v_1 and v4v_4 are d-connected given v5v_5 due to the walk v1v2v5v2v3v4v_1 \rightarrow v_2 \rightarrow v_5 \leftarrow v_2 \leftarrow v_3 \rightarrow v_4. On the right, the path corresponding to this walk is highlighted in color. There are two copies of every node from the original graph, one for each neighbor-type (this is marked at the top left of the boxes). This neighbor-type encodes the type (and direction) of the last edge on the walk. Hence, the states that are visited on the path are, in this order, (v1,,)(v_1, \leftarrow, \cdot), (v2,,)(v_2, \rightarrow, \cdot), (v5,,)(v_5, \rightarrow, \cdot), (v2,,)(v_2, \leftarrow, \cdot), (v3,,)(v_3, \leftarrow, \cdot), (v4,,)(v_4, \rightarrow, \cdot). Here, the edge in the first state (v1,,)(v_1, \leftarrow, \cdot) does not correspond to an edge in GG and can be seen as a notational artifact but the edges in all other states on this path by construction do.

In the next articles, we will focus on the syntax of CIfly rule tables and the ciflypy and ciflyr packages for using CIfly in your Python or R code.