Theoretical foundations
In this article, we explore the main ideas underlying the CIfly framework. This will lay the groundwork for introducing the precise CIfly syntax afterwards. For a more detailed exposition, we refer to the paper which thoroughly develops these theoretical foundations. The central principle of CIfly is to solve tasks from graphical causal inference with reachability algorithms. The following illustration gives a overview of what we call a causal-to-reach reduction.
We assume that the causal input consists of a graph and a sequence of node sets . Note that there are multiple edge sets each containing the edges of a specific type. E.g., may contain directed edges and undirected edges. The output of the causal task is a set of vertices .
Of course, many tasks in causal inference do not directly fit into this framework, however, often they can be solved by multiple causal primitives that do. The most basic example of a CIfly primitive is to find all nodes d-connected to a set of vertices given a conditioning set in a DAG . Here, a set of d-connected nodes is returned, which can be subsequently used to test d-separation of and by checking wether is empty.
The causal primitive is solved by constructing a reachability input, consisting of a directed graph and a node set , then finding all nodes reachable from in and mapping this set back to the causal domain to the result . Note that the graph has a different node and edge set compared to original graph .
The CIfly framework offers a concise way of specifying the mappings and between the causal and reachability task. Based on this, it can automatically perform the reduction shown in the figure above. Crucially, the mappings are defined in what is called a rule table. We define rule tables in such a way that they can express quite general mappings and while still guaranteeing an overall linear run-time in the size of the causal input. We call the reductions expressed in this way CIfly reductions.
To explain CIfly reductions, let us first introduce the structure of the reachability input graph . We define where is the set of neighbor-types and is a finite set of colors. A node is also called a state. The neighbor-types correspond to the different types of neighbors a node might have. E.g., in a causal graph with both directed and undirected edges, a node can have three types of neighbors: parents , children and undirected neighbors . The colors can be used to further distinguish between states, more on that later.
The directed edges in follow from a set of rules that given in a rule table. Consider two states and in . A rule might say: if and , and moreover , then there should be an edge in . To connect this to the original graph , this edge will only be present if is an neighbor of type of in .
This construction may appear quite complicated at first, but essentially this connects walks in the graph to paths in where information about the type of the last traversed edge is stored in the states. By choosing the rules and using colors for added complexity we can control what walks in correspond to directed paths in . Through this we reduce complex causal walk condition in to algorithmically simple reachability problems in . We now illustrate this on the example of d-connectivity.
Definition (d-separation)
Let and be two disjoint node sets in a DAG . Then, is d-connected to in if there exists , and a walk from to such that
- for any collider on , it holds that and
- for any non-collider on , it holds that .
Conversely, if and are not d-connected, they are called d-separated.
Note that this definition is slightly different from the more common one that, instead of walks (which may visit nodes multiple times), talk about paths between and . However, this has the drawback that the condition for colliders is more complicated. The definition above can be translated directly into the following CIfly rule table that returns all nodes d-connected to given .
EDGES --> <--
SETS X, Z
START <-- AT X
OUTPUT ...
--> | <-- | current in Z
... | ... | current not in Z
The precise syntax of rule tables is discussed in the next article. As intuitive explanation of the rules at the bottom, observe that the first line concerns colliders for which it is checked that current node is in set , exactly as in the definition of d-connectivity. The condition in the second line is only evaluated in case the first one was skipped. Hence, the second line is only reached for non-colliders. Here, ...
matches any edge, thus this is a catch-all case for which it is checked that the current node is not in set .
In terms of states of the reachability input graph, the rule
--> | <-- | current in Z
effectively prescribes that there is a transition to if and there is an edge in (this is checked implicitly and does not have to be stated in the rule). Note that we just use as color because colors are not needed for this example. Let us consider the constructed reachability input for a concrete input graph.
In the DAG on the left, the nodes and are d-connected given due to the walk . On the right, the path corresponding to this walk is highlighted in color. There are two copies of every node from the original graph, one for each neighbor-type (this is marked at the top left of the boxes). This neighbor-type encodes the type (and direction) of the last edge on the walk. Hence, the states that are visited on the path are, in this order, , , , , , . Here, the edge in the first state does not correspond to an edge in and can be seen as a notational artifact but the edges in all other states on this path by construction do.
In the next articles, we will focus on the syntax of CIfly rule tables and the ciflypy
and ciflyr
packages for using CIfly in your Python or R code.