| SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
This DSPA Appendix presents the foundations of causal inference in data science.
## [1] '3.9.0.2'
In previous DPSA chapters, we introduced many alternative model-based and model-free methods for supervised and unsupervised regression, classification, clustering and forecasting.
In model-based supervised problems, linear regression and correlation analyses play important roles. However, establishing a correlation relation between a pair of features does not necessarily indicate that changes in one of the variables drives the behavior of the other, i.e., causation is not necessarily guaranteed by establishing correlation.
Let’s look at two simple examples.
Example 1: In children, shoe-size is heavily correlated with outcomes on math tests, however, an increase of neither variable causes the rise (or fall) of the other. Shoe-size and cognition move synchronously up and down without a direct causal effect of one on the other.
Example 2: Let’s show a simulation of an explicit causal relation that is undetectable by correlation analysis.
n <- 1024
t <- seq(from=1, to=n, by=1)
x1 <- t; x2 <- t;
for (i in 2:n) {
x1[i] <- 0.5*x1[i-1] + rnorm(1, mean = 0, sd = 1)
x2[i] <- 0.4*x1[i-1]*x1[i-1] + rnorm(1, mean = 0, sd = 1)
}
x1_ord <- seq(from=min(x1), to=max(x1), length.out=n)
x2_ord <- x1_ord
x2_ord[1] <- (2/5)*x1_ord[1]*x1_ord[1]
for (i in 2:n) {
x2_ord[i] <- (2/5)*x1_ord[i-1]*x1_ord[i-1]
}
# plot(x1,x2)
fig <- plot_ly(x=~x1, y=~x2, type="scatter",
name = 'data', mode='markers')
fig <- fig %>% add_trace(
x=~x1_ord, y = ~x2_ord, name = 'model',
mode='lines', line = list(color = 'red',
width = 2, dash = 'dash'))
figThe graph does not illustrate a clear \(x_1\) to \(x_2\) relationship - correlational or causal. At the same time, the apparent stochasticity in system hides the simple (quadratic) causal relationship defined by:
\[x_1(n) = \frac{1}{2}x_1(n-1) + \epsilon_1; \ x_1(1)=1, \ 1\leq n \leq 1024,\]
\[x_2(n) = \frac{2}{5}x_1^2(n-1) + \epsilon_2; \ \epsilon_1,\epsilon_2 \sim N(0,1).\]
Using any (multivariate) observational data, we will showcase how to use computational statistics, data science, information theory, and causal inference to uncover such hidden causal relationships.
The study of causality has been shaped by several foundational frameworks. Granger and Wiener developed a predictive notion of causality for time series, while Pearl developed an interventional framework based on structural models and graphs. A process \(X\) is (Granger-)causally related to another process \(Y\), when predicting prospective \(Y\) realizations solely based on past \(Y\) records can be improved based on the past information from the first process \(X\) along with past knowledge about \(Y\).
It is important to distinguish the two major paradigms of causal reasoning explored in modern statistics:
Predictive (Granger-Wiener) Causality: \(X\) causes \(Y\) if past values of \(X\) help predict future values of \(Y\) above and beyond past values of \(Y\) itself. This is an observational criterion — it relies on statistical associations in the data and tests whether one time series carries incremental information about another. It does not require a mechanistic model of how \(X\) influences \(Y\).
Interventional (Pearl) Causality: Judea Pearl’s framework is built on Structural Causal Models (SCMs), Directed Acyclic Graphs (DAGs), and the do-calculus. Here, \(X\) causes \(Y\) if intervening on \(X\) (setting \(X = x\) by external manipulation) changes the distribution of \(Y\). This is captured by the do-operator: \(P(Y \mid do(X = x))\), which represents the distribution of \(Y\) when \(X\) is forcibly set to \(x\). This is fundamentally different from the observational conditional \(P(Y \mid X = x)\), which may be distorted by confounders.
The critical distinction is that Granger causality asks “Does \(X\) help predict \(Y\)?”, whereas Pearl’s framework asks “What happens to \(Y\) if we manipulate \(X\)?” These are different questions, and they can yield different answers — especially when unmeasured confounders are present. In this module, we focus primarily on Granger and transfer-entropy methods, but students should be aware that these detect predictive causal relationships, not interventional ones.
Granger causality tests whether past values of \(X\) improve the prediction of \(Y\). However, this does not guarantee a direct mechanistic causal link from \(X\) to \(Y\). A critical threat is the unmeasured confounder: if an unobserved variable \(W\) drives both \(X\) and \(Y\), but \(X\) reacts faster than \(Y\), then \(X\) will appear to “Granger-cause” \(Y\) even though there is no direct causal link between them.
For example, suppose a central bank policy change \(W\) simultaneously affects interest rates \(X\) and stock prices \(Y\). If interest rates adjust within a day but stock prices take a week to fully respond, past interest rates will help predict future stock prices — yielding a spurious Granger-causal finding. Transfer entropy, while model-free and sensitive to nonlinear relationships, is equally susceptible to this confounding issue because it too operates on observational data. Only methods that explicitly model the confounders — such as Pearl’s do-calculus with appropriate DAGs — can properly address this problem. When interpreting Granger-causality or transfer entropy results, always consider whether an unmeasured common driver might explain the observed predictive relationship.
Assume we have three time-varying processes indexed by time \(t\geq 0\):
The complete information up to time \(t\) for a random process will be denoted by \(X^t=\cup_{s<t}{ \{X_s\}}\). Also, for time \(t+1\), let’s denote by:
\[e_{t+1}= \underbrace{{Y}_{t+1}}_{actual}- \underbrace{\hat{Y}_{t+1}}_{predicted},\]
\[\underbrace{R(Y^{t+1}| Y^t,Z^t)}_{given\ only\ Y^t} =E(g(e_{t+1}))=E(g({Y}_{t+1}-\hat{Y}_{t+1}))=E(g({Y}_{t+1}-f(Y^t,Z^t))),\] \[\underbrace{R(Y^{t+1}| X^t,Y^t,Z^t)}_{given\ both\ X^t\ and\ Y^t} =E(g(e_{t+1}))=E(g({Y}_{t+1}-\hat{Y}_{t+1}))=E(g({Y}_{t+1}-f(X^t,Y^t,Z^t))).\]
As with other model-based techniques, e.g., linear models, the estimate \(\hat{Y}_{t+1}=f(X^t,Y^t,Z^t)\) is computed by estimating (fitting) the function \(f\) that optimizes (minimizes) the expected loss, \(R(Y^{t+1}| Y^t,Z^t)\) or \(R(Y^{t+1}| X^t,Y^t,Z^t)\).
Granger-causality: The process \(X\) is not Granger-causal to the process \(Y\), relative to the side information \(Z\), if and only if: \[\underbrace{R(Y^{t+1}| Y^t,Z^t)}_{independent\ of\ X^t} = R(Y^{t+1}| X^t,Y^t,Z^t).\]
For example, a linear vector-autoregressive model function \(f\) with a number of lags \(l\) may look like:
\[\hat{Y}(t)=\underbrace{f(Y^t)}_{independent\ of\ X^t}=\beta_o + \sum_{s=1}^l{\beta_s Y(t-s)+\epsilon_t}, \ \ \ \ (1)\] \[\tilde{Y}(t)=\underbrace{f(X^t,Y^t)}_{dependent\ on\ X^t}=\tilde{\beta}_o + \sum_{s=1}^l{\tilde{\beta}_s Y(t-s)+ \sum_{s=1}^l{\tilde\gamma}_s X(t-s) + \tilde{\epsilon}_t}. \ \ \ \ (2)\]
Similarly, one can choose alternative models to estimate the expected prediction errors based on deep learning, neural networks, support vector machines, random forest, or any other supervised technique to fit the model.
In each situation, the variable \(X\) does not Granger-cause \(Y\) if and only if the expected prediction errors of \(X\) in the restricted \(\hat{Y}(t)\) and unrestricted \(\tilde{Y}(t)\) models are equal. In other words, the two models are statistically indistinguishable and any differences in prediction are only due to random factors (noise).
One parametric test for assessing a Granger-causality test is the one-way analysis of variance (one-way ANOVA) test to compare the residuals of the restricted and unrestricted models, i.e., compare the statistical differences of the residuals. When performing multiple such tests, e.g., in contrasting multiple \(l\) lags, one needs to adjust the resulting p-values by correcting for multiple hypothesis tests, using false discovery rate (FDR), family-wise error rate (FWER), or Bonferroni correction.
We can also formulate a more general probabilistic definition of causality using conditional probability, \(p(. | .)\). This approach avoids the need to explicitly define a priori prediction function models in advance.
Recall that \(Y_{t+1}\) and \(X^t\) are (statistically) independent given the past information \((X^t,Y^t)\) if and only if \[p(Y_{t+1}|X^t,Y^t, Z^t)=p(Y_{t+1}|Y^t, Z^t).\] In other words, knowing or ignoring past information, \(X^t\), about the process \(X\) does not affect the probability distribution of the future outcome, \(Y_{t+1}\).
Probability-based Granger causality: The process \(X\) does not Granger-cause the process \(Y\), relative to some side information \(Z\), if and only if \(Y_{t+1} \perp X^t\ |\ Y^t,Z^t\). That is, given the past information about \(Y^t\) and \(Z^t\), \(Y_{t+1}\) is independent of (orthogonal to) \(X^t\).
Probabilistic Granger causality does not require an explicit function model coupling the two processes \(X\) and \(Y\). However, it still requires a strategy to compute the conditional dependence of \(Y_{t+1}\) on the other previously observed variables, \(p(Y_{t+1}|X^t,Y^t, Z^t)\).
Now, we will connect Granger causality with information theory.
Reviewing some of the information theory details in the DSPA Information Theory and Statistical Learning Appendix may be useful before proceeding further in this section.
For a bivariate process \(W=(X,Y)\), let’s denote the pair of marginal and the joint distribution functions by \(p_X(x)\), \(p_Y(y)\), and \(p_{X,Y}(x,y)\). In the discrete and continuous cases, the joint (Shannon/Differential) entropy between the univariate processes \(X\) and \(Y\) is defined by:
\[\underbrace{H(X,Y)}_{Shannon\ entropy}=-\sum_{x\in X}{\sum_{y\in Y}{p_{X,Y}(x,y)\log ( p_{X,Y}(x,y)) }},\]
\[\underbrace{h(X,Y)}_{Differential\ entropy}=-\int_{x\in \Omega_X}{\int_{y\in \Omega_Y}{p_{X,Y}(x,y)\log ( p_{X,Y}(x,y)) }}dxdy,\]
The uncertainty of \(Y\) given a specific realization \(X\) is captured by the conditional entropy :
\[H(Y|X) = H(X,Y) - H(X),\]
where the entropy of the single variable \(X\) is:
\[H(X)=-\sum_{x\in X}{p_{X}(x)\log ( p_{X}(x)) }.\]
The Granger causality can be computed using transfer entropy, which detects directional and dynamical information without assuming any particular function modeling process interactions.
Transfer entropy quantifies the amount of directed information transfer between the random processes \(X\) and \(Y\). As a non-parametric statistic, the transfer entropy measures the reduction of uncertainty in prospective values of \(Y\) given the past values of \(X\) and \(Y\).
The transfer entropy is the following difference of conditional entropies:
\[T_{X\rightarrow Y\ |\ Z}=H\left (\underbrace{Y_{t+1}}_{future}\ |\ \underbrace{Y^t, Z^t}_{past}\right ) - H\left (\underbrace{Y_{t+1}}_{future}\ |\ \underbrace{X^t, Y^t, Z^t}_{past}\right ),\] \[T_{X\rightarrow Y\ |\ Z}=\left (H(Y^t, X^t) - H(Y_{t+1},Y^t, X^t)\right )- \left (H(Y_{t+1}) - H(Y_{t+1},Y^t)\right ).\]
where \(H(X)\) is Shannon entropy of \(X\). Transfer entropy can also be considered as conditional mutual information:
\[T_{X\rightarrow Y\ |\ Z}=I(Y_{t+1}\ ;\ X^t\ |\ Y^t,Z^t),\]
where
\[I(Y_{t+1}\ ;\ X^t\ |\ Y^t,Z^t)=\] \[\sum_{(y^t,z^t)\in \Omega_Y^t\times \Omega_Z^t}{p_{(Y^t,Z^t)}(y^t,z^t)\sum_{x^t\in \Omega_X^t}{\sum_{y_{t+1}\in \Omega_Y^{t+1}}{ \left (p_{(Y_{t+1},X^t|Y^t,Z^t)}(y_{t+1},x^t|y^t,z^t) \times \log\left ( \frac{p_{(Y_{t+1},X^t| Y^t,Z^t)}(y_{t+1},x^t|y^t,z^t)} {p_{(Y_{t+1}| Y^t,Z^t)}(y_{t+1}|y^t,z^t)\times p_{(X^t| Y^t,Z^t)}(x^t|y^t,z^t)} \right ) \right ) }}}=\]
\[\sum_{(y^t,z^t)\in \Omega_Y^t\times \Omega_Z^t}{\sum_{x^t\in \Omega_X^t}{\sum_{y_{t+1}\in \Omega_Y^{t+1}}{ \left (p_{(Y_{t+1},X^t,Y^t,Z^t)}(y_{t+1},x^t,y^t,z^t) \times \log\left ( \frac{p_{(Y^t,Z^t)}(y^t,z^t)\times p_{(Y_{t+1},X^t,Y^t,Z^t)}(y_{t+1},x^t,y^t,z^t)} {p_{(Y_{t+1}, Y^t,Z^t)}(y_{t+1},y^t,z^t)\times p_{(X^t,Y^t,Z^t)}(x^t,y^t,z^t)} \right ) \right ) }}}.\]
In terms of transfer entropy, \(X\) does not cause \(Y\), relative to side information \(Z\), if and only if \(T_{X\rightarrow Y, Z^t}=0.\), i.e.,
\[H(Y_{t+1}\ |\ Y^t, Z^t)=H(Y_{t+1}\ |\ X^t,Y^t, Z^t).\]
As the transfer entropy is not a symmetric function, \(T_{X\rightarrow Y\ |\ Z}\not=T_{Y\rightarrow X\ |\ Z}\), it allows us to infer directionality of the information flow, i.e., causality. More specifically, we can define the net information flow measure, \(F_{X\rightarrow Y\ |\ Z}\), as:
\[F_{X\rightarrow Y\ |\ Z}=T_{X\rightarrow Y\ |\ Z}-T_{Y\rightarrow X\ |\ Z}.\]
The net information flow measure, \(F_{X\rightarrow Y\ |\ Z}\), quantifies the dominant direction of the information flow; positive values indicate a dominant information flow from \(X\longrightarrow Y\), rather than the opposite direction, and similarly, negative values indicate a reversed dominant information flow from \(Y\longrightarrow X\). Hence, it suggests which process yields more predictive information about the other.
In the special case of exploring vector auto-regressive processes, such as the ones we showed in equations (1) and (2) above, the transfer entropy reduces to Granger causality, see this reference. Using the classical \(g=||.||_{2}\) norm as the loss function and based on the linear vector-autoregressive model function \(f\), see equations (1) and (2), Granger-causality of a bivariate process can be expressed as:
\[GrangerCausality_{\{X\rightarrow Y\}}=\log \left ( \frac{Var(\epsilon_t)}{Var(\hat{\epsilon}_t)}\right ).\]
This explicit connection between the transfer entropy and the linear Granger-causality facilitates the estimation of the transfer entropy, \(T_{X\rightarrow Y\ |\ Z}\), and the net information flow metric, \(F_{X\rightarrow Y\ |\ Z}\).
Let’s look at a pair of linear and non-linear synthetic data examples.
set.seed(1234)
# initialization (Gaussian noise)
n <- 10000
x1 <- x2 <- x3 <- x4 <- x5 <- x6 <- rep(0, n + 1)
# instantiation (linear relationships)
for (i in 2:(n + 1)) {
x1[i] <- 0.8 * sqrt(3)* x1[i-1] - 0.8*x1[i-1] + rnorm(1, mean=0, sd=1)
x2[i] <- 0.5*x1[i-1] + rnorm(1, mean=0, sd=1)
x3[i] <- -0.3*x1[i-1] + rnorm(1, mean=0, sd=1)
x4[i] <- -0.6*x1[i-1] + 0.3*sqrt(3)*x4[i-1] + 0.2*sqrt(3)*x5[i-1] + rnorm(1, mean=0, sd=1)
x5[i] <- -0.2*sqrt(3)*x4[i-1] + 0.2*sqrt(3)*x5[i-1] + rnorm(1, mean=0, sd=1)
x6[i] <- 0.3*sqrt(5)*x3[i-1] + 0.4*sqrt(3)*x4[i-1] + rnorm(1, mean=0, sd=1)
}
x1 <- x1[-1]; x2 <- x2[-1]; x3 <- x3[-1]
x4 <- x4[-1]; x5 <- x5[-1]; x6 <- x6[-1]
lin.system <- data.frame(x1, x2, x3, x4, x5, x6)
# Ground-truth causal graph (known from the simulation equations)
library(igraph)
g_ground_truth <- graph(c(1,1, 1,2, 1,3, 1,4, 4,4, 5,4, 4,5, 5,5, 3,6, 4,6), n=6)
V(g_ground_truth)$label <- c("x1","x2","x3","x4","x5","x6")
plot(g_ground_truth, main="Ground-Truth Causal Graph (Linear Simulation)",
vertex.label.color="black", vertex.size=15, edge.arrow.size=0.5)According to the definition above, \(linearGC_{\{X\rightarrow Y\}}=\log \left (
\frac{Var(\epsilon_t)}{Var(\hat{\epsilon}_t)}\right )\), we
define a function, linearGC(), to compute the linear
Granger-causality for a bivariate process \((X,Y)\).
linearGC <- function(X, Y){
n<-length(X)
X.past <- X[1:(n-1)]
Y.past <- Y[1:(n-1)]
Y.future <- Y[2:n]
regression.uni <- lm(Y.future ~ Y.past)
regression.mult <- lm(Y.future ~ Y.past + X.past)
var.eps.uni <- (summary(regression.uni)$sigma)^2
var.eps.mult <- (summary(regression.mult)$sigma)^2
linGC <- log(var.eps.uni/var.eps.mult)
return(linGC)
}Granger causality hinges on the idea that including past values of \(X\) reduces the prediction error for \(Y\). Let’s make this tangible by comparing the residuals of the restricted model (using only \(Y\)’s own past) with those of the unrestricted model (using both \(X\)’s and \(Y\)’s pasts) for a pair where we know a causal link exists: \(x_1 \rightarrow x_2\).
library(plotly)
# Extract the known causal pair: x1 -> x2
X <- lin.system$x1
Y <- lin.system$x2
n_len <- length(X)
X.past <- X[1:(n_len-1)]
Y.past <- Y[1:(n_len-1)]
Y.future <- Y[2:n_len]
# Fit restricted (Y past only) and unrestricted (X + Y past) models
reg_restricted <- lm(Y.future ~ Y.past)
reg_unrestricted <- lm(Y.future ~ Y.past + X.past)
resid_restricted <- residuals(reg_restricted)
resid_unrestricted <- residuals(reg_unrestricted)
# Show a subset of residuals for visual clarity (first 500 time steps)
idx <- 1:500
fig_resid <- plot_ly() %>%
add_lines(x = idx, y = resid_restricted[idx], name = "Restricted (Y past only)",
line = list(color = 'red', width = 1)) %>%
add_lines(x = idx, y = resid_unrestricted[idx], name = "Unrestricted (X+Y past)",
line = list(color = 'blue', width = 1)) %>%
layout(title = "Residual Comparison: x1 → x2 (first 500 steps)<br>Red = Restricted, Blue = Unrestricted",
xaxis = list(title = "Time"), yaxis = list(title = "Residual"))
fig_resid# Compare residual distributions
var_restricted <- var(resid_restricted)
var_unrestricted <- var(resid_unrestricted)
cat("Restricted model residual variance: ", round(var_restricted, 4), "\n")## Restricted model residual variance: 1.3333
## Unrestricted model residual variance: 1.0023
cat("Variance reduction ratio (restricted/unrestricted):",
round(var_restricted / var_unrestricted, 4), "\n")## Variance reduction ratio (restricted/unrestricted): 1.3302
## Linear GC (log ratio): 0.2853
# Overlay histograms of the residual distributions
fig_hist <- plot_ly() %>%
add_histogram(x = resid_restricted, name = "Restricted",
marker = list(color = 'rgba(255,0,0,0.4)'),
nbinsx = 80) %>%
add_histogram(x = resid_unrestricted, name = "Unrestricted",
marker = list(color = 'rgba(0,0,255,0.4)'),
nbinsx = 80) %>%
layout(title = "Residual Distributions: x1 → x2",
xaxis = list(title = "Residual Value"),
yaxis = list(title = "Count"),
barmode = "overlay")
fig_hist