Symmetry as Intervention

An Analysis of Causal Effect Estimation using
Outcome Invariant Data Augmentation arXiv:2510.25128

Uzair Akbar

Georgia Tech

Niki Kilbertus

TU Munich

Hao Shen

TU Munich

Krikamol Muandet

CISPA

Bo Dai

Georgia Tech

November 16, 2025

Motivation

Correlation vs. causation

xkcd.com/925 by Randall Munroe.

  • Can we recover causal effects from observational data?
  • Yes—but only with untestable assumptions and domain knowledge!

This work




We try to answer the fundamental question:

Can knowledge of symmetries in data generation—often used implicitly in certain regularizers—be repurposed to
improve causal effect estimation given
only observational \((X, Y)\) data?

Statistical vs. Causal Estimation

Empirical risk minimization (ERM)

For treatment \(X\), outcome \(Y\) samples generated as \[ Y = \rfunc{X} + \xi , \qquad \E{\xi} = 0 , \]

statistical inference entails recovering the optimal predictor \(\E{ Y \mid X = \vx }\) by minimizing the risk

\[ R_{\ERM}( \bh ) := \E{ \sqNorm{ Y - \bhyp{X} } } , \] over hypotheses \(\bh\in\H\), for a rich enough class \(\H\).

For un-correlated \(X\) and noise \(\xi\), the ERM minimizer
\(\bh_{\ERM}{\blue{(}} \vx {\blue{)}}\) coincides with the true causal function \(\rfunc{\vx}\).

Data augmentation

For finite \(n\) samples \(\D := \{ (\vx_i, \vy_i) \}_{i=0}^n\), regularization
techniques are used to mitigate estimation variance.


E.g., data augmentation (DA) achieves this via multiple random augmentations \((\gG \vx_i, \vy_i)\) per sample in the risk


\[ R_{\DA+\ERM}( \bh ) := \E{ \sqNorm{ Y - \bhyp{ \gG X } } } , \quad \gG\sim \P_{ \gG } . \]

Confounding bias and spurious correlation

Problem with ERM: Generally, \(X\) and \(\xi\) are correlated.

This makes the ERM minimizer a biased estimator of \(\rf\): \[ \begin{align*} Y &= \rfunc{X} + \xi ,\\ \nonumber \Rightarrow \underbrace{\E{Y \mid X = \vx} }_{\text{ERM minimizer}} &= \rfunc{\vx} + \underbrace{\E{ \xi \mid X = \vx}}_{\text{confounding bias $\neq 0$}} . \end{align*} \]

This spurious correlation b/w \(X\) and \(\xi\) arises due to their unobserved common parents, called confounders.

Intervention for causal estimation

Removing correlation b/w \(X\), \(\xi\), requires an intervention—explicitly assigning \(X\) some independently sampled \(\Xtilde\) during data generation, a.k.a. a randomized control trial: \[ Y = \rfunc{\Xtilde} + \xi \]

Now, doing ERM on samples of \((Y, \Xtilde)\) recovers \(\rf\).

Problem: Often not possible to intervene on real systems.
We only have access to pre-collected observational data.

Instrumental variables (IVs)

To workaround interventions, use instrument \(\gZ\) satisfying

  1. treatment relevance \(\gZ \nindep X\),
  2. exclusion \(\gZ\indep Y \mid X\),
  3. un-confoundedness \(\gZ \indep \xi\),
  4. outcome relevance \(Y \nindep \gZ\),

Conditioning the model on \(\gZ\) gives us \(\E{ Y \mid \gZ } = \E{ \rfunc{X} \mid \gZ }\),
which can be solved for \(\rf\) by minimizing the risk \[ R_{\IV}( \bh ) := \E{ \sqNorm{ Y - \E{ \bhyp{X} \mid \gZ } } } . \]

Problem: Instruments are scarce in most application domains.

Causal Estimation with Data Augmentaiton

Data augmentation = model symmetries

We restrict ourselves to DA transformations with respect to which \(\rf\) is invariant. Specifically, \(\gG\) takes values in \(\G\) such that \(\rf\) is \(\G\)-invariant: \[ \rfunc{\vx} = \rfunc{\vg \vx}, \qquad \forall \;\; (\vx, \vg)\in \X\times\G . \]

Of course, constructing such DA requires knowledge of symmetries of \(\rf\). E.g., when classifying images \(\vx\in\X\) of cats vs. dogs, the true labeling function would certainly be invariant to random image rotations \(\gG\vx\).

Data augmentation = soft intervention

Key insight: When \(\rf\) is \(\G\)-invariant, \((Y, \gG X)\) follows the data generation: \[ Y = \rfunc{ \gG X } + \xi . \]

Therefore, DA is equivalent to a soft intervention on the treatment \(X\).

\(\Rightarrow\) DA+ERM dominates vanilla ERM on causal estimation error (CER):

\[ \CER(\bh) := \E{ \sqNorm{ \rfunc{X} - \bhyp{X} } } , \phantom{\qquad \boxed{ \CER(\bh_{\DA+\ERM}) \leq \CER(\bh_{\ERM}) } .} \]

\[ \CER(\bh) := \E{ \sqNorm{ \rfunc{X} - \bhyp{X} } } , \qquad \boxed{ \CER(\bh_{\DA+\ERM}) \leq \CER(\bh_{\ERM}) . } \]

  • Strictly better when DA perturbes spurious features correlated with \(\xi\).
  • But otherwise performs no worse than ERM.

Data augmentation = relaxed IVs

Key insight: DA params \(\gG\) are IV-like (IVL)—having IV properties (i)—(iii) by design.


Such a relaxation renders IV regression ill-posed, so we suggest IVL regression:

\[ R_{\IVL}(\bh) := R_{\IV}(\bh) + \boxed{\alpha \cdot R_{\ERM}(\bh)} \Big\} {\tiny\text{ERM regularizer for ill-posed IV reg.}} \]

\(\Rightarrow\) The composition DA+IVL simulates a worst-case/adversarial DA.

\(\Rightarrow\) DA+IVL dominates DA+ERM; better iff spurious features perturbed, \[ \boxed{\CER(\bh_{\DA+\IVL}) \leq \CER(\bh_{\DA+\ERM}).} \]

Data augmentation = causal regularization

Causal regularization: Methods aim to impove causal estimation of \(\rf\) even if full identification may not be possible.

But why bother?

  • “No-regret” improvement: Under our symmetry based DA construction, DA dominates on causal estimation \(\Rightarrow\) sometimes better, never worse.
  • Robust prediction: Mitigating confounding bias reduces spurious correlations \(\Rightarrow\) predictors generalize better to distribution shifts.

\(\Rightarrow\) Causal regularization is a principled approach for downstream tasks

    • out-of-distribution (OOD) generalizaiton
    • domain generalizaiton

Experiments

Simulation ablations

Simulation experiment with a linear, centered Gaussian model with \(\rbf\in\R^m\), confounding strength \(\kappa > 0\), and DA strength \(\gamma > 0\), s.t.
\(\qquad\qquad\qquad \gG X := X + \gamma\cdot \gG\qquad\) \(\gG\in\operatorname{null}(\rbf)\).

Normalized CER (nCER) \(=0\) for true \(\rf\) and \(1\) for pure confounding.

Baseline comparison

Comparison with select causal regularization methods and common domain generalisation baselines. All methods are provided only \((X, Y)\) data along with DA transformations \(\gG\)—Gausian noise in optical device, and hue, saturation, contrast, translation perturbations in colored-MNIST.

Conclusion

We provide a unifying framework b/w symmetry transformations and causal interventions, allowing us to re-purpose the ubiquitous statistical regularization tool of data augmentation for causal regularization.

Thank You

in/uzair25

@uzairakbar25

arXiv:2510.25128

Project Page