UAI 2026

Not Just How Much,
But Where

Decomposing Epistemic Uncertainty into Per-Class Contributions

Mutual information says how uncertain a model is. The per-class vector $\mathbf{C}(x)$ reveals which classes drive that uncertainty.

Mame Diarra Touré and David A. Stephens
Department of Mathematics and Statistics, McGill University

34.7% lower selective risk than MI on diabetic retinopathy
56.2% lower selective risk than the per-class variance baseline
2 / 2 OoD benchmarks where $\sum_k C_k$ achieves the highest AUROC

The central problem

A single uncertainty score can hide the failure mode that matters most.

In asymmetric classification, ignorance involving a benign class is not equivalent to ignorance involving a safety-critical class. Scalar MI collapses both cases into the same number.

Same scalar summary MI = 0.024

Catastrophic miss: Grade 3 predicted as Grade 0

versus
Nearly identical scalar summary MI = 0.027

Severity underestimate: Grade 3 predicted as Grade 2

Their scalar MI is nearly identical, but their per-class epistemic signatures are not.

01 / Method

Decompose MI into class-level contributions.

From $S$ stochastic predictions, estimate each class mean $\mu_k$ and variance $\operatorname{Var}[p_k]$. A second-order Taylor expansion of entropy gives a simple, additive decomposition.

Per-class epistemic contribution
$$C_k(x) = \frac{1}{2}\frac{\operatorname{Var}[p_k](x)}{\mu_k(x)}$$
Additive connection
$$\sum_{k=1}^{K} C_k(x) \approx \operatorname{MI}(y;\boldsymbol{\omega}\mid x)$$
1

Sample predictions

$S$ stochastic forward passes produce probability vectors on the simplex.

2

Estimate moments

Compute the class-wise mean, variance, covariance, and third central moment.

3

Attribute uncertainty

Each $C_k$ identifies the share of epistemic uncertainty associated with class $k$.

Why the normalization matters

Raw variance is suppressed exactly where rare classes live.

On the probability simplex, $\operatorname{Var}[p_k]\leq\mu_k(1-\mu_k)$. As $\mu_k\rightarrow0$, raw variance must vanish even when posterior disagreement remains important.

$\mu_k \rightarrow 0$ rare class
Raw variance
upper bound $\rightarrow 0$
$C_k$
upper bound $\rightarrow \tfrac{1}{2}$

The $1/\mu_k$ term is not an ad hoc correction.

It arises from the entropy Hessian. A given amount of probability variance carries more information-theoretic weight for a low-probability class.

This makes $C_k$ comparable across rare and common classes, while preserving the additive connection to MI.

02 / Reliability

The method says when its approximation should be trusted.

Boundary correction improves sensitivity to rare classes, but also makes the Taylor approximation more sensitive to higher-order skewness. The paper pairs $C_k$ with a diagnostic.

Skewness diagnostic
$$\rho_k(x)=\frac{|m_{3,k}|}{3\mu_k\operatorname{Var}[p_k]}$$
$\rho_k \ll 1$ quadratic approximation is reliable
$\rho_k > 0.3$ interpret $C_k$ with caution
Correlation-aware fallback

CBEC detects active confusion across a safety boundary.

$$\operatorname{CBEC}(x)=\sum_{i\in\mathcal{S}}\sum_{j\in\mathcal{C}} \sqrt{C_iC_j}\max(0,-\rho_{ij})$$

Use $C_{\text{crit\_max}}$ when critical-class $C_k$ values are reliable. CBEC is the robust alternative when skewness degrades the Taylor approximation.

Axiomatic profile of $\sum_k C_k$
A0 non-negativity A1 zero iff no disagreement A3 increases under mean-preserving spread A2 and A5 inherited trade-offs

The A5 violation is the mechanism that counteracts boundary suppression: sensitivity to $\mu_k$ prevents rare-class contributions from being forced to zero.

03 / Evidence

Class-aware uncertainty changes the decisions a model can support.

The paper evaluates selective prediction, OoD detection, and sensitivity to controlled label noise.

A

Selective prediction / diabetic retinopathy

Defer when uncertainty involves a clinically critical class.

Lowest AUSC: 0.285 $C_{\text{crit\_max}}$ dominates across the full coverage range.
Selective prediction results on diabetic retinopathy
Critical FNR across coverage levels and bootstrap AUSC distributions. Lower AUSC means safer selective prediction.
Family Policy AUSC ↓ Critical FNR @80% ↓
ScalarEntropy0.604 ± 0.0220.401 ± 0.016
ScalarMI0.436 ± 0.0190.339 ± 0.014
Per-class varianceSale_EU_crit0.650 ± 0.0130.409 ± 0.016
Per-class $C_k$$C_{\text{crit\_sum}}$0.327 ± 0.0170.321 ± 0.014
Per-class $C_k$$C_{\text{crit\_max}}$0.285 ± 0.0160.302 ± 0.013
Correlation-awareCBEC0.416 ± 0.0200.335 ± 0.014

Grade 3 has mean probability near $0.06$. Raw variance is therefore heavily suppressed; the entropy-derived $1/\mu_k$ weighting recovers the critical-class signal.

Across inference methods, the preferred policy follows the diagnostic: direct $C_k$ targeting under reliable posteriors, CBEC when skewness is high.

B

Interpretability / error signatures

Nearly identical MI can conceal different confusion pathways.

Per-class epistemic signatures for diabetic retinopathy errors
The per-class profiles distinguish a catastrophic miss from a severity underestimate, despite nearly identical scalar MI.

Cross-boundary epistemic confusion is 2.7× stronger than within-group confusion.

C

Out-of-distribution detection

The per-class view shows where distributional shift concentrates.

Best AUROC on both tasks All-class aggregation captures shifts whose locus is not known in advance.
Method FashionMNIST → KMNIST MIMIC-III ICU → Newborn
AUROC ↑OoD / ID ratio AUROC ↑OoD / ID ratio
Neg. MSP0.665 ± 0.0132.070.688 ± 0.0301.24
MI0.724 ± 0.0095.920.802 ± 0.0041.61
$EU_{\text{var}}$0.710 ± 0.0105.920.778 ± 0.0151.71
$\sum_k C_k$0.735 ± 0.0096.430.815 ± 0.0171.62
Per-class epistemic contributions on FashionMNIST and KMNIST
FashionMNIST → KMNIST: the shift raises all ten class contributions.
OOD uncertainty score distributions on MIMIC-III
MIMIC-III: cleaner score separation matters more than the largest mean ratio.

On MIMIC-III, the OoD/ID contribution ratio is $2.15\times$ for survival but $1.30\times$ for mortality. The larger shift signal resides in the non-critical class.

This is why critical-class targeting helps selective prediction, while all-class aggregation is necessary for general OoD detection.

D

Sensitivity to data quality

Posterior and training regime shape uncertainty at least as strongly as the metric.

19 / 20 End-to-end settings where $\sum_k C_k$ is less entangled than MI.
Disentanglement results under label noise
Relative disentanglement ratio under controlled label noise. Values near zero indicate cleaner separation of epistemic and aleatoric uncertainty.
End-to-end Bayesian training $|R_{\mathrm{rel}}| \ll 0.3$

Both metrics remain near-perfectly disentangled.

Frozen-backbone transfer learning $0.74$ to $1.97$

Entanglement increases by over an order of magnitude.

04 / What the paper changes

The value of $\mathbf{C}(x)$ is not that it replaces MI as a scalar.

It enables class-specific questions that scalar uncertainty cannot express: which classes drive uncertainty, whether confusion crosses a safety boundary, and where a distribution shift enters the label space.

Selective prediction

Target the uncertainty connected to the costly failure.

OoD diagnosis

Inspect whether shift is uniform or concentrated in particular classes.

Posterior scrutiny

Use $\rho_k$ and attribution behavior to examine the quality of uncertainty propagation.

Across all tasks, how uncertainty is propagated through the network matters as much as how it is measured.

Limitations and next steps

A diagnostic decomposition, with a clear frontier.

The paper is explicit about where the approximation loosens and what must be studied next.

High skewness

The additive approximation loosens for low-probability classes with high skewness. $\rho_k$ diagnoses this but does not correct it.

High-cardinality tasks

The $1/\mu_k$ normalization introduces $O(K^2)$ aggregate scaling, motivating truncation or reweighting.

Broader posteriors

Future evaluation should include Laplace approximations, SWAG, structured prediction, and low-rank ensembles.

Class-targeted decisions

Per-class attribution can support active learning, structured deferral, and richer safety-aware uncertainty profiles.

Citation

@inproceedings{
Toure2026not,
title={Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions},
author={Mame Diarra Toure, David A. Stephens},
booktitle={Forty-Second Annual Conference on Uncertainty in Artificial Intelligence},
year={2026},
url={https://openreview.net/forum?id=cxuWscJmAr}
}