UAI 2026

Not Just How Much,
But Where

Decomposing Epistemic Uncertainty into Per-Class Contributions

Mutual information says how uncertain a model is. The per-class vector $\mathbf{C}(x)$ reveals which classes drive that uncertainty.

Mame Diarra Touré and David A. Stephens
Department of Mathematics and Statistics, McGill University

Poster PDF OpenReview arXiv Code Homepage Explore the method

34.7% lower selective risk than MI on diabetic retinopathy

56.2% lower selective risk than the per-class variance baseline

2 / 2 OoD benchmarks where $\sum_k C_k$ achieves the highest AUROC

The central problem

A single uncertainty score can hide the failure mode that matters most.

In asymmetric classification, ignorance involving a benign class is not equivalent to ignorance involving a safety-critical class. Scalar MI collapses both cases into the same number.

Same scalar summary MI = 0.024

Catastrophic miss: Grade 3 predicted as Grade 0

versus

Nearly identical scalar summary MI = 0.027

Severity underestimate: Grade 3 predicted as Grade 2

Their scalar MI is nearly identical, but their per-class epistemic signatures are not.

01 / Method

Decompose MI into class-level contributions.

From $S$ stochastic predictions, estimate each class mean $\mu_k$ and variance $\operatorname{Var}[p_k]$. A second-order Taylor expansion of entropy gives a simple, additive decomposition.

Per-class epistemic contribution

$$C_k(x) = \frac{1}{2}\frac{\operatorname{Var}[p_k](x)}{\mu_k(x)}$$

Additive connection

$$\sum_{k=1}^{K} C_k(x) \approx \operatorname{MI}(y;\boldsymbol{\omega}\mid x)$$

Sample predictions

$S$ stochastic forward passes produce probability vectors on the simplex.

Estimate moments

Compute the class-wise mean, variance, covariance, and third central moment.

Attribute uncertainty

Each $C_k$ identifies the share of epistemic uncertainty associated with class $k$.

Why the normalization matters

Raw variance is suppressed exactly where rare classes live.

On the probability simplex, $\operatorname{Var}[p_k]\leq\mu_k(1-\mu_k)$. As $\mu_k\rightarrow0$, raw variance must vanish even when posterior disagreement remains important.

$\mu_k \rightarrow 0$ rare class

Raw variance

upper bound $\rightarrow 0$

$C_k$

upper bound $\rightarrow \tfrac{1}{2}$

The $1/\mu_k$ term is not an ad hoc correction.

It arises from the entropy Hessian. A given amount of probability variance carries more information-theoretic weight for a low-probability class.

This makes $C_k$ comparable across rare and common classes, while preserving the additive connection to MI.

02 / Reliability

The method says when its approximation should be trusted.

Boundary correction improves sensitivity to rare classes, but also makes the Taylor approximation more sensitive to higher-order skewness. The paper pairs $C_k$ with a diagnostic.

Skewness diagnostic

$$\rho_k(x)=\frac{|m_{3,k}|}{3\mu_k\operatorname{Var}[p_k]}$$

$\rho_k \ll 1$ quadratic approximation is reliable

$\rho_k > 0.3$ interpret $C_k$ with caution

Correlation-aware fallback

CBEC detects active confusion across a safety boundary.

$$\operatorname{CBEC}(x)=\sum_{i\in\mathcal{S}}\sum_{j\in\mathcal{C}} \sqrt{C_iC_j}\max(0,-\rho_{ij})$$

Use $C_{\text{crit\_max}}$ when critical-class $C_k$ values are reliable. CBEC is the robust alternative when skewness degrades the Taylor approximation.

Axiomatic profile of $\sum_k C_k$

A0 non-negativity A1 zero iff no disagreement A3 increases under mean-preserving spread A2 and A5 inherited trade-offs

The A5 violation is the mechanism that counteracts boundary suppression: sensitivity to $\mu_k$ prevents rare-class contributions from being forced to zero.

03 / Evidence

Class-aware uncertainty changes the decisions a model can support.

The paper evaluates selective prediction, OoD detection, and sensitivity to controlled label noise.

Selective prediction / diabetic retinopathy

Defer when uncertainty involves a clinically critical class.

Lowest AUSC: 0.285 $C_{\text{crit\_max}}$ dominates across the full coverage range.

Selective prediction results on diabetic retinopathy — Critical FNR across coverage levels and bootstrap AUSC distributions. Lower AUSC means safer selective prediction.

Family	Policy	AUSC ↓	Critical FNR @80% ↓
Scalar	Entropy	0.604 ± 0.022	0.401 ± 0.016
Scalar	MI	0.436 ± 0.019	0.339 ± 0.014
Per-class variance	Sale_EU_crit	0.650 ± 0.013	0.409 ± 0.016
Per-class $C_k$	$C_{\text{crit\_sum}}$	0.327 ± 0.017	0.321 ± 0.014
Per-class $C_k$	$C_{\text{crit\_max}}$	0.285 ± 0.016	0.302 ± 0.013
Correlation-aware	CBEC	0.416 ± 0.020	0.335 ± 0.014

Grade 3 has mean probability near $0.06$. Raw variance is therefore heavily suppressed; the entropy-derived $1/\mu_k$ weighting recovers the critical-class signal.

Across inference methods, the preferred policy follows the diagnostic: direct $C_k$ targeting under reliable posteriors, CBEC when skewness is high.

Interpretability / error signatures

Nearly identical MI can conceal different confusion pathways.

Per-class epistemic signatures for diabetic retinopathy errors — The per-class profiles distinguish a catastrophic miss from a severity underestimate, despite nearly identical scalar MI.

Cross-boundary epistemic confusion is 2.7× stronger than within-group confusion.

Out-of-distribution detection

The per-class view shows where distributional shift concentrates.

Best AUROC on both tasks All-class aggregation captures shifts whose locus is not known in advance.

Method	FashionMNIST → KMNIST		MIMIC-III ICU → Newborn
Method	AUROC ↑	OoD / ID ratio	AUROC ↑	OoD / ID ratio
Neg. MSP	0.665 ± 0.013	2.07	0.688 ± 0.030	1.24
MI	0.724 ± 0.009	5.92	0.802 ± 0.004	1.61
$EU_{\text{var}}$	0.710 ± 0.010	5.92	0.778 ± 0.015	1.71
$\sum_k C_k$	0.735 ± 0.009	6.43	0.815 ± 0.017	1.62

Per-class epistemic contributions on FashionMNIST and KMNIST — FashionMNIST → KMNIST: the shift raises all ten class contributions.

OOD uncertainty score distributions on MIMIC-III — MIMIC-III: cleaner score separation matters more than the largest mean ratio.

On MIMIC-III, the OoD/ID contribution ratio is $2.15\times$ for survival but $1.30\times$ for mortality. The larger shift signal resides in the non-critical class.

This is why critical-class targeting helps selective prediction, while all-class aggregation is necessary for general OoD detection.

Sensitivity to data quality

Posterior and training regime shape uncertainty at least as strongly as the metric.

19 / 20 End-to-end settings where $\sum_k C_k$ is less entangled than MI.

Disentanglement results under label noise — Relative disentanglement ratio under controlled label noise. Values near zero indicate cleaner separation of epistemic and aleatoric uncertainty.

End-to-end Bayesian training $|R_{\mathrm{rel}}| \ll 0.3$

Both metrics remain near-perfectly disentangled.

Frozen-backbone transfer learning $0.74$ to $1.97$

Entanglement increases by over an order of magnitude.

04 / What the paper changes

The value of $\mathbf{C}(x)$ is not that it replaces MI as a scalar.

It enables class-specific questions that scalar uncertainty cannot express: which classes drive uncertainty, whether confusion crosses a safety boundary, and where a distribution shift enters the label space.

Selective prediction

Target the uncertainty connected to the costly failure.

OoD diagnosis

Inspect whether shift is uniform or concentrated in particular classes.

Posterior scrutiny

Use $\rho_k$ and attribution behavior to examine the quality of uncertainty propagation.

Across all tasks, how uncertainty is propagated through the network matters as much as how it is measured.

Limitations and next steps

A diagnostic decomposition, with a clear frontier.

The paper is explicit about where the approximation loosens and what must be studied next.

High skewness

The additive approximation loosens for low-probability classes with high skewness. $\rho_k$ diagnoses this but does not correct it.

High-cardinality tasks

The $1/\mu_k$ normalization introduces $O(K^2)$ aggregate scaling, motivating truncation or reweighting.

Broader posteriors

Future evaluation should include Laplace approximations, SWAG, structured prediction, and low-rank ensembles.

Class-targeted decisions

Per-class attribution can support active learning, structured deferral, and richer safety-aware uncertainty profiles.

Resources

Paper materials

Paper, poster, citation, implementation, and project links.

OpenReviewOfficial UAI paper page arXivPaper abstract and PDF Poster PDFUAI 2026 conference poster CodeRepository reserved for the official implementation Project URLPermanent page for sharing and QR codes HomepageResearch and publications

Citation

@inproceedings{
Toure2026not,
title={Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions},
author={Mame Diarra Toure, David A. Stephens},
booktitle={Forty-Second Annual Conference on Uncertainty in Artificial Intelligence},
year={2026},
url={https://openreview.net/forum?id=cxuWscJmAr}
}

Not Just How Much,But Where