ICML 2026 Accepted Paper

Singular
Bayesian Neural
Networks

Mame Diarra Touré · David A. Stephens

W = ABT → singular posterior geometry theory-backed scalable uncertainty

A low-rank variational BNN whose induced posterior has singular geometric support.

arradiat.github.io/projects/singular-bnn · github.com/arradiat/SBNN · arXiv:2602.00387
Contribution Map

This paper shows that low-rank variational BNNs induce singular posterior geometry.

MethodEnd-to-end variational learning over low-rank factors W = ABT.
GeometryThe induced qW is a pushforward posterior supported on the rank-r manifold.
SingularityqW is singular with respect to Lebesgue measure when r < min(m,n).
CorrelationMean-field factors still induce structured weight correlations in W.
TheoryApproximation, PAC-Bayes, and Gaussian-complexity guarantees.
EvidenceMLPs, LSTMs, Transformers; MIMIC, Beijing PM2.5, SST-2.

Core thesis: low-rank is not merely parameter economy; it changes posterior support, covariance, and capacity.

Problem

Bayesian neural networks provide uncertainty, but full weight-space inference is expensive.

Full-rank BBB and Deep Ensembles scale with the full weight matrix. Mean-field VI assigns independent uncertainty to each weight entry, so weight correlations are not represented.

O(mn)coordinates for a single matrix
mean + scale per weight in BBB
model copies in Deep Ensembles
MFVI over W Every cell gets its own posterior.
SBNN factor space Shared factors induce correlations in W.

But modern weight matrices often have fast spectral decay.

What Existing Low-Rank Methods Miss

The paper is not LoRA with uncertainty pasted on top.

Approach
Backbone
Posterior object
Missing piece
Post-hoc perturbations
pretrained
noise around fixed weights
not end-to-end uncertainty
Low-rank covariance
full W means
covariance approximation
still O(mn) weight means
Bayesian LoRA
pretrained/fine-tuning
adapter uncertainty
not from-scratch BNN training
SBNN
from scratch
singular qW on rank-r support
pushforward posterior

The novelty is the induced posterior measure over W: its support, geometry, covariance, and complexity all change.

The Move

Learn uncertainty in factor space, then map it into weight space.

W
=
A
·
BT
Variational posteriorq(A,B)=qA(A)qB(B), mean-field Gaussians over factor entries.
Parameter countO(mn) becomes O(r(m+n)); rank controls expressiveness.
Drop-in layersImplemented for dense layers, LSTM gates, and Transformer components.
Pushforward Posterior

The posterior over W is induced, not independently assigned.

q(A,B)factor posterior
T(A,B)=ABT
qWinduced weight posterior

This is the formal turn: the Bayesian object is a pushforward measure on weight space.

Central Theorem

qW is singular with respect to Lebesgue measure.

If r < min(m,n), and W = ABT with A ∈ Rm×r, B ∈ Rn×r:

qW(Rr) = 1    and    λ(Rr) = 0 therefore qW ⟂ λ

The posterior cannot be represented as a full-dimensional density over Rm×n. It lives on zero-volume rank-r support.

Geometric Distinction

Mean-field fills volume. Low-rank mass lives on a manifold.

Geometric distinction between mean-field and low-rank posteriors

This is why “singular BNN” is the right phrase: the posterior support itself has changed.

Structured Latent-Factor Uncertainty

Mean-field factors do not imply mean-field weights.

Weight correlation heatmap
Covariance lemmaEven when A and B have independent entries, Wij and Wi′j′ can be correlated through shared latent dimensions.
Inductive biasThe model trades independence bias for correlation bias: row/column-level uncertainty propagates coherently.

Rank r controls how expressive these correlations can be.

Theory Stack

The paper gives three theory certificates, not one.

ApproximationEckart–Young–Mirsky controls rank-r approximation by the tail singular values of W*.
Learned factorsError decomposes into learning error ||W−Wr*||F plus unavoidable rank bias σ>r.
GeneralizationPAC-Bayes complexity scales as √r(m+n), and Gaussian complexity transfers to Bayesian predictive means.
PAC-Bayes and Gaussian complexity bounds

Figure result: PAC-Bayes shows a critical rank transition; Gaussian complexity decreases with rank reduction.

Implementation

The framework is not architecture-specific.

MLPstandard dense layers
W = ABT
LSTMinput-to-hidden + hidden-to-hidden
factors sampled once per batch
Transformerposition-wise factorization
rank r=16 in SST-2 experiments

Drop-in variational layers make the geometry portable across model families.

Experimental Design

Three architectures. Three data regimes. Six uncertainty baselines.

Dataset
Architecture
Shift
Rank
MIMIC-III ICU mortality
2-layer MLP
adult ICU → newborn ICU
r=15
Beijing PM2.5 forecasting
2-layer LSTM
Beijing → Guangzhou
r=14/20
SST-2 sentiment
4-layer Transformer
movie reviews → AGNews
r=16

Baselines: deterministic, Deep Ensemble, Full-Rank BBB, Low-Rank random init, Low-Rank SVD init, Rank-1 multiplicative; SWAG added as supplementary comparator.

Result: MIMIC-III

On clinical shift, low-rank gives the strongest OOD uncertainty.

MIMIC radar plot
0.802AUC-OOD, best overall
0.788AUPR-OOD, best overall
0.824AUPR-In, best overall
NuanceDeep Ensemble keeps stronger in-domain AUROC and NLL. Low-rank prioritizes epistemic separation under shift.

Parameter reduction: 70% fewer than Full-Rank BBB; 88% fewer than Deep Ensemble.

Result: Beijing PM2.5

For time-series forecasting, uncertainty quality shows up in coverage and abstention.

Beijing LSTM radar plot
Selective prediction
0.790PICP, best coverage
17.4%MAE reduction at 80% retention
64%fewer params than Full-Rank BBB
Result: SST-2 Transformer

At Transformer scale, the parameter story becomes decisive.

SST-2 Transformer radar plot
1.5MLow-Rank BBB parameters
13×fewer than Full-Rank BBB
33×fewer than Deep Ensemble
Performance profileLow-Rank BBB: 0.806 accuracy, best AUPR-In, second-best AUROC-OOD. Deep Ensemble remains strongest overall but costs 49.6M parameters.
Training timeLow-Rank trains in 8.2 min vs 23.1 for Full-Rank BBB and 64.7 for Deep Ensemble.
Efficiency Evidence

The quality-efficiency frontier changes.

Model parameter comparison
Efficiency adjusted Transformer radar

The result is not “low-rank always wins every metric”; it is that useful Bayesian uncertainty becomes plausible at modern parameter scales.

Why Rank Is Plausible

Observed singular value decay supports the rank-r constraint.

When layer spectra decay quickly, the manifold is not an arbitrary bottleneck. It is a structured approximation class.

Theory connectionEYM tail singular values determine rank approximation error; rank ablations and spectra guide practical rank selection.
Singular value analysis
Comparator: SWAG

A strong posterior baseline does not overturn the central tradeoff.

Setting
SWAG strength
Low-rank advantage
SST-2
Accuracy tied at 0.808 vs 0.806
Better NLL/OOD; 1.47M params vs 208.37M
MIMIC
Higher in-domain AUROC/NLL
OOD MI metrics: 0.802 / 0.788 vs 0.634 / 0.680
Beijing
Higher coverage
Much narrower/costlier tradeoff; stronger OOD for low-rank

SWAG is a credible baseline. However, SBNN’s geometry gives a better quality-efficiency path in the paper’s target regimes.

Honest Takeaway

The empirical message is not compression alone.

Where low-rank shinesOOD separation on MIMIC; coverage/selective prediction on Beijing; efficiency-adjusted uncertainty on SST-2.
Where ensembles remain strongIn-distribution likelihood and sharpness can favor Deep Ensembles; calibration gaps remain in some settings.
Why that mattersTrustworthy AI often needs useful uncertainty under shift, abstention, and deployment constraints, not only marginal NLL.

SBNNs make that goal more practical: structured uncertainty, lower cost, and scalable to modern architectures.

Research Agenda

A platform for scalable Bayesian posteriors in modern architectures.

Adaptive rank selectionRanks chosen by spectra, validation tradeoffs, or sparse priors rather than fixed grids.
Bayesian TransformersStructured uncertainty for large sequence models without ensemble-scale cost.
Beyond Gaussian factorsLaplace, spike-and-slab, and richer factor distributions while keeping singular support.
Trustworthy AIUncertainty that supports abstention, coverage, OOD awareness, and constrained deployment.
Closing

Singular posterior geometry for scalable Bayesian deep learning.

Singular Bayesian Neural Networks