W = ABT → singular posterior geometrytheory-backed scalable uncertainty
A low-rank variational BNN whose induced posterior has singular geometric support.
Contribution Map
This paper shows that low-rank variational BNNs induce singular posterior geometry.
MethodEnd-to-end variational learning over low-rank factors W = ABT.GeometryThe induced qW is a pushforward posterior supported on the rank-r manifold.SingularityqW is singular with respect to Lebesgue measure when r < min(m,n).CorrelationMean-field factors still induce structured weight correlations in W.TheoryApproximation, PAC-Bayes, and Gaussian-complexity guarantees.EvidenceMLPs, LSTMs, Transformers; MIMIC, Beijing PM2.5, SST-2.
Core thesis: low-rank is not merely parameter economy; it changes posterior support, covariance, and capacity.
Problem
Bayesian neural networks provide uncertainty, but full weight-space inference is expensive.
Full-rank BBB and Deep Ensembles scale with the full weight matrix. Mean-field VI assigns independent uncertainty to each weight entry, so weight correlations are not represented.
But modern weight matrices often have fast spectral decay.
What Existing Low-Rank Methods Miss
The paper is not LoRA with uncertainty pasted on top.
Approach
Backbone
Posterior object
Missing piece
Post-hoc perturbations
pretrained
noise around fixed weights
not end-to-end uncertainty
Low-rank covariance
full W means
covariance approximation
still O(mn) weight means
Bayesian LoRA
pretrained/fine-tuning
adapter uncertainty
not from-scratch BNN training
SBNN
from scratch
singular qW on rank-r support
pushforward posterior
The novelty is the induced posterior measure over W: its support, geometry, covariance, and complexity all change.
The Move
Learn uncertainty in factor space, then map it into weight space.
W
=
A
·
BT
Variational posteriorq(A,B)=qA(A)qB(B), mean-field Gaussians over factor entries.Parameter countO(mn) becomes O(r(m+n)); rank controls expressiveness.Drop-in layersImplemented for dense layers, LSTM gates, and Transformer components.
Pushforward Posterior
The posterior over W is induced, not independently assigned.
q(A,B)factor posterior
T(A,B)=ABT →
qWinduced weight posterior
This is the formal turn: the Bayesian object is a pushforward measure on weight space.
Central Theorem
qW is singular with respect to Lebesgue measure.
If r < min(m,n), and W = ABT with A ∈ Rm×r, B ∈ Rn×r:
qW(Rr) = 1 and λ(Rr) = 0thereforeqW ⟂ λ
The posterior cannot be represented as a full-dimensional density over Rm×n. It lives on zero-volume rank-r support.
Geometric Distinction
Mean-field fills volume. Low-rank mass lives on a manifold.
This is why “singular BNN” is the right phrase: the posterior support itself has changed.
Structured Latent-Factor Uncertainty
Mean-field factors do not imply mean-field weights.
Covariance lemmaEven when A and B have independent entries, Wij and Wi′j′ can be correlated through shared latent dimensions.Inductive biasThe model trades independence bias for correlation bias: row/column-level uncertainty propagates coherently.
Rank r controls how expressive these correlations can be.
Theory Stack
The paper gives three theory certificates, not one.
ApproximationEckart–Young–Mirsky controls rank-r approximation by the tail singular values of W*.Learned factorsError decomposes into learning error ||W−Wr*||F plus unavoidable rank bias σ>r.GeneralizationPAC-Bayes complexity scales as √r(m+n), and Gaussian complexity transfers to Bayesian predictive means.
Figure result: PAC-Bayes shows a critical rank transition; Gaussian complexity decreases with rank reduction.
Implementation
The framework is not architecture-specific.
MLPstandard dense layers Wℓ = AℓBℓTLSTMinput-to-hidden + hidden-to-hidden factors sampled once per batchTransformerposition-wise factorization rank r=16 in SST-2 experiments
Drop-in variational layers make the geometry portable across model families.
Experimental Design
Three architectures. Three data regimes. Six uncertainty baselines.
Dataset
Architecture
Shift
Rank
MIMIC-III ICU mortality
2-layer MLP
adult ICU → newborn ICU
r=15
Beijing PM2.5 forecasting
2-layer LSTM
Beijing → Guangzhou
r=14/20
SST-2 sentiment
4-layer Transformer
movie reviews → AGNews
r=16
Baselines: deterministic, Deep Ensemble, Full-Rank BBB, Low-Rank random init, Low-Rank SVD init, Rank-1 multiplicative; SWAG added as supplementary comparator.
Result: MIMIC-III
On clinical shift, low-rank gives the strongest OOD uncertainty.
0.802AUC-OOD, best overall
0.788AUPR-OOD, best overall
0.824AUPR-In, best overall
NuanceDeep Ensemble keeps stronger in-domain AUROC and NLL. Low-rank prioritizes epistemic separation under shift.
Parameter reduction: 70% fewer than Full-Rank BBB; 88% fewer than Deep Ensemble.
Result: Beijing PM2.5
For time-series forecasting, uncertainty quality shows up in coverage and abstention.
0.790PICP, best coverage
17.4%MAE reduction at 80% retention
64%fewer params than Full-Rank BBB
Result: SST-2 Transformer
At Transformer scale, the parameter story becomes decisive.
1.5MLow-Rank BBB parameters
13×fewer than Full-Rank BBB
33×fewer than Deep Ensemble
Performance profileLow-Rank BBB: 0.806 accuracy, best AUPR-In, second-best AUROC-OOD. Deep Ensemble remains strongest overall but costs 49.6M parameters.Training timeLow-Rank trains in 8.2 min vs 23.1 for Full-Rank BBB and 64.7 for Deep Ensemble.
Efficiency Evidence
The quality-efficiency frontier changes.
The result is not “low-rank always wins every metric”; it is that useful Bayesian uncertainty becomes plausible at modern parameter scales.
Why Rank Is Plausible
Observed singular value decay supports the rank-r constraint.
When layer spectra decay quickly, the manifold is not an arbitrary bottleneck. It is a structured approximation class.
Theory connectionEYM tail singular values determine rank approximation error; rank ablations and spectra guide practical rank selection.
Comparator: SWAG
A strong posterior baseline does not overturn the central tradeoff.
Setting
SWAG strength
Low-rank advantage
SST-2
Accuracy tied at 0.808 vs 0.806
Better NLL/OOD; 1.47M params vs 208.37M
MIMIC
Higher in-domain AUROC/NLL
OOD MI metrics: 0.802 / 0.788 vs 0.634 / 0.680
Beijing
Higher coverage
Much narrower/costlier tradeoff; stronger OOD for low-rank
SWAG is a credible baseline. However, SBNN’s geometry gives a better quality-efficiency path in the paper’s target regimes.
Honest Takeaway
The empirical message is not compression alone.
Where low-rank shinesOOD separation on MIMIC; coverage/selective prediction on Beijing; efficiency-adjusted uncertainty on SST-2.Where ensembles remain strongIn-distribution likelihood and sharpness can favor Deep Ensembles; calibration gaps remain in some settings.
Why that mattersTrustworthy AI often needs useful uncertainty under shift, abstention, and deployment constraints, not only marginal NLL.
SBNNs make that goal more practical: structured uncertainty, lower cost, and scalable to modern architectures.
Research Agenda
A platform for scalable Bayesian posteriors in modern architectures.
Adaptive rank selectionRanks chosen by spectra, validation tradeoffs, or sparse priors rather than fixed grids.Bayesian TransformersStructured uncertainty for large sequence models without ensemble-scale cost.Beyond Gaussian factorsLaplace, spike-and-slab, and richer factor distributions while keeping singular support.Trustworthy AIUncertainty that supports abstention, coverage, OOD awareness, and constrained deployment.
Closing
Singular posterior geometry for scalable Bayesian deep learning.