Visual odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality. While self-supervised learning has eliminated the need for expensive ground-truth labels in monocular VO, dynamic objects and occlusions that violate the static scene assumption lead to erroneous pose estimates. Existing uncertainty-based methods filter unreliable regions but rely solely on single-frame information, neglecting temporal consistency across consecutive frames. We present Combined Projected Uncertainty (CoProU), a principled probabilistic formulation that propagates and fuses uncertainties across temporal frames. Our key insight is that robust uncertainty estimation requires combining target frame uncertainty with projected uncertainty from reference frames, enabling effective identification of dynamic regions and temporal inconsistencies. We demonstrate CoProU's versatility through two complementary frameworks. CoProU-VO-2F employs a decoupled architecture with CNN-based pose encoder and vision transformer-based depth encoder for two-frame visual odometry. CoProU-VO-MF extends our approach to multi-frame scenarios using a unified transformer architecture with coupled encoders that produce shared representations for ego-motion and geometry estimation. This demonstrates that CoProU, though originally formulated for frame pairs, generalizes naturally to multi-frame settings through pairwise application. Comprehensive experiments validate our contributions. CoProU-VO-2F achieves substantial improvements over state-of-the-art two-frame methods, reducing ATE by up to 63% on KITTI and 33% on nuScenes. CoProU-VO-MF achieves 45% lower average ATE across KITTI, nuScenes, and Waymo compared to the large-scale pretrained VGGT baseline. Extensive ablation studies confirm the effectiveness of temporal uncertainty propagation and CoProU's adaptability across different architectural paradigms.
The original two-frame decoupled architecture (presented in our GCPR 2025 work) employs a CNN-based pose encoder alongside a Vision Transformer-based depth encoder.
Our extended multi-frame framework utilizes a unified transformer architecture with coupled encoders, producing shared representations for both ego-motion and geometry estimation, allowing uncertainty to be propagated seamlessly across multiple frames.
We evaluate our coupled multi-frame formulation (CoProU-VO-MF) against our decoupled two-frame baseline (CoProU-VO-2F) and the large-scale pretrained VGGT model. Our method achieves state-of-the-art results purely through self-supervised training without any ground truth.
Best results are in bold, second best are underlined. Click on column headers to sort.
| Method | Trainable | Training | KITTI | nuScenes | Waymo | Avg. ATE | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K | N | W | ATE ↓ | terr ↓ | rerr ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | ATE ↓ | RPEtrans ↓ | RPErot ↓ | |||
| VGGT | -- | 48.99 | 18.28 | 2.51 | 2.498 | 0.094 | 0.054 | 1.606 | 0.064 | 0.057 | 17.70 | |||
| VGGT* | D | ✓ | ✓ | ✓ | 41.87 | 17.15 | 2.65 | 2.126 | 0.078 | 0.055 | 1.484 | 0.063 | 0.057 | 15.16 |
| CoProU-VO-2F | D | ✓ | 30.40 | 10.65 | 2.37 | 3.548 | 0.169 | 0.204 | 6.842 | 0.321 | 0.297 | 13.60 | ||
| CoProU-VO-2F | D | ✓ | ✓ | 70.11 | 20.38 | 6.13 | 0.859 | 0.031 | 0.085 | 0.960 | 0.046 | 0.057 | 23.98 | |
| CoProU-VO-2F | D | ✓ | ✓ | ✓ | 59.20 | 16.59 | 5.01 | 0.489 | 0.024 | 0.054 | 0.909 | 0.046 | 0.056 | 20.20 |
| CoProU-VO-MF | E+D | ✓ | 35.91 | 12.19 | 3.78 | 2.012 | 0.091 | 0.123 | 3.150 | 0.146 | 0.166 | 13.69 | ||
| CoProU-VO-MF | E+D | ✓ | ✓ | 36.89 | 11.28 | 3.35 | 1.044 | 0.039 | 0.079 | 0.896 | 0.037 | 0.041 | 12.94 | |
| CoProU-VO-MF | E+D | ✓ | ✓ | ✓ | 27.97 | 9.49 | 2.94 | 0.557 | 0.025 | 0.044 | 0.830 | 0.036 | 0.040 | 9.79 |
Our multi-frame approach yields robust global trajectories even in challenging dynamic environments where the baseline two-frame method exhibits drift.
Our approach enables high-quality multi-frame 3D reconstruction by accurately filtering dynamic objects and effectively leveraging geometric consistency across frames. The sequence demonstrates the point cloud alongside camera poses learned entirely in a self-supervised fashion.
@article{xie2026combining,
title = {Combining Projected Uncertainty for Self-Supervised Visual Odometry: From Two-Frame to Multi-Frame},
author = {Xie, Jingchao and Dhaouadi, Oussema and Chen, Weirong and Meier, Johannes and Bauer, Zuria and Pollefeys, Marc and Cremers, Daniel},
journal = {International Journal of Computer Vision},
volume = {134},
number = {7},
pages = {330},
year = {2026},
month = {Jun},
day = {30},
issn = {1573-1405},
doi = {10.1007/s11263-026-02915-y},
url = {https://doi.org/10.1007/s11263-026-02915-y}
}