Combined Projected Uncertainty for Visual Odometry

Combining Projected Uncertainty for Self-Supervised Visual Odometry:
From Two-Frame to Multi-Frame

1 TUM     2 ETH Zurich     3 MCML     4 Microsoft
* shared first authorship
International Journal of Computer Vision (IJCV) 2026
Teaser Image

We propose CoProU-VO, an unsupervised visual odometry method that improves pose estimation in dynamic scenes by propagating and combining uncertainty across consecutive frames.

Abstract

Visual odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality. While self-supervised learning has eliminated the need for expensive ground-truth labels in monocular VO, dynamic objects and occlusions that violate the static scene assumption lead to erroneous pose estimates. Existing uncertainty-based methods filter unreliable regions but rely solely on single-frame information, neglecting temporal consistency across consecutive frames. We present Combined Projected Uncertainty (CoProU), a principled probabilistic formulation that propagates and fuses uncertainties across temporal frames. Our key insight is that robust uncertainty estimation requires combining target frame uncertainty with projected uncertainty from reference frames, enabling effective identification of dynamic regions and temporal inconsistencies. We demonstrate CoProU's versatility through two complementary frameworks. CoProU-VO-2F employs a decoupled architecture with CNN-based pose encoder and vision transformer-based depth encoder for two-frame visual odometry. CoProU-VO-MF extends our approach to multi-frame scenarios using a unified transformer architecture with coupled encoders that produce shared representations for ego-motion and geometry estimation. This demonstrates that CoProU, though originally formulated for frame pairs, generalizes naturally to multi-frame settings through pairwise application. Comprehensive experiments validate our contributions. CoProU-VO-2F achieves substantial improvements over state-of-the-art two-frame methods, reducing ATE by up to 63% on KITTI and 33% on nuScenes. CoProU-VO-MF achieves 45% lower average ATE across KITTI, nuScenes, and Waymo compared to the large-scale pretrained VGGT baseline. Extensive ablation studies confirm the effectiveness of temporal uncertainty propagation and CoProU's adaptability across different architectural paradigms.

Architecture Overviews

CoProU-VO-2F (Baseline Two-Frame Architecture)

CoProU-VO-2F Architecture

The original two-frame decoupled architecture (presented in our GCPR 2025 work) employs a CNN-based pose encoder alongside a Vision Transformer-based depth encoder.



CoProU-VO-MF (New Multi-Frame Architecture)

CoProU-VO-MF Architecture

Our extended multi-frame framework utilizes a unified transformer architecture with coupled encoders, producing shared representations for both ego-motion and geometry estimation, allowing uncertainty to be propagated seamlessly across multiple frames.

Multi-Frame Visual Odometry Results

We evaluate our coupled multi-frame formulation (CoProU-VO-MF) against our decoupled two-frame baseline (CoProU-VO-2F) and the large-scale pretrained VGGT model. Our method achieves state-of-the-art results purely through self-supervised training without any ground truth.

Best results are in bold, second best are underlined. Click on column headers to sort.

Method Trainable Training KITTI nuScenes Waymo Avg. ATE
K N W ATE ↓ terr rerr ATE ↓ RPEtrans RPErot ATE ↓ RPEtrans RPErot
VGGT -- 48.99 18.28 2.51 2.498 0.094 0.054 1.606 0.064 0.057 17.70
VGGT* D 41.87 17.15 2.65 2.126 0.078 0.055 1.484 0.063 0.057 15.16
CoProU-VO-2F D 30.40 10.65 2.37 3.548 0.169 0.204 6.842 0.321 0.297 13.60
CoProU-VO-2F D 70.11 20.38 6.13 0.859 0.031 0.085 0.960 0.046 0.057 23.98
CoProU-VO-2F D 59.20 16.59 5.01 0.489 0.024 0.054 0.909 0.046 0.056 20.20
CoProU-VO-MF E+D 35.91 12.19 3.78 2.012 0.091 0.123 3.150 0.146 0.166 13.69
CoProU-VO-MF E+D 36.89 11.28 3.35 1.044 0.039 0.079 0.896 0.037 0.041 12.94
CoProU-VO-MF E+D 27.97 9.49 2.94 0.557 0.025 0.044 0.830 0.036 0.040 9.79

VO qualitative results

Trajectories

Our multi-frame approach yields robust global trajectories even in challenging dynamic environments where the baseline two-frame method exhibits drift.

3D reconstruction results

3D reconstruction results

Our approach enables high-quality multi-frame 3D reconstruction by accurately filtering dynamic objects and effectively leveraging geometric consistency across frames. The sequence demonstrates the point cloud alongside camera poses learned entirely in a self-supervised fashion.

BibTeX

@article{xie2026combining,
  title     = {Combining Projected Uncertainty for Self-Supervised Visual Odometry: From Two-Frame to Multi-Frame},
  author    = {Xie, Jingchao and Dhaouadi, Oussema and Chen, Weirong and Meier, Johannes and Bauer, Zuria and Pollefeys, Marc and Cremers, Daniel},
  journal   = {International Journal of Computer Vision},
  volume    = {134},
  number    = {7},
  pages     = {330},
  year      = {2026},
  month     = {Jun},
  day       = {30},
  issn      = {1573-1405},
  doi       = {10.1007/s11263-026-02915-y},
  url       = {https://doi.org/10.1007/s11263-026-02915-y}
}