Combined Projected Uncertainty for Visual Odometry

IJCV 2026
GCPR 2025

Combining Projected Uncertainty for Self-Supervised Visual Odometry:
From Two-Frame to Multi-Frame

Jingchao Xie^*,1,3, Oussema Dhaouadi^*,1,2,3, Weirong Chen^1,3, Johannes Meier^1,3, Zuria Bauer², Marc Pollefeys^2,4, Daniel Cremers^1,3

¹ TUM ² ETH Zurich ³ MCML ⁴ Microsoft

* shared first authorship

International Journal of Computer Vision (IJCV) 2026

Paper Code

We propose CoProU-VO, an unsupervised visual odometry method that improves pose estimation in dynamic scenes by propagating and combining uncertainty across consecutive frames.

Abstract

Visual odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality. While self-supervised learning has eliminated the need for expensive ground-truth labels in monocular VO, dynamic objects and occlusions that violate the static scene assumption lead to erroneous pose estimates. Existing uncertainty-based methods filter unreliable regions but rely solely on single-frame information, neglecting temporal consistency across consecutive frames. We present Combined Projected Uncertainty (CoProU), a principled probabilistic formulation that propagates and fuses uncertainties across temporal frames. Our key insight is that robust uncertainty estimation requires combining target frame uncertainty with projected uncertainty from reference frames, enabling effective identification of dynamic regions and temporal inconsistencies. We demonstrate CoProU's versatility through two complementary frameworks. CoProU-VO-2F employs a decoupled architecture with CNN-based pose encoder and vision transformer-based depth encoder for two-frame visual odometry. CoProU-VO-MF extends our approach to multi-frame scenarios using a unified transformer architecture with coupled encoders that produce shared representations for ego-motion and geometry estimation. This demonstrates that CoProU, though originally formulated for frame pairs, generalizes naturally to multi-frame settings through pairwise application. Comprehensive experiments validate our contributions. CoProU-VO-2F achieves substantial improvements over state-of-the-art two-frame methods, reducing ATE by up to 63% on KITTI and 33% on nuScenes. CoProU-VO-MF achieves 45% lower average ATE across KITTI, nuScenes, and Waymo compared to the large-scale pretrained VGGT baseline. Extensive ablation studies confirm the effectiveness of temporal uncertainty propagation and CoProU's adaptability across different architectural paradigms.

Architecture Overviews

CoProU-VO-2F (Baseline Two-Frame Architecture)

The original two-frame decoupled architecture (presented in our GCPR 2025 work) employs a CNN-based pose encoder alongside a Vision Transformer-based depth encoder.

CoProU-VO-MF (New Multi-Frame Architecture)

Our extended multi-frame framework utilizes a unified transformer architecture with coupled encoders, producing shared representations for both ego-motion and geometry estimation, allowing uncertainty to be propagated seamlessly across multiple frames.

Multi-Frame Visual Odometry Results

We evaluate our coupled multi-frame formulation (CoProU-VO-MF) against our decoupled two-frame baseline (CoProU-VO-2F) and the large-scale pretrained VGGT model. Our method achieves state-of-the-art results purely through self-supervised training without any ground truth.

Best results are in bold, second best are underlined. Click on column headers to sort.

Method	Trainable	Training			KITTI			nuScenes			Waymo			Avg. ATE
Method	Trainable	K	N	W	ATE ↓	t_err ↓	r_err ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓	Avg. ATE
VGGT	--				48.99	18.28	2.51	2.498	0.094	0.054	1.606	0.064	0.057	17.70
VGGT*	D	✓	✓	✓	41.87	17.15	2.65	2.126	0.078	0.055	1.484	0.063	0.057	15.16
CoProU-VO-2F	D	✓			30.40	10.65	2.37	3.548	0.169	0.204	6.842	0.321	0.297	13.60
CoProU-VO-2F	D	✓		✓	70.11	20.38	6.13	0.859	0.031	0.085	0.960	0.046	0.057	23.98
CoProU-VO-2F	D	✓	✓	✓	59.20	16.59	5.01	0.489	0.024	0.054	0.909	0.046	0.056	20.20
CoProU-VO-MF	E+D	✓			35.91	12.19	3.78	2.012	0.091	0.123	3.150	0.146	0.166	13.69
CoProU-VO-MF	E+D	✓		✓	36.89	11.28	3.35	1.044	0.039	0.079	0.896	0.037	0.041	12.94
CoProU-VO-MF	E+D	✓	✓	✓	27.97	9.49	2.94	0.557	0.025	0.044	0.830	0.036	0.040	9.79

VO qualitative results

Our multi-frame approach yields robust global trajectories even in challenging dynamic environments where the baseline two-frame method exhibits drift.

3D reconstruction results

Our approach enables high-quality multi-frame 3D reconstruction by accurately filtering dynamic objects and effectively leveraging geometric consistency across frames. The sequence demonstrates the point cloud alongside camera poses learned entirely in a self-supervised fashion.

BibTeX

@article{xie2026combining,
  title     = {Combining Projected Uncertainty for Self-Supervised Visual Odometry: From Two-Frame to Multi-Frame},
  author    = {Xie, Jingchao and Dhaouadi, Oussema and Chen, Weirong and Meier, Johannes and Bauer, Zuria and Pollefeys, Marc and Cremers, Daniel},
  journal   = {International Journal of Computer Vision},
  volume    = {134},
  number    = {7},
  pages     = {330},
  year      = {2026},
  month     = {Jun},
  day       = {30},
  issn      = {1573-1405},
  doi       = {10.1007/s11263-026-02915-y},
  url       = {https://doi.org/10.1007/s11263-026-02915-y}
}

CoProU-VO: Combining Projected Uncertainty for End-to-End
Unsupervised Monocular Visual Odometry

Jingchao Xie^*,1,3, Oussema Dhaouadi^*,1,2,3, Weirong Chen^1,3, Johannes Meier^1,3, Jacques Kaiser², Daniel Cremers^1,3

¹ TUM ² DeepScenario ³ MCML

* shared first authorship

German Conference on Pattern Recognition (GCPR) 2025
Oral · Best Paper Award

arXiv Code

Poster

We propose CoProU-VO, an unsupervised visual odometry method that improves pose estimation in dynamic scenes by propagating and combining uncertainty across consecutive frames.

Abstract

Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation

Contribution

We compare the self-discovered dynamic mask from SC-Depth, the single uncertainty approach used in D3VO and KPDepth, and our proposed CoProU. The masks are predicted from both target and reference images. Our CoProU accurately identifies and masks regions affected by dynamic objects with sharp boundaries. In contrast, the single uncertainty and self-discovered masks fail to fully cover dynamic regions and incorrectly mask static areas, producing blurry boundaries around object contours.

Method

Given two consecutive frames (target I_t and reference I_t′ ), (1) features are extracted using a pre-trained vision transformer backbone, (2) depth maps and uncertainty estimates are produced through a decoder network for both frames, (3) relative camera pose is predicted by a PoseNet module, (4) projection and warping operations synthesize views between frames, and (5) our novel CoProU-VO module integrates uncertainty information from both target and reference frames, which is used to (6) compute the uncertainty-aware loss.

Results

Visual Odometry Results

In sequence 01, the hybrid method DF-VO completely fails, while SC-Depth exhibits larger deviations compared to our method, which achieves significantly more accurate trajectories. In sequence 09, CoProU-VO with DepthAnythingV2 demonstrates performance comparable to established hybrid methods, including DF-VO and KP-Depth-VO, demonstrating that our approach achieves comparable accuracy to hybrid methods without the computational overhead of their multi-stage processing pipelines.

Visual Comparisons of Learned Uncertainty

CoProU (Ours)

D3VO - Single Uncertainty

CoProU (Ours)

D3VO - Single Uncertainty

CoProU (Ours)

D3VO - Single Uncertainty

CoProU (Ours)

D3VO - Single Uncertainty

Visualizations in Sequences

KITTI

Sequence 4

Depth

RGB

Target View Uncertainty

Sequence 5

Depth

RGB

Target View Uncertainty

NuScenes

Scene 0685 (Front Camera)

Depth

RGB

Target View Uncertainty

Scene 0733 (Front Camera)

Depth

RGB

Target View Uncertainty

BibTeX

@inproceedings{xie2025coprou,
  title        = {CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry},
  author       = {Xie, Jingchao and Dhaouadi, Oussema and Chen, Weirong and Meier, Johannes and Kaiser, Jacques and Cremers, Daniel},
  booktitle    = {DAGM German Conference on Pattern Recognition},
  year         = {2025}
}

Combined Projected Uncertainty for Visual Odometry

Combining Projected Uncertainty for Self-Supervised Visual Odometry:From Two-Frame to Multi-Frame

We propose CoProU-VO, an unsupervised visual odometry method that improves pose estimation in dynamic scenes by propagating and combining uncertainty across consecutive frames.

Abstract

Architecture Overviews

CoProU-VO-2F (Baseline Two-Frame Architecture)

CoProU-VO-MF (New Multi-Frame Architecture)

Multi-Frame Visual Odometry Results

VO qualitative results

3D reconstruction results

BibTeX

CoProU-VO: Combining Projected Uncertainty for End-to-EndUnsupervised Monocular Visual Odometry

We propose CoProU-VO, an unsupervised visual odometry method that improves pose estimation in dynamic scenes by propagating and combining uncertainty across consecutive frames.

Abstract

Contribution

Method

Results

Visual Odometry Results

Visual Comparisons of Learned Uncertainty

Visualizations in Sequences

KITTI

Sequence 4

Sequence 5

NuScenes

Scene 0685 (Front Camera)

Scene 0733 (Front Camera)

BibTeX

Combining Projected Uncertainty for Self-Supervised Visual Odometry:
From Two-Frame to Multi-Frame

CoProU-VO: Combining Projected Uncertainty for End-to-End
Unsupervised Monocular Visual Odometry