



Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation
We compare the self-discovered dynamic mask from SC-Depth, the single uncertainty approach used in D3VO and KPDepth, and our proposed CoProU. The masks are predicted from both target and reference images. Our CoProU accurately identifies and masks regions affected by dynamic objects with sharp boundaries. In contrast, the single uncertainty and self-discovered masks fail to fully cover dynamic regions and incorrectly mask static areas, producing blurry boundaries around object contours.
Given two consecutive frames (target It and reference It′ ), (1) features are extracted using a pre-trained vision transformer backbone, (2) depth maps and uncertainty estimates are produced through a decoder network for both frames, (3) relative camera pose is predicted by a PoseNet module, (4) projection and warping operations synthesize views between frames, and (5) our novel CoProU-VO module integrates uncertainty information from both target and reference frames, which is used to (6) compute the uncertainty-aware loss.
In sequence 01, the hybrid method DF-VO completely fails, while SC-Depth exhibits larger deviations compared to our method, which achieves significantly more accurate trajectories. In sequence 09, CoProU-VO with DepthAnythingV2 demonstrates performance comparable to established hybrid methods, including DF-VO and KP-Depth-VO, demonstrating that our approach achieves comparable accuracy to hybrid methods without the computational overhead of their multi-stage processing pipelines.
@inproceedings{xie2025coprou,
title = {CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry},
author = {Xie, Jingchao and Dhaouadi, Oussema and Chen, Weirong and Meier, Johannes and Kaiser, Jacques and Cremers, Daniel},
booktitle = {DAGM German Conference on Pattern Recognition},
year = {2025}
}