Pablo Vela (@pablovelagomez1)
2025-04-17 | ❤️ 124 | 🔁 10
Visualized with @rerundotio, I’ve integrated video-based depth estimation into my robot-training pipeline to make data collection as accessible as possible—without requiring specialized hardware. Traditionally, achieving accurate depth requires multiple calibrated, time‑synchronized cameras or expensive sensors like Intel RealSense, which are cost‑prohibitive and tricky to set up. With this new addition, I’m one step closer to supporting just a common phone or webcam.
Originally, I ran VGGT over multiview video sequences, which delivers good consistency across views but suffers from:
-
Low throughput. On an RTX 5090, processing eight 640×480 @ 30 FPS streams with a runtime of 40 seconds takes about 15 minutes—so a 5 minute recording requires nearly 2 hours of compute.
-
Occasional catastrophic failures, where VGGT’s predictions collapse and consistency is lost.
To speed things up, I evaluated Video Depth Anything (VDA), which trades off some accuracy for performance:
-
5× faster. VDA processes the same 5 minute video in ~20 minutes instead of 2 hours.
-
Greater robustness. I’ve seen no large-scale failures.
The downside is noticeably poorer multiview consistency and temporal stability. To bridge the gap, I’m exploring a splatting refinement step—using DN‑Splatter to optimize the initial depth maps with a photometric-rendering loss and Pearson-depth loss, which should improve consistency. But then we’re back to trading off runtime. I’m still not sure what the right solution to this is
Overall, this experiment is promising, but not yet a substitute for dedicated depth sensors. A more reliable interim solution might be using an iPhone or iPad with LiDAR—still more accessible than multiple RealSense units, though less universal than plain webcams. I’ll release this code alongside the rest of our Gradio annotation pipeline.
From here, I’ll be testing on some self-collected data instead of using the wonderful baseline HOCAP dataset!
미디어
![]()
🔗 Related
Auto-generated - needs manual review
Tags
domain-vision-3d domain-rendering domain-robotics domain-ai-ml domain-dev-tools domain-crypto domain-visionos