Pablo Vela (@pablovelagomez1)
2025-09-29 | โค๏ธ 26 | ๐ 4
Itโs finally done, Iโve finished ripping out my full-body pipeline and replaced it with a hands-only version. Critical to make it work in a lot more scenarios! Iโve visualized the final predictions with @rerundotio!
I want to emphasize that these are not the ground-truth values provided by the wonderful HOCap dataset, but rather from my pipeline that was written from the ground up!
For context, it consists of 4 parts
- Exo/Ego camera estimation
- Hand Shape Calibration
- Per View 2D keypoint estimation
- Hand Pose Optimization
At the end of it all, I have a pipeline where you input synchronized videos and this outputs full tracked per-view 2D keypoints, bounding boxes, 3D keypoints, MANO joint angles + hand shape!
Really happy with how it looks so far, but this is far from ideal.
- Not even close to real time, this 30-second 8-view sequence took nearly 5 minutes to process on my 5090 GPU
- 8 views is WAY too many and unscalable, Iโm convinced this can be done with far fewer (2 exo + 1 stereo ego)
- Interacting hands causes lots of issues, and the pipeline is very fragile when thereโs no clear delineation between hands
Still, Iโm quite happy with how itโs going so far. Currently, I have a reasonable set of datasets to validate, a performant baseline, and an annotation app to correct inaccurate predictions.
From here, the focus will be more on the egocentric side!
๐ Related
Auto-generated bookmark
์ธ์ฉ ํธ์
Pablo Vela (@pablovelagomez1)
If youโre not labeling your own data, youโre NGMI. I take this seriously, so I finished building the first version of my hand-tracking annotation app using @rerundotio and @Gradio.
The combination of Rerunโs callback system and Gradio integration enables a highly customizable and powerful labeling app. It supports multiple views, 2D and 3D, and maintains time synchronization!
The only input required is a zip file containing two or more multiview MP4 files. I handle everything else automatically. This application works with both egocentric (first-person) and exocentric (third-person) videos.
Networks will occasionally make mistakes, so having the ability to correct them manually is crucial. This is a significant step towards robust and powerful hand tracking, which will provide excellent training data for robot dexterous manipulation.
The next step involves leveraging Rerunโs recent updates, particularly the multisink support. Changes are saved directly to a file in .rrd format, easily extractable since the underlying representation is PyArrow. This can be converted to Pandas, Polars, or DuckDB.
This tight integration between visuals, predictions, and data is crucial to ensure your data is precisely what you expect it to be.
๐ฌ ์์