Daisuke Okanohara / 岡野原 大輔 (@hillbig)

2025-06-22 | ❤️ 14 | 🔁 2


VGGT directly outputs all essential 3D attributes—including camera parameters, point cloud maps, depth maps, and 3D tracking—from an image collection in a single forward pass. It is significantly faster and more accurate than existing methods. Best Paper of CVPR 2025.

Traditionally, 3D reconstruction has been addressed using visual geometric methods that employ iterative optimization techniques, such as bundle adjustment. However, these approaches suffer from complexity and significant computational overhead. VGGT predicts scene 3D attributes from one to several hundred images in a single forward pass, completing the entire process within seconds. Unlike recent iterative optimization-based approaches, such as DUSt3R, MASt3R, and VGGSfM, their method surpasses optimization-based techniques without requiring post-processing. Furthermore, the feature representations produced by VGGT also yield substantial performance improvements in downstream tasks such as non-rigid tracking and novel view synthesis.

Method

Input: N RGB image sequences capturing a scene.

Output: For each image, the system generates the following information:

  • Camera parameters (intrinsic and extrinsic parameters), represented as a vector containing rotation quaternions (4D), translation vectors (3D), and field of view angles (2D). Assumes the principal point is located at the image center.
  • Depth map
  • Point cloud map (3D coordinates relative to the coordinate system of the first image, with world coordinates established)
  • C-dimensional feature grid for tracking. An additional module then uses these features to output correspondence coordinates for each point across all images.

The input images, except for the first frame, maintain permutation equivariance to their order. The first image serves as a reference frame and establishes the world coordinates for the point cloud map.

Network Architecture

Each image is converted into tokens using DINO. These tokens are then concatenated to form a token sequence. The network consists of alternating attention layers, including image-level self-attention layers and global self-attention layers.

This alternating attention mechanism enables the integration of information from multiple images while processing each image, thereby maintaining permutation equivariance to image order.

Each image incorporates both a camera token and a registration token. The first frame uses a special learnable token, while subsequent frames share the same token. This allows the network to treat the first frame separately from the others.

The network also simultaneously predicts aleatoric uncertainty for point cloud and depth map, which is incorporated into the training loss function.

Training

Training is performed end-to-end using a multi-task loss, which minimizes the error between the ground truth and predicted values for each task (the aforementioned 3D tasks).

The training dataset comprises diverse public datasets (16 types) and newly synthesized data, covering a wide range of domains, including indoor/outdoor environments and real/synthetic scenes. 3D annotations were obtained through various methods, including sensor data, physical simulators, and SfM. The dataset’s scale and diversity match those of MASt3R.

Experimental Evaluation

In camera pose estimation, VGGT significantly outperforms conventional iterative approaches in both accuracy and speed, achieving results in just 0.2 seconds (compared to 15 seconds required by colmap, or 7-9 seconds required by DUSt3R and MASt3R).

In multi-view depth estimation, VGGT demonstrates overwhelmingly superior performance among approaches without ground truth camera information, matching or even exceeding results achieved by methods that triangulate-based ground truth cameras.

Even without specialized training for image matching tasks, VGGT achieves performance that surpasses existing dedicated matching approaches.

Comments

This method represents a significant advancement in the 3D reconstruction problem. 3D reconstruction has not only been slow but also required specialized expertise. Their work demonstrates that this challenging 3D problem can be solved both rapidly and with high accuracy.

Their presentation slide notes: “You no longer need to be ‘Zisserman’ for 3D reconstruction”—presumably meaning that this methodology makes the knowledge described in the canonical 3D reconstruction textbook by Zisserman unnecessary (a sort of joke that only Zisserman’s lab paper can state).

Moreover, the approach itself can be characterized as a surrogate model for problems where accurate answers are available and the function mapping from inputs to outputs is neither chaotic nor overly complex.

The network architecture itself is straightforward, with minimal 3D geometry-specific information. The network incorporates only minimal prior knowledge—an alternating attention mechanism between image units and global processing—to enable the representation required to solve the problem’s equivariances (allowing distinction between the initial camera and other cameras), while providing no other priors. The model internally resolves relative positions and correspondences among itself.

Regarding the method’s generalization capabilities as a surrogate model, this depends entirely on the scale and diversity of the training data—something that only became feasible once current levels of data availability were achieved.

While images and video streams can be readily acquired in large quantities, unsupervised learning (such as differentiable bundle adjustment) is promising. They have already experimentally validated this at small scales. If further developed, this could lead to very large-scale training and on-the-fly adaptation for new scenes.


See similar notes in domain-vision-3d, domain-simulation, domain-llm, domain-dev-tools

Tags

type-paper domain-vision-3d, domain-simulation, domain-llm, domain-dev-tools