Rohan Paul (@rohanpaul_ai)

2025-11-11 | โค๏ธ 297 | ๐Ÿ” 77


Dr. Fei-Fei Liโ€™s latest article, โ€œFrom Words to Worlds: Spatial Intelligence is AIโ€™s Next Frontierโ€.

The next big step for AI is the spatial intelligence, built by world models that create consistent 3D worlds, fuse many signals, and predict how those worlds evolve.

LLMs are great with words but weak at grounded reasoning, so models that can represent and interact with space are needed.

Current multimodal models still fail at basics like estimating distance, maintaining object identity across views, predicting simple physics, and keeping video coherent for more than a few seconds.

The proposed answer is world models with 3 core abilities, generative, multimodal, and interactive, so the model can build consistent worlds, take many input types, and roll the state forward after actions.

These models have to keep physics and space consistent so objects act like they would in real life, not just sound correct like text from a chatbot. They need to work with many data types at onceโ€”images, sounds, text, and actionsโ€”and then decide what to do next in a fast perception to action loop.

They also need to understand how time flows, so they can guess what will happen when things move, like when a robot reaches for something or a person walks through a room. World Labs, the startup Li co leads, is building these Large World Models and their Marble tool is incredible. Its a creator tool that keeps consistent 3D environments that can be explored and edited, lowering the cost of world building.

Training such models needs a universal task function that plays a role like next token prediction but bakes in geometry and physics.

Data must scale beyond text using internet images and videos, plus synthetic data, depth, and tactile signals, with algorithms that recover 3D from 2D frames at scale.

Architectures need 3D/4D aware tokenization, context, and memory, since flattening everything to 1D or 2D makes counting objects or long-term scene memory brittle.

World Labs points to a real-time generator called RTFM, which uses spatially grounded frames as memory to keep persistence while generating quickly.

The near term use is creative tooling where directors, game designers, and YouTubers can block scenes, change layouts, and iterate with natural prompts inside physically coherent sets.

The mid term path is robotics, where the same models give robots a predictive map to plan grasps and navigation in homes and warehouses, cutting trial and error. The long term bet is science and simulation, where controllable worlds accelerate testing of designs in materials, biology, and climate by running fast โ€œwhat ifโ€ experiments before real trials.

๋ฏธ๋””์–ด

image


Tags

3D GenAI Robotics