Yunzhu Li (@YunzhuLiYZ)

2025-06-19 | ❤️ 132 | 🔁 26


Steerability remains one of the key issues for current vision-language-action models (VLAs). Natural language is often ambiguous and vague: “Hang a mug on a branch” vs “Hang the left mug on the right branch.”

Many works claim to handle language input, yet the tasks are often obvious even without the language. We need systematic benchmarks to evaluate how well models are steered by different instructions.

In our RSS2025 paper, we study whether scaling alone can solve the steerability issue and propose CodeDiffuser, which generates task-specific code—an interpretable, executable interface that calls perception APIs and produces effective conditioning for imitation learning policies (e.g., hang mug, pack battery, stow books). 🔗 https://robopil.github.io/code-diffuser/

Key takeaways:

  1. Scaling data alone—without careful design of intermediate conditioning—is unlikely to solve steerability.
  2. Following VoxPoser and ReKep, LLM/VLM-generated code is highly versatile for conditioning (e.g., via visual heatmaps) and effective in steering policies.
  3. The code is Foundation Model (FM)-generatable, human-interpretable, and robot-executable—an ideal bridge between FMs and robot actions.
  4. The framework is scalable—improvements in VLMs/LLMs/VLAs can enhance performance—but more importantly, it is interpretable and debuggable: when things go wrong, we have a good chance to pinpoint the failure.

Check out the interactive demos on our project page and @YXWangBot’s thread for more!


See similar notes in domain-robotics, domain-llm, domain-vlm, domain-dev-tools

인용 트윗

Yixuan Wang (@YXWangBot)

🤖 Does VLA models really listen to language instructions? Maybe not 👀 🚀 Introducing our RSS paper: CodeDiffuser — using VLM-generated code to bridge the gap between high-level language and low-level visuomotor policy 🎮 Try the live demo: https://t.co/sLlTIyFu19 (1/9) https://t.co/ba91DbcXzc

원본 트윗

🎬 영상

Tags

type-paper domain-robotics, domain-llm, domain-vlm, domain-dev-tools