The Battle Between VLA and World Model Approaches Intensifies | EV Briefing

[Technical Roadmap] At GTC 2026, the debate between the VLA (Vision-Language-Action) approach and the world model path reached a boiling point, with the industry focusing on breakthroughs in generalization capabilities.

Key Developments: Li Auto Champions VLA, While Former Executive Bets on a Hybrid Approach

Li Auto unveiled its MindVLA-o1 architecture at GTC 2026, asserting that a single VLA model can universally control both vehicles and robots. Zhan Kun, head of foundation models at Li Auto, stated that this framework will power the next generation of embodied intelligence.

Jia Peng, former head of Li Auto’s intelligent driving division and current CEO of Zhijian Power, criticized pure VLA approaches as having “almost zero generalization capability” and proposed an alternative path: a unified foundation model integrating world models with VLA.

Strategic Underpinnings: Competing for Influence at the Inflection Point of Physical AI

At its core, the dispute is not about technical minutiae but about achieving robust generalization in the real world. The VLA camp—represented by NVIDIA and Li Auto—advocates for end-to-end unified architectures, while companies like Unitree insist that a simulatable physical world model must be built first. Behind this divergence in technical roadmaps lies an intense battle for ecosystem dominance.