![]()
NextFin — As billions of dollars pour into humanoid robots and embodied artificial intelligence, SenseTime co-founder Wang Xiaogang says the industry is heading toward a dead end unless it abandons a machine-centred mindset and returns to a more fundamental source of intelligence: human interaction with the physical world.
“Robots will never understand the real physical world by reading articles or looking at pictures,” Wang shared with Jany Hejuan Zhao, the founder and CEO of NextFin.AI and the publisher of Barron's China, during the 2025 T-EDGE that kicked off on Monday, December 8 and lasts through December 21.
“We need to shift from a machine-centred paradigm to a human-centred one, and learn from how people actually interact with their environment,” he added.
Wang, who is also chairman of robotics company ACE Robotics, argued that current embodied AI research relies too heavily on large vision-language-action (VLA) models trained on internet data and robot self-generated data — an approach he believes cannot deliver general, transferable intelligence for machines operating in the real world.
Instead, Wang is advocating a new research framework he calls ACE — short for “anthropocentric, context-driven, embodied” intelligence — which places human behaviour, physical interaction and environmental context at the core of how machines learn.
His comments come at a moment of extraordinary enthusiasm and investment in humanoid robotics. According to industry estimates cited at the forum, global investment in humanoid robots reached roughly $7 billion in the first nine months of 2025, up about 250% year-on-year, driven largely by China. Yet most products remain at an early demonstration stage, limited to walking, dancing or performing scripted tasks.
“There is a gap between the capital frenzy and the technological reality,” Wang said. “That gap is rooted in the research paradigm itself.”
The embodied AI sector has become one of the hottest frontiers in global technology, with companies racing to build robots capable of operating in factories, hospitals, warehouses and eventually homes.
Yet despite rapid progress in sensors, actuators and computing, Wang argues that current systems lack a deep understanding of physics, causality and long-horizon planning — capabilities humans acquire naturally through everyday interaction.
Today’s dominant approach trains robots either by imitation learning from human demonstrations or by reinforcement learning through trial and error in simulated or controlled environments. These systems are often combined with large language or multimodal models, producing machines that can follow instructions or replicate tasks but struggle when faced with novel situations.
“The industry hopes that if we just scale up models and data, general intelligence will emerge,” Wang said. “But scaling is reaching diminishing returns because the data source — the internet — is being exhausted.”
That exhaustion, he said, marks the end of the AI 2.0 era driven by massive text, image and code corpora, and the beginning of a new phase in which intelligence must be drawn from interaction with the physical world.
Wang’s proposed shift is conceptual as much as technical.
In the machine-centred paradigm, researchers start with a robot, collect data from its sensors, and train a model tailored to that specific physical form. But robots differ widely — humanoids, wheeled robots, robotic arms, drones — each with different constraints and capabilities.
“You cannot expect all these different bodies to share the same brain,” Wang said. “Just as humans and dogs do not share the same brain, machines with different physical structures cannot share one universal model.”
The human-centred approach reverses the process. Instead of starting with the robot, it starts with humans: observing how people perceive, plan and act in physical environments.
Using wearable devices, first-person cameras, motion capture systems and force and tactile sensors, Wang’s team records human activity across everyday tasks — cooking, cleaning, handling objects, navigating spaces — and uses this data to build what he calls a “world model”: a structured representation of physical laws, object dynamics and goal-driven behaviour.
This world model can then be transferred across different robotic embodiments.
“The intelligence should be in the world model, not in a specific robot,” Wang said. “Once you understand the world, you can adapt that understanding to different bodies.”
At the core of Wang’s vision is the ACE framework developed by ACE Robotics, which integrates three layers:
- Environmental data acquisition — collecting rich, multimodal data from human-environment interactions, including vision, motion, force, and spatial context.
- World model 3.0 — a generative and predictive model that encodes physical laws, human behaviour and long-term causal relationships, capable of simulating alternative scenarios and predicting outcomes.
- Embodied interaction — mapping the world model’s outputs onto specific robotic bodies through control and planning systems.
Unlike purely generative video or simulation models, Wang said, ACE emphasises physical consistency and causal structure.
“It is not enough to generate realistic images of a world,” he said. “The model must know what happens when you push an object, open a door, or apply force.”
One of Wang’s strongest criticisms of the current AI trajectory is its dependence on historical internet data.
Large language models have extracted enormous value from centuries of human writing and knowledge, but Wang argues that this reservoir is finite.
“Once you’ve mined all the intelligence embedded in text and images, there is no more to extract,” he said. “The next intelligence must come from the physical world.”
That requires new data pipelines, new sensors and new ethical and technical frameworks for collecting and using human behavioural data.
It also requires interdisciplinary collaboration, bringing together AI researchers, roboticists, mechanical engineers, cognitive scientists and ergonomics experts.
“Fields that were once disconnected — biomechanics, human factors, robotics, AI — now need to converge,” Wang said.
The push toward embodied intelligence has implications beyond technology.
Robotics is increasingly seen as a strategic industry, affecting manufacturing competitiveness, labour markets, aging societies and national security.
China, with its strong manufacturing base, dense urban environments and large domestic market, has become a major testing ground for embodied AI.
Wang believes this gives China a structural advantage — not just in production, but in data.
“China has enormous diversity in physical environments, industrial scenarios and daily life settings,” he said. “That diversity is a rich training ground for embodied intelligence.”
At the same time, he stressed the importance of open ecosystems and international collaboration, warning against closed technological silos.
Despite the ambition of his vision, Wang acknowledged that truly general-purpose robots remain years away.
Safety, reliability, cost, regulation and public acceptance all pose major challenges, especially for household robots.
Before robots can become everyday companions, he said, society must resolve issues such as physical safety, accountability for accidents, and trust.
Still, he believes that a correct research paradigm can dramatically accelerate progress, just as end-to-end learning transformed autonomous driving once the right conceptual framework emerged.
“If the direction is right, data and engineering will follow,” Wang said.
Wang’s own career mirrors the evolution of artificial intelligence in China, from early computer vision research to large-scale models and now embodied intelligence.
He began his academic path at the University of Science and Technology of China, later earning a PhD and conducting research at MIT before co-founding SenseTime in 2014, a company that became one of China’s leading AI firms.
Now, through ACE Robotics, he is trying to shape what he calls the “AI 3.0” era — one that reconnects intelligence with the physical world.
“The biggest breakthroughs do not come from scaling what already exists,” he said. “They come from changing how we think about the problem.”
For Wang, the question is no longer whether embodied AI will succeed, but how it will be built.
A future dominated by machines trained only on virtual data, he warned, risks producing systems that are brittle, unsafe and incapable of true autonomy.
A future grounded in human experience and physical reality, he argues, offers a more robust and ultimately more human-aligned path.
“Intelligence did not come from the internet,” Wang said. “It came from humans living in the world. If machines are to become intelligent, that is where they must learn as well.”










