TMTPOST--Seeing is not just for understanding the world, but for doing. And one day it will be AGI and spacial intelligence that close the loop between seeing and doing, said Stanford University's artificial intelligence leader Fei-Fei Li at the first Asian American Pioneer Medal Symposium and Ceremony at Stanford University on Friday.
"Nature created perceptual animals like us, but always starting from twilight half a billion years ago, because there is a in imperative in evolution that seeing and doing is a close loop," she remarked. This spatial intelligence involves not just recognizing objects but understanding their relationships and planning actions in a 3D space.
To illustrate this, Li provided examples of AI algorithms capable of reconstructing 3D scenes from 2D images, showcasing the early signs of robust spatial intelligence. These advancements have profound implications for fields like robotics, where machines need to navigate and manipulate their environment.
In the rapidly evolving field of artificial intelligence (AI), a significant divide is emerging between proponents of open source technology and advocates of proprietary solutions. Industry experts and stakeholders from academia, the public sector, venture capital, and entrepreneurial circles are rallying to support open source initiatives, highlighting the critical need for collaborative development in AI.
California Senate Bill 1047 poses a significant threat to the open source community, Li pointed out. "It's actually wrong that this legislation is coming out of California," said Li, who stressed that many are actively working to amend or repeal the bill to protect the interests of the open source community.
In a speech, Li outlined how modern AI has been driven by three converging forces in the past decades: neural networks (or deep learning), advanced chips like Nvidia's GPUs, and big data. These elements have collectively propelled significant advancements in AI, particularly in the realm of computer vision.
Li highlighted the remarkable progress in visual recognition, saying "Machines quickly became able to recognize visual objects on par with human performance." However, this achievement is just the beginning. The past decade has seen tremendous strides in areas such as object segmentation, dynamic tracking, and understanding complex, multi-object scenarios.
Current AI models, such as GPT-4 and Gemini 1.5, have demonstrated impressive capabilities in processing and generating language from multimodal inputs. These models can interpret text, images, and even generate language outputs, Li said so when responding to a question raised by Zhao hejuan, the CEO of TMTPost.
Yet, despite their advancements, these models are still largely confined to two-dimensional representations of the world. For example, the AI-generated video of a Japanese woman walking down a street in Tokyo or Kyoto is limited to a single perspective and lacks the ability to understand and manipulate the scene in three dimensions, she further explained.
The limitation lies in the AI's lack of spatial intelligence—a fundamental aspect of human cognition that enables us to understand depth, shape, and spatial relationships, Li elaborated, saying "Nature evolution made animals to be able to understand and live and plan and interact in this 3D world. And this is as ancient as 540 million years ago, when the first trial by starting to see light in the water, they need to navigate. If they don't navigate the 3D world, they become someone's dinner very quickly. So as evolution goes, animals gained more for spatial intelligence capability."
The integration of spatial intelligence in AI would unlock new possibilities. Li envisioned. In AR and VR, it would enhance the realism and interactivity of virtual environments. For robotics, it would enable machines to better navigate and manipulate objects in the real world. This advancement would also benefit design and creative industries by allowing AI to generate and understand complex three-dimensional designs, Li noted.
She also discussed the creation of image captioning algorithms. "We gave the computer one picture, and through the neural network, it was able to describe the scene in natural language," Li explained. This milestone was followed by the development of algorithms capable of generating images from textual descriptions, showcasing the rapid evolution of generative AI, Li added.
In recent years, the generative AI field has expanded beyond static images to include video generation. Companies like OpenAI and various startups have developed algorithms that can generate videos from single sentences, pushing the boundaries of what AI can achieve. However, Li posed the question: what's next?
Looking to the future, Li envisioned AI that can perform complex tasks through thought alone. A pilot study from Li's lab demonstrated a subject wearing an EEG cap, controlling a robot to make a meal using brain signals. While this technology is not yet ready for commercialization, it represents the cutting-edge potential of AI.
While large language models continue to dominate the AI landscape, Li argued that spatial intelligence will be crucial for the next wave of AI advancements. "It's nature's way of closing the loop between seeing and doing, and it will be AI's way of understanding and interacting with the world," Li elaborated.
Li's team has been collaborating with Nvidia to create dynamic environments that benchmark everyday household activities for robots. Additionally, they have been integrating large language models with visual models to instruct robots in performing tasks, such as opening doors or making sandwiches, based purely on natural language instructions.
When it comes to understanding specific technical issues or details, the consensus is to rely on credible experts in the field, said Li when talking about the question of trust. Experts bring specialized knowledge and are often engaged in ongoing debates and discussions. Peer reviews and expert forums are key mechanisms for ensuring the reliability of technical information, she explained.
However, the situation differs when evaluating broader aspects of technology, such as its safety and societal impact. Historically, government agencies and industry bodies have played significant roles in these evaluations. For instance, the FDA has been crucial in regulating drugs and food products. Yet, there are instances where these institutions have been criticized, and their actions scrutinized, as seen with high-profile cases of wrongdoing and inefficiencies, Li further illustrated.
Technology, whether it's AI, CRISPR, or any other advancement, is not the property of any single entity or group. Instead, it is a collective responsibility, she added. As these technologies become increasingly integrated into various aspects of society, it is essential for all stakeholders—governments, industries, and the public—to engage in continuous dialogue and oversight, Li emphasized.