Quang-Tien Dam | Ritsumeikan University

Note: This is my sound diary. Transcripted by Gemini.

My Vision for the Future of Robotic Intelligence

My vision for the future of robotic intelligence is based on the feeling that AI is currently limited to the software world. The software world is predicted and deterministic; everything that happens there does not have real consequences—at least, not in a physical sense.

The idea is that future AI systems will have real effects. They must interact with the real world through limbs or a robotic body. Embodied intelligence requires a very special way of reasoning about the world because each body has its own unique experience.

The fact is, the real world is highly stochastic; it is not predictable in a robust way like the software world. In the software world, the action a robot thinks it is taking is executed perfectly. However, in the real world, the action a robot attempts is not going to be executed in the exact same way as in the digital realm. Everything the AI does in the digital world can be reflected perfectly in a simulation, but in the real world, there are uncertainties and possibilities that things won’t happen as expected.

Therefore, the new generation of AI systems must have the capability to figure out how to recover from mistakes and handle that uncertainty.

Aggregating Experience

The second point is that the AI has to learn from its unique experience—whether “off-grid” or “on-grid”—to judge the uncertainty of new events happening in the moment. I struggle to find the exact words to describe what is currently in my mind, but I feel that by aggregating experience from different bodies—from multiple robotic “ancestors”—the new intelligent model will be able to figure out what is real.

The data here is very important. This kind of model must rely on data from the real world. Experience can be aggregated through simulation in a “dream world,” but that isn’t going to be real or perfectly reflect reality. Of course, we can always use simulation as a medium to train skills—that is perfectly okay—but to handle the complexity of reality, we have to figure out a way to aggregate knowledge from a big corpus of real data. That is the vision.

The Human Analogy

I would like to use an analogy: Humans have all of their ancestors’ experiences encoded in our genes and our body configurations. Also, our social structures have been evolving forever. Two things make humans intelligent enough to handle the uncertainty of the world, and it is all enabled by the power of evolution. Evolution encodes and aggregates experience “death-wise” through ancestors (vertically) and through communication within the same generation (horizontally).

So, we have two kinds of knowledge that need to be aggregated to let the model adapt to the environment. I call this the “evolution of data.” By having a really big corpus of experience from multiple robots, we can achieve two things: first, we gather data from multiple robotic embodiments; second, we gather knowledge from different sensory inputs.

Resource Constraints and Selection Criteria

By aggregating this pool of knowledge, we learn how to pre-train a “brain” that possesses knowledge allocated from ancestors and peers (horizontally and vertically). However, the point is that we don’t have unlimited resources to store all of that knowledge. We have to figure out a way to select data and change model parameters based on specific criteria.

Novelty: New things are aggregated into the system by judging how novel they are.
Robustness: Behavior needs to be robust across many experiences. If many robots in the same generation experience the same thing, that norm needs to be recorded and learned from to increase robustness.

Diversity in Configuration

Finally, regarding the checkpoints trained on this corpus: we need to figure out how to configure them differently to make every single robot slightly different from the others. This ensures the overall process benefits from this aggregation of experience. Imagine if every robot were the same; the experience would be identical, and there would be no improvement in observation.

Although we want robustness and consistency—a system that works everywhere—we can always define a threshold of diversity to ensure the final system has observed enough diverse data through different living experiences.

Miss but an hour's fated meeting, and lifetimes pass before next greeting.