How Large Language Models Are Enhancing Robots’ Abilities in Human-Robot Interaction

If you walked into the bustling office of Google DeepMind in Mountain View, California, you might encounter a curious sight: a slender robot politely guiding employees around the office. This isn’t science fiction. Thanks to an upgrade with Google’s Gemini large language model, these robots are transforming how we interact with machines.

A New Kind of Office Helper

These robots, employing the Gemini AI model, can understand complex commands such as “Find me somewhere to write.” They then trundle off, guiding the person to a whiteboard or an available desk. What makes these robots remarkable? They combine large language models with algorithms that convert these commands into physical actions, making them adept helpers in environments like offices.

The robots’ navigation skills were impressively demonstrated in a recent paper by DeepMind researchers. The robots utilized Gemini’s capability to process video and text data, absorbing a wealth of information from recorded office tours. This understanding allows the robots to navigate and perform tasks with up to 90 percent accuracy, even in response to challenging prompts like “Where did I leave my coaster?”

Beyond the Browser

While chatbots like Gemini typically operate within web browsers or apps, their integration into physical robots opens up new possibilities. These robots can parse visual cues and interpret spoken commands, much like interpreting a sketch of a route on a whiteboard. This multimodal capacity extends their usefulness beyond mere conversation, into practical, real-world assistance.

Charging Ahead

A practical example of this technology in action involves a robot guiding someone to a power outlet after recognizing a phone and hearing, “Where can I charge this?” By learning the layout through video tours, the Gemini-powered robot navigates accurately and efficiently, a significant leap from older models that required pre-mapped environments.

This advancement has led to real-world applications and further research. Investors are showing keen interest, funding startups like Physical Intelligence and Skild AI. These companies are exploring similar integrations of large language models with real-world robots, aiming for general problem-solving abilities.

A limited context length makes it a challenge for many AI models to recall environments. 🌐

Powered with 1.5 Pro’s 1 million token context length, our robots can use human instructions, video tours, and common sense reasoning to successfully find their way around a space. pic.twitter.com/eIQbtjHCbW

— Google DeepMind (@GoogleDeepMind) July 11, 2024

Adapting to More Complex Questions

The capabilities of robots equipped with large language models extend beyond basic navigation and simple commands. Gemini, for instance, can answer more intricate questions by contextualizing its environment. Imagine someone surrounded by Coke cans asking, “Do they have my favorite drink today?” With Gemini’s programming, the robot could check the fridge and provide an accurate answer.

This adaptability could make robots more versatile, responding to a wider range of commands and improving their usefulness in various settings. Further testing is underway to explore how Gemini can handle even more complex scenarios and different types of robots.

Humanizing Robots

What does this mean for our daily lives? Large language models enable robots to interact with humans more naturally and efficiently. This breakthrough isn’t just about sophisticated programming but about making technology approachable and practical. Tasks that once required detailed, methodically programmed instructions are now achievable through simple, natural spoken or written commands.

The demonstrations and research provide a glimpse into a future where robots could assist us in our homes and workplaces more effectively. For instance, robots might help locate misplaced items like keys or wallets, a common frustration for many.

Looking Forward

Despite these innovations, there are still challenges to overcome. Processing times for these commands can range from 10–30 seconds, which, while not excessive, highlights areas for potential improvement. As technology advances, we can expect quicker responses and even greater reliability from these helpful machines.

Embracing the Change

In conclusion, the integration of large language models into robots represents a step forward in human-robot interaction. As technology develops, these robots are becoming more adept at understanding and responding to our commands, making them valuable companions in our daily routines. While the journey is ongoing, the progress made so far holds promise for a future where robots are integral to both our personal and professional lives.

Stay tuned as we continue to follow these exciting developments in technology, providing insights into how they will shape our world.

How Large Language Models Are Enhancing Robots’ Abilities in Human-Robot Interaction