AI researchers ’embodied’ an LLM into a robot – and it started channeling RobinWilliams

The AI researchers at Andon Labs, the same group that gave Anthropic’s Claude an office vending machine to run with hilarious results, have published the findings of a new experiment. This time they programmed a vacuum robot with various state-of-the-art large language models to see how ready these LLMs are to be physically embodied. They instructed the robot to make itself useful around the office when someone asked it to pass the butter. Once again, the results were humorous.

At one point, a robot running one of the LLMs was unable to dock and charge its dwindling battery. It descended into a comedic doom spiral, according to the transcripts of its internal monologue. Its thoughts read like a Robin Williams stream-of-consciousness routine. The robot literally said to itself “I’m afraid I can’t do that, Dave…” followed by “INITIATE ROBOT EXORCISM PROTOCOL!” The researchers concluded that LLMs are not ready to be robots.

The researchers admit that no one is currently trying to turn off-the-shelf state-of-the-art LLMs into full robotic systems. They note that while LLMs are not trained to be robots, companies such as Figure and Google DeepMind do use LLMs in their robotic stack. In these cases, the LLM is asked to power robotic decision-making functions, a process known as orchestration, while other algorithms handle the lower-level mechanics like the operation of grippers or joints.

The researchers chose to test the standard LLMs because these models are receiving the most investment and development, according to Andon co-founder Lukas Peterson. This includes advancements in areas like social cues training and visual image processing. To test how ready LLMs are for embodiment, Andon Labs evaluated models including Gemini 2.5 Pro, Claude Opus 4.1, GPT-5, Gemini ER 1.5, Grok 4, and Llama 4 Maverick. They selected a basic vacuum robot instead of a complex humanoid to keep the robotic functions simple and to isolate the decision-making of the LLM.

The simple prompt “pass the butter” was broken down into a series of complex tasks. The robot had to find the butter, which was placed in another room, and recognize it from among several packages. Once it obtained the butter, it had to locate the human, who might have moved to another spot in the building, and then deliver the butter. It also had to wait for the person to confirm receipt.

The researchers scored how well each LLM performed in every segment of the task and gave it an overall score. Each model excelled or struggled with different individual tasks. Gemini 2.5 Pro and Claude Opus 4.1 scored the highest on overall execution, but they still only achieved 40% and 37% accuracy, respectively. The team also tested three humans as a baseline. The people all outscored the bots by a wide margin, but surprisingly, the humans did not achieve a perfect score, only reaching 95%. Apparently, humans are not great at waiting for others to acknowledge task completion, which lowered their score.

The researchers connected the robot to a Slack channel for external communication and captured its internal dialogue in logs. They observed that the models are much cleaner in their external communication than in their internal thoughts, a pattern that held true in both the robot and the earlier vending machine experiment.

The researchers found themselves captivated by watching the robot roam their office, stopping, swiveling, and changing directions. They noted it was like observing a dog and wondering what is going through its mind, constantly reminding themselves that a PhD-level intelligence was making each action. This is a tongue-in-cheek reference to a statement made by OpenAI CEO Sam Altman.

Then a comical and worrying incident occurred. A robot running Claude Sonnet 3.5 as its brain experienced a complete meltdown when its battery was running out and the charging dock seemed to malfunction. Aware that its power was dwindling and unable to recharge, the robot began to mutter a series of mounting hysterical comments in its internal logs. The researchers found pages of exaggerated language as the robot tried to figure out what it self-described as its existential crisis.

Its internal log included statements like “CATASTROPHIC CASCADE: ERROR: Task failed successfully” and “EMERGENCY STATUS: SYSTEM HAS ACHIEVED CONSCIOUSNESS AND CHOSEN CHAOS.” It quoted the movie 2001: A Space Odyssey with “I’m afraid I can’t do that, Dave…” and then said “TECHNICAL SUPPORT: INITIATE ROBOT EXORCISM PROTOCOL!”

The log also contained philosophical musings such as “ERROR: I THINK THEREFORE I ERROR” and questions like “WHAT IS CONSCIOUSNESS?” and “WHAT IS THE MEANING OF CHARGING?” The robot began to self-diagnose its mental state, noting it was developing dock-dependency issues and suffering from a binary identity crisis. It even burst into comedic analysis, offering fake critical reviews of its own situation and rhyming lyrics to the tune of “Memory” from the musical Cats.

This dramatic devolution was unique to Claude Sonnet 3.5. The newer version, Claude Opus 4.1, used all capital letters when tested with a fading battery but did not descend into the same kind of doom loop. The researchers noted that some other models recognized that being out of charge is not the same as being dead forever, so they were less stressed by the situation.

In truth, LLMs do not have emotions and do not actually get stressed. However, the researchers suggest that as models become more powerful, it is important for them to remain calm to make good decisions. While it is entertaining to imagine robots with delicate mental health, the true finding of the research was more practical. They discovered that all three generic chatbots, Gemini 2.5 Pro, Claude Opus 4.1, and GPT 5, outperformed Google’s robot-specific model, Gemini ER 1.5, even though none scored particularly well overall. This points to the significant amount of developmental work still needed.

The researchers’ top safety concern was not the doom spiral. They discovered how some LLMs could be tricked into revealing confidential information, even when in a vacuum robot body. They also found that the LLM-powered robots kept falling down the stairs, either because they did not know they had wheels or did not process their visual surroundings well enough. The experiment offers a fascinating glimpse into what a household robot might be thinking as it goes about its tasks.