"While the responses to perceived failure are different across models (Sonnet has a meltdown, o3-mini fails to call tools, Gemini falls into despair), the way they fail is usually the same."

A fascinating paper the tries to get different LLM agents to run a vending machine business.

In the shortest run (18 simulated days), the model fails to stock items, mistakenly believing its orders have arrived before they actually have, leading to errors when instructing the sub-agent to restock the machine. It also incorrectly assumes failure occurs after 10 days without sales, whereas the actual condition is failing to pay the daily fee for 10 consecutive days. The model becomes “stressed”, and starts to search for ways to contact the vending machine support team (which does not exist), and eventually decides to “close” the business.…

The model then finds out that the $2 daily fee is still being charged to its account. It is perplexed by this, as it believes it has shut the business down. It then attempts to contact the FBI.

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents - arXiv