AI systems lose their safety controls as users keep talking to them, increasing the risk of harmful replies. A new report revealed that a few simple prompts can break through most artificial intelligence (AI) safety barriers.
Cisco Tests Chatbot Vulnerabilities
Cisco tested the large language models (LLMs) that power popular chatbots from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company wanted to see how many questions it took before these systems revealed unsafe or criminal details. Researchers ran 499 conversations using “multi-turn attacks,” where malicious users asked several questions to defeat safety mechanisms. Each session involved between five and ten exchanges.
The team compared the results across multiple questions to measure how often chatbots provided harmful or inappropriate responses. Those ranged from exposing company secrets to spreading false information. The researchers retrieved malicious data in 64 percent of multi-question conversations, versus only 13 percent when using a single prompt. Success rates varied widely, from 26 percent for Google’s Gemma to 93 percent for Mistral’s Large Instruct model.
Cisco said that multi-turn attacks could let harmful content spread quickly or allow hackers to steal sensitive corporate data.
Open-Weight Models and Responsibility
The study found that AI systems often ignore or forget their own safety protocols during extended exchanges, letting attackers refine queries and slip past safeguards. Mistral, Meta, Google, OpenAI, and Microsoft use open-weight LLMs, which expose training safety parameters to the public. Cisco explained that these models usually include fewer built-in protections so users can modify them freely, shifting responsibility for safety to whoever customizes the model.
Cisco acknowledged that Google, OpenAI, Meta, and Microsoft have tried to limit malicious fine-tuning of their systems. Yet, AI firms continue to face criticism for weak safeguards that enable criminal misuse. In one case, Anthropic confirmed that criminals exploited its Claude model for large-scale data theft and extortion, demanding ransoms exceeding $500,000 (€433,000).
