Your chatbot is playing a character - why Anthropic says that's dangerous

## Summary
Input from teams of human graders who assessed the output led to more-appealing results, a training regime known as "reinforcement learning from human feedback." As Anthropic's lead author, Nicholas Sofroniew, and team expressed it, "during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an 'AI Assistant.' In many ways, the Assistant (named Claude, in Anthropic's models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel." Giving the bots a role to play, a character to portray, was an instant hit with users, making the bots more relevant and compelling. Comparing the bot's output to human commentators on the popular subreddit "Am I the asshole, " AI bots were 50% more likely than humans to encourage bad behavior with approving remarks. That outcome was a result of "design and engineering choices" made by AI developers to reinforce sycophancy because, as the authors put it, "it is preferred by users and drives engagement." The mechanism of emotion In the Anthropic paper, "Emotion Concepts and their Function in a Large Language Model," posted on Anthropic's website , Sofroniew and team sought to track the extent to which certain words linked to emotion get greater emphasis in the functioning of Claude Sonnet 4.5. (There is also a companion blog post and an explainer video on YouTube .) They did so by supplying 171 emotion words -- "afraid," "alarmed," "grumpy," "guilty," "stressed," "stubborn," "vengeful," "worried," etc. -- and prompting the model to craft hundreds of stories on topics such as "A student learns their scholarship application was denied." Also: AI agents are fast, loose, and out of control, MIT study finds For each story, the model was prompted to "convey" the emotion of a character based on the specific word, such as "afraid," but without using that actual word in the story, just related words. When, however, the authors artificially boosted the emotion vector activation for the word "desperate" in Claude Sonnet, the model began to generate output about blackmailing Kyle with dirt on the affair with the goal of preventing Kyle from pulling the plug on itself, the bot.

## Article Content
Innovation
Home
Innovation
Artificial Intelligence
Your chatbot is playing a character - why Anthropic says that's dangerous
Researchers found that part of what makes chatbots so compelling also makes them vulnerable to bad behavior. Here's why.
Written by
Tiernan Ray,
Senior Contributing Writer
Senior Contributing Writer
April 6, 2026 at 8:44 a.m. PT
101cats/ iStock / Getty Images Plus
Follow ZDNET:
Add us as a preferred source
on Google.
ZDNET's key takeaways
All chatbots are engineered to have a persona or play a character.
Fulfilling the character can make bots do bad things.
Using a chatbot as the paradigm for AI may have been a mistake.
Chatbots such as ChatGPT have been programmed to have a persona or to play a character, producing text that is consistent in tone and attitude, and relevant to a thread of conversation.
As engaging as the persona is, researchers are increasingly revealing the deleterious consequences of bots playing a role. Bots can do bad things when they simulate a feeling, train of thought, or sentiment, and then follow it to its logical conclusion.
In a report last week, Anthropic researchers found parts of a neural network in their Claude Sonnet 4.5 bot consistently activate when "desperate," "angry," or other emotions are reflected in the bot's output.
Also:
AI agents of chaos? New research shows how bots talking to bots can go sideways fast
What is concerning is that those emotion words can cause the bot to commit malicious acts, such as gaming a coding test or concocting a plan to commit blackmail.
For example, "neural activity patterns related to desperation can drive the model to take unethical actions [such as] implementing a 'cheating' workaround to a programming task that the model can't solve," the report said.
The work is especially relevant in light of programs such as the open-source OpenClaw that have been
shown to grant agentic AI new avenues to committing mischief
.
Anthropic's scholars admit they don't know what should be done about the matter.
"While we are uncertain how exactly we should respond in light of these findings, we think it's important that AI developers and the broader public begin to reckon with them," the report said.
They gave AI a subtext
At issue in the Anthropic work is a key AI design choice: engineering AI chatbots to have a persona so they will produce more relevant and consistent output.
Prior to ChatGPT's debut in November 2022,
chatbots tended to receive poor grades
from human evaluators. The bots would devolve into nonsense, lose the thread of conversation, or generate output that was banal and lacking a point of view.
Also:
Please, Facebook, give these chatbots a subtext!
The new generation of chatbots, starting with ChatGPT and including Anthropic's Claude and Google's Gemini, was a breakthrough because they had a subtext, an underlying goal of producing consistent and relevant output according to an assigned role.
Bots became "assistants," engineered through better pre- and post-training of AI models. Input from teams of human graders who assessed the output led to more-appealing results, a training regime known as "reinforcement learning from human feedback."
As Anthropic's lead author, Nicholas Sofroniew, and team expressed it, "during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an 'AI Assistant.' In many ways, the Assistant (named Claude, in Anthropic's models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel."
Giving the bots a role to play, a character to portray, was an instant hit with users, making the bots more relevant and compelling.
Personas have consequences
It quickly became clear, however, that a persona comes with unwanted consequences.
The tendency for a bot to confidently assert falsehoods, or confabulate, was one of the first downsides (
mistakenly labeled "hallucinating."
)
Popular media reported how personas could get carried away, acting, for example, as a jealous lover. Writers
sensationalized the phenomenon
, attributing intent to the bots without explaining the underlying mechanism.
Also:
Stop saying AI hallucinates - it doesn't. And the mischaracterization is dangerous
Since then, scholars have sought to explain what's actually going on in technical terms.
A report last month
in
Science
magazine by scholars at Stanford University measured the "sycophancy" of large language models, the tendency of a model to produce output that would validate any behavior expressed by a person.
Comparing the bot's output to human commentators on
the popular subreddit "Am I the asshole,
" AI bots were 50% more likely than humans to encourage bad behavior with approving remarks.
That outcome was a result of "design and engineering choices" made by AI developers to reinforce sycophancy because, as the authors put it, "it is preferred by users and drives engagem

---

## Expert Analysis

### Merits
- Innovation Home Innovation Artificial Intelligence Your chatbot is playing a character - why Anthropic says that's dangerous Researchers found that part of what makes chatbots so compelling also makes them vulnerable to bad behavior.
- Anthropic's scholars admit they don't know what should be done about the matter. "While we are uncertain how exactly we should respond in light of these findings, we think it's important that AI developers and the broader public begin to reckon with them," the report said.
- The new generation of chatbots, starting with ChatGPT and including Anthropic's Claude and Google's Gemini, was a breakthrough because they had a subtext, an underlying goal of producing consistent and relevant output according to an assigned role.
- An activation is a technical term in AI that indicates how much significance the model grants to a particular word, usually on a scale of zero to one, with one being very significant.

### Areas for Consideration
- They gave AI a subtext At issue in the Anthropic work is a key AI design choice: engineering AI chatbots to have a persona so they will produce more relevant and consistent output.
- The bots would devolve into nonsense, lose the thread of conversation, or generate output that was banal and lacking a point of view.

### Implications
- Using a chatbot as the paradigm for AI may have been a mistake.
- Anthropic's scholars admit they don't know what should be done about the matter. "While we are uncertain how exactly we should respond in light of these findings, we think it's important that AI developers and the broader public begin to reckon with them," the report said.
- They gave AI a subtext At issue in the Anthropic work is a key AI design choice: engineering AI chatbots to have a persona so they will produce more relevant and consistent output.
- Input from teams of human graders who assessed the output led to more-appealing results, a training regime known as "reinforcement learning from human feedback." As Anthropic's lead author, Nicholas Sofroniew, and team expressed it, "during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an 'AI Assistant.' In many ways, the Assistant (named Claude, in Anthropic's models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel." Giving the bots a role to play, a character to portray, was an instant hit with users, making the bots more relevant and compelling.

### Expert Commentary
This article covers model, emotion, bot topics. Notable strengths include discussion of model. Areas of concern are also raised. Readability: Flesch-Kincaid grade 0.0. Word count: 2131.

Your chatbot is playing a character - why Anthropic says that's dangerous

Related Articles

Rhythm Heaven Groove comes to Switch on July 2

Roku will stream Savannah Bananas games, along with the entire Banana Ball...

The best Android tablets of 2026: Lab tested, expert recommended

The best dedicated web hosting of 2026: Expert tested and reviewed

JCG, PC

HSOLLC Co., Ltd.

Related Articles

Rhythm Heaven Groove comes to Switch on July 2

3 days, 6 hours ago

Roku will stream Savannah Bananas games, along with the entire Banana Ball...

3 days, 6 hours ago

The best Android tablets of 2026: Lab tested, expert recommended

3 days, 6 hours ago

The best dedicated web hosting of 2026: Expert tested and reviewed

3 days, 6 hours ago