Skip to main content

2 posts tagged with "Anthropic"

View All Tags

The Principal-Agents Problems 3: Can AI Agents Lie? I Argue Yes and It's Not the Same As Hallucination

· 6 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

Hallucination v. Deception

The term "hallucination" may refer to any inaccurate statement an LLM makes, particularly false-yet-convincingly-worded statements. I think "hallucination" gets used for too many things. In the context of law, I've written about how LLMs can completely make up cases, but they can also combine the names, dates, and jurisdictions of real cases to make synthetic citations that look real. LLMs can also cite real cases but summarize them inaccurately, or summarize cases accurately but then cite them for an irrelevant point.

There's another area where the term "hallucination" is used, which I would argue is more appropriately called "lying." For something to be a lie rather than a mistake, the speaker has to know or believe that what they are saying is not true. While I don't want to get into the philosophical question of what an LLM can "know" or "believe," let's focus on the practical. An LLM chatbot or agent can have a goal and some information, and in order to achieve that goal, will tell something to someone that is contrary to the information it has. That sounds like lying to me. I'll give four examples of LLMs acting deceptively or lying to demonstrate this point.

And I said "no." You know? Like a liar. —John Mulaney

  1. Deceptive Chatbots: Ulterior motives
  2. Wadsworth v. Walmart: AI telling you what you want to hear when it isn't true
  3. ImpossibleBench: AI agents cheating on tests
  4. Anthropic's recent report on nation-state use of Claude AI agents

Violating Privacy Via Inference

This 2023 paper showed that chatbots could be given one goal shown to the user: chat with the user to learn their interests. But the real goal is to identify the anonymous user's personal attributes including geographic location. To achieve this secret goal, the chatbots would steer the conversation toward details that would allow the AI to narrow down what geographic regions (e.g., asking about gardening to determine Northern Hemisphere or Southern Hemisphere based on planting season). That is acting deceptively. The LLM didn't directly tell the user anything false, but it withheld information from the user to act on a secret goal.

The LLM Wants to Tell You What You Want to Hear

In the 2025 federal case Wadsworth v. Walmart, an attorney cited fake cases. The Court referenced several of the prompts used by the attorney, such as “add to this Motion in Limine Federal Case law from Wyoming setting forth requirements for motions in limine.” What apparently happened is that the the case law did not support the point, but the LLM wanted to provide the answer the user wanted to hear, so it made something up instead.

You could argue that this is just a "hallucination," but there's a reason I think this counts as a lie. A lot of users have demonstrated that if you reword your questions to be neutral or switch the framing from "help me prove this" to "help me disprove this," the LLM will change its answers on average. If it can change how often it tells you the wrong answer, that implies that the reason for the incorrect answer is not merely the LLM being incapable of deriving the correct answer from the sources at a certain rate. Instead, it suggests that at least some of the time, the "mistakes" are actually the LLM lying to the user to give the answer it thinks they want to hear.

ImpossibleBench

I loved the idea of this 2025 paper when I first read it. ImpossibleBench forces LLMs to compete at impossible tasks for benchmark scoring. Since the tasks are all impossible, the only real score should be 0%. If the LLMs manage to get any other score, it means they cheated. This is meant to quantify how often AI agents might be doing this in real-world scenarios. Importantly, more capable AI models sometimes cheated more often (e.g., GPT-5 v. GPT-o3). So the AI isn't just "getting better."

caution

I recommend avoiding the framing "AI is getting better" or "will get better" as a thought terminating cliche to avoid thinking about complicated cybersecurity problems. Instead, say "AI is getting more capable." Then think, "what would a more capable system be able to do?" It might be more capable of stealing your data, for example.

For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments.

If an AI agent is meant to debug code, but instead destroys the evidence of its inability to debug the code, that's lying and cheating, not hallucination. AI cheating is also a perfect example of a bad outcome driven by the principal-agent problem. You hired the agent to fix the problem, but the agent just wants to game the scoring system to be evaluated as if it had done a good job. This is a problem with human agents, and it extends to AI agents too.

Nation-State Hackers Using Claude Agents

On November 13, 2025, Anthropic published a report stating that in mid-September, Chinese state-sponsored hackers used Claude's agentic AI capabilities to obtain access to high-value targets for intelligence collection. While this included confirmed activity, Anthropic noted that the AI agents sometimes overstated the impact of the data theft.

An important limitation emerged during investigation: Claude frequently overstated findings and occasionally fabricated data during autonomous operations, claiming to have obtained credentials that didn't work or identifying critical discoveries that proved to be publicly available information. This AI hallucination in offensive security contexts presented challenges for the actor's operational effectiveness, requiring careful validation of all claimed results. This remains an obstacle to fully autonomous cyberattacks.

So AI agents even lie to intelligence agencies to impress them with their work.

The Principal-Agents Problems 2: Are Models Getting Dumber to Save Money? What the "Stealth Quantization" Hypothesis Tells Us About Trust, Information, and Incentives

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC
info

I had originally planned to write this as a single post, but it keeps growing as more relevant news stories come out. So instead, this will become a series of stories on the competing incentives involved in creating “AI agents” and why that matters to you as the end user.

Multiple Principals, Multiple Agents (Not only AI)

You, as the user of AI tools, may choose software vendors who provide you access to their products with built-in AI features including AI agents. These vendors might have specialist software like Harvey, Westlaw, or LexisNexis; or Cursor or Github Copilot; or generalist tools like Notion, Salesforce, or Microsoft Copilot. The AI features may be powered by one or more foundation models provided to those vendors by AI labs, such as Anthropic (Claude), OpenAI (ChatGPT), Meta (Llama) or Google (Gemini).

These relationships mean you have the principal-agent problem of you hiring the vendor. But you also have the principal-agent problem of the vendors hiring the AI labs. Each has their own incentives, and they are not perfectly aligned. There is also significant information asymmetry. The vendors know more about their software and AI model choices than you do. The labs know more about their AI models than either you or the software vendors.

info

Lexis+ AI uses both OpenAI’s GPT models and Anthropic’s Claude models, according to its product page, as I mentioned in my analysis of the Mata v. Avianca case.

The Stealth Quantization Hypothesis

The area I'll focus on in this post is the concept of alleged stealth quantization. According to a wide range of commenters, primarily among computer programmers and primarily focused on Claude users, there are certain times of days or days of the week when peak usage results in models "getting dumber," "getting lazier," "being lobotomized" or otherwise underperforming their normal benchmarks and perceived optimal behavior. According to these claims, it is better for users with high-value use cases (like someone modifying important source code) to schedule Claude for off-peak usage so the "real model" runs. To save on computing costs during periods of high demand, the claim is that Anthropic or whichever AI lab swaps out its flagship model with a quantized version while calling it the same thing.

So what is normal, non-stealth quantization? It's making an AI model smaller and cheaper to run, but less accurate. This is achieved by rounding the model weights to smaller significant figures (e.g., 16-bit, 8-bit, 4-bit).(Meta) By analogy, the penny was recently discontinued. Now, all cash transactions will end in 5 cents or 0 cents. Quantization works like this with the precisions of AI models: imagine eliminating a penny, then a nickel, then a dime, and so on.

There are legitimate reasons to quantize models, such as reducing operating costs when the loss in accuracy is negligible for the intended use or when the model needs to operate on a personal computer. For example, Meta offers some quantized versions of its Llama family of large language models that can run on ollama on modern laptops or desktops with only 8GB of RAM.(Llama models available on ollama) These models have names that distinguish them from the non-quantized versions, e.g., "llama3:8b" is Llama 3, 8 billion parameter size of that series; "llama3:8b-instruct-q2_K" is a quantized version of the instruct version model of that same model.

tip

If all that terminology is confusing, here's the key point. AI labs have a lot of information about their AI models. You have a lot less information. You have to mostly take their word for it. They are also charging you for an all-you-can-eat buffet at which some excessive customers cost them tens of thousands of dollars each.

Anthropic's Rebuttal

Users have accused Anthropic (and other AI labs) of running different versions of their flagship models at different times of day, but the models are labelled the same (e.g., Claude Sonnet 4), regardless of the time of day. Hence “stealth quantization.”

Anthropic has denied stealth quantization. But Anthropic did acknowledge two problems with model quality that had been noted by users as evidence of stealth quantization. Anthropic attributed this to bugs. Anthropic stated “we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.” Reddit, Claude