Skip to main content

One post tagged with "Meta"

View All Tags

The Principal-Agents Problems 2: Are Models Getting Dumber to Save Money? What the "Stealth Quantization" Hypothesis Tells Us About Trust, Information, and Incentives

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC
info

I had originally planned to write this as a single post, but it keeps growing as more relevant news stories come out. So instead, this will become a series of stories on the competing incentives involved in creating “AI agents” and why that matters to you as the end user.

Multiple Principals, Multiple Agents (Not only AI)

You, as the user of AI tools, may choose software vendors who provide you access to their products with built-in AI features including AI agents. These vendors might have specialist software like Harvey, Westlaw, or LexisNexis; or Cursor or Github Copilot; or generalist tools like Notion, Salesforce, or Microsoft Copilot. The AI features may be powered by one or more foundation models provided to those vendors by AI labs, such as Anthropic (Claude), OpenAI (ChatGPT), Meta (Llama) or Google (Gemini).

These relationships mean you have the principal-agent problem of you hiring the vendor. But you also have the principal-agent problem of the vendors hiring the AI labs. Each has their own incentives, and they are not perfectly aligned. There is also significant information asymmetry. The vendors know more about their software and AI model choices than you do. The labs know more about their AI models than either you or the software vendors.

info

Lexis+ AI uses both OpenAI’s GPT models and Anthropic’s Claude models, according to its product page, as I mentioned in my analysis of the Mata v. Avianca case.

The Stealth Quantization Hypothesis

The area I'll focus on in this post is the concept of alleged stealth quantization. According to a wide range of commenters, primarily among computer programmers and primarily focused on Claude users, there are certain times of days or days of the week when peak usage results in models "getting dumber," "getting lazier," "being lobotomized" or otherwise underperforming their normal benchmarks and perceived optimal behavior. According to these claims, it is better for users with high-value use cases (like someone modifying important source code) to schedule Claude for off-peak usage so the "real model" runs. To save on computing costs during periods of high demand, the claim is that Anthropic or whichever AI lab swaps out its flagship model with a quantized version while calling it the same thing.

So what is normal, non-stealth quantization? It's making an AI model smaller and cheaper to run, but less accurate. This is achieved by rounding the model weights to smaller significant figures (e.g., 16-bit, 8-bit, 4-bit).(Meta) By analogy, the penny was recently discontinued. Now, all cash transactions will end in 5 cents or 0 cents. Quantization works like this with the precisions of AI models: imagine eliminating a penny, then a nickel, then a dime, and so on.

There are legitimate reasons to quantize models, such as reducing operating costs when the loss in accuracy is negligible for the intended use or when the model needs to operate on a personal computer. For example, Meta offers some quantized versions of its Llama family of large language models that can run on ollama on modern laptops or desktops with only 8GB of RAM.(Llama models available on ollama) These models have names that distinguish them from the non-quantized versions, e.g., "llama3:8b" is Llama 3, 8 billion parameter size of that series; "llama3:8b-instruct-q2_K" is a quantized version of the instruct version model of that same model.

tip

If all that terminology is confusing, here's the key point. AI labs have a lot of information about their AI models. You have a lot less information. You have to mostly take their word for it. They are also charging you for an all-you-can-eat buffet at which some excessive customers cost them tens of thousands of dollars each.

Anthropic's Rebuttal

Users have accused Anthropic (and other AI labs) of running different versions of their flagship models at different times of day, but the models are labelled the same (e.g., Claude Sonnet 4), regardless of the time of day. Hence “stealth quantization.”

Anthropic has denied stealth quantization. But Anthropic did acknowledge two problems with model quality that had been noted by users as evidence of stealth quantization. Anthropic attributed this to bugs. Anthropic stated “we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.” Reddit, Claude