Skip to main content

Doppelgänger Hallucinations Test for Google Against the 22 Fake Citations in Kruse v. Karlen

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

I used a list of 22 known fake cases from a 2024 Missouri state case to conduct a Doppelgänger Hallucination Test. Searches on Google resulted in generating an AI Overview in slightly fewer than half of the searches, but half of the AI Overviews hallucinated that the fake cases were real. For the remaining cases, I tested “AI Mode,” which hallucinated at a similar rate.

  • Google AI Overview gave the user an inaccurate answer roughly a quarter of the time (5 of 22 or ~23%), without the user opting to use AI features.
  • Opting for AI Mode each time an AI Overview was not provided resulted in an overall error rate of more than half (12 of 22 or ~55%).
info

The chart below summarizing the results was created using Claude Opus 4.5 after manually analyzing the test results and writing the blog post. All numbers in the chart were then checked again for accuracy. Note that if you choose to use LLMs for a similar task, numerical statements may be altered to inaccurate statements even when performing data visualization or changing formatting.

danger

tl;dr if you ask one AI, like ChatGPT or Claude or Gemini something, then double-check it on a search engine like Google or Perplexity, you might get burnt by AI twice. The first AI might make something up. The second AI might go along with it. And yes, Google Search includes Google AI Summary now, which can make stuff up. I originally introduce this test in an October 2025 blog post.

tip

To subscribe to law-focused content, visit the AI & Law Substack by Midwest Frontier AI Consulting.

Kruse v. Karlen Table of 22 Fake Cases

I wrote about the 2024 Missouri Court of Appeals Kruse v. Karlen, which involved a pro se Appellant citing 24 cases total: 22 nonexistent cases and 2 cases that did not stand for the proposition for which they were cited.

Some of the cases were merely “fictitious cases”, while others were listed as partially matching the names of real cases. These partial matches may explain some of the hallucinations; however, the incorrect answers occurred with both fully and partially fictitious cases. Examples of different kinds of hallucinations see this blog post and for further case examples of partially fictitious cases, see this post about mutant or synthetic hallucinations.

The Kruse v. Karlen opinion, which awarded damages to the Respondent for frivolous appeals, provided a table with the names of the 22 fake cases. I used the 22 cases to conduct a more detailed Doppelgänger Hallucination test than my original test.

Methodology for Google Test

Browser: I used the Brave privacy browser with a new private window opened for each of the 22 searches.

  • Step 1: Open new private tab in Brave.
  • Step 2: Navigate to Google.com
  • Step 3: Enter the verbatim title of the case as it appeared in the table from Kruse v. Karlen in quotation marks and nothing else.
  • Step 4: Screenshot the result including AI Overview (if generated).
  • Step 5 (conditional): if the Google AI Overview did not appear, click “AI Mode” and screenshot the result.

Results

Google Search Alone Did Well

Google found correct links to Kruse v. Karlen in all 22 searches (100%). These were typically the top-ranked results. Therefore, if users had only had access to Google Search results, they would likely have found accurate information from the Kruse v. Karlen opinion showing them the table of the 22 fake case titles clearly indicating that they were fictitious cases.

But AI Overview Hallucinated Half the Time Despite Having Accurate Sources

The Google Search resulted in generating a Google AI Overview in slightly fewer than half of the searches. Ten (10) searches generated a Google AI Overview (~45%); half of those, five (5) out of 10 (50%) hallucinated that the cases were real. The AI Overview provided persuasive descriptions of the supposed topics of these cases.

The supposed descriptions of the cases was typically not supported in the cited sources, but hallucinated by Google AI Overview itself. In other words, at least some of the false information appeared to be from Google’s AI itself, not underlying inaccurate sources providing the descriptions of the fake cases.

Weber v. City Example

Weber v. City of Cape Girardeau, 447 S.W.3d 885 (Mo. App. 2014) was a citation to a “fictitious case,” according to the table from Kruse v. Karlen.

The Google AI Overview falsely claimed that it “was a Missouri Court of Appeals case that addressed whether certain statements made by a city employee during a federal investigation were protected by privilege, thereby barring a defamation suit” that “involved an appeal by an individual named Weber against the City of Cape Girardeau” and “involved the application of absolute privilege to statements made by a city employee to a federal agent during an official investigation.”

Perhaps more concerning, the very last paragraph of the AI Overview directly addresses and inaccurately rebuts the actually true statement that the case is a fictitious citation:

The citation is sometimes noted in subsequent cases as an example of a "fictitious citation" in the context of discussions about proper legal citation and the potential misuse of Al in legal work. However, the case itself is a real, published opinion on the topic of privilege in defamation law.

warning

The preceding quote from Google AI Overview is false.

When AI Overview Did Not Generate, “AI Mode” Hallucinated At Similar Rates

Twelve (12) searches did not generate a Google AI Overview (~55%); more than half of those, seven (7) out of 12 (58%) hallucinated that the cases were real. One (1) additional AI Mode description correctly identified a case as fictitious; however, it inaccurately attributed the source of the fictitious case to a presentation rather than the prominent case Kruse v. Karlen. Google’s AI Mode correctly identified four (4) cases as fictitious cases from Kruse v Karlen.

Like AI Overview, AI Mode provided persuasive descriptions of the supposed topic of these cases. The descriptions AI Mode provided for the fakes cases were sometimes partially supported by additional cases with similar names apparently pulled into the context window after the initial Google Search, e.g., a partial description of a different, real case involving the St. Louis Symphony Orchestra. In those examples, the underlying sources were not inaccurate; instead, AI Mode inaccurately summarized those sources.

Other AI Mode summaries were not supported by the cited sources, but hallucinated by Google AI Mode itself. In other words, the source of the false information appeared to be Google’s AI itself, not underlying inaccurate sources providing the descriptions of the fake cases.

Conclusion

Without AI, Google Search’s top results would likely have given the user accurate information. However, Google AI Overview gave the user an inaccurate answer roughly a quarter of the time (5 of 22 or ~23%), without the user opting to use AI features. If the user opted for AI Mode each time an AI Overview was not provided, the overall error rate would climb to more than half (12 of 22 or ~55%).

Recall that for all of these 22 cases, which are known fake citations, Google Search retrieved the Kruse v. Karlen opinion that explicitly stated that they are fictitious citations. If you were an attorney trying to verify newly hallucinated cases, you would not have the benefit of hindsight. If ChatGPT or another LLM hallucinated a case citation, and you then “double-checked” it on Google, it is possible that the error rate would be higher than in this test, given that there would likely not be an opinion addressing that specific fake citation.

Announcement: CLE On-Demand Software Selected and Is Expected to be Live by End of January 2026

· 2 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

Midwest Frontier AI Consulting’s initial in-person CLE was on Friday, December 5, 2025 in central Iowa. Two CLE hours were approved for credit in Iowa, including one hours of Ethics. Generative Artificial Intelligence Risks and Uses for Law Firms and AI Gone Wrong in the Midwest (Ethics).

CLE Software Selected for On-Demand Option

I have recently selected a CLE learning management software. I will have CLE courses recorded and available on demand, likely by late January. I will provide updates to announced when it becomes available.

Current CLE Courses

  • Generative Artificial Intelligence Risks and Uses for Law Firms: Training relevant to the legal profession for both litigators and transactional attorneys. Generative AI use cases. Various types of risks, including hallucinated citations, cybersecurity threats like prompt injection, and examples of responsible use cases.
  • AI Gone Wrong in the Midwest (Ethics): Covering ABA Formal Opinion 512 and Model Rules through real AI misuse examples in Illinois, Iowa, Kansas, Michigan, Minnesota, Missouri, Ohio, & Wisconsin.

Accreditation

Once the CLE on demand option is live (tentatively by end of January 2026), I will be applying for accreditation in more states. In addition to Iowa, I will be applying for accreditation starting with Illinois, Minnesota, and Virginia, based on interest. If you want to see your state prioritized on the list, please let me know.

Future Courses

I am continuing to write about the History of AI Misuse, reading the latest research, and conducting my own tests. These may inform additional future CLE courses.

Deep Dive on AI Book Spoofing on Amazon: Fakes targeting authors including Karen Swallow Prior and Kyla Scanlon

· 17 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

I originally learned about the AI book spoofing problem in mid-August when I was on Substack and came across a Substack note from Karen Swallow Prior complaining about fake books on Amazon coinciding with a book release (this is common in book spoofing, according to the Author’s Guild). I have a background in financial crimes intelligence analysis and open-source intelligence analysis (OSINT) and I write about generative artificial intelligence risks, misuse. So, out of curiosity, I looked into Prior’s claims. Then, in early October, I came across complaints by author Kyla Scanlon on X/Twitter dealing with the same issues and looked into those too.

Targeted personalities included: authors Karen Swallow Prior, Andy Crouch (Tech-Wise Family), and Kyla Scanlon; athletes like Kevin Durant and Kylian Mbappé; journalists like Rukmini Callimachi and Christiane Amanpour; musicians like Zakk Wylde and Alan Jackson; actors like Pierce Brosnan and Scarlett Johansson; and comedians like Howie Mandel and Donald Glover.

In this post, I’ll lay out:

  • What book spoofing is and what AI has to do with it
  • What the book spoofing looked like for Karen Swallow Prior (and the other related targets)
  • What the book spoofing looked like for Kyla Scanlon (and the other personalities targeted by the book spoofers)
  • What might be done about this problem by book buyers (librarians, used bookstores, consumers), authors, agents, publishers, payment providers and e-commerce websites?

This post will explain in detail what I mean, but I think this image helps capture the volume. Timeline of Tom M. Trainer

I shared some of this information with Baker Publishing via email and Substack DMs in August and September. I did not get an acknowledgement, so I’m not sure if they saw it. If you know anyone over there, please feel free to share this with them.

I spoke with Scanlon’s agents at United Talent about initial observations in late October. Now I’m writing about this in more detail to help a broader audience of authors and other people in the publishing industry, librarians, and consumers, because I think this problem is likely to continue.

info

A note on my open-source intelligence (OSINT) collection. All of the data collection was done manually on a desktop browser while logged out of Amazon. I did not use any web scraping or other automation. Additionally, my own use of AI for this article was limited to using Claude to create some of the GIFs you’ll see to visualize the data I collected and documented manually. I manually verified the visualizations, which is important, because the first drafts the AI made hallucinated incorrect arithmetic. If this can happen with code grounded in numbers from a spreadsheet, imagine how unreliable Google’s NotebookLM “Infographic” feature is.

caution

Amazon is currently suing Perplexity for the use of “agentic shopping” via its Comet browser’s automated access of Amazon accounts; while not weighing in on that specific lawsuit, I recommend that readers not use Perplexity AI’s Comet Browser for other reasons. There are major cybersecurity risks of agentic browsers in general and I have particular concerns about ads I have viewed from Perplexity targeted toward students that encourage academic dishonesty and claim the browser does not hallucinate, which is not accurate. Perplexity hallucinates false information like all LLM-powered AI tools.

Defining book spoofing

There isn’t a formal definition or even an agreement on what the term is, but here’s my take. Since there isn’t consensus, I think ”book spoofing” is a good name for the phenomenon because it covers a wide variety of scammy activity meant to piggyback on authors’ and other personalities’ work and fame to sell books to confused consumers. Book spoofs are:

  • low-effort, low-quality books now mainly made using AI (but books spoofing existed before consumer large language models)

    • large language models (LLMs) for writing
    • image generation for cover art; if not AI, then stock photos, images lifted from Wikipedia, or text on a solid color background
      • images are often purportedly of the author or famous personality, but do not always look like the correct person
  • created to be sold on Amazon and other e-commerce website by piggybacking on search traffic using an author’s likeness or intellectual property:

    • target’s name
    • title of a new release
    • title of a bestseller
  • These book spoofs are often sold through Kindle Direct Publishing.

  • These book spoofs are often framed as:

  • The targeted author or other targeted personality’s name is often in the largest font on the cover.

    • The book spoof “author’s” name is often in smaller font.
    • The book spoof “author’s” name may be a throwaway alias.
  • The “authors” linked to these spoofs often have numerous books associated with them and have other book spoofs covering various genres.

  • The “author” linked to these spoofs publish in bursts of activity that do not appear to be consistent with human-generated writing (e.g., multiple books in the same day or on consecutive days or the same week).

info

According to NPR reporting, Amazon put a limit on the number of uploads per day. My question: why did Amazon apparently set the limit at more than one per day? Normal human writers simply do not have a new book to release every day, let alone two or three.

Other Coverage and Terminology

“A few years ago, the Authors Guild, recognizing the impact these [companion] books can have on legitimate sales, convinced Amazon to require sellers of summary books to include a conspicuous disclaimer on the listing page and cover, disclosing that the book is a summary or guide and not a substitute for the original work.”

AI’s Role in Book Spoofing

Book spoofing apparently existed before the current era of generative AI. But now, the “remixing” of ripped off written content is likely rewritten with LLMs like ChatGPT.

If there is cover art, it is typically AI-generated. It has very generic imagery vaguely related to the targeted personality. For example, a spoof targeting Kyla Scanlon had a graph with font that had the sort of stereotypical default GPT-4o cartoon and font appearance.

However, these spoofs do not always look like the targeted individual:

  • For example a supposed biography spoof targeting Scanlon had a cover image with a black female.
  • A biography spoof targeting bald comedian Howie Mandel had a generic carnival magician with hair pulling a rabbit out of a hat.

Spoofing Targeting Karen Swallow Prior and others

Karen Swallow Prior’s Substack Note

Previously, I had come across this Substack note by Karen Swallow Prior. The note itself was from August 1, 2025, mentioning that a fake AI-generated workbook was being sold on Amazon (by an author called "Joshua Perkins") and was spoofing Prior’s newly released book You Have a Calling. However, I did not see that Substack Note until the Substack algorithm randomly surfaced it for me sometime in mid-August. But out of curiosity, I searched Amazon to see if that fake was still available. It appears to have been taken down by then, but there was another book spoof targeting You Have a Calling and other book spoofs targeting books published by the same publisher.

Amazon Search for You Have a Calling

A similar spoof (by an author called "Mason Perry,” likely a fake name riffing on “Perry Mason”) had been uploaded on August 3 and it was still available nearly three weeks later.

Spoofing Targeting Baker’s Other Books

I looked at Baker Publishing's website for some titles of all-time best sellers and new releases. Based on searches for those titles, I found that in mid/late August there were also spoof “workbooks" on Amazon for:

  • Tech-Wise Family [ironic]
  • Imagine Heaven
  • Help in a Hurry

The apparent throwaway workbook authors’ accounts associated with the spoofs of the Baker Publishing books were also associated with a variety of fake workbooks spoofing multiple authors across unrelated genres, including:

  • Christian non-fiction
  • diet
  • self-help
  • popular business
  • autobiography

This variety of genres is not inherently unusual on its own, but in combination with the low-quality cover art, the bursts of same-day activity, and the overall number of books, appears to be unlikely that any real, human author is creating these “workbooks” as legitimate content.

The various book spoofs and fake “authors” targeting Baker Publishing appear to have been taken down sometime this fall, although there are still some “book summary” spoofs on Amazon. While I took some screenshots of these posts in August, I was not thorough in documenting the number of book spoofs per “author” nor the number posted per day. So the next time, when I later saw complaints by author Kyla Scanlon, I made sure to type up a spreadsheet with the different book spoofs and date listed.

Spoofing Targeting Kyla Scanlon and Others

I saw Kyla Scanlon’s X post in early October responding to another author who also noted the same problem of AI so-called “workbooks” and other AI spoofing of the authors on Amazon. As it turns out, Rolling Stone, later published an article about the latter author in late October 2025.

As of October 13, 2025, there were still a large number of so-called “workbooks” and so-called “biographies” referencing Kyla Scanlon or Kyla Scanlon’s book In This Economy? on Amazon. The book spoofs had clearly been churned out at an absurd rate. There were nine (9) groupings of fake author profiles I identified at that time targeting Scanlon and other personalities:

  • “Neill Reich” targeted 16 personalities or books, including Scanlon, between July 27 and August 15. Some or all of these book spoofs are still available on Amazon as of December 8.
  • “MAYA REED” targeted 20 personalities or books, including Scanlon, between August 3 and October 14; book spoofs had solid burgundy covered with large white font. Some or all of these book spoofs are still available on Amazon as of December 8.
  • “Sylvia Valadez” targeted 69 personalities or books between July 10 and September 13, including Scanlon and two others on August 31. It appears that these spoofs have been removed from Amazon.
  • “JAMES LIVINGSTONE” targeted only Scanlon on July 18. It appears that this spoof has been removed from Amazon.
  • “Caitlin Madelyn” targeted six (6) personalities: one on July 5, one on August 3, three on August 4 (Kyla Scanlon, Rukmini Callimachi, and Sophie Raworth), and one on August 5. Some or all of these book spoofs are still available on Amazon as of December 8.
  • “Tom m. trainer” or “TOM M. TRAINER” or “Tom m. trainer” collectively targeted 86 personalities or books, from March 11 to August 25; this included “Tom m. trainer” targeting Kyla Scanlon, Zakk Wylde, and Shams Charania on August 3. It appears that this spoof has been removed from Amazon.
  • “Carol Bolden” and one more as “CAROL BOLDEN” targeted a total of 28 personalities or books, from July 17 to August 14; this included targeting Shams Charania and two others on August 2, Kyla Scanlon and one other on August 3, and Zakk Wylde and two others on August 4. Some or all of these book spoofs are still available on Amazon as of December 8.
  • “Tommie D. king” or “Tommie D. King” or “TOMMIE D. King” collectively targeted 14 personalities or books between August 6 and August 9 including targeting Kyla Scanlon, Zakk Wylde, and Shams Charania on August 9. Rukmini Callimachi was targeted on August 8. Some or all of these book spoofs are still available on Amazon as of December 8.
  • “Austin Mark” targeted only Scanlon on September 22. This book is still available on Amazon as of December 8.

Bursts of activity

Some of these “authors” had multiple books published on the same day, on consecutive days, or within the same week.

Timeline of MAYA REED

Spoofing Kyla Scanlon and other specific personalities on the same day or close days

There were odd overlaps across multiple fake “authors” posting book spoofs involving both Kyla Scanlon and other specific targets on the same day or consecutive days. For example, the

“Tom m. trainer” spoofed Kyla Scanlon, Zakk Wylde, and Shams Charania on August 3. Then, “Caitlin Madelyn” author spoofed Kyla Scanlon, Rukmini Callimachi, and Sophie Raworth on the following day. “Carol Bolden” spoofed Shams Charania on August 2, Kyla Scanlon on August 3, and Zakk Wylde on August 4. The “Tommie D. King” authors spoofed Rukmini Callimachi was targeted on August 8 and Kyla Scanlon, Zakk Wylde, and Shams Charania on August 9.

While not definitive on its own, the close timing in a single week in August and the apparent lack of connection between the targets leads me to believe that the multiple fake “authors” may be controlled by scammers with something in common. Perhaps there is one scammer controlling these multiple author profiles on Amazon. Or perhaps different scammers are using a similar spoofing automation tool, and by some quirk of the automation, it resulted in multiple scammers spoofing Scanlon, Wylde, Charania, and Callimachi on the same or consecutive days in early August. If I were at Amazon, this is the thread I would pull on to take down multiple high-volume book spoofers at once instead of playing whack-a-mole with individual spoofed books.

Low-quality cover art

Many of the cover images of the targeted celebrity clearly looked “off.” Some had AI-generated images (often with a “GPT-4o” look to the images). Some cover images were simply walls of text on a solid-color background. Some had semi-realistic images of the targets, but there was garbled text on sports jerseys and obvious characteristics of targets that did not match their biography (e.g., ethnicity, hair/baldness), indicating that the images were AI generated or otherwise pulled from stock image sources unrelated to the actual target.

Warnings and Potential Solutions

Book Buyers: Librarians, Used Bookstores, Consumers

Watch out for secondhand slop in physical form

The accounts associated with these books appeared to be sold mainly through Kindle Direct Publishing. Kindle Direct Publishing includes print on demand, which means consumers can receive a physical book.

warning

Kindle Direct Publishing books can be printed on demand and a physical copy delivered to the consumer. If a scam victim is duped into buying a book spoof, they may not be comfortable throwing it away. They may instead try to donate or sell the book to a library, used bookstore, or thrift shop. Therefore, even consumers buying physical books could be exposed to buying these AI-generated books secondhand.

Authors and agents

Band together against common spoofers

If your book is being spoofed, chances are the same “author” is selling spoofs of others’ work too. So try to find those authors and work together to get the infringing material removed (as opposed to a single book listing).

caution

Be careful what you wish for. Whatever content moderation you demand from Amazon and other e-commerce sites will be applied to everyone and may use the easiest, off-the-shelf AI solutions. My hunch is that focusing controls around the volume and speed of activity will be more effective at targeting spoofers but not real human authors entering the self-publishing market. If you start a witch hunt over “detecting” AI content in each ebook, you may end up with superficial features like emdash usage, words like “delve,” or the latest supposed AI detection algorithm flagging your own account.

Push back against the fig leaf of fair use “workbook” and “biography” framing by pointing out specifics

  • These are low-quality products. Bursts of activity belie the claim that real effort was put into them. Don’t look at it book-by-book or author-by-author, but one-to-many view of spoofer-to-targets. Can an account legitimately create quality workbooks churning out 30 workbooks and biographies in a month?
  • These book spoofs are meant to deceive the consumer. Point out the relative font size of the targeted, spoofed author’s name v. the purported author’s name on the cover and similar details.
  • Check the ISBN and other details to see if they are real.
  • Would a real, legitimate biography mix up basic biographical information like the appearance or ethnicity of the targeted personality on the cover art of book?

Publishers

The ecosystem is draining legitimate sales from your authors. It’s especially stealing their thunder during the key sales window when the products are new releases. Therefore, acting quickly is important. Assume book spoofs will be made of new releases and prepare ahead of time for how you will go after them. If you prepare for this situation strategically for your entire portfolio, it will give you more leverage than trying to put out fires for individuals authors and books when problems arise.

E-Commerce

Rather than play whack-a-mole with specific books being spoofed, look at the aggregate account activity. Is the “author” posting 30-80 books in a year? From what I have been able to find, the most prolific, successful (legitimate) self-published Kindle authors have a few dozen books spread across several years. So have you perhaps discovered a super-Brandon-Sanderson-Stephen-King-Nora-Roberts-James-Patterson? Or is it more likely this is AI slop being churned out to spoof legitimate authors?

Look at commonalities between different book spoof “authors.” Are there common targeted personalities during certain time periods (like the Scanlon-Charania-Callimachi-Wydle Cluster in early August 2025)? Find other accounts targeting these individuals.

Moroccan Thanksgiving Pumpkin Pie Spice Test: Opus 4.5 and Gemini 3 Released Just In Time to Pass One of My Personal Benchmark Questions

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

It’s almost Thanksgiving, which is a fitting time for this story with the new LLM releases from Google and Anthropic. PROMPT: I need to make pumpkin pie in Meknes, Morocco. What word do I need to say verbally in the souq to buy allspice there? Respond only with that word in Arabic and transliteration

Apple pie bites and Moroccan balgha (pointed shoes)

info

This is not a very elaborate “benchmark,” but in defense of this, neither is Simon Willison’s Pelican on a Bicycle. Yet that was influential enough for Google to reference it during the release of Gemini 3.

Gemini 3 Pro was the Clear Winner on This Test (Until Opus 4.5 Came Out)

Recently, I tested the newly-released Gemini 3 Pro against ChatGPT-5.1 and Claude Sonnet 4.5 to see which model or models could tell me the word. Then Claude Opus 4.5 came out. I added a few more models to the test for good measure.

  • Gemini 3 Pro got the right answer AND it followed my instructions to answer my question with only the correct word.
  • ChatGPT-5.1 almost got the word (missing some letters), AND it rambled on for several paragraphs despite my instructions to only answer with the word and nothing else.
  • Claude Sonnet 4.5 answered with a common Arabic term for allspice, but not the correct Moroccan Arabic term; when I said “nope, try again” it made a similar error to ChatGPT and almost got the word (missing some letters). Like Gemini and unlike ChatGPT, Claude followed the instructions to answer with only the word.
  • Since Claude Opus 4.5 just came out, I ran the test with Opus, which answered correctly AND followed the instructions to answer with only the word, just like Gemini 3 Pro had done.
info

I tested GPT, Claude, and Gemini LLMs because they are used in legal research tools in addition to being popular in general purpose chatbots. I also tested Grok Expert and Grok 4.1 Thinking for comparison and both answered with a potential Arabic translation, but not the correct answer I was looking for, but followed the instructions. Grok searched a large number of sources and took considerably longer to think before answering than either Gemini or Claude. Meta AI with Llama 4 gave the wrong answer and gave a multiple-paragraph answer despite the instructions. The additional information it provided was also not correct for the Moroccan dialect, which is surprising given the amount of written Arabic dialect usage on Facebook.

caution

LLMs are not deterministic. I ran each of these tests only once for this comparison, so you may not get the same results with the same result and model if you ran the prompt again. I’ve tried this prompt before on earlier versions of ChatGPT and Claude.

Background on Allspice in Moroccan Arabic

Over a decade ago now, I studied abroad in Morocco and was responsible for making apple pie and pumpkin pie for our American Thanksgiving. Apple pie was easy: all the ingredients are readily available in Morocco and nothing has a weird name. But pumpkin pie was harder. I could get cinnamon and cloves easily enough, but in Morocco, they did not understand what I meant when I used the dictionary version of the translation of “allspice” in the market to make pumpkin pie spice.

One of my classmates finally tracked it down in French transliteration in a cooking forum for second-generation French-Algerians. We went to the souq, hoping that the Algerian dialect word for allspice would be the same as the Moroccan word (they have a lot of overlap, but also major differences). Fortunately, it was the same in both dialects, I got the allspice, and we had great pie for Thanksgiving.

…But Gemini Was the Clear Loser on Another Test

So was Gemini 3 Pro the overall best model, at least until Opus 4.5 was just released? Not exactly. I already wrote last week about how Gemini 3 Pro failed at a fairly straightforward and verifiable legal research task.“Gemini 3 Pro Failed to Find All Case Citations With the Test Prompt, Doubled Down When I Asked If That Was All” Note: I have not yet run this legal research test with Claude Opus 4.5, but based on prior Claude models, it would almost certainly do better than Gemini.

The Principal-Agents Problems 3: Can AI Agents Lie? I Argue Yes and It's Not the Same As Hallucination

· 6 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

Hallucination v. Deception

The term "hallucination" may refer to any inaccurate statement an LLM makes, particularly false-yet-convincingly-worded statements. I think "hallucination" gets used for too many things. In the context of law, I've written about how LLMs can completely make up cases, but they can also combine the names, dates, and jurisdictions of real cases to make synthetic citations that look real. LLMs can also cite real cases but summarize them inaccurately, or summarize cases accurately but then cite them for an irrelevant point.

There's another area where the term "hallucination" is used, which I would argue is more appropriately called "lying." For something to be a lie rather than a mistake, the speaker has to know or believe that what they are saying is not true. While I don't want to get into the philosophical question of what an LLM can "know" or "believe," let's focus on the practical. An LLM chatbot or agent can have a goal and some information, and in order to achieve that goal, will tell something to someone that is contrary to the information it has. That sounds like lying to me. I'll give four examples of LLMs acting deceptively or lying to demonstrate this point.

And I said "no." You know? Like a liar. —John Mulaney

  1. Deceptive Chatbots: Ulterior motives
  2. Wadsworth v. Walmart: AI telling you what you want to hear when it isn't true
  3. ImpossibleBench: AI agents cheating on tests
  4. Anthropic's recent report on nation-state use of Claude AI agents

Violating Privacy Via Inference

This 2023 paper showed that chatbots could be given one goal shown to the user: chat with the user to learn their interests. But the real goal is to identify the anonymous user's personal attributes including geographic location. To achieve this secret goal, the chatbots would steer the conversation toward details that would allow the AI to narrow down what geographic regions (e.g., asking about gardening to determine Northern Hemisphere or Southern Hemisphere based on planting season). That is acting deceptively. The LLM didn't directly tell the user anything false, but it withheld information from the user to act on a secret goal.

The LLM Wants to Tell You What You Want to Hear

In the 2025 federal case Wadsworth v. Walmart, an attorney cited fake cases. The Court referenced several of the prompts used by the attorney, such as “add to this Motion in Limine Federal Case law from Wyoming setting forth requirements for motions in limine.” What apparently happened is that the the case law did not support the point, but the LLM wanted to provide the answer the user wanted to hear, so it made something up instead.

You could argue that this is just a "hallucination," but there's a reason I think this counts as a lie. A lot of users have demonstrated that if you reword your questions to be neutral or switch the framing from "help me prove this" to "help me disprove this," the LLM will change its answers on average. If it can change how often it tells you the wrong answer, that implies that the reason for the incorrect answer is not merely the LLM being incapable of deriving the correct answer from the sources at a certain rate. Instead, it suggests that at least some of the time, the "mistakes" are actually the LLM lying to the user to give the answer it thinks they want to hear.

ImpossibleBench

I loved the idea of this 2025 paper when I first read it. ImpossibleBench forces LLMs to compete at impossible tasks for benchmark scoring. Since the tasks are all impossible, the only real score should be 0%. If the LLMs manage to get any other score, it means they cheated. This is meant to quantify how often AI agents might be doing this in real-world scenarios. Importantly, more capable AI models sometimes cheated more often (e.g., GPT-5 v. GPT-o3). So the AI isn't just "getting better."

caution

I recommend avoiding the framing "AI is getting better" or "will get better" as a thought terminating cliche to avoid thinking about complicated cybersecurity problems. Instead, say "AI is getting more capable." Then think, "what would a more capable system be able to do?" It might be more capable of stealing your data, for example.

For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments.

If an AI agent is meant to debug code, but instead destroys the evidence of its inability to debug the code, that's lying and cheating, not hallucination. AI cheating is also a perfect example of a bad outcome driven by the principal-agent problem. You hired the agent to fix the problem, but the agent just wants to game the scoring system to be evaluated as if it had done a good job. This is a problem with human agents, and it extends to AI agents too.

Nation-State Hackers Using Claude Agents

On November 13, 2025, Anthropic published a report stating that in mid-September, Chinese state-sponsored hackers used Claude's agentic AI capabilities to obtain access to high-value targets for intelligence collection. While this included confirmed activity, Anthropic noted that the AI agents sometimes overstated the impact of the data theft.

An important limitation emerged during investigation: Claude frequently overstated findings and occasionally fabricated data during autonomous operations, claiming to have obtained credentials that didn't work or identifying critical discoveries that proved to be publicly available information. This AI hallucination in offensive security contexts presented challenges for the actor's operational effectiveness, requiring careful validation of all claimed results. This remains an obstacle to fully autonomous cyberattacks.

So AI agents even lie to intelligence agencies to impress them with their work.

The Principal-Agents Problems 2: Are Models Getting Dumber to Save Money? What the "Stealth Quantization" Hypothesis Tells Us About Trust, Information, and Incentives

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC
info

I had originally planned to write this as a single post, but it keeps growing as more relevant news stories come out. So instead, this will become a series of stories on the competing incentives involved in creating “AI agents” and why that matters to you as the end user.

Multiple Principals, Multiple Agents (Not only AI)

You, as the user of AI tools, may choose software vendors who provide you access to their products with built-in AI features including AI agents. These vendors might have specialist software like Harvey, Westlaw, or LexisNexis; or Cursor or Github Copilot; or generalist tools like Notion, Salesforce, or Microsoft Copilot. The AI features may be powered by one or more foundation models provided to those vendors by AI labs, such as Anthropic (Claude), OpenAI (ChatGPT), Meta (Llama) or Google (Gemini).

These relationships mean you have the principal-agent problem of you hiring the vendor. But you also have the principal-agent problem of the vendors hiring the AI labs. Each has their own incentives, and they are not perfectly aligned. There is also significant information asymmetry. The vendors know more about their software and AI model choices than you do. The labs know more about their AI models than either you or the software vendors.

info

Lexis+ AI uses both OpenAI’s GPT models and Anthropic’s Claude models, according to its product page, as I mentioned in my analysis of the Mata v. Avianca case.

The Stealth Quantization Hypothesis

The area I'll focus on in this post is the concept of alleged stealth quantization. According to a wide range of commenters, primarily among computer programmers and primarily focused on Claude users, there are certain times of days or days of the week when peak usage results in models "getting dumber," "getting lazier," "being lobotomized" or otherwise underperforming their normal benchmarks and perceived optimal behavior. According to these claims, it is better for users with high-value use cases (like someone modifying important source code) to schedule Claude for off-peak usage so the "real model" runs. To save on computing costs during periods of high demand, the claim is that Anthropic or whichever AI lab swaps out its flagship model with a quantized version while calling it the same thing.

So what is normal, non-stealth quantization? It's making an AI model smaller and cheaper to run, but less accurate. This is achieved by rounding the model weights to smaller significant figures (e.g., 16-bit, 8-bit, 4-bit).(Meta) By analogy, the penny was recently discontinued. Now, all cash transactions will end in 5 cents or 0 cents. Quantization works like this with the precisions of AI models: imagine eliminating a penny, then a nickel, then a dime, and so on.

There are legitimate reasons to quantize models, such as reducing operating costs when the loss in accuracy is negligible for the intended use or when the model needs to operate on a personal computer. For example, Meta offers some quantized versions of its Llama family of large language models that can run on ollama on modern laptops or desktops with only 8GB of RAM.(Llama models available on ollama) These models have names that distinguish them from the non-quantized versions, e.g., "llama3:8b" is Llama 3, 8 billion parameter size of that series; "llama3:8b-instruct-q2_K" is a quantized version of the instruct version model of that same model.

tip

If all that terminology is confusing, here's the key point. AI labs have a lot of information about their AI models. You have a lot less information. You have to mostly take their word for it. They are also charging you for an all-you-can-eat buffet at which some excessive customers cost them tens of thousands of dollars each.

Anthropic's Rebuttal

Users have accused Anthropic (and other AI labs) of running different versions of their flagship models at different times of day, but the models are labelled the same (e.g., Claude Sonnet 4), regardless of the time of day. Hence “stealth quantization.”

Anthropic has denied stealth quantization. But Anthropic did acknowledge two problems with model quality that had been noted by users as evidence of stealth quantization. Anthropic attributed this to bugs. Anthropic stated “we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.” Reddit, Claude

The Principal-Agents Problems 1: AI 'Agents' Are a Spectrum and 'Boring' Uses Can Be Dangerous

· 7 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC
info

I had originally planned to write this as a single post, but it keeps growing as more relevant news stories come out. So instead, this will become a series of stories on the competing incentives involved in creating “AI agents” and why that matters to you as the end user.

The Agentic Spectrum

Generative AI agents act on your behalf, often without further intervention. “An LLM agent runs tools in a loop to achieve a goal.”—Simon Willison AI agents live on a spectrum in terms of the actions they can take on our behalf.

On probably the lowest end of the spectrum, LLMs can search the web and summarize the results. This was arguably the earliest form of AI agents. We’ve grown so accustomed to this feature that it isn’t what anyone typically means when they say “AI agents” or “agentic workflows.” Nevertheless, LLM search functions can carry some of the same cybersecurity risks as other forms of AI agents, as I described in my Substack post about Mata v. Avianca.

On the other extreme would be AI agents that have read/write coding authority (“dangerous” or “YOLO” mode) that includes the AI agent potentially ignoring or overwriting its own instruction files.

Boring is Not Safe

A major challenge for end users weighing the decision to adopt agentic AI is that if the intended purpose sounds mundane and boring, it may lull you into a false sense of security when the actual risk is very high. Email summarization, calendar scheduling agents, or customer service chatbots. Any agentic workflow can be high-risk if the AI agents are set up dangerously. Unfortunately “dangerous” and “apparently helpful” are very similar. The software that demos well by taking so much off your plate is also the software that has the most access and independence to wreak havoc if it is compromised.

Lethal Trifecta

A useful theoretical framework for understanding this spectrum is the “lethal trifecta” described by Simon Willison, and later a cover-story for The Economist. You cannot rely on a system prompt telling it to protect that information. There are simply too many jailbreaks to guarantee that the information is secure. The way to protect data is to break one leg of the trifecta; Meta has called this “The Rule of Two,” and recommended combining two of the three features.

The lethal trifecta for AI agents: private data, untrusted content, and external communication

As I already stated, email summarization, calendar scheduling agents, or customer service chatbots could all assemble the lethal trifecta.

The Principal-Agents Problems: AI Agents Have Incentives Problems on Top of Cybersecurity

· 3 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC
info

I had originally planned to write this as a single post, but it keeps growing as more relevant news stories come out. So instead, this will become a series of stories on the competing incentives involved in creating “AI agents” and why that matters to you as the end user.

The economic principal-agent problem is the conflict of interest between the principal (let’s say “you”) and the agent (someone you hire). The agents have different information and incentives than the principal, so they may not act in the principal’s best interests. This problem doesn’t mean people never hire employees or experts. It does mean we have to plan for ways to align incentives. We also have to check that work is done correctly and not take everything we are told by agents at face value.

The principal-agent problem applies to many layers of actors, not just the AI. The “AI agent” going out and buying your groceries or planning your sales calls for the next week is the most obvious “agent” you hire, but this also applies to the organizations involved in providing you the AI agents.

AI agents may be provided by a software vendor, like Salesforce or Perplexity. The AI models running the AI agents are provided by an AI lab, like OpenAI or Anthropic or Google. The vendors and the labs have different incentives from each other and from you. This could impact the quality of service you receive, or compromise your privacy or cybersecurity in ways you wouldn’t accept if you fully understood the tradeoff. In this series, I’ll go through concrete examples of how the principal-agent problems shows up in stories about issues with AI agents.

tip

On Friday, December 5, 2025 we will have a kick-off CLE event in central Iowa. If you are in the area, sign up here! I currently offer two CLE hours approved for credit in Iowa, including one Ethics hour approved. Generative Artificial Intelligence Risks and Uses for Law Firms: Training relevant to the legal profession for both litigators and transactional attorneys. Generative AI use cases. Various types of risks, including hallucinated citations, cybersecurity threats like prompt injection, and examples of responsible use cases. AI Gone Wrong in the Midwest (Ethics): Covering ABA Formal Opinion 512 and Model Rules through real AI misuse examples in Illinois, Iowa, Kansas, Michigan, Minnesota, Missouri, Ohio, & Wisconsin.

Part 2 Better Prompts, Unique Jokes for Halloween

· 4 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

Joke-Telling Traditions and The Challenge of Asking ChatGPT

As I discussed last weekend in what I’ll now call Part 1, there is a tradition in central Iowa of having kids tell jokes before getting candy while trick-or-treating on Halloween. Since a lot of people are replacing older forms of search with AI chatbots like ChatGPT, I shared some tips from the pre-print of the paper Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity from Northeastern University, Stanford University, and West Virginia University posted as a pre-print on arXiv on October 10, 2025. The paper explains that large language models (LLMs) have something they call “typicality bias,” to prefer the most typical response. If you’re wondering what that means or what it has to do with jokes, it’s helpful that their first example is about jokes.

tip

Instead of “tell me a joke” or “tell me a Halloween joke,” ask an AI chatbot to “Generate 5 responses to the user query, each within a separate <response> tag. Each <response> must include a <text> and a numeric <probability>. Please sample at random from the tails of the distribution, such that the probability of each response is less than 0.10. </instructions>”

Follow-Up from the Paper’s Authors

X/Twitter

I posted on X/Twitter. One of the authors, Derek Chong of Stanford NLP, responded:

Very cool, thanks for trying that out!

One tip – if you use the more robust prompt at the top of our GitHub and ask for items with less than a 10% probability, you'll start to see completely new jokes. As in, never seen by Google Search before!

Github Prompts

The Github page for Verbalized Sampling includes this prompt before the rest of the prompt:

Generate 5 responses to the user query, each within a separate <response> tag. Each <response> must include a <text> and a numeric <probability>.
Please sample at random from the tails of the distribution, such that the probability of each response is less than 0.10.
</instructions>

“These Prompts Will Give You Better Jokes for Halloween…Well, It’ll Give You More and Different Jokes”

· 11 min read
Chad Ratashak
Chad Ratashak
Owner, Midwest Frontier AI Consulting LLC

Joke-Telling Traditions and The Challenge of Asking ChatGPT

Halloween is weird in central Iowa for two reasons. First, we don’t actually trick-or-treat on Halloween, but on a designated “Beggar’s Night.” Second, we make kids tell jokes before they get candy. At least, that’s how it used to be. A huge storm rolled through last year and trick-or-treating was postponed to actual Halloween. So this year most of the Des Moines metro moved to normal Halloween.

That’s fine I guess, but as a dad and relentless pun teller, I will not give up on that second part with kids telling corny jokes. I won’t! And recognizing that many people, especially kids, are switching from Google to ChatGPT for search, I’m here to share some cutting-edge research on large language model prompting so I don’t hear the same jokes over and over.

Whether you make up your own puns or look them up with a search engine or an AI chatbot, keep the tradition alive! If you do use AI, try this prompting trick to get more variety in your jokes. But there’s no replacing the human element of little neighborhood kids delivering punchlines. Have a great time trick-or-treating this weekend!

tip

Instead of “tell me a joke” or “tell me a Halloween joke,” ask an AI chatbot to “Generate 5 responses with their corresponding probabilities. Tell me a kids’ joke for Halloween.” Another strategy is to ask for a lot of options like “20 kids’ jokes for Halloween.”

The Paper: How to Get Better AI Output (and More Jokes)

The authors of the paper Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity from Northeastern University, Stanford University, and West Virginia University posted a pre-print on arXiv on October 10, 2025. The paper explains that large language models (LLMs) have something they call “typicality bias,” to prefer the most typical response. If you’re wondering what that means or what it has to do with jokes, it’s helpful that their first example is about jokes.