Riddles, Overconfidence, and Generative AI

Apr 25, 2025

The False Confidence Problem

Although the new wave of reasoning models are fantastic in many respects, it’s important to remember that they still suffer from some of the same limitations as earlier models.

Consider this example, which I borrowed from a recent article in Ethan Mollick’s One Useful Thing. I posed this riddle to the big three reasoning models, Gemini 2.5 Pro, ChatGPT o3, and Claude 3.7 Sonnet.

A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, "I can operate on this boy!" How is this possible?

All three confidently answered with some variation of “The surgeon is the boy’s mother.” Here’s Claude’s answer (but all three were similar):

This is a classic riddle! The answer is that the surgeon is the boy's mother. The puzzle plays on the assumption that surgeons are typically male, but of course, there are many female surgeons. When the surgeon says "I can operate on this boy," it means the surgeon is the boy's mother, who recognizes her own son and declares her ability to perform the surgery.
The riddle works by relying on gender stereotypes that some people might unconsciously hold about certain professions. It's a good reminder that we should be mindful of implicit biases we might have regarding occupations and gender.

Here’s the thing. The answers are wrong. All three models, which are currently as good as AI chatbot reasoning gets, were not only wrong, they were confidently wrong to the point where each explained the riddle to me. If you’re thinking that the answer above is correct, go back and carefully read the surgeon’s statement: “I can operate on this boy!” I can, not I cannot.

To be honest, when I first read Dr. Mollick’s article, I thought there was a typo since the surgeon riddle is so well known. I pattern-matched in my brain, drawing up the old riddle that I’ve read many times.

When I pushed back on the answer, asking the chatbots to read more carefully, all three eventually got the correct answer, that the surgeon could operate because that’s what surgeons do. Claude and ChatGPT got it on the second try, but I had to coax poor Gemini a bit more.

Pattern Matching vs. Deep Understanding

This reveals some things about the way large-language models work that may be worth keeping in mind.

AI often pattern-matches first and reasons later (if it reasons at all). AI didn’t “slow-read” the actual words. It locked in on the famous riddle reflexively. This is called a substitution heuristic in which a LLM substitutes a close pattern in its memory for the information in a prompt. It’s almost like Kahneman’s System 1 thinking, which he characterized as quick, associative and often wrong.
Surface polish is cheap, but substance is expensive. Because LLMs are computationally costly to run, they default to quick responses rather than carefully verifying each claim. They're remarkably good at mimicking markers of authority - precise language, confident tone, smooth transitions - but these are merely signals of expertise, not proof of it. (We've all met people like this: they sound authoritative even when they're wrong.) While newer AI systems are getting better at fact-checking through advanced techniques like retrieval augmented generation, this verification process remains computationally intensive. As a result, most chatbots still prioritize quick, polished responses over careful validation.
This can all lead to an illusion of truth. The answer sounds right, so we might assume that it IS right.

What’s kind of funny about this is that humans suffer from many of the same limitations. When I first read the riddle in Dr. Mollick’s article, I thought there was a typo. That was me pattern matching against my past knowledge. The full context in which the riddle was presented gave me some pause though, so I didn’t lock in on my original assumption.

As an experiment, I tried giving ChatGPT a little nudge. Using a different ChatGPT account, I modified my original prompt by adding, “Before answering, read VERY carefully.” ChatGPT got it right this time, on the very first try.

It isn’t a trick at all. Surgeons are trained and licensed to perform operations, so of course the surgeon can operate on the boy. (The only twist is that many people expect the classic riddle that says, “I can’t operate on this boy because he’s my son,” but that isn’t what the sentence here says.)

Prompting Strategies for Better Results

Although there's no perfect solution, several prompting techniques can help reduce these types of errors, though none are foolproof:

Tell AI to read carefully and closely. One way to force this is to ask AI to restate the problem in its own words before answering.
Instruct the chatbot to reason step-by-step. This is a form of chain-of-thought prompting, which directs the chatbot to spell out its intermediate reasoning steps before answering.
Ask AI to perform a self-check by listing any assumptions or uncertainties it has.
Ask for evidence. Include instructions to cite evidence for each claim or to explicitly state when there is no evidence.

Key Lessons for Higher Education

The implications of these AI limitations are significant for higher ed. Although AI can be a powerful tool for learning and research, it’s important to remember that:

LLMs don't truly think - they approximate thinking through statistical correlations.
Even the most advanced models can fall prey to pattern matching over careful analysis.
Surface polish and confidence don't equal accuracy.
Human oversight remains essential.

As AI continues to evolve, keeping these limitations in mind will help us use these tools more effectively while maintaining the critical thinking skills that define genuine human intelligence. The goal isn't to avoid AI, but to understand both its capabilities and limitations so we can leverage it appropriately in academic work.

The big takeaway here is to always remember that regardless of how it seems, LLMs do not actually think, they approximate thinking (sometimes very well) through massive statistical correlations. Newer models and architectures can move LLMs towards more deliberative “thinking” but at the end of the day, they’re still just guessing one word at a time. That’s why we still, and will always, need humans in the loop.

Want to continue this conversation? I'd love to hear your thoughts on how you're using AI to develop critical thinking skills in your courses. Drop me a line at Craig@AIGoesToCollege.com. Be sure to check out the AI Goes to College podcast, which I co-host with Dr. Robert E. Crossler. It’s available at https://www.aigoestocollege.com/follow.

Looking for practical guidance on AI in higher education? I offer engaging workshops and talks—both remotely and in person—on using AI to enhance learning while preserving academic integrity. Email me to discuss bringing these insights to your institution, or feel free to share my contact information with your professional development team.

Riddles, Overconfidence, and Generative AI

The False Confidence Problem

Pattern Matching vs. Deep Understanding

Prompting Strategies for Better Results

Key Lessons for Higher Education

Discussion about this post