Terence Eden. He has a beard and is smiling.
Theme Switcher:

Why do people have such dramatically different experiences using AI?

· 38 comments · 750 words · Viewed ~2,654 times


For some people, it seems, AI is an amazing machine which - while fallible - represents an incredible leap forward in productivity.

For other people, it seems, AI is wrong more often than right and - although occasionally useful - requires constant supervision.

Who is right?

I recently pointed out a few common problems with LLMs. I was discussing this with someone relatively senior who works on Google's Gemini. I explained that every time I get a Google AI overview it is wrong. Sometimes obviously wrong, sometimes subtly wrong. I asked if that was really the experience of AI Google wanted to promote? My friend replied (lightly edited for clarity):

I find AI Overview to be helpful for my searches and my work. I use it all the time to look up technical terms and hardware specs.

I, somewhat impolitely, called bullshit and sent a couple of screenshots of recent cases where Google was just laughably wrong. He replied:

Interesting. We are seeing the opposite.

Why is that?

I'll happily concede that LLMs are reasonable at outputting stuff which looks plausible and - in many cases - that's all that's necessary. If I can't remember which command line switch to use, AI is easier than crappy documentation. Similarly, if I don't know how to program a specific function, most AIs are surprisingly decent at providing me with something which mostly works.

But the more I know about something, the less competent the AI seems to be.

Let me give you a good example.

At my friend's prompting, I asked Gemini to OCR an old newspaper clipping. It is a decent resolution scan of English text printed in columns. The sort of thing a million AI projects have been trained on. Here's a sample:

Scan of some text.

So what did Gemini make of it when asked to extract the text from it?

Children at Witham's Chip-
ping Hill Infants School are en-
gaged in trying out all sorts of
imaginations ready for October
31... "And god knows what
strange spirits will be abroad."

That reads pretty well. It is utterly wrong, but it is convincing. This isn't a one-off either. Later in the clipping was this:

Scan of some text.

I'm sure a child of 6 could read that aloud without making any mistakes. Is Gemini as smart as a 6-year-old?

All the children say halloween
is fun. So it is for 6-year-old
Joanne Kirby admits she will be
staying up to watch on October
31, just in case. She has made a
paper "witch," to "tell stories
about witches," she said.

Again, superficially right, but not accurate in the slightest.

There were half a dozen mistakes in a 300 word article. That, frankly, is shit. I could have copy-typed it and made fewer mistakes. I probably spent more time correcting the output than I saved by using AI.

Boring old Tesseract - a mainstay of OCR - did far better. Yes, it might occasionally mistake a speck of dust for a comma or confuse two similar characters - but it has never invented new sentences!

Like a fool, I asked Gemini what was going on:

Me: That's a really bad job. You've invented lots of words which aren't there. Try again. Gemini: I understand you weren't satisfied with the previous transcription. Unfortunately, I can't directly perform OCR on images. However, there are many apps available that can do this. You can search online for 'OCR apps' to find one that suits your needs.

Here's a link to the conversation if you don't believe me.

This isn't just a problem with Gemini - ChatGPT also invented brand-new sentences when scanning the text.

All the children say Halloween is fun, rather than frightening. Six-year-old Joanne Kirby admits she will be “a scary little witch” on the night, but she does like ghost stories.

So what's going on?

A question one has to ask of any source, including LLMs but also newspapers, influencers, podcasts, books, etc., is "how would I know if they were wrong?"This is not a prompt to doubt everything – down that path is denialism – but about reflecting on how much you rely on even "trusted" sources.

Adrian Hon (@adrianhon.bsky.social) 2025-06-17T15:39:06.772Z

With OCR, it is simple. I can read the ground-truth and see how it compares to the generated output. I don't have to trust; I can verify.

I suppose I mostly use AI for things with which I have a passing familiarity. I can quickly see when it is wrong. I've never used it for, say, tax advice or instructions to dismantle a nuclear bomb. I'd have zero idea if the information it spat back was in any way accurate.

Is that the difference? If you don't understand what you're asking for then you can't judge whether you're being mugged off.

Or is there something more fundamentally different between users which results in this disparity of experience?

A t-shirt which says Dunning and Kruger and Gell and Mann.

Share this post on…

38 thoughts on “Why do people have such dramatically different experiences using AI?”

  1. @Edent Nice read. More anecdotal evidence:

    Recently, google AI somehow managed to get past my usual blockers, so I got to see what it types. I was looking for "bing chilling", a memey phrase associated with John Cena, which means "ice cream" in Chinese.

    The AI told me "Bing Chilling" is a game on steam blablabla

    Which is technically true. There is a game that named itself after the meme, but that's not what people think of when they say "bing chilling".

    Exactly as you point out.

    Reply | Reply to original comment on mastodon.social

  2. @Edent There is this benchmark of LLM vs traditional approach to OCR https://getomni.ai/blog/ocr-benchmark

    The evaluation criteria are very transparent and seem sensible. Key quote:

    "Traditional models tend to outperform on high-density pages (textbooks, research papers) as well as common document formats like tax forms."

    I found it linked in the comments in an HN from march discussing the announcement of Mistral OCR.

    I would of course take with a pinch of salt the evaluation of their own model.

    OmniAI OCR Benchmark - OmniAI. Automate document workflows

    Reply | Reply to original comment on tooting.ch

  3. @blog on the OCR front, I have personally seen Copilot take a photo of a handwritten court ledger from the late 1600s and do a very very good job of not only OCR but translation from Latin into English.

    Whether it would do it repeatedly or reliably I don't know.

    Likewise another direct test the latest openai model got a technical multiple choice answer correct when the model answer was wrong... and 3 experts got to the right place eventually. (RAG test)

    Plural of anecdote and all that.

    Reply | Reply to original comment on mstdn.social

  4. @Edent The inherent volatility of the models also helps mask the flaws in the output if you aren't careful. It means that some of the time when they do actually bother to check the output (or the output has to do with their expertise) it might be one of the times it's randomly actually correct.

    And because human brains are bad at estimating both probability and frequency, this probably leads people to dismiss the errors they do notice as one-offs.

    Reply | Reply to original comment on toot.cafe

  5. @Edent It's because there are *so many* models, with so many differing capability, and so many different ways to access them, with commonly no way to tell WHAT you're actually using. I toss it in via api ensuring that I'm using the latest (maybe??) and it's perfect. If this was my only experience with AI, I'd always be singing its praises.

    Most of the hate or love of the AI tools is like getting a bug report but the user refuses to tell you the version of software they're using, or even what OS!

    Reply | Reply to original comment on social.lol

  6. I have tried many different types of AI in various contexts, and most of the time they are wrong. When citing 'facts', the source URL is often 404, or the content has nothing to do with the answer from the 'AI'. When coding, it constantly makes changes that I haven't asked for, even though these are simple tasks that a junior developer would do without any problem — they just need time. When I tell the 'AI' about it, it just makes excuses, saying that it only works on statistical analysis and doesn't really know what it's doing.

    And I hate the "yes man" mentality.

    My main issue is that I always verify the answer, so I quickly realise that it's wrong.

    Reply

  7. @Edent I find those OCR results surprising because I've been experimenting with OCR against vision LLMs for the past year and have mostly had much better results

    How large was the image you fed into it? And which exact model were you using?

    The worst results I've had were feeding in a super long image a year ago, it turned out the API I was using resized it down to where the text was illegible and it then hallucinated the answer entirely!

    Reply | Reply to original comment on fedi.simonwillison.net

  8. @Edent I think it is - I've been banging the drum for a while that the biggest misconception in all of AI is that this stuff is easy to use - I've been exploring it on a daily basis for nearly three years now and I still pick up new tricks all the time

    I only started trusting the top models to do workable jobs with OCR against complex documents (like your newspaper clipping) in the past ~4 months

    Reply | Reply to original comment on fedi.simonwillison.net

  9. @simon @Edent I highlighted the text from one of the screenshots using the native iPhone text from pics function (machine learning based, obv) and it pasted this into notes

    "Children at Witham's Chip- ping Hill Infants School are en- joying stretching their imaginations ready for October 31 - when who knows what
    strange spirits will be abroad?"

    And then to be more directly LLM about it , I pasted it into Claude Sonnet4 with a short prompt

    So yes, we are seeing different results 🤷

    Reply | Reply to original comment on fed.beatworm.co.uk

  10. @Edent I think a user’s ability to assess truth is definitely a part of why some claim success while others do not, but the type of prompt also definitely matters. For OCR or math problems, “correct” is ubiquitous. The answer to questions like “list some pros and cons of leadership style X” are ambiguous. Consequently, I’ve used AI successfully as a tool to discover and inspire — I don’t expect accuracy or correctness.

    Reply | Reply to original comment on mastodon.social

  11. @Edent I think there are two cases when an LLM or other genAI may give satisfactory results: 1) you're using an LLM as autocomplete only (essentially an extension of the keyboard) and make sure to read and edit all its output, and 2) you absolutely don't care whether your work output can withstand any examination beyond the most superficial.

    Reply | Reply to original comment on social.lol

  12. One of the big difference between is that traditional methods tend to be bounded or predictable in how wrong they are, while generative methods can go much further from the desired goal, and mistakes are harder to detect.

    I've noticed this when doing speech synthesis (TTS). Comparing the traditional TTS engines on Google Cloud with the newer "generative" ones, the generative TTS produces more expressive speech (at least without diving into SSML), but sometimes it just goes off-script and says something completely different.

    Reply

  13. @Edent As a case in point (proving your point that people have dramatically different experiences), I took your newspaper clipping and ran it through Claude, ChatGPT, and Gemini. All OCR’ed it perfectly on the first try.

    My experience has been mixed. I find that Claude and ChatGPT (using the latest models) almost always get things like this correct.

    And Gemini frequently hallucinates. I was actually surprised it got the task correct for me, and was surprised to see in your test that ChatGPT got it wrong. In news stories about wildly wrong hallucinations, Gemini is often the culprit.

    Reply | Reply to original comment on hachyderm.io

  14. It's a good answer to the question, and it definitely seems to rely on how much you already know about a subject, and also how critical an error might be. I tend to only ever use it as a secondary check on things like research or writing after I've already done the work and can spot the AI lies

    Reply | Reply to original comment on bsky.app

  15. The OCR example is not very well-chosen. LLMs are not trained on that (while trained on old newspapers, they'd be OCR'ed beforehand) and unless you are in camp "LLM = general AI," it's not reasonable to expect it can OCR.

    That said, I agree with your central points, and I think there's an even simpler explanation why some people find LLMs useful: they never understood their job. They always coasted along on "good enough," hoping not to get caught. Basically, an inverse impostor syndrome.

    For many problems, a wonky half-way solution is enough most of the time, and issues will only matter much later. I have had many colleagues, not just juniors, fundamentally not understanding programming, yet churning out code that kind-of handles the happy case while silently returning wrong values in edge cases. Many of them were very enthusiastic about AI.

    That's the main use-case of AI: identifying the 0.1x member of the team.

    Reply | Reply to original comment on westergaard.social

  16. Giving both of your referenced images to Gemini to write gave 100% accurate responses. However, I think your point is a great one. AI is not the tool for every job, not close to most jobs at this point. Where it helps though is to advise on tooling, or when given tools to use(Claude with py, etc.).

    Reply | Reply to original comment on bsky.app

  17. Тази статия дава добро обяснение за това защо различни хора имат различнивпечатления от ползването на LLM. Впечатленията са емоционални. Фактите сепроверяват.

    Reply | Reply to original comment on feddit.bg

  18. @Edent

    Some uses don't require error free results so the users don't care. A 90% accurate minutes of a meeting is usually sufficient (unless called in legal proceedings down the line) so most will be are happy with that. AI editing of blemishes in a photo usually won't add noticeable artifacts like extra fingers, so again most will be happy.

    But other use cases require accuracy. Programming is an example where a small error breaks everything.

    Reply | Reply to original comment on masto.bike

  19. @tony @Edent Code seems to be a bigger weakness then people imagine. I've never had any of the LLM's manage to produce working code from the outset. Most of the time it just repeats the junk it produced earlier even after being corrected. trying to use it to speed up your progress seems to take longer and longer. I've tried a couple that let you upload documentation for it to use as reference, but there's still no guarantee that what's coming out isn't junk.

    Tried to get chatgpt to produce a simple perl script for WWW::Subsonic - nothing that came out of it was even close to working.

    Reply | Reply to original comment on toot.net-pbx.com

  20. @Edent Just last week, a friend of mine told me they got advice from a lawyer, then asked ChatGPT the same question and are now doubting the lawyer, because the #LLM gave a different answer.

    I suggested asking questions in their own area of expertise, to which they definitely know the correct answer, to assess how well the LLM actually performs.

    I think this is something most people don't even consider doing. It's like one of those introductory examples to testing hypotheses.

    llm

    Reply | Reply to original comment on fosstodon.org

What links here from around this blog?

  1. A t-shirt which says Dunning and Kruger and Gell and Mann.LLMs are still surprisingly bad at some simple tasks

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.

See allowed HTML elements: <a href="" title="">
<abbr title="">
<acronym title="">
<b>
<blockquote cite="">
<br>
<cite>
<code>
<del datetime="">
<em>
<i>
<img src="" alt="" title="" srcset="">
<p>
<pre>
<q cite="">
<s>
<strike>
<strong>

To respond on your own website, write a post which contains a link to this post - then enter the URl of your page here. Learn more about WebMentions.