Why do people have such dramatically different experiences using AI?

AI LLM · 38 comments · 750 words · Viewed ~2,788 times

For some people, it seems, AI is an amazing machine which - while fallible - represents an incredible leap forward in productivity.

For other people, it seems, AI is wrong more often than right and - although occasionally useful - requires constant supervision.

Who is right?

I recently pointed out a few common problems with LLMs. I was discussing this with someone relatively senior who works on Google's Gemini. I explained that every time I get a Google AI overview it is wrong. Sometimes obviously wrong, sometimes subtly wrong. I asked if that was really the experience of AI Google wanted to promote? My friend replied (lightly edited for clarity):

I find AI Overview to be helpful for my searches and my work. I use it all the time to look up technical terms and hardware specs.

I, somewhat impolitely, called bullshit and sent a couple of screenshots of recent cases where Google was just laughably wrong. He replied:

Interesting. We are seeing the opposite.

Why is that?

I'll happily concede that LLMs are reasonable at outputting stuff which looks plausible and - in many cases - that's all that's necessary. If I can't remember which command line switch to use, AI is easier than crappy documentation. Similarly, if I don't know how to program a specific function, most AIs are surprisingly decent at providing me with something which mostly works.

But the more I know about something, the less competent the AI seems to be.

Let me give you a good example.

At my friend's prompting, I asked Gemini to OCR an old newspaper clipping. It is a decent resolution scan of English text printed in columns. The sort of thing a million AI projects have been trained on. Here's a sample:

So what did Gemini make of it when asked to extract the text from it?

Children at Witham's Chip-
ping Hill Infants School are en-
gaged in trying out all sorts of
imaginations ready for October
31... "And god knows what
strange spirits will be abroad."

That reads pretty well. It is utterly wrong, but it is convincing. This isn't a one-off either. Later in the clipping was this:

I'm sure a child of 6 could read that aloud without making any mistakes. Is Gemini as smart as a 6-year-old?

All the children say halloween
is fun. So it is for 6-year-old
Joanne Kirby admits she will be
staying up to watch on October
31, just in case. She has made a
paper "witch," to "tell stories
about witches," she said.

Again, superficially right, but not accurate in the slightest.

There were half a dozen mistakes in a 300 word article. That, frankly, is shit. I could have copy-typed it and made fewer mistakes. I probably spent more time correcting the output than I saved by using AI.

Boring old Tesseract - a mainstay of OCR - did far better. Yes, it might occasionally mistake a speck of dust for a comma or confuse two similar characters - but it has never invented new sentences!

Like a fool, I asked Gemini what was going on:

Me: That's a really bad job. You've invented lots of words which aren't there. Try again. Gemini: I understand you weren't satisfied with the previous transcription. Unfortunately, I can't directly perform OCR on images. However, there are many apps available that can do this. You can search online for 'OCR apps' to find one that suits your needs.

Here's a link to the conversation if you don't believe me.

This isn't just a problem with Gemini - ChatGPT also invented brand-new sentences when scanning the text.

All the children say Halloween is fun, rather than frightening. Six-year-old Joanne Kirby admits she will be “a scary little witch” on the night, but she does like ghost stories.

So what's going on?

A question one has to ask of any source, including LLMs but also newspapers, influencers, podcasts, books, etc., is "how would I know if they were wrong?"This is not a prompt to doubt everything – down that path is denialism – but about reflecting on how much you rely on even "trusted" sources.
— Adrian Hon (@adrianhon.bsky.social) 2025-06-17T15:39:06.772Z

With OCR, it is simple. I can read the ground-truth and see how it compares to the generated output. I don't have to trust; I can verify.

I suppose I mostly use AI for things with which I have a passing familiarity. I can quickly see when it is wrong. I've never used it for, say, tax advice or instructions to dismantle a nuclear bomb. I'd have zero idea if the information it spat back was in any way accurate.

Is that the difference? If you don't understand what you're asking for then you can't judge whether you're being mugged off.

Or is there something more fundamentally different between users which results in this disparity of experience?

A t-shirt which says Dunning and Kruger and Gell and Mann.

38 thoughts on “Why do people have such dramatically different experiences using AI?”

David Gerard

@Edent honestly, because some people can't tell good from bad. You're seeing it in that example - you see the glaring errors, other guy literally doesn't see them or care.

after a while AI bros become sure that quality is fake and if you think it isn't, you're *lying*

Reply | Reply to original comment on circumstances.run 2025-06-18 12:37
Vitlöksbjörn

@Edent Nice read. More anecdotal evidence:

Recently, google AI somehow managed to get past my usual blockers, so I got to see what it types. I was looking for "bing chilling", a memey phrase associated with John Cena, which means "ice cream" in Chinese.

The AI told me "Bing Chilling" is a game on steam blablabla

Which is technically true. There is a game that named itself after the meme, but that's not what people think of when they say "bing chilling".

Exactly as you point out.

Reply | Reply to original comment on mastodon.social 2025-06-18 12:42
Khleedril

@Edent Because most people are lazy but many others are meticulous.

Reply | Reply to original comment on cyberplace.social 2025-06-18 12:43
Russell Garner

@Edent Who *is* right? There's only one way to find out...

FIIIIIIGHcomplete environmental collapse

Reply | Reply to original comment on mastodon.social 2025-06-18 12:44
Sara Joy :happy_pepper:

@Edent I think you've got it - it depends how well you can evaluate or fix the output.

I don't trust it, because when summarising, it can make small errors in a particle or a word or a sentence which might even accidentally convey an opposite meaning to the input.

Reply | Reply to original comment on front-end.social 2025-06-18 12:49
nicopap

@Edent There is this benchmark of LLM vs traditional approach to OCR https://getomni.ai/blog/ocr-benchmark

The evaluation criteria are very transparent and seem sensible. Key quote:

"Traditional models tend to outperform on high-density pages (textbooks, research papers) as well as common document formats like tax forms."

I found it linked in the comments in an HN from march discussing the announcement of Mistral OCR.

I would of course take with a pinch of salt the evaluation of their own model.

OmniAI OCR Benchmark - OmniAI. Automate document workflows

Reply | Reply to original comment on tooting.ch 2025-06-18 13:07
Susanna the Artist 🌻

@Edent I recently used an AI tool for something & the tool included suggested edits to make my prompt “better.” Basically, it suggested I write like James Patterson instead of just giving the AI instructions. So maybe there’s a difference between those of us who treat it like a computer program & those who treat it as a person.

Reply | Reply to original comment on mastodon.xyz 2025-06-18 13:08
drs1969 (David Smith) 🇬🇧

@blog on the OCR front, I have personally seen Copilot take a photo of a handwritten court ledger from the late 1600s and do a very very good job of not only OCR but translation from Latin into English.

Whether it would do it repeatedly or reliably I don't know.

Likewise another direct test the latest openai model got a technical multiple choice answer correct when the model answer was wrong... and 3 experts got to the right place eventually. (RAG test)

Plural of anecdote and all that.

Reply | Reply to original comment on mstdn.social 2025-06-18 13:13
nicopap

@Edent The HN thread in question is interesting: half of the comments are people testing Mistral OCR and saying it has major issues, the other half is waxing lyrical about how LLMs "solved" OCR https://news.ycombinator.com/item?id=43282905

Missing: a discussion about tradeoffs of accuracy vs genericity.

Mistral OCR | Hacker News

Reply | Reply to original comment on tooting.ch 2025-06-18 13:20
dan

absolutely agree with the premise here, but also am on the side of "this doesn't accord at all with my experience using the tools" which, for OCR like this tends to be very accurate (with human checking). e.g. here's Claude

https://bsky.app/profile/amoeba.com.au/post/3lruzqhn4o22e

Reply | Reply to original comment on bsky.app 2025-06-18 13:39
Baldur Bjarnason

@Edent The inherent volatility of the models also helps mask the flaws in the output if you aren't careful. It means that some of the time when they do actually bother to check the output (or the output has to do with their expertise) it might be one of the times it's randomly actually correct.

And because human brains are bad at estimating both probability and frequency, this probably leads people to dismiss the errors they do notice as one-offs.

Reply | Reply to original comment on toot.cafe 2025-06-18 13:58
Tim

@Edent It's because there are *so many* models, with so many differing capability, and so many different ways to access them, with commonly no way to tell WHAT you're actually using. I toss it in via api ensuring that I'm using the latest (maybe??) and it's perfect. If this was my only experience with AI, I'd always be singing its praises.

Most of the hate or love of the AI tools is like getting a bug report but the user refuses to tell you the version of software they're using, or even what OS!

Reply | Reply to original comment on social.lol 2025-06-18 14:06
Amelia Szymańska

I have tried many different types of AI in various contexts, and most of the time they are wrong. When citing 'facts', the source URL is often 404, or the content has nothing to do with the answer from the 'AI'. When coding, it constantly makes changes that I haven't asked for, even though these are simple tasks that a junior developer would do without any problem — they just need time. When I tell the 'AI' about it, it just makes excuses, saying that it only works on statistical analysis and doesn't really know what it's doing.

And I hate the "yes man" mentality.

My main issue is that I always verify the answer, so I quickly realise that it's wrong.

Reply 2025-06-18 14:06
Simon Willison

@Edent I find those OCR results surprising because I've been experimenting with OCR against vision LLMs for the past year and have mostly had much better results

How large was the image you fed into it? And which exact model were you using?

The worst results I've had were feeding in a super long image a year ago, it turned out the API I was using resized it down to where the text was illegible and it then hallucinated the answer entirely!

Reply | Reply to original comment on fedi.simonwillison.net 2025-06-18 14:27
Terence Eden

@simon Literally the same resolution as in the blog post. Think the full thing was about 2400px wide for 6 columns of text.

Reply | Reply to original comment on mastodon.social 2025-06-18 14:30
Simon Willison

@Edent I think it is - I've been banging the drum for a while that the biggest misconception in all of AI is that this stuff is easy to use - I've been exploring it on a daily basis for nearly three years now and I still pick up new tricks all the time

I only started trusting the top models to do workable jobs with OCR against complex documents (like your newspaper clipping) in the past ~4 months

Reply | Reply to original comment on fedi.simonwillison.net 2025-06-18 14:59
Simon Willison

@Edent this post is a good example of how unpredictable these vision models can be - https://simonwillison.net/2025/May/18/qwen25vl-in-ollama/ - I got garbage results with a 6GB version of qwen2.5vl running via Ollama on my own machine, someone else then got much better results from the 9GB version of the same model run using MLX

qwen2.5vl in Ollama

Reply | Reply to original comment on fedi.simonwillison.net 2025-06-18 15:16
cms

@simon @Edent I highlighted the text from one of the screenshots using the native iPhone text from pics function (machine learning based, obv) and it pasted this into notes

"Children at Witham's Chip- ping Hill Infants School are en- joying stretching their imaginations ready for October 31 - when who knows what
strange spirits will be abroad?"

And then to be more directly LLM about it , I pasted it into Claude Sonnet4 with a short prompt

So yes, we are seeing different results 🤷

Reply | Reply to original comment on fed.beatworm.co.uk 2025-06-18 15:17
Bill the Lizard

@blog This explains why managers in my organization, who are used to delegating work to subject matter experts, are all-in on AI. They can't tell the difference in the output.

Reply | Reply to original comment on hachyderm.io 2025-06-18 15:27
Cris Luengo

@blog I have the same issue with science reporting. When it's on a subject I don't know, the report sounds convincing. When it's on a subject I know, I immediately find all sorts of misunderstandings the journalist had about the science they're reporting on, and can't trust anything in the article.

Reply | Reply to original comment on fosstodon.org 2025-06-18 15:36
Tim Severien

@Edent I think a user’s ability to assess truth is definitely a part of why some claim success while others do not, but the type of prompt also definitely matters. For OCR or math problems, “correct” is ubiquitous. The answer to questions like “list some pros and cons of leadership style X” are ambiguous. Consequently, I’ve used AI successfully as a tool to discover and inspire — I don’t expect accuracy or correctness.

Reply | Reply to original comment on mastodon.social 2025-06-18 15:46
hades

@Edent I think there are two cases when an LLM or other genAI may give satisfactory results: 1) you're using an LLM as autocomplete only (essentially an extension of the keyboard) and make sure to read and edit all its output, and 2) you absolutely don't care whether your work output can withstand any examination beyond the most superficial.

Reply | Reply to original comment on social.lol 2025-06-18 15:50
James

One of the big difference between is that traditional methods tend to be bounded or predictable in how wrong they are, while generative methods can go much further from the desired goal, and mistakes are harder to detect.

I've noticed this when doing speech synthesis (TTS). Comparing the traditional TTS engines on Google Cloud with the newer "generative" ones, the generative TTS produces more expressive speech (at least without diving into SSML), but sometimes it just goes off-script and says something completely different.

Reply 2025-06-18 16:50
Nate Silva

@Edent As a case in point (proving your point that people have dramatically different experiences), I took your newspaper clipping and ran it through Claude, ChatGPT, and Gemini. All OCR’ed it perfectly on the first try.

My experience has been mixed. I find that Claude and ChatGPT (using the latest models) almost always get things like this correct.

And Gemini frequently hallucinates. I was actually surprised it got the task correct for me, and was surprised to see in your test that ChatGPT got it wrong. In news stories about wildly wrong hallucinations, Gemini is often the culprit.

Reply | Reply to original comment on hachyderm.io 2025-06-18 17:22
CatButtes 🐾

I tried out the "How many i's in teamwork" question for Google to see if they had fixed it...

...they have not.

Reply | Reply to original comment on bsky.app 2025-06-18 17:25
Dan Thornton

It's a good answer to the question, and it definitely seems to rely on how much you already know about a subject, and also how critical an error might be. I tend to only ever use it as a secondary check on things like research or writing after I've already done the work and can spot the AI lies

Reply | Reply to original comment on bsky.app 2025-06-18 18:07
Michael Westergaard

The OCR example is not very well-chosen. LLMs are not trained on that (while trained on old newspapers, they'd be OCR'ed beforehand) and unless you are in camp "LLM = general AI," it's not reasonable to expect it can OCR.

That said, I agree with your central points, and I think there's an even simpler explanation why some people find LLMs useful: they never understood their job. They always coasted along on "good enough," hoping not to get caught. Basically, an inverse impostor syndrome.

For many problems, a wonky half-way solution is enough most of the time, and issues will only matter much later. I have had many colleagues, not just juniors, fundamentally not understanding programming, yet churning out code that kind-of handles the happy case while silently returning wrong values in edge cases. Many of them were very enthusiastic about AI.

That's the main use-case of AI: identifying the 0.1x member of the team.

Reply | Reply to original comment on westergaard.social 2025-06-18 19:23
Ben Aiken

Giving both of your referenced images to Gemini to write gave 100% accurate responses. However, I think your point is a great one. AI is not the tool for every job, not close to most jobs at this point. Where it helps though is to advise on tooling, or when given tools to use(Claude with py, etc.).

Reply | Reply to original comment on bsky.app 2025-06-19 04:58
Jeroen Bosman

@Edent And for others still, regardless of whether AI is 'right', it matters that genAI and the companies behind it destroy the earth, democracies and brains.

Reply | Reply to original comment on akademienl.social 2025-06-19 06:27
Kestenan's Typoist. BSc SSc

@Edent wait wait wait, are you telling me that LLM outputs are not only inconsistent, but their inconsistencies are inconsistent?

Reply | Reply to original comment on meow.social 2025-06-19 08:11
Mathew Attlee

@Edent the bit where you talk about generative AI making mistakes with OCR, reminded me of a great interview with @timnitGebru, and how she points out that LLM can be less accurate at text-to-speech than existing machine learning approaches https://techwontsave.us/episode/267_ai_hype_enters_its_geopolitics_era_w_timnit_gebru

AI Hype Enters Its Geopolitics Era w/ Timnit Gebru - Tech Won’t Save Us

Reply | Reply to original comment on hachyderm.io 2025-06-19 10:28
Heather Migliorisi

@Edent - I think you hit the nail on the head there, “But the more I know about something, the less competent the AI seems to be.”

Reply | Reply to original comment on front-end.social 2025-06-19 15:54
feddit.bg

Тази статия дава добро обяснение за това защо различни хора имат различнивпечатления от ползването на LLM. Впечатленията са емоционални. Фактите сепроверяват.

Reply | Reply to original comment on feddit.bg 2025-06-19 15:18
MattChippytea

@Edent I still maintain the best summary I’ve seen is that “if you don’t care, it seems miraculous. If you do care the illusion falls apart pretty quickly” without fail the people I’ve encountered who use and value LLM output simply do not care. It saves them time and effort and if the output is garbage…. Meh

Reply | Reply to original comment on infosec.exchange 2025-06-20 09:14
David S

@Edent

Some uses don't require error free results so the users don't care. A 90% accurate minutes of a meeting is usually sufficient (unless called in legal proceedings down the line) so most will be are happy with that. AI editing of blemishes in a photo usually won't add noticeable artifacts like extra fingers, so again most will be happy.

But other use cases require accuracy. Programming is an example where a small error breaks everything.

Reply | Reply to original comment on masto.bike 2025-06-20 09:55
ScaredyCat

@tony @Edent Code seems to be a bigger weakness then people imagine. I've never had any of the LLM's manage to produce working code from the outset. Most of the time it just repeats the junk it produced earlier even after being corrected. trying to use it to speed up your progress seems to take longer and longer. I've tried a couple that let you upload documentation for it to use as reference, but there's still no guarantee that what's coming out isn't junk.

Tried to get chatgpt to produce a simple perl script for WWW::Subsonic - nothing that came out of it was even close to working.

Reply | Reply to original comment on toot.net-pbx.com 2025-06-20 10:31
bocops

@Edent Just last week, a friend of mine told me they got advice from a lawyer, then asked ChatGPT the same question and are now doubting the lawyer, because the #LLM gave a different answer.

I suggested asking questions in their own area of expertise, to which they definitely know the correct answer, to assess how well the LLM actually performs.

I think this is something most people don't even consider doing. It's like one of those introductory examples to testing hypotheses.

llm

Reply | Reply to original comment on fosstodon.org 2025-06-20 10:44
Terence Eden
As several* people saw that image and said "shut up and take my money" - I am happy to report that I am shutting up and taking people's money.

http://www.redbubble.com/shop/ap/1716...
- Two
Reply | Reply to original comment on bsky.app 2025-06-20 15:23
More comments on Mastodon.