Why are there strongly conflicting opinions about the effectiveness of LLMs?


Last updated

I’ve read a lot of discussions of LLMs and LLM agents, and one of the most frustrating things about them is that for every comment that reports amazing results, there is its opposite suggesting that LLMs are good for nothing at all.

Why could that be? I think there are several factors at play:

  • Expectations in terms of what an LLM should be able to do
  • Expectations in terms of code quality
  • Application domain and target
  • Approach to using the tools
  • Choice of tools

Expectations in terms of what an LLM should be able to do

People have different reactions to the AI hype. Some try to ignore it, others set their expectations to align with the hype. Consequently, people will attempt different tasks with LLMs and react differently to the success or failure. One person might be happy that an LLM can generate a few CSS styles or one small function at a time. Another will be disappointed that it couldn’t generate a whole application in a single shot.

The sales pitches, of course, try to imply that LLMs can pop out complete applications, but the reality is likely very different.

Expectations in terms of code quality

People have very different levels of experience and very different standards regarding what good enough code looks like. A junior developer might be overjoyed with the LLM output because they are not aware of the problems contained in the generated code. If it works at all, it’s fine! A seasoned engineer might not be happy even if the output is passable, because it didn’t quite get the abstractions right or failed to deal with a corner case. People judge LLM output relative to the quality of their own output.

Application domain and target

It seems pretty clear that results vary depending on domain (eg. programming language, the intended functionality) and target (ie. brand new project or altering an existing codebase).

The level of success in any given domain is driven by the volume of training data.

In terms of target, it’s down to context limitations. It’s one thing to scaffold a new application. It’s another thing to bolt things onto an existing codebase.

The LLM context window is limited (and most likely smaller than advertised, in part because of things like the system prompt). There is only so much input that can go in, which means you can’t just chuck your million LOC codebase into an LLM. That means that it will have limited knowledge of your codebase and will be more likely to fail.

Approach to using the tools

There is quite a bit of nuance in managing the context supplied to the LLM. Some people take a very systematic approach to it. I’ve seen others throwing three word prompts at an LLM and expecting a fully formed React component to come out.

Perhaps some are much better at “going with the grain” of the LLM output and accepting its implementation, while others want the code to look exactly as they would have written it.

People have different ideas about value for money as well. A person who’s got 4 $200/month Claude Code instances chugging away in parallel might get quite a few more successes (and at the same time be more happy with the value they’re getting) than somebody who’s trying out the free tier and not getting very far.

Choice of tools

People are using the tools differently. Some are still just copy-pasting in web chat. Others are leaning into the features of agent IDEs. Some are using the cheapest thing available whereas others think whatever they pay is great value because it’s so much cheaper than a developer.

    The AI landscape is shifting fast. Drop your email below to get notified when I make significant updates to the site or write new posts.
    No spam, only updates related to No Hype AI.