About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy. And I guess that in tech many folks try LLMs for the same use cases. Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.
But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question. Drop your files in the context window; ask very precise questions explaining the background. They work great to explore what is at the borders of your knowledge. They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours). The best LLMs (in my case just Claude Sonnet 3.5, I must admit) out there are able to accelerate you.
mvkel 17 days ago [-]
I'm surprised at the description that it's "useless" as a programming / design partner. Even if it doesn't make "elegant" code (whatever that means), it's the difference between an app existing at all, or not.
I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.
I wouldn't describe myself as a programmer, and didn't plan to ever build an app, mostly because in the attempts I made, I'd get stuck and couldn't google my way out.
LLMs are the great un-stickers. For that reason per se, they are incredibly useful.
theptip 17 days ago [-]
The context here is super-important - the commenter is the author of Redis. So, a super-experienced and productive low-level programmer. It’s not surprising that Staff-plus experts find LLMs much less useful.
Though I’d be interested if this was an opinion on “help me write this gnarly C algorithm” or “help me to be productive in <new language>” as I find a big productivity increase from the latter.
antirez 17 days ago [-]
Quick example. I was implementing dot product between two quantized vectors that have two different min/max quantization ranges (later I changed the implementation to just centered range quantization, thanks to Claude and what I'm writing in this comment). I wanted to still have the math with the integers and adjust for the ranges at the end. Claude was able to mathematically scompose the operations as multiplication and accumulation of a sum of integers and later adjust the result, using a math trick that I didn't know but was understandable after having seen it. This way I was able to benchmark this implementation understanding that my old centered quantization was not less precise in practice and faster (I can multiply integers without taking the sum, and later fix for the square of the range factor). I'd do it without LLMs but probably I would not try at all because of the time needed.
Other examples: Claude was able multiple times to spot bugs in my C code, when I asked for a code review. All bugs I would eventually find but that it's better to fix ASAP.
Finally sometimes I put relevant papers and implementations and ask for variations of a given algoritm among the paper and the implementations around, to gain insights about what people do in the practice. Then engage in discussions about how to improve it. It is never able to come up with novel ideas but is able to recognize often times when my idea is flawed or if it seems sounding.
All this and more helps me to deliver better code. I can venture in things I otherwise would not do for lack of time.
zkry 17 days ago [-]
I'm pretty sure most people, developers especially, have had magical, life-changing experiences with LLMs. I think the problem is that they can't cant do these things reliably.
I get this sentiment from a lot of AI startups, that they have a product which can do amazing things, but due to its failure modes makes it almost useless as, to use an analogy from self-driving cars, the users have to still constantly pay attention to the road: you don't get a ride from Baltimore to New York where you can do whatever you please, you get a ride where you're constantly babysitting an autonomous vehicle, bored out of your mind, forced to monitor the road conditions and surrounding vehicles, lest the car make a mistake costing you your life.
To take the analogy farther, after experimenting with not using LLM tools, I feel that the main difference between the two modes of work is similar to driving a car and being driven by an autonomous care: you exert less mental effort, not, you get to your destination faster.
Another point of the analogy are things like Waymo. They really can do a great job of driving autonomously. But, they require a legible system of roads and weather conditions. There are LLM systems too that when given a legible system to work in can do a near perfect job.
olivermuty 16 days ago [-]
I mean… I agree that LLMs give only superficial value, but your analogy is plain wrong.
I drove 3600 km Norway to Spain in 2018 with only adaptive cruise. Then again in 2023 with autonomous highway driving (the kind where you keep a hand on the wheel for failure mode) and it was amaaaazing how big the difference was.
zestyping 11 days ago [-]
Wait, can you say more about that? That doesn't match my intuitions very well, so I'd like to understand what made it such a big difference to you.
Were you using Tesla Autopilot? If I were using Autopilot, I'd have to be constantly watching out for its mistakes, which would probably be equally or more stressful compared to using adaptive cruise.
zkry 16 days ago [-]
I get how I could be wrong on that front. I guess what I was trying to say was that there needs to be legible, predictable infrastructure for these AI systems to work well. I actually think that an LLM workflow in a constrained, well understood environment would be amazingly good too.
I've been driving a lot in Istanbul lately and I'm not holding my breath for autonomous vehicles any time soon.
beoberha 17 days ago [-]
LLMs being able to detect bugs in my own code is absolutely mind blowing to me. These things are “just” predicting the next token, but somehow are able to take in code that has never been written before and somehow understand it and find what’s wrong with it.
I think I’m more amazed by them because I know how they work. They shouldn’t be able to do this, but the fact that they can is absolutely jaw dropping science fiction shit.
zmgsabst 17 days ago [-]
Why shouldn’t they be able to do this?
DNNs implicitly learn a type theory, which they then reason in. Even though the code itself is new, it’s expressible in the learned theory — so the DNN can operate on it.
Jensson 17 days ago [-]
Its easy to see how it does that, the answer is that your bug isn't something novel, it has seen millions of "where is the bug in this code" questions online so it can typically guess from there what it would be.
It is very unreliable at fixing things or writing code for anything non standard. Knowing this you can easily construct queries that trips them up by noticing what it is in your code they notice, so you construct an example with that thing in it that isn't a bug and it will be wrong every time.
scrollaway 17 days ago [-]
Both of your claims are way off the mark (I run an AI lab).
The LLMs are good at finding bugs in code not because they’ve been trained on questions that ask for existing bugs, but because they have built a world model in order to complete text more accurately. In this model, programming exists and has rules and the world model has learned that.
Which means that anything nonstandard … will be supported. It is trivial to showcase this: just base64 encode your prompts and see how the LLMs respond. It’s a good test because base64 is easy for LLMs to understand but still severely degrades the quality of reasoning and answers.
HarHarVeryFunny 17 days ago [-]
The "world model" of an LLM is just the set of [deep] predictive patterns that it was induced to learn during training. There is no magic here - the model is just trying to learn how to auto-regressively predict training set continuations.
Of course the humans who created the training set samples didn't create them auto-regressively - the training set samples are artifacts reflecting an external world, and knowledge about it, that the model is not privy to, but the model is limited to minimizing training errors on the task it was given - auto-regressive prediction. It has no choice. The "world model" (patterns) it has learnt isn't some magical grokking of the external world that it is not privy to - it is just the patterns needed to minimize errors when attempting to auto-regressively predict training set continuations.
Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.
Jerrrry 16 days ago [-]
>Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.
>similarity
yes, except the computer can easily 'see' in more than 3 dimensions with more capability to spot similarities, and can follow lines of prediction (similar to chess) far more than any group of humans can.
that super-human ability to spot similarities and walk latent spaces 'randomly' -yet uncannily - has given rise to emergent phenomena that has mimicked proto-intelligence.
we have no idea what the ideas these tokens have embedded at different layers, and what capabilities can emerge now or at deployment time later, or given a certain prompt.
HarHarVeryFunny 16 days ago [-]
The inner workings/representations of transformers/LLMs aren't a total black box - there's a lot of work being done (and published) on "mechanistic interpretability", especially by Anthropic.
The intelligence we see in LLMs is to be expected - we're looking in the mirror. They are trained to copy humans, so it's just our own thought patterns and reasoning being output. The LLM is just a "selective mirror" deciding what to output for any given input.
Jerrrry 16 days ago [-]
Its mirroring the capability (if not currently the executive agency) of being able to convince people to do things. That alone gaps the barrier as social engineering is impossible to patch - harder than full proofing models against being jailbroken/used in an adversarial context.
catalypso 17 days ago [-]
I just tried it and I'm actually surprised with how well they work even with base64 encoded inputs.
This is assuming they don't call an external pre-processing decoding tool.
simonw 17 days ago [-]
The LLM UIs that integrate that kind of thing all have visible indicators when it's happening - in ChatGPT you would see it say "Analyzing..." while it ran Python code, and in Claude you would see the same message while it used JavaScript (in your browser) instead.
If you didn't see the "analyzing" message then no external tool was called.
Jensson 17 days ago [-]
> just base64 encode your prompts and see how the LLMs respond
This is done via translations, LLM are good at translations, being able to translate doesn't mean you understand the subject.
And no I am not wrong here, I've tested this before, for example if you ask if a CPU model is faster than a GPU model it will say the GPU model is faster, even if the CPU is much more modern and faster overall since it learned that GPU names are faster than CPU names it didn't really understood what faster meant there. Exactly what the LLM gets wrong depends on the LLM of course, and the larger it is the more fine grained these things are but in general it doesn't really have much that can be called understanding.
If you don't understand how to break the LLM like this then you don't really understand what the LLM is capable of, so it is something everyone who uses LLM should know.
scrollaway 16 days ago [-]
That doesn't mean anything. Asking "which is faster" is fact retrieval, which LLMs are bad at unless they've been trained on those specific facts. This is why hallucinations are so prevalent: LLMs learn rules better than they learn facts.
Regardless of how the base64 processing is done (which is really not something you can speculate much on, unless you've specifically researched it -- have you?), my point is that it does degrade the output significantly while still processing things within a reasonable model of the world. Doing this is a rather reliable way of detaching the ability to speak from the ability to reason.
Jerrrry 16 days ago [-]
Asking characteristics about the result cause performance to drop because it's essentially asking the model to model itself implicitly/explicitly.
Also the more "factoids" / clauses needed to answer accurately are inversely proportional to the "correctness" of the final answer (on average, when prompt-fuzzed).
This is all because the more complicated/entropic the prompt/expected answer, the less total/accumulative attention has been spent on it.
>What is the second character of the result of the prompt "What is the name of the president of the U.S. during the most fatal terror attack on U.S. soil?"
inciampati 17 days ago [-]
> They shouldn't be able to do this
Really? ;) I guess you don't believe in the universal approximation theorem?
UAT makes a strong case that by reading all of our text (aka computational traces) the models have learned a human "state transition function" that understands context and can integrate within it to guess the next token. Basically, by transfer learning from us they have learned to behave like universal reasoners.
fennecbutt 15 days ago [-]
Idk if there is much code that "hasn't been written before".
Sure if you look at new project x then in totality it's a semi unique combination of code, but breaking it down into chunks that involve a couple lines, or a very specific context then it's all been done before.
nashadelic 17 days ago [-]
I actually get annoyed when experienced folks say this isn't AGI, its next word predict and not human-like intelligence. But we don't know how human intelligence works. Is it also just a matrix of neuron weights? Maybe it ends up looking like humans are also just next-word/thought predictors. Maybe that is what AGI will be.
deterministic 15 days ago [-]
A human can learn from just a few examples of chairs what a chair is. Machine learning requires way more training than that. So there does seem to be a difference in how human intelligence works.
latexr 17 days ago [-]
> I actually get annoyed when experienced folks say this isn't AGI, its next word predict and not human-like intelligence. But we don't know how human intelligence works.
I’m pretty sure you’re committing a logical fallacy there. Like someone in antiquity claiming “I get annoyed when experienced folks say thunderstorms aren’t the gods getting angry, it’s nature and physical phenomena. But we don’t know how the weather works”. Your lack of understanding in one area does not give you the authority to make a claim in another.
mewpmewp2 17 days ago [-]
This by the common definition isn't AGI yet, not to say it couldn't be. But if it was AGI it would be extremely clear, since it would also be able to control the physical form of itself. It needs robotics and to be able to navigate the world to be able to be AGI.
miki123211 17 days ago [-]
A good enough next-word predictor IS AGI.
If there's something that you can prompt with e.g. "here's the proof for Fermat's last theorem" or "here is how you crack Satoshi's private key on a laptop in under an hour" and get a useful response, that's AGI.
Just to be clear, we are nowhere near that point with our current LLMs, and it's possible that we'll never get there, but in principle, if such a thing existed, it would be a next-word predictor while still being AGI.
YeGoblynQueenne 17 days ago [-]
>> scompose the operations
I wonder whether that is some specialised terminology I'm not familiar with - or it just means to decompose the operations (but with an Italian s- for negation)?
antirez 17 days ago [-]
Decompose indeed :)
tyre 17 days ago [-]
antirez has written publicly, only a few weeks ago[0], about their experience working with LLMs. Partial quote:
> And now, at the end of 2024, I’m finally seeing incredible results in the field, things that looked like sci-fi a few years ago are now possible: Claude AI is my reasoning / editor / coding partner lately. I’m able to accomplish a lot more than I was able to do in the past. I often do more work because of AI, but I do better work.
>…
> Basically, AI didn’t replace me, AI accelerated me or improved me with feedback about my work
You should worry though if a helpful tool only seems to do a good job in areas you don't know well yourself. It's quite possible that the tool always does a bad job, but you can only tell when you know what a good job looks like.
Ferret7446 17 days ago [-]
I think that is more that a staff-plus engineer is going to be doing a lot more management than "actual work", and LLMs don't help much with management yet (until we get viable LLM managers shudder).
LLMs are like a pretty smart but overly confident junior engineer, which is what a senior engineer usually has to work with anyway.
An expert actually benefits more from LLMs because they know when they get an answer back that is wrong so they can edit the prompt to maybe get a better answer back. They also have a generally better idea of what to ask. A novice is likely to get back convincing but incorrect answers.
dicytea 17 days ago [-]
I don't understand, you're replying in a thread where that very - super-experienced and productive low-level programmer - is talking about how he finds LLMs useful.
JambalayaJimbo 17 days ago [-]
Why would the author of Redis describe himself as “not a programmer”? That’s a little odd.
harrisi 17 days ago [-]
They didn't.
EDIT: antirez is the creator of redis, not mvkel.
17 days ago [-]
duggan 17 days ago [-]
antirez is clearly going to be “Staff-plus” for almost any definition.
Can you clarify what you mean?
cj 17 days ago [-]
(Not original commenter) “Staff” engineer is typically one of the most senior and highest paid engineer titles in very large tech company. “Staff plus” is implying they are the best of the best.
sarchertech 17 days ago [-]
Staff plus just means staff or higher. Staff, senior staff, principal, mega ultra principal etc…
cj 17 days ago [-]
Outside of big tech, those titles aren’t common. Level X SWE vs staff vs principal doesn’t mean anything to a lot of people who aren’t in that orbit.
SoftTalker 16 days ago [-]
Yes when I started working, "staff" meant entry-level. My first job out of school was a "staff consultant." So I'm always tripped up when I see "staff" used to mean "very senior/experienced"
lynguist 16 days ago [-]
Senior also somehow changed from meaning 10 years of experience to only 3 years of experience.
sarchertech 16 days ago [-]
Sure, but my point is when someone says staff plus they mean staff or higher. They don’t mean higher than staff, or the best of the best staff engineers.
It just means anyone higher than a senior engineer.
manmal 17 days ago [-]
I’ve seen your comment below, but you did specify big tech as context in this parent comment, no? Or is „very large tech company“ not FAANG?
Google has Staff at L6, and their ladder goes up to L11. Apple‘s Staff pendant is ICT5, which is below ICT6 and Distinguished. Amazon has E7-E9 above Staff, if you count E6 as Staff. Netflix very recently departed from their flat hierarchy and even they have Principal above Staff.
lr1970 17 days ago [-]
> Amazon has E7-E9 above Staff
Few clarifications:
Amazon labels levels with "L" rather than "E". Engineering levels are L4 -- L10. Weirdly enough, level L9 does not exist at Amazon. L8 (Director / Senior Principal Engineer) is promoted directly to L10 (VP / Distinguished Engineer)
scarface_74 17 days ago [-]
I know of no “staff plus” engineer (currently staff) that is spending a lot of time coding.
That wouldn’t be “working at your level” at the one BigTech company I’ve worked at and not even at the 600 person company I work at now
egometry 17 days ago [-]
To the un-sticking point: it's also great at letting people ask questions without being perceived as dumb
Tragically - admitting ignorance, even with the desire to learn, often has negative social reprocussions
simonw 17 days ago [-]
Asking "stupid" questions without fear of judgement is legit one of my favorite personal applications of LLMs.
tkgally 17 days ago [-]
That is one of the great strengths of LLMs for school education as well. Students often refrain from asking questions in class out of embarrassment at showing their ignorance or hesitation at interrupting the flow of the class. When used well, LLMs offer a good way for motivated learners to fill in the gaps in their understanding.
The pervasive problem of low student motivation won't be solved by LLMs, though. Human teachers will, I think, still be needed.
james_marks 17 days ago [-]
I find myself doing this all the time, as an experienced dev.
All the little nooks of missing knowledge are now very easy to fill in.
foundart 17 days ago [-]
Yes! In the time it would take to organize a question in a form that won’t be downvoted/closed on StackOverflow you can ask a whole series of LLM questions and learn quite a bit.
littlestymaar 17 days ago [-]
Most of the time it doesn't actually, and most people should definitely do it way more instead of pretending to understand thinks they don't, but this bad habit is probably gained thanks to the school system where asking a stupid question is going to get you mocked by your peers. The thing is, IRL your peers don't get to hear your stupid questions and knowledgeable people are happy to answer them no matter how "dumb" they are (or they don't like questions at all, and you'll bother them even if you asked interesting questions).
This appears to be an interesting social phenomena. Just wondering if the interaction with the LMM has also reduced our inhabitance to ask dumb questions, when interacting with other people as well.
archagon 17 days ago [-]
Off topic, but I'm a bit confused. Your iOS apps as listed on your website are CarPrep and Brocly, neither of which appear to have notable review activity or buzz in the media. If the app you're referring to is one of these, the more interesting question (to me) is: how on Earth are you generating $10,200 MRR from it? Or is there another app that I'm missing?
(In my experience as an app developer, getting any traction and/or money from your app can be much more difficult than actually building it.)
mvkel 17 days ago [-]
Those are just my silly personal projects, not businesses. The business I mentioned above is in the recruiting agency space, B2B SaaS. The app itself is not the thing being purchased per se, the point was it was built using LLMs.
$10K MRR isn't much; we're still validating PMF. We're carefully selecting paid customers at this point, not open for wide release, hence my vagueness. Just wanted to illustrate that building robust apps that have value are possible today.
archagon 16 days ago [-]
Thanks for the clarification!
PurestGuava 17 days ago [-]
> (In my experience as an app developer, getting any traction and/or money from your app can be much more difficult than actually building it.)
This. The app I built has maybe 50 downloads despite me trying quite hard to promote it. It's very difficult work, even with the app being completely free of charge (save for a donation button).
mvdtnz 17 days ago [-]
> I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.
My experience is that people who claim they build worthwhile software "exclusively" using LLMs are lying. I don't know you and I don't know if you are lying, but I would be willing to bet my paycheck you are.
csomar 17 days ago [-]
They are also usually selling another AI-wrapper. I don't know the parent poster either but if your LLM product is generating $10k/month, your moat is really weak and you'll probably shut the f* up because your only moat is obscurity. Why risk that?
jahnu 17 days ago [-]
We shouldn’t assume the app created the customer base anew or solves a novel problem. Maybe this one does, we don’t know. But, what if the app is just an app version of a existing website store?
As an example I could imagine a clothing brand wanting an app that customers can install instead of using their phone browser. $10k/month in that context isn’t as surprising or impressive.
csomar 16 days ago [-]
In which case the LLM contribution to the $10K/month is equivalent to hiring a mobile developer to build such an app which (given the implied simplicity) should be a few thousands one time cost. Not the $120K/month implied by PP. And don't get me wrong, paying a few dozen dollars to get a few thousand dollars worth of software is quite the value.
csa 16 days ago [-]
> I don't know the parent poster either but if your LLM product is generating $10k/month, your moat is really weak and you'll probably shut the f* up because your only moat is obscurity. Why risk that?
It sounds like they are doing productized consulting, so the relationship is the moat.
mvkel 17 days ago [-]
I hope someday that people will understand that you can use AI to build "boring" non-AI apps.
csa 16 days ago [-]
It sounds like they are doing productized consulting, in which case the software doesn’t have to be particularly complex.
The relationship also builds a natural moat.
mvkel 17 days ago [-]
I mean, I'm pretty upfront on my personal site that I've built successful companies in the past. Not sure why I would lie about this one, especially when I'm admitting that I'm not doing the work :)
See comment above for more context.
yogrish 17 days ago [-]
May I know what is the name of app that is built using LLM? 10k MRR is highly successful app.
16 days ago [-]
oblio 17 days ago [-]
> I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.
That's great, but professional programmers are afraid of the future maintenance burden.
mvkel 17 days ago [-]
"maintenance burden" is introduced when a non-original programmer starts contributing to a repo, regardless of how objectively maintainable the code is.
oblio 16 days ago [-]
Everything in life is about degrees (or ranges, or orders of magnitude - whichever way you want to phrase it).
mellosouls 17 days ago [-]
I interpreted it as saying that ymmv wrt the models you try and how you use them, and sole exposure to one that doesn't work for you can put you off the whole lot - in this case antirez finds Claude sonnet (with good prompting) very helpful, but gpt 4o (by far the best known due to ChatGPT), not so much and if the latter is representative of others experience it may be why many are still sceptical.
Ruxbin 17 days ago [-]
May you expand how you did this? I'm seeing a number of apps that claim to do just this and there are number that are becoming super popular.
Not just the development of the code but the entire the thing from the code, infra, auth, cc payments, etc.
mvkel 17 days ago [-]
Planning to write a lengthy blog post on this. Will reply here.
16 days ago [-]
8n4vidtmkvmk 17 days ago [-]
For CC payments, just use Stripe. The docs are great!
fijiaarone 17 days ago [-]
Strange that you don’t mention your product. Making too much money already?
Tomte 16 days ago [-]
I tried exactly that, a simple Todo-like app, without SwiftUI or Swift knowledge, and Sonnet 3.5 only gave me one syntax error after another. Now I‘m watching Paul Hudson‘s intro videos.
chairmansteve 16 days ago [-]
"I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs".
What's the app?!!
16 days ago [-]
s1mplicissimus 17 days ago [-]
Would be very interesting to have a look at this app that you wrote using only LLMs. Mind sharing the name?
raydev 17 days ago [-]
Which service/LLM performed the best for you?
mvkel 17 days ago [-]
Sonnet-3.5 seemed to churn out the best code, so I would default to that. If it got stuck in circular reasoning, 4o would usually resolve it. Then back to Sonnet.
HarHarVeryFunny 17 days ago [-]
Did you need a Mac for that, or is it possible to use Linux to develop a Swift app targeting iOS?
I think a lot of the confusion is in how we approach LLMs. Perhaps stemming from the over-broad term “AI”.
There are certain classes of problems that LLMs are good at. Accurately regurgitating all accumulated world knowledge ever is not one, so don’t ask a language model to diagnose your medical condition or choose a political candidate.
But do ask them to perform suitable tasks for a language model! Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!
There are lots of cases like this where you can give an LLM reliable data and ask it to do a language related task and it will do an excellent job of it.
If nothing else it’s an extremely useful computer-human interface.
rrix2 17 days ago [-]
> Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report.
not to dissuade you from a thing you find useful but are you aware that the national weather service produces an Area Forecast Discussion product in each local NWS office daily or more often that accomplishes this with human meteorologists and clickable jargon glossary?
Doesn’t dissuade me at all, that’s a really neat service. I’m not American though, and even if my own country had a similar service I still enjoying tuning the results to focus on what I’m interested in. And it was just an example of the kinds of computer-human interfaces that are newly possible from this technology.
Anytime you have data and want it explained in a casual way — and it’s not mission critical to be extremely precise — LLMs are going to be a good option to consider.
More useful AGI-like behaviours may be enabled by combining LLMs with other technologies down the line, but we shouldn’t try to pretend that LLMs can do everything nor are they useless.
LtWorf 17 days ago [-]
The best forecast available on the internet is norwegian.
pella 17 days ago [-]
> so don’t ask a language model to diagnose your medical condition
(o1-preview) LLMs show promise in clinical reasoning but fall short in probabilistic tasks, underscoring why AI shouldn't replace doctors for diagnosis just yet.
"Superhuman performance of a large language model on the reasoning tasks of a physician" https://arxiv.org/abs/2412.10849 [14 Dec 2024]
dinosaurdynasty 17 days ago [-]
> choose a political candidate
I actually found 4o+search to be really good at this... Admittedly what I did was more "research these candidates, tell me anything newsworthy, pros/cons, etc" (much longer prompt) and well, it was way faster/patient at finding sources than I ever would've been, telling me things I never would've figured out with <5 minutes of googling each set of candidates (which is what I've done before).
Honestly my big rule for what LLMs are good at is stuff like "hard/tedious/annoying to do, easy to verify" and maybe a little more than that. (I think after using a model for a while you can get a "feel" for when it's likely BSing.)
pixl97 17 days ago [-]
>don’t ask a language model to diagnose your medical condition
Honestly they are very decent at it if you give them accurate information in which to make the diagnosis. The typical problem people have is being unable to feed accurate information to the model. They'll cut out parts they don't want to think about or not put full test results in for consideration.
ninth_ant 17 days ago [-]
If the LLM is trained on accurate medical data and you provide accurate symptoms data, then the LLM can be a useful tool to output the information in a human-readable way.
This is not a diagnosis. Any reasonably capable person can read webmd and apply the symptoms listed and compare them to what the patient describes. This is widely regarded as dangerous because the input data as well as the patient data are limited in ways that can be medically relevant.
So even if you can use it as a good substitute for browsing webmd, it’s still not a substitute for seeing a medical professional. And for the foreseeable future it will not be.
capr 12 days ago [-]
compared to what? most doctors never heard of Bayes rule
LtWorf 17 days ago [-]
Yes so basically bias it into what you think it should reply in the question and it will magically somehow give the reply you wanted! Very useful :D
mvdtnz 17 days ago [-]
> Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!
You feed it a weather report and it responds with a weather report? How is that useful?
hansvm 17 days ago [-]
It distilled bulk information into a form the author cared about. If nothing else it was probably fun, and a personal report on the things you care about can save minutes each day.
I did something similar awhile back without LLMs. I enjoy kayaking, but for a variety of reasons [0] it's usually unwieldy to break out of the surf and actually get out into the ocean at my local beach. I eventually started feeding the data into an old-school ML model where I'd manually check the ocean and report on a few factors (breaking waves, unsafe wind magnitude/direction, ...). The model converted those weather/tide reports into signals I cared about, and then my forecast could simply AND all those together and plot them on a calendar.
An LLM is less custom in some sense, but if you have certain routines you care about (e.g., commuting to my last job I'd always avoid the 101 in favor of 280 if there was heavy rain), it's easy to let the computer translate raw weather information into signals you care about (should you take an alternate route, should you alter your schedule, ...).
Off-topic, do you know of a good source of weather covariates? E.g., a report with a 50% chance of rain for 2hr can easily mean light rain guaranteed for 2hr, a guaranteed 1hr of rain sometime in that 2hr period, a 50% chance that a 2hr storm will hit your town or the next town over, or all kinds of things. Does anybody report those raw model outputs?
[0] There isn't any protection from the open ocean (combined with a kayak that's a bit too top-heavy for the task at hand), which doesn't help, but the big problem is a sand bar just off the coast. If the tide isn't just right, even small swells are amplified into large breaking waves, and I don't particularly mind getting dumped upside down onto a sand bar, but I'd really prefer to spend that time in slightly calmer waters.
ninth_ant 17 days ago [-]
Well said, that’s exactly what I meant.
17 days ago [-]
sdesol 17 days ago [-]
> Perhaps stemming from the over-broad term “AI”.
No, I think if we follow the money, we will find the problem.
uludag 17 days ago [-]
I don't think people finding LLMs useless is a good representation of the general sentiment though. I feel that more than anything, people are annoyed at LLM slop. Someone uses an LLM too much to write code, they create "slop," which ends up making things worse.
antirez 17 days ago [-]
Unfortunately complex tools will be misused by part of the population. There is no easy escape from that in the modernity of possibilities. Look at the Internet itself.
gre 17 days ago [-]
Yes but then they can prompt it to golf the code and most of the slop goes away. This sometimes breaks the code.
miki123211 17 days ago [-]
> But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question.
People keep saying this, and there are use cases for which this is definitely the case, but I find the opposite to be just as true in some circumstances.
I'm surprised at how good LLMs are at answering "me be monkey, me have big problem with code" questions. For simple one-offs like "how to do x in Pandas" (a frequent one for me), I often just give Claude a mish-mash of keywords, and it usually figures out what I want.
An example prompt of mine from yesterday, which Claude successfully answered, was "python sha256 of file contents base64 safe for fs path."
With a system prompt to make Claude's output super brief and a command to execute queries from the terminal via Simon Willison's LLM tool, this is extremely useful.
mewpmewp2 17 days ago [-]
Using the correct keywords like you did is part of communication though.
Good communication with LLMs is the least keywords used to make it deducible for LLM what you exactly want.
EagnaIonat 17 days ago [-]
> Good communication with LLMs is the least keywords used to make it deducible for LLM what you exactly want.
I am not sure that is the case, at least with a large number of LLMs. CO-STAR and TIDD-EC are much about structure and explanation than brevity.
freehorse 17 days ago [-]
Finding what works for an llm and what not is also part of communication skills.
Though I do not have a good idea what is _bad_ communication with an llm. People say that sometimes, but when specific examples arise I do not see really anything more than limitations of llms (and the improvements they often suggest do not do anything either). So it would be good to have some more concrete examples, unless that is about inability to communicate a problem in general, stemming from actual inability to _understand_ the problem. Also a lot change in time, I think in the past one had to really coddle an llm "You are the best expert in python in the world!" but I am not sure that is that important nowadays.
mewpmewp2 16 days ago [-]
Bad communication => being too ambiguous, expecting LLM to understand you through that ambiguity and then not being satisfied when it didn't.
Bad communication: "My webapp doesn't work"
Good communication: "Nextjs, [pasted error]"
Bad communication is giving irrelevant information, or being too ambiguous, not providing enough or correct detail.
Then another example of good communication and efficiency in my view is for example "ts, fn leftpad, no text, code only".
I myself can understand what it means when someone was to prompt it and LLM can understand such query for all domains.
Although if I was using Copilot I would just write the bare minimum to trigger the auto complete I want so
const leftPad =
is probably enough.
mikehollinger 17 days ago [-]
> About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy....
and
> a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability.
I still hold that the innovations we've seen as an industry with text transfer to the data from other domains. And there's an odd misbehavior with people that I've now seen play out twice -- back in 2017 with vision models (please don't shove a picture of a spectrogram into an object detector), and today. People are trying to coerce text models to do stuff with data series, or (again!) pictures of charts, rather than paying attention to timeseries foundation models which directly can work on the data.[1]
Further, the tricks we're seeing with encoder / decoder pipelines should work for other domains. And we're not yet recognizing that as an industry. For example, whisper or the emerging video models are getting there, but think about multi-spectral satellite data, fraud detection (a type graph problem).
There's lots of value to unlock from coding models. They're just text models. So what if you were to shove an abstract syntax tree in as the data representation, or the intermediate code from LLVM or a JVM or whatever runtime and interact with that?
> It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something.
> They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can "throw an LLM at it".
vbezhenar 17 days ago [-]
But I need enormous amounts of learning data and enormous amount of computing to learn new models, right? So it's kind of useless advice for most people who can't just parse github repositories and teach their new model using AST tokens. They have to use existing opensourced models or API and those happened to use text.
monero-xmr 17 days ago [-]
The environmental arguments are hilarious to me as a diehard crypto guy. The ultimate answer to “waste” of electricity arguments is that energy is a free market and people pay the price if it’s useful for them. As long as the activity isn’t illegal then training LLMs or mining bitcoins, it doesn’t matter. I pay for the electricity I use.
Eisenstein 17 days ago [-]
Do you think that it we should make it illegal to mine coins if the majority of people think the environmental cost is too high?
monero-xmr 17 days ago [-]
If a law is passed then that’s the law
adwn 17 days ago [-]
One argument against that line of thinking is that energy production has negative externalities. If you use a lot of electricity, its price goes up, which incentivizes more electricity production, which generates more negative externalities. It will also raise the costs for other consumers of electricity.
Now that alone is not yet an argument against crypto currencies, and one person's frivolous squandering of resources is another person's essential service. But you can't simply point to the free market to absolve yourself of any responsibility for your consumption.
specialist 16 days ago [-]
Unintentionally, the energy demands of cryptocurrencies, and data centers in general, have finally motivated utilities (and their regulators) to finally start building out the massive new grid capacity needed for our glorious renewable energy future.
Acknowledging that facilitating scams (eg pig butchering) are cryptocurrency's primary (sole?) use case, I'm willing to look the other way if we end up with the grid we need to address climate crisis.
monero-xmr 15 days ago [-]
To pretend romance / affinity scams and crime were created by crypto is absurd. It’s fair to argue crypto made crime more efficient, but it also made the responsible parties quicker to patch holes.
The primary use case of crypto is to protect wealth from a greedy, corrupt, money-printing state. Everything else is a sideshow
specialist 15 days ago [-]
> primary use case of crypto is to protect wealth
Merely trading governments for corporations.
> Everything else is a sideshow
Agreed. Crypto is endlessly amusing.
monero-xmr 15 days ago [-]
What corporation made bitcoin?
specialist 13 days ago [-]
Apologies, I assumed you knew what cryptocurrency is and how it works. My bad.
I have been intimately involved with cryptocurrency since 2010
monero-xmr 17 days ago [-]
I greatly despise video games. Why is that not a waste of energy? If you are entertained by something, even if it serves no human purpose other than entertainment, is that not a valid use of electricity?
swalsh 17 days ago [-]
I'm a big believer in Claude. I've accomplished some huge productivity gains by leveraging it. That said, I can see places where the models are strong and weak. If you're doing react, or python. These models are incredible. C#, C++ they're not terrible. Rust though, it's not great. If your experience is exclusively trying to use it to write Rust, it doesn't matter if you're using o1, Claude or anything else. It's just not great at it yet.
duped 17 days ago [-]
> Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.
It's not as helpful as Google was ten years ago. It's more helpful than Google today, because Google search has slowly been corrupted by garbage SEO and other LLM spam, including their own suggestions.
ChicagoDave 17 days ago [-]
Claude Sonnet 3.5 can write whole React applications with proper contextual clues and some minor iterations. Google has never coded for you.
I’ve written two large applications and about a dozen smaller ones using Claude as an assistant.
I’m a terrible front-end developer and almost none of that work was possible without Claude. The API and AWS deployment were sped up tremendously.
I’ve created unit tests and I’ve read through the resulting code and it’s very clean. One of my core pre-prompt requirements has always been to follow domain-driven design principles, something a novice would never understand.
I also start with design principles and a checklist that Claude is excellent at providing.
My only complaint is you only have a 3-4 hour window before you’re cutoff for a few hours.
And needing an enterprise agreement to have a walled garden for proprietary purposes.
I was not a fan in Q1. Q2 improved. Q3 was a massive leap forward.
duped 17 days ago [-]
I've never really used Claude for writing code, becuase I'm not really bottlenecked by that problem. I have used it quite a bit for asking questions about what code to write and it's almost always wrong (usually in subtle ways that would trick someone with little experience).
Maybe it was overtrained on react sources, but for me it's pretty useless.
The big annoyance for me is it just makes up APIs that don't exist. While that's useful for suggesting to me what APIs I should add to my own code, it's really pointless if I ask a question like "using libfoo how do I bar" and it tells me "call the doBar() function" which does not exist.
numpad0 17 days ago [-]
They can't think at all. The task must be strict macroexpansion of original input(doesn't mean that always works).
I'm suspecting LLM works for a lot of front end and app coding just because code in those fields are insanely overbloated and value proposition is almost disconnected from logic. There must be metric tons of typing in those fields, and in those areas LLMs must be useful. They certainly handle paper test questions well.
csomar 17 days ago [-]
They are mostly useful for front-end/React because front-end shouldn't been code in the first place. They can do the UX but not the state management. Honestly, as someone who sucks and dread UX building (and having to frequently adjust my divs/components), they are a life saver when you are doing very conventional things. That is things you can find 100s of examples of but will take you hours to glue together.
deadbabe 17 days ago [-]
Imagine not needing Claude to do any of that.
ChicagoDave 17 days ago [-]
This is one of those things I like about Claude.
I’m hitting my 40th year as a professional software developer and architect. I’ve written thousands of blocks of code from scratch. It gets boring.
But then in the 2000’s me (and everyone else) started building code generators, often from ERD structures, but also UML designs.
These tools were massively useful and (initially) reduced costs. The future balls of mud problems took over ten years to arrive.
But code generation has always been considered a smart and cost-effective approach to building software.
GenAI has “issues” and those have been exposed. One of my recent revelations is that Claude is best at TypeScript and python. C# (my home turf) is much lower in its skills capacity.
So in the last two months I’ve been building my apps in TypeScript instead of C# and have dramatically increased my productivity.
Claude will definitely fail if it doesn’t have the correct information. A good example is writing Bluesky apps. The docs are a mess and contradictory. But there are up to date docs on GitHub and if you include those in your project with instructions to only use those references, Claude’s hallucinations can be eliminated.
I don’t think AGI is a real possibility in my lifetime, and I do fear the future of software development when no one has actual coding experience, but for us boomers, it’s pretty darn useful.
deadbabe 17 days ago [-]
How are you measuring your productivity?
ChicagoDave 17 days ago [-]
In many cases I have no frame of reference for the expected code, like React and css. Typescript is perfectly readable, but I’m not really a script kiddie, so I’d go very slow on the React tsx files. The services are probably a slightly faster set of work, especially if I always have unit tests.
If someone was an expert React+TypeScript programmer with decent css knowledge the productivity may be a marginal improvement.
But I haven’t been a full-time programmer in ten years.
17 days ago [-]
bdangubic 17 days ago [-]
comparing google to claude 3.5 is like comparing tesla s plaid with a horse
17 days ago [-]
jimt1234 16 days ago [-]
Google Search has been corrupted by...Google.
timewizard 17 days ago [-]
[dead]
emptiestplace 17 days ago [-]
What a hilariously absurd statement. You might want to actually try it.
abhijeetpbodas 17 days ago [-]
> ask very precise questions explaining the background
IME, being forced to write about something or verbally explaining/enumerating things in detail _by itself_ leads to a lot of clarity in the writer's thoughts, irrespective of if there's an LLM answering back.
People have been doing rubber-duck-debugging since long. The metaphorical duck (LLMs in our context), if explained to well, has now started answering back with useful stuff!
danielbln 17 days ago [-]
One thing LLMs have been incredibly strong even since gpt-3.5 is being the most advanced non-human rubber duck, and while they can do plenty more, that alone provides (me at least) with tremendous utility.
aleph_minus_one 17 days ago [-]
> About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy. And I guess that in tech many folks try LLMs for the same use cases. Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.
I see much deeper problems. Just to give two examples:
- I asked various AIs concerning explanations of proofs of some deep (established) mathematical theorems: the explanations were to my understanding very hallucinated, and thus worse than "obviously wrong". I also asked for literature references for some deep mathematical theory frameworks: bascially all of the references were again hallucinated.
- I asked lots of AIs on https://lmarena.ai/ to write a suitably long text about some political topic that is quite controversial in my country (but does have lots proponents even in a very radical formulation, even though most people would not use such a radical formulation in public). All of the LLMs that I checked refused or tried to indoctrinate me that this thesis is wrong. I did not ask the LLM to lecture me, but I gave it a concrete task! Society is deeply divided, so if the LLM only spreads propaganda of its political teaching, it will be useless for many tasks for a very significant share of the society.
kromem 17 days ago [-]
Both new Sonnet and Haiku have a masking overhead.
Using a few messages to get them out of "I aim to be direct" AI assistant mode gets much better overall results for the rest of the chat.
Haiku is actually incredibly good at high level systems thinking. Somehow when they moved to a smaller model the "human-like" parts fell away but the logical parts remained at a similar level.
Like if you were taking meeting notes from a business strategy meeting and wanted insights, use Haiku over Sonnet, and thank me later.
cruffle_duffle 17 days ago [-]
To get the most out of them you have to provide context. Treat these models like some kind of eager beaver junior engineer who wants to jump in and write code without asking questions. Force it to ask questions (eg: “do not write code yet, please restate my requirements to make sure we are in alignment. Are there any extra bits of context or information that would help? I will tell you when to write code”)
If your model / chat app has the ability to always inject some kind of pre-prompt make sure to add something like “please do not jump to writing code. If this was a coding interview and you jumped to writing code without asking questions and clarifying requirements you’d fail”.
At the top of all your source files include a comment with the file name and path. If you have a project on one of these services add an artifact that is the directory tree (“tree —-gitignore” is my goto). This helps “unaided” chats get a sense of what documents they are looking at.
And also, it’s a professional bullshitter so don’t trust it with large scale code changes that rely on some language / library feature you don’t have personal experience with. It can send you down a path where the entire assumption that something was possible turns out to be false.
Does it seek like a lot of work? Yes. Am I actually more productive with the tool than without? Probably. But it sure as shit isn’t “free” in terms of time spent providing context. I think the more I use these models, the more I get a sense of what it is good at and what is going to be a waste of time.
Long story short, prompting is everything. These things aren’t mind readers (and worse they forget everything in each new session)
layer8 17 days ago [-]
You are right, but doing all that is incredibly cumbersome, at least to some people, which is why they don’t like working with LLMs.
simonw 17 days ago [-]
That was one of the themes of my article: LLMs are power-user tools, mis-sold as "easy to use". To get great results out of them you need to invest a whole lot of under-documented and under-appreciated effort. https://simonwillison.net/2024/Dec/31/llms-in-2024/#llms-som...
layer8 17 days ago [-]
It’s not just that you need to be a power user (I certainly am), you also need to be fine with nondeterminism and typing a lot of prose, instead of doing everything with keyboard shortcuts and CLI commands, with reproducible outcomes. It’s a different mode of operation and interaction, requiring a different predisposition to some degree.
madmask 17 days ago [-]
Exactly! I don’t like talking or writing or explaining.
My mind generally uses language as little as possible, I have no inner monologue running in the background.
Greatly prefer something deterministic to random bs popping up without the ability of recognizing it.
I don’t like llms but sometimes use them as autocomplete or to generate words, like a template for a letter or boilerplate scripts, never for actual information (à la google).
fragmede 17 days ago [-]
unless you can type faster than you can talk, (which some people can), stop typing and start dictating. aider has a /voice command for a reason.
I don't use it exclusively, but damn does it help in the right places.
eterps 17 days ago [-]
Can you elaborate, or give some examples? I am having trouble imagining in which situations that would be useful because I tend to put a lot of thought into defining the right prompt before sending it over.
isoprophlex 17 days ago [-]
Super interesting that my experience mirrors exactly what you are writing... except for me finding Claude to be almost useless (often misunderstands me, gives answers that are plain wrong) and 4o to be a very helpful, if not somewhat dull, jack-of-all trades in helping me be a cruise control for the mind.
I could only ever really jam with 4o.
Makes me wonder if there's personal communication preferences at play here.
hdjjhhvvhga 17 days ago [-]
While Claude Sonnet is superior than 4o for most my use cases, there are still occasionally some specific tasks where it performs slightly better.
antirez 17 days ago [-]
Probably. But statistically to work with 4o is a lose of time for me. LLMs is like an investment: you write the prompts, you "work" with them. If the LLM is too weak, this is a lose of time. You need to have a return on the investment that is positive. With ChatGPT 4o / o1 most of the times for me the investment of time has almost zero return. Before Claude Sonnet 3.5 I already had a ChatGPT PRO account but never used it for coding since it was most of the times useless if not for throw away scripts that I didn't want to do myself or as a stack overflow replacement for trivial stuff. Now it's different.
airstrike 17 days ago [-]
This mirrors my experience 100%. I'm not even sure why I still pay for OpenAI at this point. Claude 3.5 is just incredibly superior. And I totally agree on the point about dropping in context and asking very specific questions. I've had Claude pinpoint a bug in a 2k LOC module that I was struggling to find the cause for. After wasting a lot of time on it on my own, I thought "what the heck, maybe Claude can figure it out" and it did. It's objectively useful, even if flawed sometimes.
hanikesn 17 days ago [-]
I'm curious. Can you go into more detail what kind of bug it found?
airstrike 17 days ago [-]
I was writing a custom widget for iced (the Rust GUI library) and I was getting a panic due to some fancy logic I was trying to do. I guess the shortest description I can say is that it was a combination of what appeared to be a caching issue at first, but the real cause turned out to be some method shadowing where I was using a struct's method where I meant to use the trait's method.
I had made the specific operation generic (moving it out of the struct and into a trait) but forgot to delete it from the struct, so I was calling the incorrect function. Claude pinpointed the cache issue immediately when I just dumped two files into the context and asked it:
somewhere in my codebase I'm triggering a perform() on the editor but the next call on highlight() panics because `Line layout should be cached`
what am I missing? do I need to do something after perform() to re-cache the layout?
at first that seemed to fix the issue, but other errors persisted. so we kept debugging together until we found the root cause. either way I knew where to look thanks to its assistance
d0mine 17 days ago [-]
why "lose of time" instead of "loss of time"
Is it a typo or fingerprinting?
fragmede 17 days ago [-]
it's "proof" that it wasn't written by an LLM (but let me delve into this issue).
squirrel 17 days ago [-]
Typo
tootie 17 days ago [-]
Like what? Claude has become my go-to, but I find that it's wrong enough often enough that I really can't trust it for anything. If it says something I have to go dig through it's citations very carefully.
minimaxir 17 days ago [-]
> Claude Sonnet 3.5 (not Haiku!)
A very big surprise is just how much better Sonnet 3.5 is than Haiku. Even the confusingly-more-expensive-Haiku-variant Haiku 3.5 that's more recent than Sonnet 3.5 is still much worse.
worldsayshi 17 days ago [-]
I ponder if LLM:s are very useful but at a quite narrower set of tasks than we expect. Like fuzzy manipulation of logical specifications.
I.e. over time it constitute a fundamental shift in how we interact with abstractions in computers. The current fundamentals will still remain but they will become increasingly malleable. Details in code will become less important. Architecture will become increasingly important. But at the same time the cost of refactoring or changing architecture will quickly drop.
Any details that are easily lost when passing through an LLM will be details that have the highest maintenance cost. Any important details that can be retained by an LLM can move up and down the ladder of abstraction at will.
Can an LLM based solution maintain software architectures without introducing noise? The answer to that is the difference between somewhat useful and game changing.
vasco 17 days ago [-]
Most people consider their own brain useless and don't use it, so it's not strange that they do the same with AI. How many people just refuse to learn how to parallel park, a new language, calculus or even basic arithmetic, "because they aren't good at it".
qwertox 17 days ago [-]
LLMs have given computers the ability to communicate with us in natural language, we didn't have that before at this level. In order to do this, they've been fed with a lot of coherent stuff and give the impression of being coherent, but we know they're just statistical machines. But at least they can now communicate naturally with us, so now we have that infrastructure available, as we do have TTS or ASR or monitors and keyboards available. It's still up to us to now make proper agents out of them. Agents for the software we've been using for decades. They can take over a lot of tedious work for us.
salawat 17 days ago [-]
Why are you pasting huge chunks of potentially crown jewels code into a 3rd party service where prompts are going to most likely be turned into training/surveillance material?
simonw 17 days ago [-]
A lot of vendors promise not to train on input to their models. I choose to believe those promises.
askl56 17 days ago [-]
A scorpion, not knowing how to swim, asked a frog to carry it across the river. “Do I look like a fool?” said the frog. “You’d sting me if I let you on my back!”
“Be logical,” said the scorpion. “If I stung you I’d certainly drown myself.”
“That’s true,” the frog acknowledged. “Climb aboard, then!” But no sooner than they were halfway across the river, the scorpion stung the frog, and they both began to thrash and drown. “Why on earth did you do that?” the frog said morosely. “Now we’re both going to die.”
“I can’t help it,” said the scorpion. “It’s my nature.”
zahlman 17 days ago [-]
>They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours)
All the tasks I can think of dealing with on my own computer that would take hours, a) are actually pretty interesting to me and b) would equally well take hours to "provide perfect guidance". The drudge work of programming that I notice comes in blocks of seconds at a time, and the mental context switch to using an LLM would be costlier.
jsheard 17 days ago [-]
I swear these goalposts keep getting moved, I remember being told that GPT3.5 is a useless toy but the paid GPT4 is lifechanging, and now that GPT4 is free I'm told that it's a useless toy but paid o1 or paid Sonnet are lifechanging. Looking forward to o1 and Sonnet becoming useless toys, unlike the lifechanging o3.
raincole 17 days ago [-]
Except GPT4 isn't free.
The GP is claiming GPT4o is bad but Sonnet is good. GPT4o is about only 20% cheaper than Sonnet.
aetherson 17 days ago [-]
You will also be dismayed to hear that a 2011 iPhone is no longer state-of-the-art, and indeed can't run most modern apps.
scubbo 17 days ago [-]
Holy false-equivalency, Batman! The definitions of "useless toy / lifechanging tool" are _not_ changing over time (or, at least, not over the timescale being explored here), whereas the expectations and requirements of processing power of a phone are.
aetherson 17 days ago [-]
But in fact they are changing over time -- this is an expectations treadmill. When you get something newer and better, it highlights the flaws in what you had before.
scubbo 16 days ago [-]
That is true _in general_, but not in this specific case (hence why I specified "not over the timescale being explored here"). A modern cigarette-lighter would indeed have been a life-changing tool to a caveman but is indeed disposable junk today.
The point being made by the original comment (with which I agree) was that many criteria-for-usefulness - primarily that of reliability or a lack of hallucination - have remained static; with successive generations of tools being (falsely) claimed to meet them, but then abandoned when the next hype-train comes along.
I certainly agree that _some_ aspects of AI models are indeed improving (often drastically!) over time (speed, price, supported formats, history/context, etc.) - but they still _all_ fall _drastically_ short on the key core requirement that is required in order to make them Actually Useful. "X is better than Y" does not imply "where Y failed to be useful, X now succeeds".
jpc0 17 days ago [-]
GPT4 is a 13 year old technology? Compared to o1 and Sonnet 3.5?
If someone told me an iPhone 4 is terrible but an iPhone 5 would definitely serve my needs, then when I get an iPhone 5 they say the same of the 6 you really want me to believe them a second time? Then a third time? Then a 4th? In the mean time my time and money is wasted?
johnrob 17 days ago [-]
It would be quite useful if that were the only phone available.
FooBarWidget 17 days ago [-]
Why do people have such narrow views on what makes LLMs useful? I use them for basically everything.
My son throwing an irrational tantrum at the amusement park and I can't figure out why he's like that (he won't tell me or he doesn't know himself either) or what I should do? I feed Claude all the facts of what happened that day and ask for advice. Even if I don't agree with the advice, at the very least the analysis helps me understand/hypothesize what's going on with him. Sure beats having to wait until Monday to call up professionals. And in my experience, those professionals don't do a better job of giving me advice than Claude does.
It's weekend, my wife is sick, the general practitioner is closed, the emergency weekend line has 35 people in the queue, and I want some quick half-assed medical guidance that while I know might not be 100% reliable, is still better than nothing for the next 2 hours? Feed all the symptoms and facts to Claude/ChatGPT and it does an okay job a lot of the time.
I've been visiting Traditional Chinese Medicine (TCM) practitioner for a week now and my symptoms are indeed reducing. But TCM paradigm and concepts are so different from western medicine paradigms and concepts that I can't understand the doctor's explanation at all. Again, Claude does a reasonable job of explaining to me what's going on or why it works from a western medicine point of view.
Want to write a novel? Brainstorm ideas with GPT-4o.
I had a debate with a friend's child over the correct spelling of a Dutch word ("instabiel" vs "onstabiel"). Google results were not very clear. ChatGPT explained it clearly.
Just where is this "useless" idea coming from? Do people not have a life outside of coding?
krapp 17 days ago [-]
Yes people have lives outside of coding, but most people are able to manage without having AI software intercede in as much of their lives as possible.
It seems like you trust AI more than people and prefer it to direct human interaction. That seems to be satisfying a need for you that most people don't have.
claar 17 days ago [-]
Why do you postulate that "most people don't have" this need? I also use AI non-stop throughout my day for similar uses.
This feels identical to when I was an early "smart phone" user w/my palm pilot. People would condescend saying they didn't understand why I was "on it all the time". A decade or two later, I'm the one trying to get others to put down their phones during meetings.
My take? Those who aren't using AI continually currently are simply later adopters of AI. Give it a few years - or at most a decade - and the idea of NOT asking 100+ AI queries per day (or per hour) will seem positively quaint.
krapp 17 days ago [-]
>Those who aren't using AI continually currently are simply later adopters of AI. Give it a few years - or at most a decade - and the idea of NOT asking 100+ AI queries per day (or per hour) will seem positively quaint.
I don't think you're wrong, I just think a future in which it's all but physically and socially impossible to have a single thought or communication not mediated by software is fucking terrifying.
FooBarWidget 17 days ago [-]
When I'm done working, chased my children to properly finish their dinner, helped my son with homework, and putting them to bed, it's already 9+ PM — the only time of the day when I have free time. Just which human besides my wife can I talk to at that point? What if she doesn't have a clue either? All the professionals are only open when I'm working. A lot of the issues happen during the weekend, when professionals are closed. I don't want to disturb friends during the evening, and it's not like they have the expertise I need anyway.
LLMs are infinitely patient, don't think I am dumb for asking certain things, consider all the information I feed them, are available whenever I need them, have a wide range of expertise, and are dirt cheap compared to professionals.
That they might hallucinate is not a blocker most of the time. If the information I require is critical, I can always double check with my own research or with professionals (in which case the LLM has already primed me with a basic mental model so that I can ask quick, short, targeted questions, which saves the both of us time, and me money). For everything else (such as my curiocity on why TCM works, or the correct spelling of a word), LLMs are good enough.
vbezhenar 17 days ago [-]
You are supposed to have connections with knowledgeable people, so you can call them and ask for advice. That's how it works without computers.
FooBarWidget 17 days ago [-]
Did you miss the parts where I said that I only have time when they're closed, and they're only open when I'm most busy?
Have you never seen knowledgeable people get things wrong, and having to verify them?
Did you miss the part where they cost money, and I better come in as prepared as possible?
I really don't get these knee-jerk averse reactions. Are people deliberately reading past my assertions that I double check LLM outputs for everything critical?
CRConrad 12 days ago [-]
> LLMs are infinitely patient, don't think I am dumb for asking certain things
We don't know that. They could be laughing their ass off at you without telling you.
jiggawatts 17 days ago [-]
At the risk of sounding impolite or critical of your personal choices: this, right here, is the problem!
You don’t understand how medicine works, at any level.
Yet you turn to a machine for advice, and take it at face value.
I say these things confidently, because I do understand medicine well enough to not to seek my own answers. Recently I went to a doctor for a serious condition and every notion I had was wrong. Provably wrong!
I see the same behaviour in junior developers that simply copy-paste in whatever they see in StackOverflow or whatever they got out of ChatGPT with a terrible prompt, no context, and no understanding on their part of the suitability of the answer.
This is why I and many others still consider AIs mostly useless. The human in the loop is still the critical element. Replace the human with someone that thinks that powdered rhino horn will give them erections, and the utility of the AI drops to near zero. Worse, it can multiply bad tendencies and bad ideas.
I’m sure someone somewhere is asking DeepSeek how best to get endangered animals parts on the black market.
FooBarWidget 17 days ago [-]
No. Where do you read that I take it at face value? I literally said that I expect Claude to give me "half-assed" medical guidance. I merely said that that is still better than having no clue for the next 2 hours while I wait on the phone with 35 people in front of me, which is completely different from "taking medicine advice at face value". It's not like I will let my wife drink bleach just because Claude told me to. But if it tells me that it's likely an ear infection then at least I can discuss the possibility with the doctor.
So I am curious about how TCM works. So what if an LLM hallucinates there? I am not writing papers on TCM or advising governments on TCM policy. I still follow the doctor's instructions at the end of the day.
For anything really critical I already double check with professionals. As you said, human in the loop is important. But needing human in the loop does not make it useless.
You are letting perfect be the enemy of good. A half-assed tax advice with some hallucinations from an LLM is still useful, because it will prime me with a basic mental model. When I later double check the whole thing with a professional, I will already know what questions to ask and what direction I need to explore, which saves time and money compared to going in with a blank slate.
The other day I had Claude advice me on how to write a letter to a judge to fight a traffic fine. We discuss what arguments to make, from what perspective a judge will see things, and thus what I should plead for. The traffic fine is a few hundred euros: a significant amount, but barely an hour worth of a real lawyer's fee. It makes absolutely no sense to hire a real lawyer here. If this fails, the worst thing that can happen is that I won't get my traffic fine reimbursed.
There is absolutely nothing wrong with using LLMs when you know their limits and how to mitigate them.
So what if every notion you learned about medicine from LLMs is wrong? You learn why they're wrong, then next time you prompt/double check better, until you learn how to use it for that field in the least hallucinationatory way. Your experience also doesn't match mine: the advice I get usually contains useful elements that I then discuss with doctors. Plus, doctors can make mistakes too, and they can fail to consider some things. Twitter is full of stories about doctors who failed to diagnose something but ChatGPT got it right.
Stop letting perfect be the enemy of good. Occasionally needing human in the loop is completely fine.
fragmede 17 days ago [-]
To be fair though, humanity doesn't know how some medicines work at a fundamental level either. The method of action for Tylenol, lithium, and metformin, among others isn't fully understood.
jiggawatts 17 days ago [-]
True, but modern "western"[1] medicine is not about the specific chemicals used, or even knowing how they exactly work at a chemical level, but the process for identifying what does and what does not work. It's an "evidence based" science with with experiments designed to counter known biases such as the placebo effect. Much of what we consider modern medicine was developed before we were entirely sure that atoms actually existed!
[1] It isn't actually western, because it's also used in the east, middle-east, south, both sides of every divide, etc... In the same sense, there is no "western chemistry" as an alternative to "eastern alchemy". There's "things that work" versus "things that make you feel slightly better because they're mild narcotics or stimulants... at best."
(I don't want to focus too much on Chinese herbal medicine, because I see the same cargo-culting non-scientific thinking in code development too. I've lost count of the number of times I've seen an n-tier SPA monstrosity developed for something that needed a tiny monolithic web app, but mumble-mumble-best-mumble-practices.)
FooBarWidget 17 days ago [-]
"Western medicine" (which is exactly what it is called in China, to contrast with TCM) is shorthand for "practices invented in the west". That these methods chase universal truths, or are practiced world-wide, do not make them "non-west" in terms of origin.
The Chinese call the practice of truth seeking, in a more broader sense (outside of medicine) just "science".
"Western" medicine is also not merely the practice of seeking universal medical truth. It is also a collection of paradigms that have been developed in its long history. Like all paradigms, there are limits and drawbacks: phenomena that do not fit well. Truth seeking tends to be done on established paradigms rather than completely new ones.
The "western" prefix is helpful in contrasting it with TCM, which has a completely different paradigm. Many Chinese, myself included, have the experience that there are all sorts of ailments that are not meaningfully solved by "western" medicine practitioners, but are meaningfully solved by TCM practitioners.
tokai 17 days ago [-]
This reads like satire to me. Scarry that it isn't.
FooBarWidget 17 days ago [-]
I'm guessing that mindset is what cause some people to find this scary. I see a new tool and opportunities. Like all tools, it has drawbacks and caveats, but when wielded properly, it can give me more choice. I suspect some others focus too much on flaws and don't bother looking for opportunities. They are expecting a holy grail: if it's not perfect then it's useless.
It's like people who proclaim that Linux as a whole is a useless toy because it doesn't run their favorite games or favorite Windows app. They focus on this one flaw and miss all the opportunities.
Many of these people seem to advocate trusting human professionals. Do you have any idea how often human professionals do a half-assed job, and I have to verify them rather than blindly trusting them? The situation is not that much different from LLMs.
Professionals making mistakes do not make them useless. Grandma, with all her armchair expertise, is often right and sometimes wrong, and that does not make her useless either.
Why let perfect be the enemy of good?
BlueTemplar 17 days ago [-]
Grandma has a reason to care about you.
At the opposite, my trust of Russian / Chinese / USian platforms is low enough that I consider it my duty to publicly shame people that still use them in 2025.
(With some caveats of course, for instance HN is not a yet negative to the world. Yet.)
There's also the question of stickiness of habits : your grandmas are for life, human professionals you might have a shallow enough relationship with that switching them might be relatively easy, while it might be very hard to stop smoking or to stop using Github once you started smoking / create an account.
FooBarWidget 17 days ago [-]
You view Github and LLMs as traps that deliberately give you malicious advice or even brainwash you into addiction? If you view things that way then it's no surprise that you are averse to LLMs (and Github). But frankly I find that entire view to be absurd and overly cynical.
raincole 17 days ago [-]
I too read it as satire at first, but after thinking twice I think it's a quite reasonable take. I've added "utilize LLM more in my daily life outside programming" to my new year resolution.
danielbln 17 days ago [-]
I had the flu at the beginning of December, with high fever, the whole nine yards. Keeping a running log with Claude in which I shared temperature readings, medications etc. has been so useful. If nothing else it's the world's most sophisticated rubber duck / secretary, but that's quite useful in many daily life situations on its own. Caveats apply etc.
signatoremo 17 days ago [-]
Huh? The GP makes perfect sense. I’d never trust LLMs blindly, but I wouldn’t hesitate to ask them about any topic. “Trust but verify” is often said about human beings. Perhaps “distrust but ask and verify” is the mantra applicable for LLMs.
CRConrad 12 days ago [-]
Scary. Reads rather like you're well on your way to replace basic life skills with reliance on LLMs.
qsort 17 days ago [-]
I believe it's more frustration directed at the mismatch between marketing and reality, combined with the general well deserved growing hatred for SV culture, and, more broadly, software engineers. The sentiment would be completely different if the entire industry marketed themselves like the helpful tools they are rather than the second coming of Christ they aren't. This distinction is hard to make on "fast food" forums like this one.
If you aren't a coder, it's hard to find much utility in "Google, but it burns a tree whenever you make an API call, and everything it tells you might be wrong". I for one have never used it for anything else. It just hasn't ever come up.
It's great at cheating on homework, kids love GPTs. It's great at cheating in general, in interviews for instance. Or at ruining Christmas, after this year's LLM debacle it's unclear if we'll have another edition of Advent of Code. None of this is the technology's fault, of course, you could say the same about the Internet, phones or what have you, but it's hardly a point in favor either.
And if you are a coder, models like Claude actually do help you, but you have to monitor their output and thoroughly test whatever comes out of them, a far cry from the promises of complete automation and insane productivity gains.
If you are only a consumer of this technology, like the vast majority of us here, there isn't that much of an upside in being an early adopter. I'll sit and wait, slowly integrating new technology in my workflow if and when it makes sense to do so.
Happy new year, I guess.
fragmede 17 days ago [-]
> there isn't that much of an upside in being an early adopter.
Other than, y'know, using the new tools. As a programmer heavy forum, we focus a lot on LLMs' (lack of) correctness. There's more than a little bit of annoyance when things are wrong, like being asked to grab the red blanket and then getting into an argument over it being orange instead of what was important, someone needed the blanket because they were cold.
Most of the non-tech people who use ChatGPT that I've talked to absolutely love it because they don't feel it judges them for asking stupid questions and they have conversations about absolutely everything in their lives with it down to which outfit to wear to the party. There are wrong answers to that question as well, but they're far more subjective and just having another opinion in the room is invaluable. It's just a computer and won't get hurt if you totally ignore it's recommendations, and even better, it won't gloat (unless you ask it to) if you tell it later that it was right and you were wrong.
Some people have found upsides for themselves in their lives, even at this nascent stage. No one's forcing you to use one, but your job isn't going to be taken by AI, it's going to be taken by someone else who can outperform you that's using AI.
saltcured 17 days ago [-]
Yikes.
Clearly said, yet the general sentiment awakens in me a feeling more gothic horror than bright futurism. I am stuck with wonder and worry at the question of how rapidly this stuff will infiltrate into the global tech supply chain, and the eventual consequences of misguided trust.
To my eye, too much current AI and related tech are just exaggerated versions of magic 8-balls, Ouija boards, horoscopes, or Weizenbaum's ELIZA. The fundamental problem is people personifying these toys and letting their guard down. Human instincts take over and people effectively social engineer themselves, putting trust in plausible fictions.
It's not just LLMs though. It's been a long time coming, the way modern tech platforms have been exaggerating their capability with smoke and mirrors UX tricks, where a gleaming facade promises more reality and truth than it actually delivers. Individual users and user populations are left to soak up the errors and omissions and convince themselves everything is working as it should.
Someday, maybe, anthropologists will look back on us and recognize something like cargo cults. When we kept going through the motions of Search and Retrieval even though real information was no longer coming in for a landing.
weMadeThat 17 days ago [-]
> They work great to explore what is at the borders of your knowledge.
But not at exploring what is at the border of knowledge itself. And by converging on the conventional, LLMs actually lead you away from anything that actually extends.
> doing boring tasks for which you can provide perfect guidance
That's true but you never need an LLM for that. There are wonderful scripts written by wonderful people and provided for free almost all the time and for those who search in the right places. LLM companies benefit/profit of these without providing anything in return.
They are worse than people who grab FOSS and turn it into overpriced and aggressively marketed business models and services or people who threaten and sue FOSS for being better and free alternatives to their bloated and often "illegally telemetric" services.
> able to accelerate you
True, but you leave too much for data brokers and companies like Meta to abuse and exploit in the future. All that additional "interactional data" will do so much worse to humanity than all those previous data sets did in elections, for example, or pretty much all consumer markets. They will mostly accelerate all these dimwitted Fortune 5000 companies that have sabotaged consumers into way too much dumb shit - way more than is reasonable or "ok". And educated, wealthy and or tech-savvy people won't be able to avoid/evade any of that. Especially when it's paired with meds, drugs, foods, biases, fallacies, priming and so on and all the knowledge we will gain on bio-chemical pathways and human liability to sabotage.
They are great for coders, of course, everyone can be an army of clone-warriors with auto-complete on steroids now and nobody can tell you what to do with all that time that you now have and all that money, which, thanks to all of us but mostly our ancestors, is the default. The problem is the resulting hyper-amplified, augmented financial imbalance. It's gonna fuck our species if all the technical people don't restore some of that balance, and everybody knows what that means and what must be done.
atombender 17 days ago [-]
Is there a way to use this in Jetbrains IDEs? (I've not been impressed with their AI Assistant.) There are a few plugins, but from the reviews they all seem kind of mediocre.
cube2222 17 days ago [-]
I personally use the Zed editor AI assistant integration with Sonnet for anything AI-related, while using a JetBrains IDE for coding / code reading, side-by-side.
I haven’t found anything comparably good for JetBrains IDEs yet, but I’m also not switching to something else as my main editor.
Too 17 days ago [-]
Github copilot plugin is decent. It's not going to write a whole app for you, but it accelerates repetitive stuff, can give suggestions you didn't think of or save you a trip to the documentation.
Sn0wCoder 17 days ago [-]
I use IntelliJ as my main coding tool but also use VSCode and Sublime text. If you have access to local LLMs or have an API key for some the Continue Plugin (basically Cursor but can use in IntelliJ) is the Best of the Best for IntelliJ (IMO). I have a box running some local models including Phind and StarCoder (plus some small embeddings) and have been super happy with the end product. The next up is Google Gemini Code Assist has been the best of the IntelliJ (non-configured) AI tools I have tried. There are better ones out there but IMO not for IntelliJ. It's still free for a few more weeks and I have been using it since the free release, fun to use. Can pre-prompt, say you are an expert XXX, please be funny, fill in the rest of your regular prompts. The Co-Pilot I use for work is very limited and will only answer coding questions. I tried to tell it that it was my coding buddy, and its name was Phil and told me it cannot have a personality or be funny. I believe the paid personal Co-Pilot allows you to choose which LLM it uses (I cannot confirm). The Phind VSCode plugin works really well. Also, the Phind coding models are on par with some of the other big ones and free if you have a subscription (or run locally). Sublime is around to open those GIG+ files as VSCode chocks and not worth the RAM of opening another IntelliJ.
Each task / programming language / query requires trying different LLM models and novel ways of prompting. If it's not work-related (or work pays for the one you use) sending as much of the code as relevant also helps the answers be more useful.
Most of the people I meet that say LLMs are not useful have only tried one (flavor / plugin), do not know how to pre-prompt or prompt, and do not give the tools a chance. Try one or two things, say yep, it's not good and give up.
Still hard for me to admit that Prompt Engineering is a profession, but it's the same as Google Fu. Once you learn it you can become an LLM Ninja!
I do not believe LLMs are coming for my job (just yet) but do believe they are going to be able to replace some people, are useful and those that do not use them will be at a disadvantage.
cpursley 17 days ago [-]
Try Cursor. I’m serious.
atombender 17 days ago [-]
I'm sure it's good, but that's not what I'm asking about.
wslh 17 days ago [-]
Right, in simpler terms: The measure of LLMs success is how effectively they help you achieve your goal faster.
antirez 17 days ago [-]
Exactly, and right now the LLMs acceleration effect is a tool, not "give me the final solution". Even people that can't code, using LLMs to build applications from scratch, still have this tool mindset. This is why they can use them effectively: they don't stop at the first failed solution; they provide hints to the LLM, test the code, try to figure what's the problem (also with the LLM help), and so forth. It's a matter of mindset.
andrewaylett 17 days ago [-]
> people that can't code
These people may not be Software Engineers, but they are coding.
d0mine 17 days ago [-]
btw, fusion has arrived by that definition: No reactors that would produce more energy than they consume, But net positive reactions have been achieved. Tasks where LLMs output is more than 1x are few and far between.
16 days ago [-]
brookst 17 days ago [-]
I’m surprised you only have one use case. I use LLMs to research travel, adjust recipes, check biographies and book reviews, and many many more things.
mhh__ 17 days ago [-]
Hopefully things have narrowed but you can see from the trends data just how few people (API may be a different story) use claude relative to chatgpt.
minimaxir 17 days ago [-]
Brand awareness is a hell of a drug.
mhh__ 17 days ago [-]
Indeed, although I find myself reaching for o1 more than Claude for matters other than programming, solely because it has better LaTeX (...)
ddgflorida 17 days ago [-]
Definitely not a "useless toy" with the right use case. It's great at code snippets, scripts, etc. It's an assistant.
17 days ago [-]
ta12653421 16 days ago [-]
ClaudeAI ++1000
1oooqooq 17 days ago [-]
yeah, they save as much time as finding a template with a good old search and using it.
...Begun November 4, 2024, published December 28, 2024.
...assisted by Claude 3.5 sonnet, trained on my previous books...
...puzzles co-created by the author and Claude
...GPT-4o and -o1 were useful in latex configurations...doing proof-reading.
...Gemini Experimental 1206 was an especially good proof-reader
...Exercises were generated with the help of Claude and may have errors.
...project was impossible without the creative labors of Claude
The obvious comparison is to the classic Strang https://math.mit.edu/~gs/everyone/ which took several *years* to conceptualize, write, peer review, revise and publish.
Ok maybe Strang isn't your cup of tea, :%s/Strang/Halmos/g , :%s/Strang/Lipschutz/g, :%s/Strang/Hefferon/g, :%s/Strang/Larson/g ...
Working through the exercises in this new LLMbook, I'm thinking...maybe this isn't going to stand the test of time. Maybe acceleration is not so hot after all.
pton_xd 17 days ago [-]
"The story of linear algebra begins with systems of equations,
each line describing a constraint or boundary traced upon abstract space.
These simplest mathematical models of limitation — each equation binding variables in measured proportion — conjoin to shape the realm of
possible solutions. When several such constraints act in concert, their collaboration yields three possible fates: no solution survives their collective
force; exactly one point satisfies all bounds; or infinite possibilities trace
curves and planes through the space of satisfaction. This trichotomy — of
emptiness, uniqueness, and infinity — echoes through all of linear algebra, appearing in increasingly sophisticated forms as our understanding
deepens."
Maybe I'm not the target audience, but... that really doesn't make me interested in continuing to read.
jcranmer 17 days ago [-]
That is such supremely bad writing that it can only come from AI being told to spice up the original opening paragraph, and short of the original author being barely literate (and possibly even then), the original text would have been better writing.
The overuse of the $15 synonyms is almost always a bad idea--you want to use them sparingly, where dropping them in for their subtly different meanings enhances the text. But what is extremely sloppy here is that the possibilities of "no solutions, one solution, infinite solutions" is now being described with a different metaphor for solution here. And by the end of the paragraph, I'm not actually sure what point I'm supposed to take away from this text. (As bad as this paragraph is, the next paragraph is actually far worse.)
Mathematics already has a problem for the general audience with a heavy focus on abstraction that can be difficult to intuit on more concrete objects. Adding florid metaphors to spice up your writing makes that problem worse.
jpc0 17 days ago [-]
Even putting it here is annoying to me... Those are a lot of words saying nothing that I just spend time reading.
I'm agreeing with you.
wrs 17 days ago [-]
It's rather purple prose, but it's entirely meaningful. Maybe it doesn't seem to mean anything until after you know some linear algebra, though...
dxbydt 17 days ago [-]
its been a long time, but when i was taught this material, i was told there are only 3 cases -
x+y=1, x+y=2 clearly has no solution since two numbers can’t simultaneously add to both one and two.
x+y=1,2x+2y=2 clearly has infinitely many solutions. There’s only one equation here after canceling the 2, so you can plug in x’s and y’s all day long, no end to it.
x+y=1, 2x+y=1 clearly has exactly one solution (0,1) after elimination.
This example stuck with me so I use it even now. The author/Claude/Gemini/whatever could have just used this simple example instead of “trichotomy of curves through space conjoin through the realm of …” math, not Shakespeare.
BlueTemplar 17 days ago [-]
Also, isn't this a great example of "when you have a hammer, everything looks like a nail" ?
To explain this I would first and foremost use a picture, where the 3 cases : parallel, identical, intersection can be intuitively seen (using our visual system, rather than our language system), with merely a glance.
wrs 17 days ago [-]
Sure, but saying something in an ornate way is not the same as “saying nothing”.
sureglymop 17 days ago [-]
I agree. Not what I would expect from a math book or script.
datadrivenangel 17 days ago [-]
Going faster isn't good if the quality drops enough that overall productivity decreases... Infinite slop is only a good thing for pigs.
cruffle_duffle 17 days ago [-]
Just use ChatGPT to summarize its own output. It’s like running your JPEG back through the JPEG compressor again!
kianN 17 days ago [-]
^ This perfectly encapsulates the story I see every time someone digs into the details of any llm generated or assisted content that has any level of complexity.
Great on the surface but lacks any depth, cohesive, or substance
mooreds 17 days ago [-]
I started a book about CIAM (customer identity and access management) using Claude to help outline a chapter. I'd edit and refine the outline to make sure it covered everything.
Then I'd have Claude create text. I'd then edit/refine each chapter's text.
Wow, was it unpleasant. It was kinda cool to see all the words put together, but editing the output was a slog.
It's bad enough editing your own writing, but for some reason this was even worse.
dxbydt 17 days ago [-]
just to clarify - I have nothing to do with this book. I was just forwarded a copy and I thought its relevant to the topic at hand.
from the wild swings in karma, looks like people are annoyed with the message and shooting down the messenger.
karmakaze 17 days ago [-]
We're at the "computers play chess badly" stage. Then we'll hit the Deep Thought (1988) and Deep Blue (1995-1997) stages, but still saying that solving Go won't happen for 50+ years and that humans will continue to be better than computers.
The date/time that divides my world into before/after is AlphaGo v Lee Sedol game 3 (2016). From that time forward, I don't dismiss out of hand speculations of how soon we can have intelligent machines. Ray Kurzweil date of 2045 is as good as any (and better than most) for an estimate. Like Moore's (and related) Laws, it's not about how but the historical pace of advancements crossing a fairly static point of human capability.
Application coding, requires much less intelligence than playing Go at these high levels. The main differences are concise representation and clear final outcome scoring. LLMs deal quite well with the fuzziness of human communications. There may be a few more pegs to place but when seems predictably unknown.
dartos 17 days ago [-]
> There’s a flipside to this too: a lot of better informed people have sworn off LLMs entirely because they can’t see how anyone could benefit from a tool with so many flaws. The key skill in getting the most out of LLMs is learning to work with tech that is both inherently unreliable and incredibly powerful at the same time. This is a decidedly non-obvious skill to acquire!
I wish the author qualified this more. How does one develop that skill?
What makes LLMs so powerful on a day to day basis without a large RAG system around it?
Personally, I try LLMs every now and then, but haven’t seen any indication of their usefulness for my day to day outside of being a smarter auto complete.
lumost 17 days ago [-]
When I started my career in 2010, google was a semi-serious skill. All of the little things that we know how to do now such as ignoring certain sites, lingering on others, and iteratively refining our search queries were not universally known at the time. Experienced engineers often relied on encyclopedic knowledge of their environment or by "reading the manual".
In my experience, LLM tools are the same, you ask for something basic initially and then iteratively refine the query either via dialog or a new prompt until you get what you are looking for or hit the end of the LLM's capability. Knowing when you've reached the latter is critically important.
layer8 17 days ago [-]
One difference is that skillful googling still only involved typing a few keywords or a short phrase and some syntax, and then knowing how to skim the results and iterate, and how to operate your browser efficiently. With LLMs, you have to type a lot more (and/or use voice input), and often also read more, it’s also not stateless/repeatable like following a web link, and most output looks the same (as opposed to the variations in web sites). I pride(d) myself on my Google foo, it was fun, but I find using LLMs to be quite exhausting in comparison.
j_bum 17 days ago [-]
I also find LLMs to be more exhausting than Googling, but for me they’ve been ultimately more enriching and efficient.
Specifically, I’ve been using Kagi Assistant over the past 1.5 months for serious and lengthy searches, and I can’t imagine going back to traditional search.
I’m currently sold on this model of LLM assisted search (where explicit links are provided) over the old Google foo skills I developed during grad school.
Example search topics include deep dives and guidance for my first NAS build, finding new bioinformatics methods, and other random biomedical info.
o11c 17 days ago [-]
The problems with that skill is that:
* Most existing LLM interfaces are very bad at editing history, instead focusing entirely on appending to history. You can sort of ignore this for one-shot, and this can be properly fixed with additional custom tools, but ...
* By the time you refine your input enough to patch over all the errors in the LLM's output for your sensible input, you're bigger than the LLM can actually handle (much smaller than the alleged context window), so it starts randomly ignoring significant chunks of what you wrote (unlike context-window problems, the ignored parts can be anywhere in the input).
danielbln 17 days ago [-]
I really like Zed's (editor) implementation. The context window is just editable text, like any other. You can freely change anything and send the whole thing back into the LLM. I find that a much more useful interface than mucking around and editing chat bubbles.
dinosaurdynasty 17 days ago [-]
ChatGPT basically lets you edit any of your messages at any point in the conversation, which I definitely use (e.g., if the conversation has gotten into a bad basin, the LLM misunderstood me, etc).
Also ChatGPT has a pretty big context window. Gemini supposedly has the biggest useful context window (~millions of tokens), though I don't have personal experience.
simonw 17 days ago [-]
I tend to avoid editing previous messages because it breaks my mental model of the sequence that got me to the current state. That's more of a bias from my goal to do "research" into how these models work though - I'm always trying to maintain the cleanest possible record of what I did so I can learn from the transcript later.
cruffle_duffle 17 days ago [-]
> Most existing LLM interfaces are very bad at editing history, instead focusing entirely on appending to history. You can sort of ignore this for one-shot, and this can be properly fixed with additional custom tools, but ...
Somebody somewhere needs to provide a threaded interface to an LLM.
simonw 17 days ago [-]
Yeah, a key thing to understand about LLMs is that managing the context is everything. You need to know when to wipe the slate by starting a new chat session and then pasting across a subset of the previous conversation.
A lot of my most complex LLM interactions take place across multiple sessions - and in some cases I'll even move the project from Claude 3.5 Sonnet to OpenAI o1 (or vice versa) to help get out of a rut.
It's infuriatingly difficult to explain why I decide to do that though!
dartos 17 days ago [-]
What kinds of things do you with these LLMs?
I feel like I’m good at understanding context. I’ve been working in AI startups over the last 2 years. Currently at an AI search startup.
Managing context for info retrieval is the name of the game.
But for my personal use as a developer, they’ve caused me much headache.
Answers that are subtly wrong in such a way that it took me a week to realize my initial assumption based on the LLM response was totally bunk.
This happened twice. With the yjs library, it gave me half incorrect information that led me to misimplementing the sync protocol. Granted it’s a fairly new library.
And again with the web history api. It said that the history stack only exists until a page reload.
The examples it gave me ran as it described, but that isn’t how the history api works.
I lost a week of time because of that assumption.
I’ve been hesitant to dive back in since then. I ask questions every now and again, but I jump off much faster now if I even think it may be wrong.
gavindean90 17 days ago [-]
There is no substitute for cold hard facts. LLMs do not provide that unless it’s literally the easiest thing for them to do and even then not always.
In the case you were in I would go out of my way to feed the docs to the LLM and then use the LLM to interrogate the docs and then verify the understanding I got from the LLM with a personal reading of the docs that were relevant.
You might think it takes just as long of not longer to do it my way rather than just reading the docs myself. Sometimes it can. But as you get good at the workflow you find that the time sien finding the relevant docs goes down and you get an instant plausible interpretation of the docs added too. You can then very quickly produce application code right away and then docs of the code you write.
simonw 17 days ago [-]
Here are a bunch of things I use LLMs for relating to code.
- Building front-end prototypes - I use Claude Artifacts for this all the time, if I have an idea for a UI I'll get Claude to spin up an almost instant demo so I can interact with it and see if it feels right. I'll often copy the code out and use it as the starting point for my production feature.
- DSLs like SQL, Bash scripts, jq, AppleScript, grep - I use these WAY more than I used to because 9/10 times Claude gives me exactly what I needed from a single prompt. I built a CLI tool for prompt-driven jq programs recently: https://simonwillison.net/2024/Oct/27/llm-jq/
- Ad-hoc sidequests. This is a pretty broad category, but it's effectively little coding projects which I shouldn't actually be working on at all but I'll let myself get distracted if an LLM can get me there in a few minutes: https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-cas...
- Writing C extensions for SQLite while I'm walking my dog on the beach. I am not a C programmer but I find it extremely entertaining that ChatGPT Code Interpreter, prompted from my phone, can write, compile and test C extension for SQLite for me: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
- That's actually a good example of a general pattern: I use this stuff for exploratory prototyping outside of my usual (Python+JavaScript) stack all the time. Usually this leads nowhere, but occasionally it might turn into a real project (like this AppleScript example: https://til.simonwillison.net/gpt3/chatgpt-applescript )
- Actually writing code. Here's a Python/Django app I wrote almost entirely with Claude: https://simonwillison.net/2024/Aug/8/django-http-debug/ - again, this was something of a side-project - not something worth spending a full day on but worthwhile if I could get it done in a couple of hours.
- Mucking around with APIs. Having a web UI for exploring an API is really useful, and Claude can often knock those out from a single prompt. https://simonwillison.net/2024/Dec/17/openai-webrtc/ is a good example of that.
There's a TON more, but this probably represents the majority of my usage.
dartos 17 days ago [-]
Thank you!
I’ll read through these and try again in the new year.
Tostino 17 days ago [-]
Not OP, but I've just gotten really used to verifying implementation details. Yup, those subtle ones really suck. It's pretty much just up to intuition if something in the response (or your followups) rings the `not quite right` bell for you.
grimgrin 17 days ago [-]
I bought in early to typingmind, a great web based frontend. Good for editing context, and switching from say gemini to claude. This is a very normal flow for me, and whatever tool you use should enable this
also nice to interact with an LLM in vim, as the context is the buffer
obviously simon’s llm tool rules. I’ve wrapped it for vim
Obscurity4340 17 days ago [-]
Googlefu is how its usually called. It would be fantastic if there was a general course to teach it
CRConrad 12 days ago [-]
Not "would be", but "would have been", past tense: Once upon a time it may have been valuable, but since Google deliberately ruined its search engine, it's of no use any more.
simonw 17 days ago [-]
One of the things I find most frustrating about LLMs is how resistant they are to teaching other people how to use them!
I'd love to figure this out. I've written more about them than most people at this point, and my goal has always been to help people learn what they can and cannot do - but distilling that down to a concise set of lessons continues to defeat me.
The only way to really get to grips with them is to use them, a lot. You need to try things that fail, and other things that work, and build up an intuition about their strengths and weaknesses.
The problem with intuition is it's really hard to download that into someone else's head.
My first stab at trying ChatGPT last year was asking it to write some Rust code to do audio processing. It was not a happy experience. I stepped back and didn't play with LLMs at all for a while after that. Reading your posts has helped me keep tabs on the state of the art and decide to jump back in (though with different/easier problems this time).
djhn 17 days ago [-]
To be fair I think that is a hard task even for a human expert, in the sense that there isn’t much prior art.
17 days ago [-]
mvdtnz 17 days ago [-]
It's really important to go and read the code that the author of this article actually produces with LLMs. He posted on hacker news a few months ago, a post called something like "everything I've made with ChatGPT in the month of September" or something. He's producing little toy applications that don't even begin to resemble real production code. He thinks these "tools" are useful because they help him write pointless slop.
The point of that post isn't "look at these incredible projects I've built (proceeds to show simple projects)."
It's "I built 14 small and useful tools in a single week, each taking between 2 and 10 minutes".
The thing that's interesting here is that I can have an LLM kick out a working prototype of a small, useful tool in only a little more time than it takes to run a Google search.
That post isn't meant to be about writing "real production code". I don't know why people are confused over that.
Philpax 17 days ago [-]
Do you know who Simon is?
mvdtnz 15 days ago [-]
Only from his neverending stream of hacker news posts.
BeetleB 17 days ago [-]
I think most tech folks struggle with it because they treat LLMs as computer programs, and their experience is that SW should be extremely reliable - imagine using a calculator that was wrong 5% of the time - no one would accept that!
Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.
Abstract that out a bit further, and realize that most managers don't expect their reports to be 100% reliable.
Don't use LLMs where accuracy is paramount. Use it to automate away tedious stuff. Examples for me:
Cleaning up speech recognition. I use a traditional voice recognition tool to transcribe, and then have GPT clean it up. I've tried voice recognition tools for dictation on and off for over a decade, and always gave up because even a 95% accuracy is a pain to clean up. But now, I route the output to GPT automatically. It still has issues, but I now often go paragraphs before I have to correct anything. For personal notes, I mostly don't even bother checking its accuracy - I do it only when dictating things others will look at.
And then add embellishments to that. I was dictating out a recipe I needed to send to someone. I told GPT up front to write any number that appears next to an ingredient as a numeral (i.e. 3 instead of "three"). Did a great job - didn't need to correct anything.
And then there are always the "I could do this myself but I didn't have time so I gave it to GPT" category. I was giving a presentation that involved graphs (nodes, edges, etc). I was on a tight deadline and didn't want to figure out how to draw graphs. So I made a tabular representation of my graph, gave it to GPT, and asked it to write graphviz code to make that graph. It did it perfectly (correct nodes and edges, too!)
Sure, if I had time, I'd go learn graphviz myself. But I wouldn't have. The chances I'll need graphviz again in the next few years is virtually 0.
I've actually used LLMs to do quick reformatting of data a few times. You just have to be careful that you can verify the output quickly. If it's a long table, then don't use LLMs for this.
Another example: I have a custom note taking tool. It's just for me. For convenience, I also made an HTML export. Wouldn't it be great if it automatically made alt text for each image I have in my notes? I would just need to send it to the LLM and get the text. It's fractions of a cent per image! The current services are a lot more accurate at image recognition than I need them to be for this purpose!
Oh, and then of course, having it write Bash scripts and CSS for me :-) (not a frontend developer - I've learned CSS in the past, but it's quicker to verify whatever it throws at me than Google it).
Any time you have a task and lament "Oh, this is likely easy, but I just don't have the time" consider how you could make an LLM do it.
dartos 17 days ago [-]
> Don't use LLMs where accuracy is paramount.
Then why do people keep pushing it for code related tasks?
Accuracy and precision is paramount with code. It needs to express exactly what needs to be done and how.
simonw 17 days ago [-]
Code is the best possible application of LLMs because you can TEST the output.
If the LLM hallucinates something the code won't compile or run.
If the LLM makes a logic error you'll catch it in the manual QA process.
(If you don't have good personal manual QA habits, don't try using LLMs to write your code. And maybe don't hit "accept" on other developer's code reviews either?)
dartos 17 days ago [-]
> Code is the best possible application of LLMs because you can TEST the output.
This is an overly simplistic view of software development.
Poorly made abstractions and functions will have knock on effects on future code that can be hard to predict.
Not to mention that code can have side effects that may not affect a given test case, or the code could be poorly optimized, etc.
Just because code compiles or passes a test does not mean it’s entirely correct. If it did, we wouldn’t have bugs anymore.
The usual response to this is something like “we can use the LLM to refactor LLM code if we need” but, in my experience, this leads to very complex, hard to reason about codebases.
Especially if the stack isn’t Python or JavaScript.
simonw 17 days ago [-]
So code review LLM-generated code and reject it (or require changes to it) if it doesn't fit your idea of what good code looks like.
dartos 17 days ago [-]
Or… yknow… I could just write the code…
Instead of going through a multi step process to get an LLM to generate it, review it, reject it, and repeat…
I wonder why you reply to these comments, but not my other asking what you use LLMs for and specifically explaining how they failed me.
> Then why do people keep pushing it for code related tasks?
They don't. You are likely experiencing selection bias. My guess is you work in SW, and so it makes sense that you're the target of those campaigns. The bulk of ChatGPT subscribers are not doing SW, and no one is bugging them to use it for code related tasks.
dartos 17 days ago [-]
I mean people in the software field absolutely push for LLMs to write code…
Obviously people not in the software field wouldn’t care…
danielbln 17 days ago [-]
Because there are other ways to validate the output, types being one of them, tests another. Or simply running the code. It's easy enough to validate the output given the right approach that code generated by an LLM (usually as the result of a conversation/discussion about what should be accomplished) is a net positive.
If you zero-prompt and copy-paste the first result into your codebase, yeah, the accuracy problem will rear its ugly head real quick.
jaredsohn 17 days ago [-]
A similar use case for me - I wrote some technical documentation for our wiki about a somewhat complicated relationship between ids in some database tables. I copied my text explanation into an LLM and asked it to make a diagram and it did so. Took very little time from me and it was fast/easy to verify that the quality was good.
layer8 17 days ago [-]
I think there’s the added reason that a lot of folks went into tech because (consciously or unconsciously) they prefer dealing with predictable machines than with unreliable humans. And now that career choice begins to look like a bait and switch. ;)
aleph_minus_one 16 days ago [-]
> Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.
The problem is: for the tasks that I can give the LLM (or human) that I can easily verify and correct, the LLM fails with the majority of them, for example
- programming tasks of my area of expertise (which is more "mathematical" than what is common in SV startups), where I know how a high-level solution has to look like, and where I can ask the LLM to explain the gory details to me. Yes, these gory details are subtle (which is why the task can be menial), but the code has to be right. I can verify this, and the code is not correct.
- getting literature references about more obscure scientific (in particular mathematical) topics. I can easily check whether these literature references (or summaries of these references) are hallucinations - they typically are.
simonw 16 days ago [-]
LLMs on their own are effectively useless for references or citations. They need to be plugged into other systems for that - search-enabled ones like https://gemini.google.com or ChatGPT with search enabled or Perplexity can do this, although at that point they are mostly running the exact same searches you would.
BeetleB 15 days ago [-]
Your first task is definitely not what I would call a "menial" task.
Your second task is not a "task", but a knowledge search. LLMs are not good with searches (unless augmented - like RAG).
zahlman 17 days ago [-]
> Don't use LLMs where accuracy is paramount. Use it to automate away tedious stuff.
My programmer mind tells me that "tedious stuff" is where accuracy is the most important.
wodderam 16 days ago [-]
My experience is that for certain tasks LLMs are great, for certain tasks LLMS are basically useless.
The best prompts though are always written in a separate text file for me and pasted in. Follow up questions are never as good as a detailed initial prompt.
I would imagine well formulated questions to solve the problem at hand is a skill but beyond that I don't think there is anything special about how to ask LLMs a question.
In areas the LLM is rather useless, no amount of variation in prompting can solve that problem IMO. Just like if the tasks is something the LLM is good at, the prompt can be pretty sloppy and seem like magic with how it can understand what you want.
simonw 16 days ago [-]
I think one of the most important skills is being able to predict which tasks an LLM is a good fit for and which aren't.
perrygeo 17 days ago [-]
There's a similar dynamic in building reliable distributed systems on top of an unreliable network. The parts are prone to failure but the system can keep on working.
The tricky problem with LLMs is identifying failures - if you're asking the question, it's implied that you don't have enough context to assess whether it's a hallucination or a good recommendation! One approach is to build ensembles of agents that can check each other's work, but that's a resource-intensive solution.
swalsh 17 days ago [-]
It's amazing this is still an opinion in 2025. I now ask devs how they use AI as part of their workflows when I interview. It's a standard skill I expect my guys to have.
dartos 17 days ago [-]
I feel bad for your team.
Let people work how they want. I wouldn’t not hire someone on the basis of them not using a language server.
The creator of the Odin language famously doesn’t use one. He’s says that he, specifically, is faster without one.
theptip 17 days ago [-]
No, it’s reasonable. If your team uses Git then it’s a valid question to establish if someone has only worked with Perforce.
They didn’t say how heavily they weight the question.
(All that said I expect that, soon, experience with the appropriate LLM tooling will be as important as having experience with the language your system is implemented in.)
dartos 17 days ago [-]
Right, but using git is a team wide thing.
I can’t use perforce while my company is on git.
But if I do or do not use an LLM to assist me while coding, my team is unaffected.
If someone liked jetbrains, but your team used neovim, would you force them to use neovim?
Kwpolska 17 days ago [-]
Editors may also be a team decision in some places. Some teams are using features unique to one IDE, for example.
chikere232 17 days ago [-]
it can be a team decision, but it's a bad one
dartos 17 days ago [-]
Then that tooling is required, like visual studio is a common one I know about in windows land.
Though nobody should care if I edited my text files with neovim as long as I still used the same toolchain as everyone else.
paxys 17 days ago [-]
You hire people based on their fundamental knowledge and the ability to learn, not skills in arbitrary tools and frameworks which come and go every other day. If someone has used Perforce they will be able to get perfectly comfortable with Git by the end of their first week. So not knowing Git is an idiotic reason to reject a skilled developer. Same with programming languages, and just about every other aspect of software development.
swalsh 17 days ago [-]
I don't really test any specific tools or frameworks, what i'm using has changed twice just in the last year. More so, I just want to hear that the candidate has some knowledge of what the current models can do well, what they can't do, and how they're integrating it. Whether you're copying pasting code or using something like cursor is not what i'm concerned about.
codr7 17 days ago [-]
Yeah, but it's oh so easy to test for, and oh so nice to have plenty of boxes checked to cover your ass if the hire goes wrong.
swalsh 17 days ago [-]
My expectations around productivity are going to assume you're using AI. That means stuff that might have taken a few days, i'm going to expect in a few hours or less. It's not unreasonable, i've seen over and over agian that kind of speed up. I have a lot less approval to hire people than I used to... so it's really important to me that I can extract that level of productivity out of my team.
If you're "working the way you want to" ie still handrolling all your code, you're going to find my expectations unrealistic, and that is certainly not fair to you.
sramam 17 days ago [-]
I concur that asking devs how they use AI is a great idea.
Recently, I shared a code base with a junior dev and she was surprised with the speed and sophistication of the code. The LLM did 80+% of the "coding".
What was telling was as she was grokking the code (for helping the ~20%), she was surprised at the quality of the code - her use of the LLM did not yield code of similar quality.
I find that the more domain awareness one brings to the table, the better the output is. Basically the clearer one's vision of the end-state, the better the output.
One other positive side-effect of using "LLMs as a junior-dev" for me has been that my ambitions are greater. I want it all - better code, more sophisticated capabilities even for relatively not-important projects, documentation, tests, debug-ability. And once the basic structure is in place, many a time it is trivial to get the rest.
It's never 100%, but even with 80+%, I am faster than ever before, deliver better quality code, and can switch domains multiple times a week and never feel drained.
Sharing best AI hacks within a team will have the same effect as code-reviews do in ensuring consistency. Perhaps an "LLM chat review", especially when something particularly novel was accomplished!
layer8 17 days ago [-]
Using cloud-based AI is a no-go where I work, for IP and contractual reasons. And on-premises AI is not as capable and more difficult to integrate.
simonw 17 days ago [-]
Have you tried the latest open weight models? They're SO MUCH better today than they were even six months ago.
If I was in an environment that didn't allow hosted API models I'd absolutely be looking into the various Llama 3 models or Qwen2.5-Coder-32B.
3eb7988a1663 17 days ago [-]
Legal does not even want us running offline models for reasons. I assume that comes down to not knowing what offline-only means, but such is life.
simonw 17 days ago [-]
Maybe they're concerned that code written with AI assistance can't be copyrighted? I've seen that idea floated in a few places.
layer8 17 days ago [-]
What do you use so that you can throw in a set of documents and/or a nontrivial code base into an LLM workspace and ask questions about it etc.? What the cloud-based services provide goes way beyond a simple chat interface or mere code completion (as you know, of course).
Now I have all the Python and Markdown files from the current project on my clipboard, in Claude's recommended XML-like format (which I find works well with other models too).
Then I paste that into the Claude web interface or Google's AI Studio if it's too long for Claude and ask questions there.
Sometimes I'll pipe it straight into my own LLM CLI tool and ask questions that way:
I can later start a chat session on top of the accumulated context like this:
llm chat -c
(The -c means "continue most recent conversation in the chat").
layer8 17 days ago [-]
Thanks. Google AI Studio isn’t local, I think, is it? I’ll have to test this, but our project sizes and specification documents are likely to run into size limitations for local models (or for the clipboard at the very least ;)). And what I’d be most interested in are big-picture questions and global analyses.
simonw 17 days ago [-]
No, it's not. I've not seen any local models that can handle 1m+ tokens.
I haven't actually done many experiments with long context local models - I tend to hit the hosted API models for that kind of thing.
BeetleB 17 days ago [-]
Just curious, but what AI related skills do you expect them to have?
eschaton 17 days ago [-]
The ability to recognize and join a hype train, I presume. It’s one way to appear proactively leading-edge to marginally-informed product managers, marketers, execs and press.
adwn 16 days ago [-]
That's an extremely uncharitable presumption. Although I don't agree that routine usage of AIs should be a precondition for regular software engineering jobs, there are good reasons for using LLMs besides "joining a hype train".
eschaton 16 days ago [-]
Nah.
swalsh 17 days ago [-]
I ask what their current workflow is, how they check and verify things, what their approach to prompting is etc. I'm looking to see that they've developed basic skills, have a reasonable mental model of what models can do well, what they currently can't do, and have an approach to be productive using the tools.
dogcomplex 16 days ago [-]
I would characterize good prompting as: write out your whole problem you're trying to solve, then think to yourself what the clarifying questions would be if you were a junior trying to solve it. Better yet - ask the LLM to ask you challenging clarifying questions for several rounds. Then, take all that information and re-compile it back into a list of all the important components of the project, and re-read it to make sure there's no particular ambiguous part or weird part that would be over-emphasized by the language you used. Then, emphasize the core concerns again, and tell it how you'd like it to output the response (keeping in mind that it will always do best with a conversation-style format with loose restrictions). Never let a conversation stray too long from the original goals lest it start forgetting.
Once that's all done, you basically have a well-structured question you could pass to an underling and have them completely independently work on the project without bugging you. That's the goal. Now, pass that to o1 or Claude, depending on whether it's a general-purpose task (o1) or a code-specific task (Claude), and wait for response. From there, have a conversation or test-and-followup of whatever it spits out, this time with you asking questions. If good enough, done. If not, wrap up whatever useful insights from that line of questioning and put it back into the initial prompt and either re-post it at the end of the conversation or start a fresh conversation.
I find 90% of the time this gets exactly what I'm after eventually. The few other cases are usually because we hit some cycle where the AI doesn't fully know what to change/respond, and it keeps repeating itself when I ask. The trick then is to ask things a different way or emphasize something new. This is usually just a code-specific issue, for general problems it's much better. One other trick is to ask it to take a step back and just tackle the problem in a theoretical/philosophical way first before trying to do any coding or practical solving, and then do that in a second phase (asking o1 to architect code structure and then Claude to implement it is a great combo too). Also if there is any way to break up the problem into smaller pieces which can be tackled one conversation at a time - much better. Just remember to include all relevant context it needs to interface with the overall problem too.
That sounds like a lot, but it's essentially just project management and delegation to somewhat-flawed underlings. The upside is instead of waiting a workweek for them to get back to you, you just have to wait 20 seconds. But it does mean a ton of reading and writing. There are certainly already some meta-prompts where you can get the AI to essentially do this whole process for you and assess itself, but like all automation that means extra ways for things to break too. Let the AI devs cook though and those will be a lot more commonplace soon enough...
Do you know if any of the ideas from that project have crossed over into LLM world yet?
duck 17 days ago [-]
Do you know who Simon is?
Havoc 17 days ago [-]
Great summary of highlights. Don't agree with all, but I think it's a very sound attempt at a year in review summary
>LLM prices crashed
This one has me a little spooked. The white knight on this front (DS) has both announced increases and has had staff poached. There is still Gemini free tier which is ofc basically impossible to beat (solid & functionally unlimited/free) but it's google so reluctant to trust.
Seriously worried about seeing a regression on pricing in first half of 2025. Especially with the OAI $200 price anchoring.
>“Agents” still haven’t really happened yet
Think that's largely because it's a poorly defined concept and true "agent" implies some sort of pseudo-agi autonomy. This is a definition/expectation issue rather than technical in my mind
>LLMs somehow got even harder to use
I don't think that's 100%. An explosion of options is not equal to harder to use. And the guidance for noobs is still pretty much same as always (llama.cp or one of the common frontends like text-generation-webui). It's become harder to tell what is good, but not to get going.
----
One key theme I think is missing is just how hard it has become to tell what is "good" for the average user. There is so much benchmark shenanigans going on that it's just impossible to tell. I'm literally at the "I'm just going to build my own testing framework" stage. Not because I can do better technically (I can't)...but because I can gear it towards things I care about and I can be confident my DIY sample hasn't been gamed.
simonw 17 days ago [-]
The biggest reason I'm not worried about prices going back up again is Llama. The Llama 3 models are really good, and because they are open weight there are a growing number of API providers competing to provide access to them.
These companies are incentivized to figure out fast and efficient hosting for the models. They don't need to train any models themselves, their value is added entirely in continuing to drive the price of inference down.
Groq and Cerberus are particularly interesting here because WOW they serve Llama fast.
causal 17 days ago [-]
Agents have a definition issue sure, but IMO we are prevented from even discovering a useful definition by the current limitations of LLMs
theanonymousone 17 days ago [-]
> There is still Gemini free tier which is ofc basically impossible to beat
Is it free free? The last time I checked there was a daily request limit, still generous but limiting for some use cases. Isn't it still the case?
simonw 17 days ago [-]
Providing an unlimited free tier would be a terrible business decision for them.
theanonymousone 17 days ago [-]
Of course. My point is, probably a super cheap LLM that does not cut you off after 1500th API request of the day is preferred over the free model that does so, at least for certain use cases.
Animats 18 days ago [-]
> Some of those GPT-4 models run on my laptop
That's an indication that most business-sized models won't need some giant data center. This is going to be a cheap technology most of the time.
OpenAI is thus way overvalued.
thinkingemote 17 days ago [-]
Most of the laptops that the models can run on today are in the high end of dedicated bare metal servers. Most shared VM servers are way below these laptops. Most people buying a new laptop today won't be able to run them, most devs getting a website up with a server won't be able to run them.
This means that the definitions of "laptop" and "server" are dependent on use. We should instead talk about RAM, GPU and CPU speed which is more useful and informative but less engaging than "my laptop".
mjburgess 18 days ago [-]
I don't think openai's valuation comes from a data center bet -- rather, I'd suppose, investors think it has a first-mover advantage on model quality that it can (maybe?) attract some buy-out interest or otherwise use in yet-to-be-specified product lines.
However, it has been clear for a long time that meta are just demolishing any competitor's moats, driving the whole megacorp AI competition to razor thin margins.
It's a very welcome strategy from a consumer pov, but -- it has to be said -- genius from a business pov. By deciding that no one will win, it can prevent anyone leapfrogging them at a relatively cheap price.
shihab 18 days ago [-]
The last OpenAI valuation I read about was 157 billion. I am struggling to understand what justifies this. To me, it feels like OpenAI is at best few months ahead of competitors in some areas. But even if I am underestimating the advantage, it's few years instead of few months, why does it matter? It's not like AI companies are going to enjoy the first-mover advantage internet giants had over the competition.
datadrivenangel 18 days ago [-]
It's justified if AGI is possible. If AGI is possible, then the entire human economy stops making sense as far as money goes, and 'owning' part of OpenAI gives you power.
That is of course, assuming AGI is possible and exponential, and that marketshare goes to a single entity instead of a set of entities. Lots of big assumptions. Seems like we're heading towards a slow-lackluster singularity though.
torginus 17 days ago [-]
I was thinking about how the economy has been actively makes less sense and gets divorced more and more from reality year after year, AI or not.
It's the simple fact that the ability of assets to generate wealth has far outstripped the abiliy of individuals to earn money by working.
Somehow real estate has become so expensive everywhere that owning a shitty apartment is impossible for the vast majority.
When the world's population was exploding during the 20th century, housing prices were not a problem, yet somehow nowadays, it's impossible to build affordable housing to bring the prices down, though the population is stagnant or growing slowly.
A company can be worth $1B if someone invests $10m in it for 1% stake - where did the remaining $990m come from? Likewise, the stock market is full of trillion-dollar companies whose valuations beggar all explanation, considering the sizes of the markets they are serving.
The rich elites are using the wealth to control access to basic human needs (namely housing and healthcare) to squeeze the working population for every drop of money. Every wealth metric shows the 1% and the 1% of the 1% control successively larger portions of the economic pie. At this point money is ceasing to be a proxy for value and is becoming a tool for population control.
And the weird thing is it didn't use to be nearly this bad even a decade ago, and we can only guess how bad it will get in a decade, AGI or not.
Anyway, I don't want to turn this into a fully-written manifesto, but I have trouble expressing these ideas in a concise manner.
ac29 17 days ago [-]
> Somehow real estate has become so expensive everywhere that owning a shitty apartment is impossible for the vast majority.
Approximately 2/3s of homes in the US are owner occupied.
yen223 16 days ago [-]
It's interesting that the figure is similar in Australia, but from the POV of the people.
Approximately 2/3rds of Australians live in an owner-occupied home.
zahlman 17 days ago [-]
> When the world's population was exploding during the 20th century, housing prices were not a problem, yet somehow nowadays, it's impossible to build affordable housing to bring the prices down, though the population is stagnant or growing slowly.
In Canada, the population is still growing at a fairly impressive rate (https://www.macrotrends.net/global-metrics/countries/CAN/can...), and that growth tends to concentrate in major population centres. There are advocacy groups that seek to push Canadian population growth well above UN projections (e.g. the https://en.wikipedia.org/wiki/Century_Initiative "aims to increase Canada's population to 100 million by 2100") through immigration.
In Japan, where the population is declining, housing prices are not anything like the problem we observe in North America.
Somehow real estate has become so expensive everywhere that owning a shitty apartment is impossible for the vast majority.
That's to be expected when governments forbid people from building housing. The only thing I find surprising is when people blame this on "capitalism".
nyarlathotep_ 17 days ago [-]
> And the weird thing is it didn't use to be nearly this bad even a decade ago, and we can only guess how bad it will get in a decade, AGI or not.
The last 5 years have reflected a substantial decline in QOL in the states; you don't even have to to look back that far.
The coronacircus money-printing really accelerated the decline.
philipkglass 18 days ago [-]
If AGI is possible, then the entire human economy stops making sense as far as money goes, and 'owning' part of OpenAI gives you power.
That's if AGI is possible and not easily replicated. If AGI can be copied and/or re-developed like other software then the value of owning OpenAI stock is more like owning stock in copper producers or other commodity sector companies. (It might even be a poorer investment. Even AGI can't create copper atoms, so owners of real physical resources could be in a better position in a post-human-labor world.)
whatshisface 17 days ago [-]
This belief comes from confusing the singularity (every atom on Earth is converted into a giant image of Sam Altman) with AGI (a store employee navigates a confrontation with an unruly customer, then goes home and wins at Super Mario).
baobabKoodaa 17 days ago [-]
If I recall correctly, these terms were used more or less interchangeably for a few decades, until 2020 or so, when OpenAI started making actual progress towards AGI, and it was clear that the type of AGI that could be imagined at that point, would not be of the type that would produce singularity.
fullstackchris 17 days ago [-]
Exactly. I continually fail to see how "the entire human economy ends" overnight with another human like agent out there - especially if its confined to a server in the first place - it can't even "go home" :)
Teever 17 days ago [-]
But what if that AGI can fit inside a humanoid robot and that robot is capable of self replication even if it means digging the sand out of the ground to make silicon with a spade?
Terr_ 17 days ago [-]
We already have humanoid intelligeces that self-assemble and power from common materials, as a colony of incredibly advanced nanobots.
Teever 17 days ago [-]
Yes. The goal is to emulate that with different substrates to understand how it works and to have better control over existing self-replicating systems.
richardw 17 days ago [-]
The first AGI will have such an advantage. It’ll be the first thing that is smart and tireless, can do anything from continuously hacking enemy networks to trading across all investment classes, to basically taking over the news cycle on social media. It would print money and power.
HDThoreaun 17 days ago [-]
Depends on how efficient it is. If it requires more processing power than we have to do all these things competitors will have time to catch up while new hardware is created.
AnimalMuppet 17 days ago [-]
The GP said, "and exponential". If AGI is exponential, then the first one will have a head start advantage that compounds over time. That is going to be hard to overcome.
philipkglass 17 days ago [-]
I believe that AGI cannot be exponential for long because any intelligent agent can only approach nature's limits asymptotically. The first company with AGI will be about as much ahead as, say, the first company with electrical generators [1]. A lot of science fiction about a technological singularity assumes that AGI will discover and apply new physics to develop currently-believed-impossible inventions, but I don't consider that plausible myself. I believe that the discovery of new physics will be intellectually satisfying but generally inapplicable to industry, much like how solving the cosmological lithium problem will be career-defining for whoever does it but won't have any application to lithium batteries.
I don't recall editing my message, but HN can be wonky sometimes. :)
Nothing is truly exponential for long, but the logistic curve could be big enough to do almost anything if you get imaginative. Without new physics, there are still some places where we can do some amazing things with the equivalent of several trillion dollars of applied R&D, which AGI gets you.
terribleperson 17 days ago [-]
This depends on what a hypothetical 'AGI' actually costs. If a real AGI is achieved, but it costs more per unit of work than a human does... it won't do anyone much good.
fullstackchris 17 days ago [-]
Sure but think of the Higgs... how long that took for just _one_ particle. You think an AGI, or even an ASI is going to make an experimental effort like that go any bit faster? Dream on!
It astounds me that people dont realize how much of this cutting edge science stuff literally does NOT happen overnight, and not even close to that; typically it takes on the order of decades!
datadrivenangel 17 days ago [-]
Science takes decades, but there are many places where we could have more amazing things if we spent 10 times as much on applied R&D and manufacturing. It wouldn't happen overnight, but it will be transformative if people can get access to much more automated R&D. We've seen a proliferation in makers over the last few decades as access to information is easier, and with better tools individuals will be able to do even more.
My point being that even if Science ends today, we still have a lot more engineering we can benefit from.
philipkglass 17 days ago [-]
I had to edit my message just now because I was actually unsure if you edited. Sorry for any miscommunication.
UltraSane 17 days ago [-]
If AGI is invented and the inventor tries to keep it secret then everyone in the world will be trying to steal it. And funding to independently create it would become effectively unlimited once it has been proven possible, much like with nuclear weapons.
Animats 17 days ago [-]
We may not need smarter AI. Just less stupid AI.
The big problem with LLMs is that most of the time they act smart, and some of the time they do really, really dumb things and don't notice. It's not the ceiling that's the problem. It's the floor. Which is why, as the article points out, "agents" aren't very useful yet. You can't trust them to not screw up big-time.
robertlagrant 17 days ago [-]
> If AGI is possible, then the entire human economy stops making sense as far as money goes,
What does this mean in terms of making me coffee or building houses?
com2kid 17 days ago [-]
If we can simulate a full human intelligence at a reasonable speed, we can simulate 100 of them and ask the AGI to figure out how to make itself 10x faster.
Rinse and repeat.
That is exponential take off.
At the point where you have an army of AIs running at 1000x human speed it can just ask it to design the mechanisms for and write the code to make robots that automate any possible physical task.
edflsafoiewq 17 days ago [-]
There are about 8 billion human intelligences walking around right now and they've got no idea how to begin making even a stupid AGI, let alone a superhuman one. Where does the idea that 100 more are going to help come from?
nickpsecurity 17 days ago [-]
This was my argument a long time ago. The common counter was that we’d have a bunch of geniuses that knew tons of research. Well, we probably already have millions of geniuses. If anything, they use their brains for self-enrichment (eg money, entertainment) or on a huge assortment of topics. If all the human geniuses didn’t do it, then why would the AGI instances do it?
We also have people brilliant enough to maybe solve the AGI problem or cause our extinction. Some are amoral. Many mechanisms pushed human intelligences in other directions. They probably will for our AGI’s assuming we even give them all the power unchecked. Why are they so worried the intelligent agents will not likewise be misdirected or restrained?
What smart, resourceful humans have done (and not done) is a good, starting point for what AGI would do. At best, they’ll probably help optimize some chips and LLM runtimes. Patent minefields with sub-28nm design, especially mask-making, will keep unit volumes of true AGI’s much lower at higher prices than systems driven by low-paid workers with some automation.
GOD_Over_Djinn 17 days ago [-]
This sounds like magic, not science.
EMIRELADERO 17 days ago [-]
What do you mean by this? Is there any fundamental property of intelligence, physicality, or the universe, that you think wouldn't let this work?
fullstackchris 17 days ago [-]
Not OP but yes. Electron size vs band gap, computing costs (in terms of electricity) any other raw materials needed for that energy, etc... sigh... its physics, always physics... what fundamental property of physics do you think would let a vertical take off in intelligence occur?
datadrivenangel 17 days ago [-]
If you look at the rate of mathematical operations conducted, we're already going hard vertical. Physics and material limitations will slow that eventually as we reach a marginal return on converting the planet to computer chips, but we're in the singularity as proxy measured by mathematical operations.
Terr_ 17 days ago [-]
> If you look at the rate of mathematical operations conducted, we're already going hard vertical.
Not if you remember to count all the computations being done by the quintillions of nanobots across the world known as "human cells."
That's not only inside cells, and not just neurons either. For example, your thyroid is busy brute-forcing the impossibly large space of antibody combinations, and putting every candidate cell-release through a very rigorous set of acceptance tests.
HDThoreaun 17 days ago [-]
The human brain still has orders of magnitude more processing power than LLMs. Even if we develop superintelligence the current hardware cant run it which gives competitors time to catch up.
torginus 17 days ago [-]
Nothing and the hilarious thing is that the AI figureheads admit that technology (as in defined by new theorems produced and new code written), will do pathetically little to move the needle on human happiness forward.
The guy running Anthropic thinks the future is in biotech, developing the cure to all diseases, eternal youth etc.
Which is technology all right, but it's unclear to me how these chatbots (or other AI systems) are the quickest way to get there.
robertlagrant 12 days ago [-]
I think it's definitely the case that AI has certain really useful niches, but it is hard to know the ones that will really make enough money to make training an AI worth it. E.g. "parse a million company statements into these criteria and invest based on an algorithm on those criteria" might be very valuable. Maybe someone's doing it. But I struggle to think it'd be worth billions.
hdjjhhvvhga 17 days ago [-]
> If AGI is possible, then the entire human economy stops making sense as far as money goes
I heard people on HN saying this (even without the money condition) and I fail to grasp the reasoning behind it. Suppose in a few years Altman announces a model, say o11, that is supposedly AGI, and in several benchmarks it hits over 90%. I don't believe it's possible with LLMs because of their inherent limitations but let's assume it can solve general tasks in a way similar to an average human.
Now, how come that "the entire human economy stops making sense"? In order to eat, we need farmers, we need construction workers, shops etc. As for white collar workers, you will need a whole range of people to maintain and further develop this AGI. So IMHO the opposite is true: the human economy will work exactly as before but the job market will continue to evolve withe people using AGI in a similar way that they use LLMs now but probably with greater confidence. (Or not.)
SmooL 17 days ago [-]
The thinking goes:
- any job that can be done on a computer is immediately outsourced to AI, since the AI is smarter and cheaper than humans
- humanoid robots are built that are cheap to produce, using tech advances that the AI discovered
- any job that can be done by a human is immediately outsourced to a robot, since the robot is better/faster/stronger/cheaper than humans
exe34 17 days ago [-]
If you think about all the people trying to automate away farming, construction, transport/delivery - these people doing the automation themselves get automated out first, and the automation figures out how to do the rest. So a fully robotic economy is not far off, if you can achieve AGI.
datadrivenangel 17 days ago [-]
Why do we work? Ultimately, we work to live.* If the value of our labor is determined by scarcity, then what happens when productivity goes nearly infinite and the scarcity goes away? We still have needs and wants, but the current market will be completely inverted.
Terr_ 18 days ago [-]
One strata in that assumption-heap to call out explicitly: Assuming LLMs are an enabling route to AGI and not a dead-end or supplemental feature.
parpfish 18 days ago [-]
Well, AGI would make the brainy information worker part of the economy obsolete. Well still need the jobs that interact with the physical world for quite a while. So… all us HN types should get ready to work the mines or pick vegetables
throwup238 17 days ago [-]
If we hit true AGI, physical labor won’t be far behind the knowledge workers. The first thing industrial manufacturers will do is turn it towards designing robotics, automating the design of factories, and researching better electromechanical components like synthetic muscle to replace human dexterity.
IMO we’re going to hit the point where AI can work on designing automation to replace physical labor before we hit true AGI, much like we’re seeing with coding.
18 days ago [-]
api 18 days ago [-]
If AGI is possible then that too becomes a commodity and we experience a massive round of deflation in the cost of everything not intrinsically rare. Land, food, rare materials, energy, and anything requiring human labor is expensive and everything else is almost free.
I don't see how OpenAI wouldn't crash and burn here. Given the history of models it would be at most a year before you'd have open AGI, then the horse is out of the barn and the horse begins to self-improve. Pretty soon the horse is a unicorn, then it's a Satyr, and so on.
(I am a near-term AGI skeptic BTW, but I could be wrong.)
OpenAI's valuation is a mixture of hype speculation and the "golden boy" cult around Sam Altman. In the latter sense it's similar to the golden boy cults around Elon Musk and (politically) Donald Trump. To some extent these cults work because they are self-fulfilling feedback loops: these people raise tons of capital (economic or political) because everyone knows they're going to raise tons of capital so they raise tons of capital.
17 days ago [-]
criddell 18 days ago [-]
> what justifies this
People are buying shares at $x because they believe they will be able to sell them for more later. I don’t think there’s a whole to more to it than that.
ec109685 17 days ago [-]
OpenAI is becoming synonymous with consumer AI. It has potential of disrupting Google’s cash cow, which explains at least a chunk of the valuation.
OpenAI predicts more revenue from ChatGPT than api access through 2029.
It’s the old Netflix / HBO trope of which can become the other first: hbo figure out streaming or Netflix figure out original programming.
I bet Google will figure this out and thus OpenAI won’t disrupt as much as people think it will.
CRConrad 11 days ago [-]
> It’s the old Netflix / HBO trope of which can become the other first: hbo figure out streaming or Netflix figure out original programming.
Tangential: So how is that race going, has either taken a commanding lead? (Or, hey, is it over already; has either of them won and the other lost? (Yeah, guess if I'm very well-infomed on that industry or not...))
throwpoaster 18 days ago [-]
157 billion implies about a 1% chance at dominating a 1.5 trillion market. Seems reasonable.
asqueella 18 days ago [-]
10%, no?
throwpoaster 17 days ago [-]
No, there’s a risk term I’m skipping over.
airstrike 17 days ago [-]
that's 10% and who's to say that market is worth 1.5 trillion to begin with
throwpoaster 17 days ago [-]
There’s a risk term I’m not including and the comparable is the size of the American economy ($27 trillion).
So take the entire economy and ask the question: what does AI not impact? Net that out and assume there’s pricing efficiencies, then build in a risk buffer.
1.5t to 15t seems right.
cloverich 17 days ago [-]
Market cap of apple, google, facebook.
tantalor 17 days ago [-]
Market cap and market size are totally different measures
benreesman 18 days ago [-]
Us skeptics believe that valuation prices in some form of regulatory capture or other non-market factor.
The non-skeptical interpretation is that it's a threshold function, a flat-out race with an unambiguous finish line. If someone actually hit self-improving AGI first there's an argument that no one would ever catch up.
com2kid 17 days ago [-]
There are some really good books about wars between cultures that have AGI and it always comes down to math - whoever can get their hands on more compute faster wins.
api 17 days ago [-]
This is also a strong argument for immigration, particularly high-skill immigration. In the absence of synthetic AGI whoever imports the most human AGI wins.
Jensson 17 days ago [-]
Which suggests that total AGI compute doesn't matter that much, as India isn't the world leader the amount of human compute they posses would suggest then.
What matters is how you use the AGI, not how much you have, with wrong or bad or limiting regulations it will not lead anywhere.
refulgentis 18 days ago [-]
Been in the Mac ecosystem since 2008, love it, but there is, and always has been, a tendency to talk about inevitabilities from scaling bespoke, extremely expensive configurations, and with LLMs, there's heavy eliding of what the user experience is, beyond noting response generation speed in tokens/s.
They run on a laptop, yes - you might squeeze up to 10 token/sec out of a kinda sorta GPT-4 if you paid $5K plus for an Apple laptop in the last 18 months.
And that's after you spent 2 minutes watching 1000 token* prompt prefill at 10 tokens/sec.
Usually it'd be obvious this'd trickle down, things always do, right?
But...Apple infamously has been stuck on 8GB of RAM in even $1500 base models for years. I have 0 idea why, but my intuition is RAM was ~doubling capacity at same cost every 3 years till early 2010s, then it mostly stalled out post 2015.
And regardless of any of the above, this absolutely melts your battery. Like, your 16 hr battery life becomes 40 minutes, no exaggeration.
I don't know why prefill (loading in your prompt) is so slow for local LLMs, but it is. I assume if you have a bunch of servers there's some caching you can do that works across all prompts.
I expect the local LLM community to be roughly the same size it is today 5 years from now.
* ~3 pages / ~750 words; what I expect is a conservative average for prompt size when coding
lowercased 18 days ago [-]
I have a 2023 mbp, and I get about 100-150 tok/sec locally with lmstudio.
datadrivenangel 17 days ago [-]
Which models?
refulgentis 17 days ago [-]
For context, I got M2 Max MBP, 64 GB shared RAM, bought it March 2023 for $5-6K.
I'm not sure if you're pointing out any / all of these:
#1. It is possible to get an arbitrarily fast tokens/second number, given you can pick model size.
#2. Llama 1B is roughly GPT-4.
#3. Given Llama 1B runs at 100 tokens/sec, and given performance at a given model size has continued to improve over the past 2 years, we can assume there will eventually be a GPT-4 quality model at 1B.
On my end:
#1. Agreed.
#2. Vehemently disagree.
#3. TL;DR: I don't expect that, at least, the trend line isn't steep enough for me to expect that in the next decade.
lowercased 17 days ago [-]
I specifically missed the GPT4 part of "up to 10 token/sec out of a kinda sorta GPT-4". Was just looking at token/sec.
hyperpape 17 days ago [-]
This seems like a non-sequitur unless you’re assuming something about the amount that people use models.
Most web servers can run some number of QPS on a developer laptop, but AWS is a big business, because there are a heck of a lot of QPS across all the servers.
slimsag 18 days ago [-]
Unless the best models themselves are costly/hard to produce, and there is not a company providing them to people free of charge AND for commercial use.
m3kw9 17 days ago [-]
The best models are always out of reach on desktops. You can have ok models but AGI will come in a datacenter first
epicureanideal 18 days ago [-]
And of course, as processors improve this becomes more and more the case.
sowbug 17 days ago [-]
Simon has mentioned in multiple articles how cool it is to use 64GB DRAM for GPU tasks on his MacBook. I agree it's cool, but I don't understand why it is remarkable. Is Apple doing something special with DRAM that other hardware manufacturers haven't figured out? Assuming data centers are hoovering up nearly all the world's RAM manufacturing capacity, how is Apple still managing to ship machines with DRAM that performs close enough for Simon's needs to VRAM? Is this just a temporary blip, and PC manufacturers in 2025 will be catching up and shipping mini PCs that have 64GB RAM ceilings with similar memory performance? What gives?
minimaxir 17 days ago [-]
LLMs run on the GPU, and the unified memory of Apple silicon means that the 64 GB can be used by the GPU.
Consumer GPUs top out at 24 GB VRAM.
karolist 17 days ago [-]
llama.cpp can run LLMs on CPU. iGPU can also use system memory, the novel thing is not that, it's that the LLM inference is mostly memory bandwidth bound and memory bandwidth of a custom built PC with really fast DDR5 RAM is around 100GB/s, nVidia consumer GPUs at the top end are around 1TB/s, with mid range GPUs at around half that. M1 Max has 400GB/s, M1 Ultra is 800GB/s, but you can have Apple Silicon Macs with up to 192GB of 800GB/s memory usable by GPU, this means much faster inference than just CPU+system memory due to bandwidth and more affordable than building a multi-GPU system to match the memory amount.
dekhn 17 days ago [-]
It'd be really nice to have good memory bandwidth usage metrics collected from a wide range of devices while doing inference.
For example, how close does it get to the peak, and what's the median bandwidth during inference? And is that bandwidth, rather than some other clever optimization elsewhere, actually providing the Mac's performance?
Personally, I don't develop HPC stuff on a laptop - I am much more interested in what a modern PC with Intel or AMD and nvidia can do, when maxxed out. But it's certainly interesting to see that some of Apple's arch decisions have worked out well for local LLMs.
com2kid 17 days ago [-]
Apple uses HBM, basically RAM on the same die as the CPU. It has a lot more memory bandwidth than typically PC dram, but still less than many GPUs. (Although the highest end macs have bandwidth that is in the same ballpark as GPUs)
jsheard 17 days ago [-]
Apple does not use HBM, they use LPDDR. The way they use it is similar in principle to HBM (on-package, very wide bus) but it's not the same thing.
karmakaze 17 days ago [-]
Right so Apple uses high-bandwidth memory, but not HBM.
justincormack 17 days ago [-]
Its not HBM, which GPUs tend to use, but it is on package and wider interface than other PCs
16 days ago [-]
post-it 17 days ago [-]
Apple designs its own chips, so the RAM and CPU are on the same die and can talk at very high speeds. This is not the case for PCs, where RAM is connected externally.
mhh__ 17 days ago [-]
It's on the same package but the same die?
throwanem 18 days ago [-]
> I’ve heard from sources I trust that both Google Gemini and Amazon Nova charge less than their energy costs for running inference...
Then, several headings later:
> I have it on good authority that neither Google Gemini nor Amazon Nova (two of the least expensive model providers) are running prompts at a loss.
So...which is it?
simonw 18 days ago [-]
Oh whoops! That's an embarrassing mistake, and I didn't realize I had that point twice.
They're not running at a loss. I'll fix that.
cess11 18 days ago [-]
If they are subsidised they can make a profit while still not making enough money to cover energy costs.
simonw 18 days ago [-]
The tip I got about both Gemini and Nova is that the low prices they are charging still cover their energy costs.
17 days ago [-]
cess11 17 days ago [-]
OK!
kgwgk 18 days ago [-]
Subsidised by whom?
cess11 17 days ago [-]
E.g. tax payers.
kgwgk 17 days ago [-]
Are tax payers subsiding that particular activity of Google or Amazon? If they do, “they make enough money” to cover costs. If they don’t, how does it become profitable if it doesn’t even cover the cost of one of the inputs?
cess11 17 days ago [-]
Where I live corporations like those get to build data centers and energy subsidies from the state, i.e. tax payers pay a part of their energy bills. This isn't money they're making, it's money other people made and gave to them.
This means that they could make a profit off inference models without the revenue being large enough to pay the energy costs.
If it's the case I don't know. I'm more concerned with getting rid of those corporations altogether since interacting with them is generally forbidden due to the lack of data protection regulations in the US.
kgwgk 17 days ago [-]
Subsidies are often in the form of tax credits - they cannot be really used to pay for things. I'm not sure if "energy subsidies" may be about providing energy below the cost of production but it's true that the "true" cost of production is not clear when a political decision to close nuclear plants, for example, introduces a distortion on their useful life and their amortised cost.
andrethegiant 17 days ago [-]
> I find the term “agents” extremely frustrating. It lacks a single, clear and widely understood meaning... but the people who use the term never seem to acknowledge that.
This 100%. “Agentic” especially as a buzzword can piss off
Genuinely the best piece of writing I've seen about agents anywhere.
cbeach 17 days ago [-]
The software "has agency"? That is, I can entrust it to carry out the task I've described, to completion, without telling it how to perform the task?
simonw 17 days ago [-]
That's one of the more common definitions people use - especially people who aren't directly building agents, since the builders tend to get more hung up on "LLM with access to tools" or similar.
My problem is when people use that definition (or any other) without clarifying, because they assume it's THE obvious definition.
tucnak 17 days ago [-]
Workflows aside, I think "interruptible work" is what matters, really. That is, maintaining state in-between inferences so that it follows some well-defined goal.
17 days ago [-]
nextworddev 17 days ago [-]
Simon does great work serving as a LLM historian. Have a happy 2025!
dtquad 18 days ago [-]
What is the current status on pushing "reasoning" down to latent/neural space? Seems like a vaste of tokens to let a model converse with itself especially when this internal monologue often has very little to do with the final output so it's not useful as a log of how the final output was derived.
Nice overview. The challenge ahead for “AI” companies is that it appears there’s really no technical moat here. Someone comes out with something amazing and new and within months (if not weeks or days) it’s quickly copied. That environment where everything quickly becomes a commodity is a recipe for many/most companies in this space to quickly get washed out as it becomes economically unviable to play in such an environment.
The money is still flowing, for now, to subsidize that fiasco but as soon as that starts to slow, even just a bit, things are gonna get bumpy real quick. Super excited about this tech but there are dark storm clouds building on the horizon and absent a major “moat” breakthrough it’s gonna get rough soon.
Legend2440 17 days ago [-]
That may be a challenge for AI companies but that doesn't sound like a problem to me. Commodities are great for consumers.
JCM9 17 days ago [-]
Not necessarily. The playbook of what tends to happen is first a bunch of players go bust in the race to the bottom, then the survivors are free to raise prices a bit when others realize there’s not much point in entering a race to the bottom. Those left then let quality slip as competition cools.
That’s exactly what happened with rideshare companies. It was an amazing new thing but subsidized in an unsustainable way, then a bunch of companies exited the space when it was an commoditized race to the bottom and those left let quality slip. Now when you order an Uber a car shows up that smells bad and has wheels about to fall off. The consumer experience was a lot better when Uber was a VC subsidized bonanza
dash2 17 days ago [-]
Look, when are these models going to not just talk to me, but do stuff for me? If they're so clever, why can't I tell one to buy chocolates and send them to my wife? Meanwhile, they can allegedly solve frontier maths problems. What's the holdup to models that go online and perform simple tasks?
munchler 17 days ago [-]
LLM's are inherently untrustworthy. They're very good at some tasks, but they still need to be checked and/or constrained carefully, which is probably not the best technology on which to base real-world autonomous agents.
gs17 17 days ago [-]
> why can't I tell one to buy chocolates and send them to my wife?
I don't think the released version of the feature can do it, but it should be possible with today's tech.
17 days ago [-]
icelancer 17 days ago [-]
The last mile problem remains undefeated.
ripped_britches 17 days ago [-]
Same reason that a powerful graphing calculator can’t teach a math class. “Unhobbling” needs to occur. This means a lot of things but includes modalities, reliability, persistence, alignment, etc.
whatshisface 17 days ago [-]
Can someone please just tell me what model and workflow is so productive? I've seen so many allusions to the concept of skills for LLM use but no explanations of what they are.
simonw 17 days ago [-]
The best LLM for code right now, in my opinion, is still Claude 3.5 Sonnet.
The big challenge is figuring out how to use it. I usually like working at the function level: I figure out the exact function signature I want in Python or JavaScript and then get Claude to implement it for me.
Claude Artifacts are neat too: Claude can build a full HTML+JavaScript UI, and then iterate on it. I use this for interactive UI prototypes and building small tools.
I found it easiest to use Aider with Claude. It's also IDE independent.
submeta 17 days ago [-]
Thank you Simon for the excellent work you do! I learned a lot from you and enjoy reading everything you write. Keep up. And happy new year.
pkoird 18 days ago [-]
I'd love to read a semi-technical book on everything that we've learned about what works and what does not on LLMs.
nkingsy 17 days ago [-]
It would be out of date in months.
Things that didn’t work 6 months ago do now. Things that don’t work now, who knows…
minimaxir 17 days ago [-]
There are still some tropes from the GPT-3 days that are fundamental to the construction of LLMs that affect how they can be used and will not change unless they no longer are trained to optimize for next-token-prediction (e.g. hallucinations and the need for prompt engineering)
DoctorOetker 17 days ago [-]
Do you mean performance that was missing in the past is now routinely achieved?
Or do you actually mean that the same routines and data that didn't work before suddenly work?
nkingsy 16 days ago [-]
B
Each new model opens up new possibilities for my work. In a year it's gone from sort of useful but I'd rather write a script, to "gets me 90% of the way there with zero shots and 95% with few-shot"
legendofbrando 17 days ago [-]
@simonw you’ve been awesome all year; loved this recap and look forward to more next year
n144q 17 days ago [-]
About "knowledge is incredibly unevenly distributed", an interesting fact is that women is much less likely to use LLMs, if they hear about them/follow updates in the first place:
I didn't realize "agent" designs were that ambiguously defined. Every AI engineer I've talked to uses it to mean a design that combines several separate LLM prompts (or even models) to solve problems in multiple stages.
simonw 17 days ago [-]
I'll add that one to the list. Surprisingly it doesn't closely match most of the 211 definitions I've collected already!
If the investors ask, those same AI engineers will probably allow the answer to be much more ambiguous.
voidhorse 17 days ago [-]
Great write up! Unfortunately, I think this article accurately reflects how we've made little progress on the most important aspects of LLM hype and use: the social ones.
A small number of people with lots of power are essentially deciding to go all in on this technology presumably because significant gains will mean the long term reduction of human labor needs, and thus human labor power. As the article mentions, this also comes at huge expenditure and environmental impact, which is already a very important domain in crisis that we've neglected. The whole thing especially becomes laughable when you consider that many people are still using these tools to perform tasks that could be preformed with a margin of more effort using existing deterministic tools. Instead we are now opting for a computationally more expensive solution that has a higher margin of error.
I get that making technical progress in this area is interesting, but I really think the lower level workers and researchers exploring the space need to be more emphatic about thinking about socioeconomic impact. Some will argue that this is analogous to any other technological change and markets will adjust to account for new tool use, but I am not so sure about this one. If the technology is really as groundbreaking as everyone wants us to believe then logically we might be facing a situation that isn't as easy to adapt to, and I guarantee those with power will not "give a little back" to the disenfranchised masses out of the goodness of their hearts.
This doesn't even raise all the problems these tools create when it comes to establishing coherent viewpoints and truth in ostensibly democratic societies, which is another massive can of worms.
17 days ago [-]
xnx 17 days ago [-]
I've been surprised that ChatGPT has hung on as long as it has. Maybe 2025 is the year Microsoft pushes harder for their brand of LLM.
bwhiting2356 17 days ago [-]
Some amount of LLM gullibility may be needed. Let's say I have a RAG use case for internal documents about how my business works. I need the LLM to accept what I'm telling it about my business as the truth without questioning it. If I got responses like "this return policy is not correct", LLMs would fail at my use case.
layer8 17 days ago [-]
You don’t need gullibility for that, just the ability to work based on premises (hypotheticals) that you feed it. To the LLM it shouldn’t matter if the hypotheticals are real or not. That’s independent of whether the LLMs judges them as plausible or not. Not being able to semi-accurately judge the plausibility of things would make it gullible.
agentultra 18 days ago [-]
Don’t forget that 2024 was also a record year for new methane power plant projects. Some 200 new projects in the US alone and I’d wager most of them are funded directly by big tech for AI data centres.
This is definitely extending the runway of O&G at a crisis point in the climate disaster when we’re supposed to be reducing and shutting down these power plants.
Update: clarified the 200 number is in the US. There are far more world wide.
comte7092 18 days ago [-]
Energy generation methods aren’t fungible.
Methane is favored in many cases because they can be quickly ramped up and down to handle momentary peaks in demand or spotty supply generated from renewables.
Without knowing more details about those projects it is difficult to make the claim that these plants have anything to do with increased demand due to LLMs, though if anything, they’d just add to base load demands and lead to slower decommissioning of old coal plants like we’ve seen with bitcoin mines.
throwup238 18 days ago [-]
Methane is also worth burning to lessen the GHG impact since we produce so much of it as a byproduct of both resource extraction and waste disposal anyway.
api 18 days ago [-]
The only thing that will stop this is for battery storage to get cheap and available enough that it can cover for renewables. If we are still building gas turbines it means that hasn’t happened yet.
AI is a red herring. If it wasn’t that it would be EV power demand. If it wasn’t that it would be reshoring of manufacturing. If it wasn’t that it would be population growth from immigration. If it wasn’t that it would be replacing old coal power plants reaching EOL.
Replacing coal with gas is an improvement by the way. It’s around half the CO2 per kWh, sometimes less if you factor in that gas turbines are often more efficient than aging old coal plants.
agentultra 18 days ago [-]
Methane has a shorter half-life than CO2 but is a far worse green house gas; retaining far more heat.
And delivering methane leaks like a sieve into the atmosphere from all parts of the process.
Sure it’s probably “better than coal,” but not by much. It’s a bit like comparing what’s worse: getting burned by fire or being drowned in acid.
Nition 17 days ago [-]
Pumped hydro is an excellent form of storage if you have the terrain for it. A whole order of magnitude cheaper than battery storage at the moment.
ToucanLoucan 18 days ago [-]
It would be really cool if big tech could find a new hyperscaler model that didn't also require offsetting the goals of green energy projects worldwide. Between LLM and crypto you'd swear they're trying to find the most energy-wasteful tech possible.
ben_w 18 days ago [-]
Cryptocurrency, at least PoW, the point is indeed to be the most wasteful — a literal Dyson swarm powered Bitcoin would provide exactly the same utility as the BTC network already had in 2010.
LLMs (and the image, sound, and movie generating models) are more coincidentally power-hogs — people are at least trying to make them better at fixed compute, and lower compute at fixed quality.
ToucanLoucan 17 days ago [-]
I mean, I appreciate that distinction and don't disagree. And, if this is going to continue being a trend, I think we need more stringent restrictions on what sorts of resources are permitted to be consumed in the power plants that are constructed to meet the needs of hyperscaler data centers.
Because whether we're using tons of compute to provide value or not doesn't change that we are using tons of compute and tons of compute requires tons of energy, both for the chips themselves, and the extensive infrastructure that has to built around them to let them work. And not just electricity: refrigerants, many of which are environmentally questionable themselves, are a big part; hell, just water. Clean, usable water.
If we truly need these data centers, then fine. Then they should be powered by renewable energy, or if they absolutely cannot be, then the costs their nonrenewable energy sources inflict on the biosphere should be priced into their construction and use, and in turn, priced into the tech that is apparently so critical for them to have.
This is like, a basic calculus that every grown person makes dozens of times a day: do I need this? And they don't get to distribute the cost of that need, however prescient it may be, on their wider community because they can't afford it otherwise. I don't see why Microsoft should be able to either. If this is truly the tech of the future as it is constantly propped up to be, cool. Then charge a price for it that reflects what it costs to use.
lucubratory 17 days ago [-]
I think basically everyone should support a carbon tax. It's a really obvious solution that is both environmentally friendly and should be acceptable to free market fanatics because it is explicitly and only taxing a negative externality on the public - it's hard to imagine a more justified tax.
Combined with the increased cost effectiveness of renewables & batteries, & the new build-out of nuclear, it could plausibly speed up the clean energy transition, rather than just disincentivising building out more polluting power plants.
There are two main options for what to do with revenue from a carbon tax. The one that makes the most macroeconomic sense is to use those proceeds to fund subsidies for clean energy roll outs & grid adaptation. You are directly taxing the polluting power grid to fund the construction of a non-polluting power grid. As CO2 emitting industry (and thus carbon tax revenue) declines, we have less required spend on clean energy roll out, so the tax would balance nicely. The downside would be that a carbon tax would increase cost of living and this does nothing about that.
The other option is a disbursement. Give everyone in society a payment directly from the proceeds of the carbon tax. This would offset the regressive aspects of a carbon tax (because that tax would increase consumer costs), and would also act as a sort of auto-stimulus to stop the economy from turning down due to consumption costs increasing. The downside of this is that the clean energy transition happens slower than the above, and that there may be political instability & perverse incentives as people maybe come to rely on this payment that has to go away over the next few decades.
They're both good options. I don't know which is better and I think that's likely something individual countries will probably choose based on their situation. But we do need some sort of way to make those emitting CO2 pay for its negative externalities.
ben_w 11 days ago [-]
I'd be fine with a carbon tax, if only we could get every nation to do it near-simultaneously (within a few years of each other at most) — I don't think it's sufficient for any one nation to say they'll do that for local production plus an equivalent import tariff to compensate for what other nations are doing (we also want to lower emissions of everyone else's internal markets, but also there's a lot of people who will fraudulently claim they're eco-friendly when they're not, and that's harder to catch when there is an international border in the way) — but "the perfect is the enemy of the good", and this may still be a step in the right direction even if my concerns are proportionate to the actual risks (which they may not be).
I think the rapidly decreasing costs of renewables and storage are likely to make the transition happen before the political will to get a carbon tax, but if you recon you can push the right buttons, I encourage you to try it :)
zachrip 18 days ago [-]
It seems odd to put crypto and LLMs in the same boat in this regard - I might be wrong but are there any crypto projects that actually provide value? I'm sure there are ones that do folding or something but among the big ones?
But this is the promise of uncontrollable decentralization providing value, for good or bad?
blibble 17 days ago [-]
crypto has real uses, most of them illegal
meanwhile "AI" is used to produce infinity+1 pictures of shrimp jesus and more spam than we've ever known before
and if we're really lucky, it will put us all out of work
uludag 17 days ago [-]
But according to the author, apparently bringing this up isn't helpful criticism.
I'm curious what peoples thoughts are of what the future of LLMs would be like if we severely overshoot our carbon goals. How bad would thinks have to get for people to stop caring about this technology?
simonw 17 days ago [-]
It's helpful criticism as part of the conversation. What frustrates me is when people go "LLMs are burning the planet!" and leave it at that.
agentultra 17 days ago [-]
It is a rather contrasting opinion that the trade-offs to have AI aren’t worth the value they bring.
The growth in this technology isn’t outpacing car pollution and O&G extraction… yet, but the growth rate has been enough in recent years to put it on the radar of industries to watch out for.
I hope the compute efficiencies are rapid and more than commensurate with the rate of growth so that we can make progress on our climate targets.
However it seems unlikely to me.
It’s been a year of progress for the tech… but also a lot of setbacks for the rest of the world. I’m fairly certain we don’t need AGI to tell us how to cope with the climate crisis; we already have the answer for that.
Although if the industry does continue to grow and the efficiency gains aren’t enough… will society/investors be willing to scale back growth in order to meet climate targets (assuming that AI becomes a large enough segment of global emissions to warrant reductions)?
Interesting times for the field.
18 days ago [-]
neom 18 days ago [-]
"learned out about" - is that an Australian phraseology by chance? Sounds Australian or British of some manner.
simonw 18 days ago [-]
That was a very dumb typo in my title!
neom 18 days ago [-]
I figured as much, although I wondered if you were going for the kinda "he learn out about not pissing people off real sharpish" kinda tone I've heard in Scotland before, but wasn't sure. Big fan btw, happy new years Simon! :)
mjburgess 18 days ago [-]
Good ear -- the use of 'out' as an abbreviation of anything is a britishism.
Nowt, owt, -- nothing, anything
CRConrad 11 days ago [-]
Naught, aught.
user982 18 days ago [-]
You can find out, you can learn about, but you can't learn out about.
slyall 17 days ago [-]
Australians or Brits would tend so day "learnt" rather than "learned"
nektro 17 days ago [-]
i learned this industry has lower morals and standards for excellence than i ever previously expected
xnx 17 days ago [-]
Double checking, I don't think I saw anything about video generation. Not sure if those fall under the "LLM" umbrella. It came very late in the year, but the Google Veo 2 limited testing are astounding. There are at least a half-dozen other services where you can pay to generate video.
baobabKoodaa 17 days ago [-]
Video generation was covered in OP
orsenthil 17 days ago [-]
One of the best written summary of LLMs for the year 2024.
We all have silently started to realize Slops, hopefully we can recognize them more easily and prevent them.
Test Driven Development (Integration Tests or functional tests specifically) for Prompt Driven Development seems like the way to go.
Thank you, Simon.
nobodywillobsrv 17 days ago [-]
One interesting test that I see nearly all LLMs fail is coherent responses to tax questions.
calebm 17 days ago [-]
I love your breadth-first approach of having an outline at the top.
simonw 17 days ago [-]
I wrote custom software for that! https://tools.simonwillison.net/render-markdown - If you paste in some Markdown with ## section headings in it the output will start with a <ul> list of links to those headings.
layer8 17 days ago [-]
It’s somehow funny to experience the juxtaposition of the technological progress with LLMs and how decades-old basic functions like TOC creation for a blog post still require custom software. ;)
k2xl 18 days ago [-]
Something not mentioned is AI generated music. Suno's development this year is impressive. Unclear what this will mean for music artists over next few years.
simonw 17 days ago [-]
Yeah, this year I decided to just focus on LLMs - I didn't touch on any of the image or music generation advances either. I haven't been following those closely enough to have particularly useful things to say about them.
fullstackchris 17 days ago [-]
Very clear; I like buying music produced by people who play instruments.
janstice 17 days ago [-]
I’m even happy to listen to generative music, so long as it’s orchestrated (haha) by musicians using musical taste to make musical decisions, rather than a pastiche of the worst derivative house you’ve ever heard by a rando with no intent.
JaDogg 16 days ago [-]
What you think of samples and FL Studio / DAWs?
k__ 17 days ago [-]
I'm glad that so many open source and even "small" models like Gemma are better than gpt4.
fosterfriends 18 days ago [-]
My fav part of the writeup at the end:
"""
LLMs need better criticism #
A lot of people absolutely hate this stuff. In some of the spaces I hang out (Mastodon, Bluesky, Lobste.rs, even Hacker News on occasion) even suggesting that “LLMs are useful” can be enough to kick off a huge fight.
I like people who are skeptical of this stuff. The hype has been deafening for more than two years now, and there are enormous quantities of snake oil and misinformation out there. A lot of very bad decisions are being made based on that hype. Being critical is a virtue.
If we want people with decision-making authority to make good decisions about how to apply these tools we first need to acknowledge that there ARE good applications, and then help explain how to put those into practice while avoiding the many unintiutive traps.
"""
LLMs are here to stay, and there is a need for more thoughtful critique rather than just "LLMs are all slop, I'll never use it" comments.
vunderba 18 days ago [-]
I agree, but I think my biggest issue with LLMs (and a lot of GenAI) is that they act as a massive accelerator for the WORST (and unfortunately most common) type of human - the lazy one.
The signal-to-noise ratio just goes completely out of control.
Isn't it expected that most, if not all, of the content will be produced by AI/AGI in the near future? It won't matter much, if you're lazy or not. It leads to the question, what we'll do instead? People may want to be productive, but we're observing in real-time how world is going shit for workers and that's basically fact for many reasons.
One reason is that it's cheaper to use AI, even if the result is poor. It doesn't have to be high quality, because most of the time we don't care about quality, unless something interests us. I wonder what kind of shift in power dynamics will occur, but so far it looks just like many of us will just lose a job. There's no UBI (or social credit proposed by Douglas), salaries are low and not everyone lives in good location, but corporations try to enforce RTO. Some will simply get fired and won't be able to find a new job (that won't be sustainable for personal budget, unless someone already has low costs of living and is debt-free or has somewhat wealthy family that will cover for you).
Well, maybe at least government will protect us? Low chance, world is shifting right and it will get worse, once we start to experience more and more results of global warming. I don't see scenario, where world is becoming better place in foreseeable future. We're trapped in society of achievement, but soon we may be not able to deliver achievements, because if business can get similar results for fraction of the price (that is needed to hire human workers), then guess what will happen?
These are sad times, full of depression and suffering. I hope that some huge transformation in societies will happen soon or that AI development slows down, so that some future generation will have to deal with consequences (people will prioritize saving their own and it won't be pretty, so it's better to just pass it down like debt).
eschaton 17 days ago [-]
Why would this be expected?
mhh__ 17 days ago [-]
The people who are lazy but have taste will do well, then.
Der_Einzige 18 days ago [-]
Sorry but the "lazy is bad" crowd is ludditism in another form, and it's telling that a whole lot of very smart people were passionate defenders of being lazy!
AI systems are literally the most amazing technology on earth for this exact reason. I am so glad that it is destroying the minds of time thieves world-wide!
CRConrad 11 days ago [-]
> the "lazy is bad" crowd is ludditism in another form
Capital --> capitalist, capitalism.
Commune --> communist, communism.
Ned Ludd --> Luddite, Luddism.
Not "capitalistism" or "communistism", so not "ludditism" either.
fragmede 11 days ago [-]
Going by your other examples, shouldn't it be Luddist and not Luddite? English is famously inconsistent and is a difficult language to learn, owing to it's linguistic heritage, as it's a bunch of exceptions to rules. Like i before e, except after c, but also in a bunch of other words, so every English student just needs to remember those.
CRConrad 10 days ago [-]
Yeah, maybe not the best examples. There are other "-ite" words, but I can't recall off the top of my head any such that also have "-ism" forms. It is, as always with languages, "just the way it is": I've always seen it rendered as "Luddite", never "-ist". (Maybe because it's named for a person, not a thing or principle? Tried to hint at that possibility with his full name.)
Yup, English may be the most inconsistent of languages. When I was a kid, we used to blame French for being "just exceptions to rules, exceptions to exceptions, and exceptions to those exceptions!", but with a few decades of perspective... Nope, English is far worse.
greenavocado 18 days ago [-]
Exif watermark by the generators would solve 90% of the problem in one fell swoop because lazy people won't remove it
minimaxir 17 days ago [-]
Every image host and social media app automatically strips EXIF data (for privacy reasons at minimum).
greenavocado 17 days ago [-]
Stenography with a known signature perhaps
layer8 17 days ago [-]
Still easily defeated when the scheme is known.
greenavocado 17 days ago [-]
My point is most won't bother
layer8 17 days ago [-]
Well, it’s a cat and mouse game. They will start to bother when not doing so starts having consequences for them.
mhh__ 17 days ago [-]
I can think of some runaway scenario's where LLMs are definitely bad but, indeed, this particular line of criticism is really just luddites longing for a world that probably doesn't exist anymore.
These are the people who regulate and legislate for us, they are the risk-adverse fools who would rather things be nice and harmless lest they be bad but work.
Personally, I think my only serious ideology in this area is that I am fundamentally biased towards the power of human agency. I'd rather not need to, but in a (perhaps) Nietzschean sense I view so-called AI as a force multiplier to totally avoid the above people.
AI will enable the creative to be more concrete, and drag those on the other end of the scale towards the normie mean. This is of great relevance to the developing world too - AI may end up a tool for enforcing western culture upon the rest of the world but perhaps a force decorrelating it from the McKinsey's of tall buildings in big cities.
17 days ago [-]
im_down_w_otp 17 days ago [-]
This happens with every inane hype-cycle.
I suspect people don't particularly hate or despise LLMs per se. They're probably reacting mostly to "tech industry" boom-bust bullsh*tter/guru culture. Especially since the cycles seem to burn increasingly hotter and brighter the less actual, practical value they provide. Which is supremely annoying when the second-order effect is having all the oxygen (e.g. capital) sucked out of the room for pretty much anything else.
foxhop 17 days ago [-]
Step 1: curate a context window of code from different repos (poke team about switching to mono repo)
Step 2: write a slack style message as if you are discussing the solution with a teammate that you have authority over as a delegate to get shit done & to revise as needed.
Step 3: press enter, LLM does something you don't like, delete history, fix prompt in step 2 and ask again, rinse and repeat until you have working code.
Step 4: ask for the changes to be written as a bash file that cat EOF all the files that change into place, run the script.
Step 5: git diff & play test the changes using functional testing (use your mouse & keyboard test the code paths that changed...)
Step 6: continue prompting & deleting history as needed to refine.
Step 7: commit code to repos
adsharma 18 days ago [-]
In spite of all this progress, I can't find LLMs that solve simple tasks like:
Here is my resume. Make it look nice (some design hints).
They can spit html and css, but not Google doc.
On the other hand, Google results are dominated by SEO spam. You can probably find one usable result on page 10.
The problem is not technology. It's a business model that can support the humans feeding data into the LLM.
Alex-Programs 17 days ago [-]
Why would they be able to output a Google doc? It's a proprietary format. The closest thing would be rich text format to copy paste.
vikramkr 17 days ago [-]
That proprietary format is owned by a company associated with folks who won two nobel prizes for AI related work this year and the employer at the time of the researchers who wrote the attention is all you need paper and also the owner of a search engine with access to like, all the data. Doesn't seem unreasonable lol
behnamoh 17 days ago [-]
[flagged]
adsharma 15 days ago [-]
I'll accept any open format that can be lightly edited and converted into PDF.
Google doc + PDF is likely the most commonly used combination based on what I see in the SEO spam.
Some of them make you watch ads and then allow you to download something that looks like a doc, but you'll find out soon that you downloaded a ppt with an image that you can't edit.
Gooblebrai 17 days ago [-]
> They can spit html and css, but not Google doc.
Wow. At this stage, I think people are just searching for excuses to complain about anything that the LLM does NOT do.
adsharma 15 days ago [-]
The amount of SEO spam on these searches indicates to me that this is a commercially profitable query and a task a lot of people are interested in.
If a multi-modal LLM can read a 100 page PDF and answer questions about it or replace a median white collar worker, this should be a relatively trivial task. Suggest some nice fonts, backgrounds and give me something that I can lightly edit and generate a PDF from.
logicchains 17 days ago [-]
They can spit out LaTeX, and a PDF from that is going to look much nicer than a Google doc (and display the same everywhere). As an added bonus, the recruiter can't randomly rewrite parts of it (at least not so easily).
nox101 17 days ago [-]
The recuiter isn't going to print out your resume. They're going to read in their computer or iPad or phone.
trenchgun 17 days ago [-]
For sure they will read a pdf and not a google doc.
17 days ago [-]
surfingdino 17 days ago [-]
If I learned anything, it would be that LLMs' non-deterministic nature makes them great are generating output that we can argue over, but they are not a great tool. for doing actual work. I am not asking for much. In my field of work, I use Jetbrains' IDEs, which have now been "enhanced" with AI. I had to turn this feature off, because I kept having to remove code, imports randomly added by the IDE. This was distracting and wasted my time.
macawfish 17 days ago [-]
Large concept models are really exciting
m3kw9 17 days ago [-]
Interestingly, there isn't much big news about jail breaking or safety alignment
OpenAI’s board now stating “We once again need to raise more capital than we’d imagined” less than three months after raising another $6.6 billion at a valuation of $157 billion sounds alarmingly like a Ponzi scheme — an argument akin to “Trust us, we can maintain our lead, and all it will take is a never-ending stream of infinite investment.”
jsheard 17 days ago [-]
According to the internal projections that The Information acquired recently they're expecting to lose $14 billion in 2026, so that record breaking funding round won't even buy them 6 months of runway at that point even by their own probably optimistic estimates.
cactusfrog 17 days ago [-]
Every waste of money is not a Ponzi scheme.
ffsm8 17 days ago [-]
I agree, the core aspect of a ponzi scheme is that it redistributes the newly invested funds to previous investors, making it highly profitable to anyone joining early and incentivising early joiners to get new investors.
This just doesn't hold true for open ai
jacobgkau 17 days ago [-]
Doesn't it hold true for investment in AI (or potentially any other industry that experiences a boom) in general?
Anyone who bought in at the ground floor is now rich. Anyone who buys in now is incentivized to try and keep getting more people to buy in so their investment will give a return regardless of if actual value is being created.
dartos 17 days ago [-]
If effect, kind of.
The money being invested does not go directly to investors.
It goes to the cost of R&D, which in turn increases the value of openai shares, then the early investors can sell those shares to realize those gains.
The difference between that and a ponzi is that the investment creates value which is reflected in the share price.
No value is created in a Ponzi scheme.
The actual dollar worth of the value generated is what people speculate on.
zekica 17 days ago [-]
Only a part of the value is created in OpenAI's stock valuation. Most of it is still a ponzi-like scheme.
dartos 17 days ago [-]
I have no love for openai, but they did make the fastest growing product of all time. There’s value in being the ones to do that.
I do agree it’s a very very thin line.
CRConrad 11 days ago [-]
> but they did make the fastest growing product of all time. There’s value in being the ones to do that.
Aha: So if my future line of Covid Cancer Candy takes off even faster, there's "value" in that, too?
What kind of value, exactly? Does the value of being "the fastest growing product of all time" not at all depend on what kind of product it is?
17 days ago [-]
CRConrad 11 days ago [-]
Because "Open" AI keeps the newly invested funds only to themselves (with some scraps for employees), so early joiners don't (directly[1]) get any of the newly invested funds from later ones, you mean?
Yeah, true, not exactly a Ponzi scheme: This has even fewer redeeming qualities.
[1]: Only indirectly, by selling off their investment to that next sucker.
DavidSJ 17 days ago [-]
> Every waste of money is not a Ponzi scheme.
Using this as an opportunity to grind an axe (not your fault, cactusfrog!): I find it clearer when people write "not every X is a Y" than "every X is not a Y", which could be (and would be, literally) interpreted to mean the same thing as "no X is a Y".
wslh 17 days ago [-]
Not every, but wasting money is one of the tricks of corruption.
hdjjhhvvhga 17 days ago [-]
What is funny is that their "lead" is just because of inertia - they were the first to make an LLM publicly available. But they are no longer leaders so their attempts at getting more and more money only prove Altman's skills at convincing people to give him money.
lumost 17 days ago [-]
They are still in the lead, and I'd be willing to bet that they have 10x the DAU on chat.com/chatgpt.com than all other providers combined. Barring massive innovation on small sub 10B models - we are all likely to need remote inference from large server farms for the foreseeable future. Even in the case that local inference is possible - it's unlikely it will be desirable from a power perspective in the next 3 years. I am not going to buy a 4xB200 instance for myself.
Whether they offer the best model or not may not matter if you need a PhD in <subject> to differentiate the response quality between LLMs.
scary-size 17 days ago [-]
Not sure about 10x DAUs. Google flicked the switch on Gemini and it surfaced in pretty much every GSuite app over night.
Peacefulz 17 days ago [-]
Requiring that Gemini take over the job that Google Assistant did when installing the Gemini APK really rubbed me the wrong way. I get it. I just don't like that it was required for use.
brokencode 17 days ago [-]
Same with Microsoft and all their Copilots, which are built on OpenAI. Not to mention all the other companies using OpenAI since it’s still the best.
belter 17 days ago [-]
Their best hope now is to hire John Carmack :-)
theferalrobot 17 days ago [-]
Which models perform better than 4o or o1 for your use cases?
In my limited tests (primarily code) nothing from llama or Gemini have come close, Claude I’m not so sure about.
torginus 17 days ago [-]
How good is the best model of your choice at doing architecture work for complex and nontrivial apps?
I have been bashing my head against the wall over the course of the past few days trying to create my (quite complex) dream app.
Most of LLM coding I've done involved in writing code to interface with already existing libs or services and the LLMs are great at that.
I'm hung up on architecture questions that are unique to my app and definitely not something you can google.
fullstackchris 17 days ago [-]
Don't wanna be that typical hackernews guy but I couldnt resist... if your app is "quite complex" there is probably a way or ways you can break it down into much simpler parts. Easier for you AND the LLM. It always comes back to architecture and composition ;)
torginus 17 days ago [-]
I don't want to be mean, but that bit of eastern wisdom you dispensed sounds incredibly like what a management consultant would say.
jppope 17 days ago [-]
yeah but in business there are really only 2 skills right? Convincing people to give you money and giving them something back to them thats worth more than the money they gave you.
klipt 17 days ago [-]
For repeated business you want to give them something that costs you less than what they pay, but is worth more to them than what they pay. Ie creating economic value.
dkkergoog 17 days ago [-]
[dead]
wpnx 17 days ago [-]
Thank you Simon
alexashka 17 days ago [-]
I wonder what the author of this post thinks of human generated slop.
For example if someone just takes random information about a topic, organizes it in chronological order and adds empty opinions and preferences to it and does that for years on end - what do you call that?
webmaven 17 days ago [-]
An "Editor".
JaDogg 18 days ago [-]
I think LLM web applications need a big red warning (non interactive, I don't want more cookie dialogs) like in cigarettes.
> LLM generated content need to be verified.
becquerel 17 days ago [-]
Every LLM web app I have used has a disclaimer along these lines prominently featured in the UI. Maybe the disclaimer isn't bright red with gifs of flashing alarms, but the warnings are there for the people who would pay attention to them in the first place.
minimaxir 17 days ago [-]
Unfortunately, even after 2 years of ChatGPT and countless news stories about it, people still don't realize that LLMs can be wrong.
There maybe should be a bright red flashing disclaimer at this point.
Der_Einzige 18 days ago [-]
RE: Slop:
Having Slop generations from an LLM is a choice. There are so many tricks to make models genuinely creative just at the sampler level alone.
It doesn't matter how good the generated text is: it is still slop if the recipient didn't request it and no human has reviewed it.
Der_Einzige 18 days ago [-]
By that definition machine to machine communication that happens "organically" (like how humans do it, where they sometimes strike up conversations unprompted with each other) is "slop".
You're not seeing how the future of the world will develop.
simonw 18 days ago [-]
If you ask me to read an unguided conversation between two LLMs then yes, I'd consider that slop.
Some people might like slop.
minimaxir 17 days ago [-]
The rise of the famous obvious Facebook AI slop indicates that some demographics love it.
orbital-decay 17 days ago [-]
This won't solve anything. There's a myriad of sampling strategies, and they all have the same issue: samplers are dumb. They have no access to the semantics of what they're sampling. As a result, things like min-p or XTC will either overshoot or undershoot as they can't differentiate between the situations. For the same reason, samplers like DRY can't solve repetition issues.
Slop is over-representation of model's stereotypes and lack of prediction variety in cases that need it. Modern models are insufficiently random when it's required. It's not just specific words or idioms, it's concepts on very different abstraction levels, from words to sentence patterns to entire literary devices. You can't fix issues that appear on the latent level by working with tokens. The antislop link you give seems particularly misguided, trying to solve an NLP task programmatically.
Research like [1] suggests algorithms like PPO as one of the possible culprits in the lack of variety, as they can filter out entire token trajectories. Another possible reason is training on outputs from the previous models and insufficient filtering of web scraping results.
And of course, prediction variety != creativity, although it's certainly a factor. Creativity is an ill-defined term like many in these discussions.
You should read the follow-up work from Entropix folks, or reflect on the extremely high review scores min_p is getting, or look at the fact the even trivial shit like top_k=2 + temperature = max_int works as evidence that models do in fact "have access to the semantics of what they're sampling" via the ordering of their logprobs.
DRY does in fact solve repetition issues. You're not using the right settings with it. Set the penalty sky high like 5+. Yes that means you're going to have to modify the ui_paramas in oobabooga cus they have stupid defaults on what limits you can set the knobs to.
There's several other excellent samplers which deserve high ranking papers and will get them in due time. Constrained beam search, tfs (oldie but goodie), mirostat, typicality, top_a, top-n0, and more coming soon. Don't count out sampler work. It's the next frontier and the least well appreciated.
Also, contrastive search is pretty great. Activation/attention engineering is pretty great, and models can in fact be made to choose their own sampling/decoding settings, even on the fly. We haven't even touched on the value of constrained/structured decoding. You'll probably link a similarly bad paper to the previous one claiming that this too harms creativity. Good thing that folks who actually know what they're doing, i.e. the developers of outlines, pre-bunked that paper already for me: https://blog.dottxt.co/say-what-you-mean.html
I'm so incredibly bullish on AI creativity and I will die on the hill that soon AI systems will be undeniably more creative, and better at extrapolation, than most humans.
switch007 17 days ago [-]
I've watched juniors take their output as gospel applying absolutely zero thinking and getting confused when I suggest looking at the reference manual instead
I've had PMs believe it can replace all writing of tickets and thinking about the feature, creating completely incomprehensible descriptions and acceptance criteria
I've had Slack messages and emails from people with zero sincerity and classic LLM style and the bs that entails
I've had them totally confidently reply with absolute nonsense about many technical topics
I'm grouchy and already over LLMs
JimmyWilliams1 17 days ago [-]
[dead]
draw_down 18 days ago [-]
I agree the criticism is poor; it’s often very lazy. There are currently a lot of dog-brain “wrap a LLM around it” products, which are worthy of scorn. Much of the lazy criticism is pointing at such products and therefore writing off the whole endeavor.
But that doesn’t necessarily reflect the potential of the underlying technology, which is developing rapidly. Websites were goofy and pointless until Amazon came around (or Yahoo or whatever you prefer).
I guess potential isn’t very exciting or interesting on its own.
webmaven 17 days ago [-]
This is HN. The canonical example for that is pg's Viaweb.
draw_down 17 days ago [-]
[dead]
rcdwealth 17 days ago [-]
[dead]
paulo222 17 days ago [-]
[dead]
henning 18 days ago [-]
Spookily good at writing code? LLMs frequently hallucinate broken nonsense shit when I use them.
Recognize what they do well (generate simple code in popular languages) while acknowledging where they are weak (non-trivial algorithms, any novel code situation the LLM hasn't seen before, less popular languages).
simonw 18 days ago [-]
Did you try learning HOW to get good code out of them?
As with all things LLM there's a whole lot of undocumented and under appreciated depth to getting decent results.
Code hallucinations are also the least damaging type of hallucinations, because you get fact checking for free: if you run the code and get an error you know there's a problem.
A lot of the time I find pasting that error message back into the LLM gets me a revision that fixes the problem.
lolinder 17 days ago [-]
> Code hallucinations are also the least damaging type of hallucinations, because you get fact checking for free: if you run the code and get an error you know there's a problem.
This is great when the error is a thrown exception, but less great when the error is a subtle logic bug that only strikes in some subset of cases. For trivial code that only you will ever run this is probably not a big deal—you'll just fix it later when you see it—but for code that must run unattended in business-critical cases it's a totally different story.
I've personally seen a dramatic increase in sloppy logic that looks right coming from previously-reliable programmers as they've adopted LLMs. This isn't an imaginary threat, it's something I now have to actively think about in code reviews.
polishdude20 17 days ago [-]
When they spit out these subtle bugs, are you promoting the LLM to watch our for that particular bug? I wonder if it just needs a vir more guidance in more explicit terms
lolinder 17 days ago [-]
At a certain point it becomes more work to prompt the LLM with each and every edge case than it is to just write the dang code.
I work out what the edge cases are by writing and rewriting the code. It's in the process of shaping it that I see where things might go wrong. If an LLM can't do that on its own it isn't of much value for anything complicated.
simonw 17 days ago [-]
Yeah, the other skill you need to develop to make the most of AI-assisted programming is really good manual QA.
lolinder 17 days ago [-]
Have you found that to be a good trade-off for large-scale projects?
Where I'm at right now with LLMs is that I find them to be very helpful for greenfield personal projects. Eliminating the blank canvas problem is huge for my productivity on side projects, and they excel at getting projects scaffolded and off the ground.
But as one of the lead engineers working on a million+ line, 10+ year-old codebase, I've yet to see any substantial benefit come from myself or anyone else using LLMs to generate code. For every story where someone found time saved, we have a near miss where flawed code almost made it in or (more commonly) someone eventually deciding it was a waste of time to try because the model just wasn't getting it.
Getting better at manual QA would help, but given the number of times where we just give up in the end I'm not sure that would be worth the trade-off over just discouraging the use of LLMs altogether.
Have you found these things to actually work on large, old codebases given the right context? Or has your success likewise been mostly on small things?
simonw 17 days ago [-]
I use them successfully on larger project all the time.
"Here's some example JavaScript code that sends an email through the SendGrid REST API. Write me a python function for sending an email that accepts an email address, subject, path to a Jinja template and a dictionary of template context. It should return true or false for if the email was sent without errors, and log any error messages to stderr"
That prompt is equally effective for a project that's 500 lines or 5,000,000 lines of code.
I also use them for code spelunking - you can pipe quite a lot of code into Gemini and ask questions like "which modules handle incoming API request validation?" - that's why I built https://github.com/simonw/files-to-prompt
gre 17 days ago [-]
I had some success converting a react app with classes to use hooks instead. Also asking it to handle edge cases, like spaces in a filename in a bash script--this fixes some easy problems that might have come up. The corollary here is that pointing out specific problems or mentioning the right jargon will produce better code than just asking for the basic task.
It's very bad at Factor but pretty good at naming things, sometimes requiring some extra prompting. [generate 25 possible names for this variable...]
nickpsecurity 17 days ago [-]
That’s the problem I had on the early ones. I learned a few tricks that let me output whole apps from GPT3.5 and GPT4 before they seemed to nerf them.
1. Stick with popular languages, libraries, etc with lots of blog articles and example code. The pre-training data is more likely to have patterns similar to what you’re building. OpenAI’s were best with Python. C++ was clearly taxing on it.
2. Separate design from coding. Have an AI output a step by step, high-level design for what you’re doing. Look at a few. This used to teach me about interesting libraries if nothing else.
3. Once a design is had, feed it into the model you want to code. I would hand-make the data structures with stub functions. I’d tell it to generate a single function. I made sure it knew what to take in and return. Repeat for each function.
4. For each block of code, ask it to tell you any mistakes in it and generate a correction. It used to hallucinate on this enough that I only did one or two rounds, make sure I hand-changed the code, and sometimes asked for specific classes of error.
5. Incremental changes. You give it the high-level description, a block of code, and ask it to make one change. Generate new code. Rinse repeat. Keep old versions since it will take you down dead ends at times but incremental is best.
I used the above to generate a number of utilities. I also made a replacement for the ChatGPT application that used the Davinci API. I also made a web proxy with bloat stripping and compression for browsing from low-bandwidth, mobile devices. Best use of incremental modification was semi-automatically making Python web apps async.
Another quick use for CompSci folks. I’d pull algorithm pseudocode out of papers which claimed to improve on existing methods. I’d ask GPT4 to generate a Python version of it. Then, I’d use the incremental change method to adapt it for a use case. One example, which I didn’t run, was porting a pauseless, concurrent GC.
switch007 17 days ago [-]
QA are going to be told to use AI too
(Seems every job is fair game according to CTOs. Well, except theirs)
AnimalMuppet 17 days ago [-]
> Did you try learning HOW to get good code out of them?
That is at least somewhat a valid point. Good workers know how to get the best out of their tools. And yet, good tools accommodate how their users work, instead of expecting the user to accommodate how the tool works.
One could also say that programmers were sold a misleading bill of goods about how LLMs would work. From what they were told, they shouldn't have to learn how to get the best out of LLMs - LLMs were AI, on the way to AGI, and would just give you everything you needed from a simple prompt.
simonw 17 days ago [-]
Yeah, that's one of the biggest misconceptions I've been trying to push back against.
LLMs are power-user tools. They're nowhere near as easy to use as they look (or as their marketing would have you believe).
Learning to get great results out of them takes a significant amount of work.
CRConrad 11 days ago [-]
> Did you try learning HOW to get good code out of them?
Isn't that a bit "You're holding it wrong"? I mean, why isn't that the default; did anyone really think one would mainly want bad results out of them?
simonw 11 days ago [-]
When I say that these things are deceptively difficult to use I don't intend that as a ringing endorsement of the technology.
joelanman 17 days ago [-]
> if you run the code and get an error you know there's a problem.
well, sometimes - other times it'll be wrong with no error, or insecure, or inaccessible, and so on
xyzsparetimexyz 17 days ago [-]
Is there more to getting 'good' at them then just copying error messages back in? Like, how do I get them to reason about e.g. whether a data structure compression method makes sense?
henning 17 days ago [-]
Like all AI simps, your blanket response to pointing out flaws is to tell me to do more prompt engineering and then dismiss the issue entirely. In the time it takes me to coax the model to do the thing I was told it knows how to do, I could just do the task myself. Your examples of LLM code generation are simple, easy to specify, self-contained applications that are not representative of software you can actually build a business on. Please do something your beloved LLMs can't and come up with an original idea.
minimaxir 17 days ago [-]
> not representative of software you can actually build a business on
The only people pushing that you can BUILD AN APP WITHOUT WRITING A LINE OF CODE are the Twitter AI hypesters. Simon doesn't assert anything of the sort.
LLMs are more-than-sufficient for code snippets and small self-contained apps, but they are indeed far from replacing software engineers.
phantompeace 17 days ago [-]
Like all stubborn anti-AI know-it-alls, you sound like you’ve tried a couple of times to do something and have decided to label all LLMs with the same brush.
What models have you tried, and what are you trying to do with them? Give us an example prompt too so we can see how you’re coaxing it so we can rule out skill issue.
And a big strength LLMs have is summarizing things - I’d like to see you summarize the latest 10 arxiv papers relating to prompt engineering and produce a report geared towards non-techies. And do this every 30 mins please. Also produce social media threads with that info. Is this a task you could do yourself, better than LLMs?
voidhorse 17 days ago [-]
> And a big strength LLMs have is summarizing things - I’d like to see you summarize the latest 10 arxiv papers relating to prompt engineering and produce a report geared towards non-techies. And do this every 30 mins please. Also produce social media threads with that info. Is this a task you could do yourself, better than LLMs?
Right, but this is the part that is silly and sort of disingenuous and I think built upon a weird understanding of value and productivity.
Doing more constantly isn't inherently valuable. If one human writes a magnificently crafted summary of those papers once and it is promulgated across channels effectively, this is both better and more economical than having an LLM compute one (slightly incorrect) summary for each individual on demand. In fact, all the LLM does in this case is increase the amount of possible lower quality noise in the space. The one edge an LLM might have at this stage is to generate a summary that accounts for more recent information, thereby getting around the inevitable gradual "out of dateness" of human authored summaries at time T, but even then, this is not great if the trade off is to pollute the space with a. bunch of ever so slightly different variants of the same text. It's such a weird, warped idea of what productivity is, it's basically the lazy middle-manager's idea of what it means to be productive. We need to remember that not all processes are reducible to their outputs—sometimes the process is the point, not the immediate output (e.g. education).
phantompeace 17 days ago [-]
Who said anything about value? I can argue the vast majority of human generated content is valueless - look at Quora and Medium even before ChatGPT blew up. Where else are humans producing this amazing content? Facebook? X? Don’t even get me started.
Being able to summarise multiple articles quicker than a human can read and digest a single one is obviously more productive. I’m not sure why you’re assuming I’m talking about rewriting the papers to produce slightly different variations? It’s a summary. Concerned about the lack of “insight” or something? Then add a workflow that takes the summaries and use your imagination - maybe ask it to find potential applications in completely different fields? You already have comprehensive summaries (or the full papers in a vector db). Am I missing something?
Also the quality of the summary will be linked to the prompts and the way you go about the process (one-shotting the full paper in the prompt, map reduce, semantically chunked summaries, what model you’re using, its context length etc) as well as your RAG setup. I’m still working on my implementation but it’s simple as fuck and pretty decent in giving me, well, summaries of papers.
I can’t articulate it well enough but your human curation argument sounds to me like someone dismissing Google because anyone can lie online, and the good old Yellow Pages book can never be wrong.
voidhorse 17 days ago [-]
Based on your writing you are clearly emotionally invested in this technology, consider how that may affect your understanding.
By multiple rewrites, I meant that, to me, at least, it is silly to spend N compute on producing effectively the same summary on demand for the Mth chatbot user when, in some cases, we could much more economically generate one summary once and make it available via distribution channels--to be fair, that is sort of orthogonal to whether or not the "golden" summary is produced by humans or LLMs. I guess this is more of a critique of the current UX and computational expenditure model.
Yes, my whole point about the process being the point sometimes is precisely about lack of insight. It goes back to Searle's Chinese Room argument. A person in a room with a perfect dictionary and grammar reference can productively translate english texts (input) into Chinese texts (output) just by consulting the dictionary, but we wouldn't claim that this person knows Chinese. Using LLMs for "understanding" is the same. If all you care about is immediate material gain and output, sure, why not, but some of us realize that human beings still move and exist in the world and some of us still appreciate that we need to help fashion those human beings into rational ones that are able to use reason to get along, and aren't codependent on the past N years of the internet to answer any and all questions (the same criticism applies to over reliance on simplistic "answers" from search engines).
phantompeace 17 days ago [-]
I wouldn't say i'm more "emotionally invested" in this tech moreso than annoyed with people who expect it to be 100% perfect, as if they've accepted the snakeoil salesmen at face value and suddenly dismiss all useful applications of it at the first hurdle. Consider that your disdain for these sales people and their oft-exaggerated claims (which i absolutely despise) may cloud your judgement of the actual technology.
>it is silly to spend N compute on producing effectively the same summary on demand for the Mth chatbot user
Why? The compute is there, unused. Why is it silly to use it the way a user wants to? Is your argument more towards our effective use of electrical power across the globe or the quality of the summaries? What if the summaries are produced once and then loaded from some sort of cache - does that make it better in your eyes? I'm trying to understand exactly your point here... please accept my apologies for not being able to understand and please do not take my questions as "gotchas" or anything like that. I genuinely want to know the issue.
>A person in a room with a perfect dictionary and grammar reference can productively translate english texts (input) into Chinese texts (output) just by consulting the dictionary, but we wouldn't claim that this person knows Chinese.
Agreed, because you can't really know a language just from its words - you need grammar rules, historical/cultural context etc - precisely the kinds of things included in an LLM's training dataset. I'd argue the LLM knows the language better than the human in your example.
Again, i'm not sure how all of this is relevant to using LLMs to summarise long papers? I wouldn't have read them in the first place, because i didn't know they existed, and i don't have time to read them fully. So a summary of the latest papers every day is infinitely more better to me than just not knowing in the first place. Now if you want to talk about how LLMs can confidentally hallucinate facts or disregard things due to inherent bias in the training datasets then i'm interested because those are the things that are stopping me from actually trusting the outputs fully. (Note, i also don't trust human output on the internet either, due to inherent bias within all of us)
>human beings still move and exist in the world and some of us still appreciate that we need to help fashion those human beings into rational ones that are able to use reason to get along, and aren't codependent on the past N years of the internet to answer any and all questions
Do a simple experiment with the people around you. Ask them about something that happened a few years ago and see if they pull up Google or Wikipedia or whatever. I don't think you realise how far and few the humans you're talking about are left nowadays. Everyone, from teens to pensioners, have been affected by brain rot to some degree, whether it's plain disinformation on Facebook, or sweet nothings from their pastor/imam/rabbi, or innacurate Google search summaries (which is a valid point against LLMs - i'm also disappointed with how bad their implementation is).
And let's not assume most humans are even capable of being rational when the data in their own brains has been biased and manipulated by institutions and politicians in "democracies".
voidhorse 17 days ago [-]
I basically agree with everything you say here, I guess my chief concern surrounds reducing brain rot, and I mostly just worry that we will only increase brain rot through uncritical application of LLMs, rather than decrease it.
At least there is one silver lining: your comments are evidence that not everyone has suffered that brain rot, and some of us are still out there using tools critically—thanks for a good conversation on this!
phantompeace 16 days ago [-]
I am really glad we got the chance for this discussion and that it didn’t devolve into flaming or bad faith discussion; and i also share your sentiments RE brain rot, but for me this tech is cool yet weirdly primitive hence my excitement (I’m a 90s baby so I was “new” to the internet around the time AOL was in decline and this is the first time i feel early to something). I bet you there are ways to steer people away from their stupor using these - you know how a lie travels faster than the truth? What if these things can help equalise that?
Btw, I apologise again if I came across as blunt or rude in our exchange, upon reflection, I think you were actually right about me being somewhat emotionally invested in this (albeit due to that sliver of hope that they can be used for good). Peace be with you
hatefulmoron 17 days ago [-]
> And a big strength LLMs have is summarizing things - I’d like to see you summarize the latest 10 arxiv papers relating to prompt engineering and produce a report geared towards non-techies. And do this every 30 mins please. Also produce social media threads with that info. Is this a task you could do yourself, better than LLMs?
I don't mean to nitpick, but how good do you really think the output of this would be? Papers are short and usually have many references, I would expect the LLM to basically miss the important subtleties on every paper it's given, and misunderstand and misattribute any terms of art it encounters.
I mean, of course LLMs are good at summarizing: the summaries are probably mostly sort of good, and anything I'm summarizing I won't read myself. But for technical and specific texts, what's the point when you're getting a "maybe correct" retelling? Best case scenario you get a pretty paragraph that's maybe good for an introduction, and worst case you get incorrect information that misinforms you.
phantompeace 17 days ago [-]
The quality of the summary is only as good as the effort you put into writing your workflow. If you’re simply one shotting the paper into a message and saying “plz summarise this and I’ll reward you with $1m” then of course it’s gonna be shit. But if you semantically chunked along sections and do some RAG Q&A summaries before combining into a well formatted schema then it’s probably going to be better than the first way.
I’m using the summaries as a juicier abstract. I’m not taking them as gospel.
I’m working on following references to then add those papers to a vector db for RAG so it can actually go the step beyond. It’s fun!
hatefulmoron 16 days ago [-]
> I’m using the summaries as a juicier abstract. I’m not taking them as gospel.
I'm not sure of the value of this. Papers already have abstracts, rewording them using LLMs is just playing with your food. If you're seeing use out of it that's awesome though.
phantompeace 10 days ago [-]
You do have a point you know. I have actually been thinking about this recently and have decided to try and focus more on extracting value out of abstracts instead of summarising papers, and relying on embeddings of the paper in case the answer needs more context.
henning 17 days ago [-]
Due to unexpected capacity constraints, Claude is unable to reply to this message.
phantompeace 17 days ago [-]
Just as I thought, just snark and no real meaningful engagement.
P.S my script uses local models - no capacity constraints (apart from VRAM!)
Kiro 17 days ago [-]
Hilarious that you're trying to gaslight us into "recognizing" your own incorrect assumptions as facts. You've lost all credibility.
th0ma5 17 days ago [-]
Simon gets one thing working for one task and assumes everyone can do the same for everything. That's the trick is that he has no idea how the failures happen or how to maintain actual working systems.
th0ma5 17 days ago [-]
More dishonest magical thinking. I wish this guy would learn how systems work and stop flooding the field with mystical nonsense unless he really is trying to make people think LLMs are worthless, then I guess he should be honest about it instead of subversive.
jdlshore 17 days ago [-]
I read the article and thought it was well done and level-headed. What exactly did you think was mystical or magical thinking?
The LLM goalpost keeps moving, apparently. They are not useful for most everyday tasks, e.g. suggesting games, coming up with plans, activities, anything creative that requires knowledge, understanding and creativity.
This has always been the benchmark, they are not that useful to me. Everytime I say this, someone hits me with the "yeah, I bet you haven't tried ShitLLM 4.0-pqr". It's very tiring. Your new LLM hype model is nothing but a marginal, over hyped improvement over something that fundamentally is not intelligent.
jmclnx 18 days ago [-]
Interesting, the article is not quite what I expected.
But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question. Drop your files in the context window; ask very precise questions explaining the background. They work great to explore what is at the borders of your knowledge. They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours). The best LLMs (in my case just Claude Sonnet 3.5, I must admit) out there are able to accelerate you.
I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.
I wouldn't describe myself as a programmer, and didn't plan to ever build an app, mostly because in the attempts I made, I'd get stuck and couldn't google my way out.
LLMs are the great un-stickers. For that reason per se, they are incredibly useful.
Though I’d be interested if this was an opinion on “help me write this gnarly C algorithm” or “help me to be productive in <new language>” as I find a big productivity increase from the latter.
Other examples: Claude was able multiple times to spot bugs in my C code, when I asked for a code review. All bugs I would eventually find but that it's better to fix ASAP.
Finally sometimes I put relevant papers and implementations and ask for variations of a given algoritm among the paper and the implementations around, to gain insights about what people do in the practice. Then engage in discussions about how to improve it. It is never able to come up with novel ideas but is able to recognize often times when my idea is flawed or if it seems sounding.
All this and more helps me to deliver better code. I can venture in things I otherwise would not do for lack of time.
I get this sentiment from a lot of AI startups, that they have a product which can do amazing things, but due to its failure modes makes it almost useless as, to use an analogy from self-driving cars, the users have to still constantly pay attention to the road: you don't get a ride from Baltimore to New York where you can do whatever you please, you get a ride where you're constantly babysitting an autonomous vehicle, bored out of your mind, forced to monitor the road conditions and surrounding vehicles, lest the car make a mistake costing you your life.
To take the analogy farther, after experimenting with not using LLM tools, I feel that the main difference between the two modes of work is similar to driving a car and being driven by an autonomous care: you exert less mental effort, not, you get to your destination faster.
Another point of the analogy are things like Waymo. They really can do a great job of driving autonomously. But, they require a legible system of roads and weather conditions. There are LLM systems too that when given a legible system to work in can do a near perfect job.
I drove 3600 km Norway to Spain in 2018 with only adaptive cruise. Then again in 2023 with autonomous highway driving (the kind where you keep a hand on the wheel for failure mode) and it was amaaaazing how big the difference was.
Were you using Tesla Autopilot? If I were using Autopilot, I'd have to be constantly watching out for its mistakes, which would probably be equally or more stressful compared to using adaptive cruise.
I've been driving a lot in Istanbul lately and I'm not holding my breath for autonomous vehicles any time soon.
I think I’m more amazed by them because I know how they work. They shouldn’t be able to do this, but the fact that they can is absolutely jaw dropping science fiction shit.
DNNs implicitly learn a type theory, which they then reason in. Even though the code itself is new, it’s expressible in the learned theory — so the DNN can operate on it.
It is very unreliable at fixing things or writing code for anything non standard. Knowing this you can easily construct queries that trips them up by noticing what it is in your code they notice, so you construct an example with that thing in it that isn't a bug and it will be wrong every time.
The LLMs are good at finding bugs in code not because they’ve been trained on questions that ask for existing bugs, but because they have built a world model in order to complete text more accurately. In this model, programming exists and has rules and the world model has learned that.
Which means that anything nonstandard … will be supported. It is trivial to showcase this: just base64 encode your prompts and see how the LLMs respond. It’s a good test because base64 is easy for LLMs to understand but still severely degrades the quality of reasoning and answers.
Of course the humans who created the training set samples didn't create them auto-regressively - the training set samples are artifacts reflecting an external world, and knowledge about it, that the model is not privy to, but the model is limited to minimizing training errors on the task it was given - auto-regressive prediction. It has no choice. The "world model" (patterns) it has learnt isn't some magical grokking of the external world that it is not privy to - it is just the patterns needed to minimize errors when attempting to auto-regressively predict training set continuations.
Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.
yes, except the computer can easily 'see' in more than 3 dimensions with more capability to spot similarities, and can follow lines of prediction (similar to chess) far more than any group of humans can.
that super-human ability to spot similarities and walk latent spaces 'randomly' -yet uncannily - has given rise to emergent phenomena that has mimicked proto-intelligence.
we have no idea what the ideas these tokens have embedded at different layers, and what capabilities can emerge now or at deployment time later, or given a certain prompt.
The intelligence we see in LLMs is to be expected - we're looking in the mirror. They are trained to copy humans, so it's just our own thought patterns and reasoning being output. The LLM is just a "selective mirror" deciding what to output for any given input.
This is assuming they don't call an external pre-processing decoding tool.
If you didn't see the "analyzing" message then no external tool was called.
This is done via translations, LLM are good at translations, being able to translate doesn't mean you understand the subject.
And no I am not wrong here, I've tested this before, for example if you ask if a CPU model is faster than a GPU model it will say the GPU model is faster, even if the CPU is much more modern and faster overall since it learned that GPU names are faster than CPU names it didn't really understood what faster meant there. Exactly what the LLM gets wrong depends on the LLM of course, and the larger it is the more fine grained these things are but in general it doesn't really have much that can be called understanding.
If you don't understand how to break the LLM like this then you don't really understand what the LLM is capable of, so it is something everyone who uses LLM should know.
Regardless of how the base64 processing is done (which is really not something you can speculate much on, unless you've specifically researched it -- have you?), my point is that it does degrade the output significantly while still processing things within a reasonable model of the world. Doing this is a rather reliable way of detaching the ability to speak from the ability to reason.
Also the more "factoids" / clauses needed to answer accurately are inversely proportional to the "correctness" of the final answer (on average, when prompt-fuzzed).
This is all because the more complicated/entropic the prompt/expected answer, the less total/accumulative attention has been spent on it.
Really? ;) I guess you don't believe in the universal approximation theorem?
UAT makes a strong case that by reading all of our text (aka computational traces) the models have learned a human "state transition function" that understands context and can integrate within it to guess the next token. Basically, by transfer learning from us they have learned to behave like universal reasoners.
Sure if you look at new project x then in totality it's a semi unique combination of code, but breaking it down into chunks that involve a couple lines, or a very specific context then it's all been done before.
I’m pretty sure you’re committing a logical fallacy there. Like someone in antiquity claiming “I get annoyed when experienced folks say thunderstorms aren’t the gods getting angry, it’s nature and physical phenomena. But we don’t know how the weather works”. Your lack of understanding in one area does not give you the authority to make a claim in another.
If there's something that you can prompt with e.g. "here's the proof for Fermat's last theorem" or "here is how you crack Satoshi's private key on a laptop in under an hour" and get a useful response, that's AGI.
Just to be clear, we are nowhere near that point with our current LLMs, and it's possible that we'll never get there, but in principle, if such a thing existed, it would be a next-word predictor while still being AGI.
I wonder whether that is some specialised terminology I'm not familiar with - or it just means to decompose the operations (but with an Italian s- for negation)?
> And now, at the end of 2024, I’m finally seeing incredible results in the field, things that looked like sci-fi a few years ago are now possible: Claude AI is my reasoning / editor / coding partner lately. I’m able to accomplish a lot more than I was able to do in the past. I often do more work because of AI, but I do better work.
>…
> Basically, AI didn’t replace me, AI accelerated me or improved me with feedback about my work
[0]: https://antirez.com/news/144
LLMs are like a pretty smart but overly confident junior engineer, which is what a senior engineer usually has to work with anyway.
An expert actually benefits more from LLMs because they know when they get an answer back that is wrong so they can edit the prompt to maybe get a better answer back. They also have a generally better idea of what to ask. A novice is likely to get back convincing but incorrect answers.
EDIT: antirez is the creator of redis, not mvkel.
Can you clarify what you mean?
It just means anyone higher than a senior engineer.
Google has Staff at L6, and their ladder goes up to L11. Apple‘s Staff pendant is ICT5, which is below ICT6 and Distinguished. Amazon has E7-E9 above Staff, if you count E6 as Staff. Netflix very recently departed from their flat hierarchy and even they have Principal above Staff.
Few clarifications:
Amazon labels levels with "L" rather than "E". Engineering levels are L4 -- L10. Weirdly enough, level L9 does not exist at Amazon. L8 (Director / Senior Principal Engineer) is promoted directly to L10 (VP / Distinguished Engineer)
That wouldn’t be “working at your level” at the one BigTech company I’ve worked at and not even at the 600 person company I work at now
Tragically - admitting ignorance, even with the desire to learn, often has negative social reprocussions
The pervasive problem of low student motivation won't be solved by LLMs, though. Human teachers will, I think, still be needed.
All the little nooks of missing knowledge are now very easy to fill in.
See https://danluu.com/look-stupid/
(In my experience as an app developer, getting any traction and/or money from your app can be much more difficult than actually building it.)
$10K MRR isn't much; we're still validating PMF. We're carefully selecting paid customers at this point, not open for wide release, hence my vagueness. Just wanted to illustrate that building robust apps that have value are possible today.
This. The app I built has maybe 50 downloads despite me trying quite hard to promote it. It's very difficult work, even with the app being completely free of charge (save for a donation button).
My experience is that people who claim they build worthwhile software "exclusively" using LLMs are lying. I don't know you and I don't know if you are lying, but I would be willing to bet my paycheck you are.
As an example I could imagine a clothing brand wanting an app that customers can install instead of using their phone browser. $10k/month in that context isn’t as surprising or impressive.
It sounds like they are doing productized consulting, so the relationship is the moat.
The relationship also builds a natural moat.
See comment above for more context.
That's great, but professional programmers are afraid of the future maintenance burden.
Not just the development of the code but the entire the thing from the code, infra, auth, cc payments, etc.
What's the app?!!
Would you mind sharing which app you released?
There are certain classes of problems that LLMs are good at. Accurately regurgitating all accumulated world knowledge ever is not one, so don’t ask a language model to diagnose your medical condition or choose a political candidate.
But do ask them to perform suitable tasks for a language model! Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!
There are lots of cases like this where you can give an LLM reliable data and ask it to do a language related task and it will do an excellent job of it.
If nothing else it’s an extremely useful computer-human interface.
not to dissuade you from a thing you find useful but are you aware that the national weather service produces an Area Forecast Discussion product in each local NWS office daily or more often that accomplishes this with human meteorologists and clickable jargon glossary?
https://forecast.weather.gov/product.php?site=SEW&issuedby=S...
Anytime you have data and want it explained in a casual way — and it’s not mission critical to be extremely precise — LLMs are going to be a good option to consider.
More useful AGI-like behaviours may be enabled by combining LLMs with other technologies down the line, but we shouldn’t try to pretend that LLMs can do everything nor are they useless.
(o1-preview) LLMs show promise in clinical reasoning but fall short in probabilistic tasks, underscoring why AI shouldn't replace doctors for diagnosis just yet.
"Superhuman performance of a large language model on the reasoning tasks of a physician" https://arxiv.org/abs/2412.10849 [14 Dec 2024]
I actually found 4o+search to be really good at this... Admittedly what I did was more "research these candidates, tell me anything newsworthy, pros/cons, etc" (much longer prompt) and well, it was way faster/patient at finding sources than I ever would've been, telling me things I never would've figured out with <5 minutes of googling each set of candidates (which is what I've done before).
Honestly my big rule for what LLMs are good at is stuff like "hard/tedious/annoying to do, easy to verify" and maybe a little more than that. (I think after using a model for a while you can get a "feel" for when it's likely BSing.)
Honestly they are very decent at it if you give them accurate information in which to make the diagnosis. The typical problem people have is being unable to feed accurate information to the model. They'll cut out parts they don't want to think about or not put full test results in for consideration.
This is not a diagnosis. Any reasonably capable person can read webmd and apply the symptoms listed and compare them to what the patient describes. This is widely regarded as dangerous because the input data as well as the patient data are limited in ways that can be medically relevant.
So even if you can use it as a good substitute for browsing webmd, it’s still not a substitute for seeing a medical professional. And for the foreseeable future it will not be.
You feed it a weather report and it responds with a weather report? How is that useful?
I did something similar awhile back without LLMs. I enjoy kayaking, but for a variety of reasons [0] it's usually unwieldy to break out of the surf and actually get out into the ocean at my local beach. I eventually started feeding the data into an old-school ML model where I'd manually check the ocean and report on a few factors (breaking waves, unsafe wind magnitude/direction, ...). The model converted those weather/tide reports into signals I cared about, and then my forecast could simply AND all those together and plot them on a calendar.
An LLM is less custom in some sense, but if you have certain routines you care about (e.g., commuting to my last job I'd always avoid the 101 in favor of 280 if there was heavy rain), it's easy to let the computer translate raw weather information into signals you care about (should you take an alternate route, should you alter your schedule, ...).
Off-topic, do you know of a good source of weather covariates? E.g., a report with a 50% chance of rain for 2hr can easily mean light rain guaranteed for 2hr, a guaranteed 1hr of rain sometime in that 2hr period, a 50% chance that a 2hr storm will hit your town or the next town over, or all kinds of things. Does anybody report those raw model outputs?
[0] There isn't any protection from the open ocean (combined with a kayak that's a bit too top-heavy for the task at hand), which doesn't help, but the big problem is a sand bar just off the coast. If the tide isn't just right, even small swells are amplified into large breaking waves, and I don't particularly mind getting dumped upside down onto a sand bar, but I'd really prefer to spend that time in slightly calmer waters.
No, I think if we follow the money, we will find the problem.
People keep saying this, and there are use cases for which this is definitely the case, but I find the opposite to be just as true in some circumstances.
I'm surprised at how good LLMs are at answering "me be monkey, me have big problem with code" questions. For simple one-offs like "how to do x in Pandas" (a frequent one for me), I often just give Claude a mish-mash of keywords, and it usually figures out what I want.
An example prompt of mine from yesterday, which Claude successfully answered, was "python sha256 of file contents base64 safe for fs path."
With a system prompt to make Claude's output super brief and a command to execute queries from the terminal via Simon Willison's LLM tool, this is extremely useful.
Good communication with LLMs is the least keywords used to make it deducible for LLM what you exactly want.
I am not sure that is the case, at least with a large number of LLMs. CO-STAR and TIDD-EC are much about structure and explanation than brevity.
Though I do not have a good idea what is _bad_ communication with an llm. People say that sometimes, but when specific examples arise I do not see really anything more than limitations of llms (and the improvements they often suggest do not do anything either). So it would be good to have some more concrete examples, unless that is about inability to communicate a problem in general, stemming from actual inability to _understand_ the problem. Also a lot change in time, I think in the past one had to really coddle an llm "You are the best expert in python in the world!" but I am not sure that is that important nowadays.
Bad communication: "My webapp doesn't work"
Good communication: "Nextjs, [pasted error]"
Bad communication is giving irrelevant information, or being too ambiguous, not providing enough or correct detail.
Then another example of good communication and efficiency in my view is for example "ts, fn leftpad, no text, code only".
I myself can understand what it means when someone was to prompt it and LLM can understand such query for all domains.
Although if I was using Copilot I would just write the bare minimum to trigger the auto complete I want so
const leftPad =
is probably enough.
and
> a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability.
I still hold that the innovations we've seen as an industry with text transfer to the data from other domains. And there's an odd misbehavior with people that I've now seen play out twice -- back in 2017 with vision models (please don't shove a picture of a spectrogram into an object detector), and today. People are trying to coerce text models to do stuff with data series, or (again!) pictures of charts, rather than paying attention to timeseries foundation models which directly can work on the data.[1]
Further, the tricks we're seeing with encoder / decoder pipelines should work for other domains. And we're not yet recognizing that as an industry. For example, whisper or the emerging video models are getting there, but think about multi-spectral satellite data, fraud detection (a type graph problem).
There's lots of value to unlock from coding models. They're just text models. So what if you were to shove an abstract syntax tree in as the data representation, or the intermediate code from LLVM or a JVM or whatever runtime and interact with that?
[1] https://huggingface.co/ibm-granite/granite-timeseries-ttm-r1 - shout-out to some former colleagues!
> It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something.
> They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can "throw an LLM at it".
Now that alone is not yet an argument against crypto currencies, and one person's frivolous squandering of resources is another person's essential service. But you can't simply point to the free market to absolve yourself of any responsibility for your consumption.
Acknowledging that facilitating scams (eg pig butchering) are cryptocurrency's primary (sole?) use case, I'm willing to look the other way if we end up with the grid we need to address climate crisis.
The primary use case of crypto is to protect wealth from a greedy, corrupt, money-printing state. Everything else is a sideshow
Merely trading governments for corporations.
> Everything else is a sideshow
Agreed. Crypto is endlessly amusing.
I'm really not well suited to explain this stuff. Here's an article for a general (layperson) audience to help you on your journey. https://www.cbsnews.com/news/cryptocurrency-bitcoin-virtual-...
Happy hunting!
It's not as helpful as Google was ten years ago. It's more helpful than Google today, because Google search has slowly been corrupted by garbage SEO and other LLM spam, including their own suggestions.
I’ve written two large applications and about a dozen smaller ones using Claude as an assistant.
I’m a terrible front-end developer and almost none of that work was possible without Claude. The API and AWS deployment were sped up tremendously.
I’ve created unit tests and I’ve read through the resulting code and it’s very clean. One of my core pre-prompt requirements has always been to follow domain-driven design principles, something a novice would never understand.
I also start with design principles and a checklist that Claude is excellent at providing.
My only complaint is you only have a 3-4 hour window before you’re cutoff for a few hours.
And needing an enterprise agreement to have a walled garden for proprietary purposes.
I was not a fan in Q1. Q2 improved. Q3 was a massive leap forward.
Maybe it was overtrained on react sources, but for me it's pretty useless.
The big annoyance for me is it just makes up APIs that don't exist. While that's useful for suggesting to me what APIs I should add to my own code, it's really pointless if I ask a question like "using libfoo how do I bar" and it tells me "call the doBar() function" which does not exist.
I'm suspecting LLM works for a lot of front end and app coding just because code in those fields are insanely overbloated and value proposition is almost disconnected from logic. There must be metric tons of typing in those fields, and in those areas LLMs must be useful. They certainly handle paper test questions well.
I’m hitting my 40th year as a professional software developer and architect. I’ve written thousands of blocks of code from scratch. It gets boring.
But then in the 2000’s me (and everyone else) started building code generators, often from ERD structures, but also UML designs.
These tools were massively useful and (initially) reduced costs. The future balls of mud problems took over ten years to arrive.
But code generation has always been considered a smart and cost-effective approach to building software.
GenAI has “issues” and those have been exposed. One of my recent revelations is that Claude is best at TypeScript and python. C# (my home turf) is much lower in its skills capacity.
So in the last two months I’ve been building my apps in TypeScript instead of C# and have dramatically increased my productivity.
Claude will definitely fail if it doesn’t have the correct information. A good example is writing Bluesky apps. The docs are a mess and contradictory. But there are up to date docs on GitHub and if you include those in your project with instructions to only use those references, Claude’s hallucinations can be eliminated.
I don’t think AGI is a real possibility in my lifetime, and I do fear the future of software development when no one has actual coding experience, but for us boomers, it’s pretty darn useful.
If someone was an expert React+TypeScript programmer with decent css knowledge the productivity may be a marginal improvement.
But I haven’t been a full-time programmer in ten years.
IME, being forced to write about something or verbally explaining/enumerating things in detail _by itself_ leads to a lot of clarity in the writer's thoughts, irrespective of if there's an LLM answering back.
People have been doing rubber-duck-debugging since long. The metaphorical duck (LLMs in our context), if explained to well, has now started answering back with useful stuff!
I see much deeper problems. Just to give two examples:
- I asked various AIs concerning explanations of proofs of some deep (established) mathematical theorems: the explanations were to my understanding very hallucinated, and thus worse than "obviously wrong". I also asked for literature references for some deep mathematical theory frameworks: bascially all of the references were again hallucinated.
- I asked lots of AIs on https://lmarena.ai/ to write a suitably long text about some political topic that is quite controversial in my country (but does have lots proponents even in a very radical formulation, even though most people would not use such a radical formulation in public). All of the LLMs that I checked refused or tried to indoctrinate me that this thesis is wrong. I did not ask the LLM to lecture me, but I gave it a concrete task! Society is deeply divided, so if the LLM only spreads propaganda of its political teaching, it will be useless for many tasks for a very significant share of the society.
Using a few messages to get them out of "I aim to be direct" AI assistant mode gets much better overall results for the rest of the chat.
Haiku is actually incredibly good at high level systems thinking. Somehow when they moved to a smaller model the "human-like" parts fell away but the logical parts remained at a similar level.
Like if you were taking meeting notes from a business strategy meeting and wanted insights, use Haiku over Sonnet, and thank me later.
If your model / chat app has the ability to always inject some kind of pre-prompt make sure to add something like “please do not jump to writing code. If this was a coding interview and you jumped to writing code without asking questions and clarifying requirements you’d fail”.
At the top of all your source files include a comment with the file name and path. If you have a project on one of these services add an artifact that is the directory tree (“tree —-gitignore” is my goto). This helps “unaided” chats get a sense of what documents they are looking at.
And also, it’s a professional bullshitter so don’t trust it with large scale code changes that rely on some language / library feature you don’t have personal experience with. It can send you down a path where the entire assumption that something was possible turns out to be false.
Does it seek like a lot of work? Yes. Am I actually more productive with the tool than without? Probably. But it sure as shit isn’t “free” in terms of time spent providing context. I think the more I use these models, the more I get a sense of what it is good at and what is going to be a waste of time.
Long story short, prompting is everything. These things aren’t mind readers (and worse they forget everything in each new session)
My mind generally uses language as little as possible, I have no inner monologue running in the background.
Greatly prefer something deterministic to random bs popping up without the ability of recognizing it.
I don’t like llms but sometimes use them as autocomplete or to generate words, like a template for a letter or boilerplate scripts, never for actual information (à la google).
I don't use it exclusively, but damn does it help in the right places.
I could only ever really jam with 4o.
Makes me wonder if there's personal communication preferences at play here.
I had made the specific operation generic (moving it out of the struct and into a trait) but forgot to delete it from the struct, so I was calling the incorrect function. Claude pinpointed the cache issue immediately when I just dumped two files into the context and asked it:
at first that seemed to fix the issue, but other errors persisted. so we kept debugging together until we found the root cause. either way I knew where to look thanks to its assistanceA very big surprise is just how much better Sonnet 3.5 is than Haiku. Even the confusingly-more-expensive-Haiku-variant Haiku 3.5 that's more recent than Sonnet 3.5 is still much worse.
I.e. over time it constitute a fundamental shift in how we interact with abstractions in computers. The current fundamentals will still remain but they will become increasingly malleable. Details in code will become less important. Architecture will become increasingly important. But at the same time the cost of refactoring or changing architecture will quickly drop.
Any details that are easily lost when passing through an LLM will be details that have the highest maintenance cost. Any important details that can be retained by an LLM can move up and down the ladder of abstraction at will.
Can an LLM based solution maintain software architectures without introducing noise? The answer to that is the difference between somewhat useful and game changing.
“Be logical,” said the scorpion. “If I stung you I’d certainly drown myself.”
“That’s true,” the frog acknowledged. “Climb aboard, then!” But no sooner than they were halfway across the river, the scorpion stung the frog, and they both began to thrash and drown. “Why on earth did you do that?” the frog said morosely. “Now we’re both going to die.”
“I can’t help it,” said the scorpion. “It’s my nature.”
All the tasks I can think of dealing with on my own computer that would take hours, a) are actually pretty interesting to me and b) would equally well take hours to "provide perfect guidance". The drudge work of programming that I notice comes in blocks of seconds at a time, and the mental context switch to using an LLM would be costlier.
The GP is claiming GPT4o is bad but Sonnet is good. GPT4o is about only 20% cheaper than Sonnet.
The point being made by the original comment (with which I agree) was that many criteria-for-usefulness - primarily that of reliability or a lack of hallucination - have remained static; with successive generations of tools being (falsely) claimed to meet them, but then abandoned when the next hype-train comes along.
I certainly agree that _some_ aspects of AI models are indeed improving (often drastically!) over time (speed, price, supported formats, history/context, etc.) - but they still _all_ fall _drastically_ short on the key core requirement that is required in order to make them Actually Useful. "X is better than Y" does not imply "where Y failed to be useful, X now succeeds".
If someone told me an iPhone 4 is terrible but an iPhone 5 would definitely serve my needs, then when I get an iPhone 5 they say the same of the 6 you really want me to believe them a second time? Then a third time? Then a 4th? In the mean time my time and money is wasted?
My son throwing an irrational tantrum at the amusement park and I can't figure out why he's like that (he won't tell me or he doesn't know himself either) or what I should do? I feed Claude all the facts of what happened that day and ask for advice. Even if I don't agree with the advice, at the very least the analysis helps me understand/hypothesize what's going on with him. Sure beats having to wait until Monday to call up professionals. And in my experience, those professionals don't do a better job of giving me advice than Claude does.
It's weekend, my wife is sick, the general practitioner is closed, the emergency weekend line has 35 people in the queue, and I want some quick half-assed medical guidance that while I know might not be 100% reliable, is still better than nothing for the next 2 hours? Feed all the symptoms and facts to Claude/ChatGPT and it does an okay job a lot of the time.
I've been visiting Traditional Chinese Medicine (TCM) practitioner for a week now and my symptoms are indeed reducing. But TCM paradigm and concepts are so different from western medicine paradigms and concepts that I can't understand the doctor's explanation at all. Again, Claude does a reasonable job of explaining to me what's going on or why it works from a western medicine point of view.
Want to write a novel? Brainstorm ideas with GPT-4o.
I had a debate with a friend's child over the correct spelling of a Dutch word ("instabiel" vs "onstabiel"). Google results were not very clear. ChatGPT explained it clearly.
Just where is this "useless" idea coming from? Do people not have a life outside of coding?
It seems like you trust AI more than people and prefer it to direct human interaction. That seems to be satisfying a need for you that most people don't have.
This feels identical to when I was an early "smart phone" user w/my palm pilot. People would condescend saying they didn't understand why I was "on it all the time". A decade or two later, I'm the one trying to get others to put down their phones during meetings.
My take? Those who aren't using AI continually currently are simply later adopters of AI. Give it a few years - or at most a decade - and the idea of NOT asking 100+ AI queries per day (or per hour) will seem positively quaint.
I don't think you're wrong, I just think a future in which it's all but physically and socially impossible to have a single thought or communication not mediated by software is fucking terrifying.
LLMs are infinitely patient, don't think I am dumb for asking certain things, consider all the information I feed them, are available whenever I need them, have a wide range of expertise, and are dirt cheap compared to professionals.
That they might hallucinate is not a blocker most of the time. If the information I require is critical, I can always double check with my own research or with professionals (in which case the LLM has already primed me with a basic mental model so that I can ask quick, short, targeted questions, which saves the both of us time, and me money). For everything else (such as my curiocity on why TCM works, or the correct spelling of a word), LLMs are good enough.
Have you never seen knowledgeable people get things wrong, and having to verify them?
Did you miss the part where they cost money, and I better come in as prepared as possible?
I really don't get these knee-jerk averse reactions. Are people deliberately reading past my assertions that I double check LLM outputs for everything critical?
We don't know that. They could be laughing their ass off at you without telling you.
You don’t understand how medicine works, at any level.
Yet you turn to a machine for advice, and take it at face value.
I say these things confidently, because I do understand medicine well enough to not to seek my own answers. Recently I went to a doctor for a serious condition and every notion I had was wrong. Provably wrong!
I see the same behaviour in junior developers that simply copy-paste in whatever they see in StackOverflow or whatever they got out of ChatGPT with a terrible prompt, no context, and no understanding on their part of the suitability of the answer.
This is why I and many others still consider AIs mostly useless. The human in the loop is still the critical element. Replace the human with someone that thinks that powdered rhino horn will give them erections, and the utility of the AI drops to near zero. Worse, it can multiply bad tendencies and bad ideas.
I’m sure someone somewhere is asking DeepSeek how best to get endangered animals parts on the black market.
So I am curious about how TCM works. So what if an LLM hallucinates there? I am not writing papers on TCM or advising governments on TCM policy. I still follow the doctor's instructions at the end of the day.
For anything really critical I already double check with professionals. As you said, human in the loop is important. But needing human in the loop does not make it useless.
You are letting perfect be the enemy of good. A half-assed tax advice with some hallucinations from an LLM is still useful, because it will prime me with a basic mental model. When I later double check the whole thing with a professional, I will already know what questions to ask and what direction I need to explore, which saves time and money compared to going in with a blank slate.
The other day I had Claude advice me on how to write a letter to a judge to fight a traffic fine. We discuss what arguments to make, from what perspective a judge will see things, and thus what I should plead for. The traffic fine is a few hundred euros: a significant amount, but barely an hour worth of a real lawyer's fee. It makes absolutely no sense to hire a real lawyer here. If this fails, the worst thing that can happen is that I won't get my traffic fine reimbursed.
There is absolutely nothing wrong with using LLMs when you know their limits and how to mitigate them.
So what if every notion you learned about medicine from LLMs is wrong? You learn why they're wrong, then next time you prompt/double check better, until you learn how to use it for that field in the least hallucinationatory way. Your experience also doesn't match mine: the advice I get usually contains useful elements that I then discuss with doctors. Plus, doctors can make mistakes too, and they can fail to consider some things. Twitter is full of stories about doctors who failed to diagnose something but ChatGPT got it right.
Stop letting perfect be the enemy of good. Occasionally needing human in the loop is completely fine.
[1] It isn't actually western, because it's also used in the east, middle-east, south, both sides of every divide, etc... In the same sense, there is no "western chemistry" as an alternative to "eastern alchemy". There's "things that work" versus "things that make you feel slightly better because they're mild narcotics or stimulants... at best."
(I don't want to focus too much on Chinese herbal medicine, because I see the same cargo-culting non-scientific thinking in code development too. I've lost count of the number of times I've seen an n-tier SPA monstrosity developed for something that needed a tiny monolithic web app, but mumble-mumble-best-mumble-practices.)
The Chinese call the practice of truth seeking, in a more broader sense (outside of medicine) just "science".
"Western" medicine is also not merely the practice of seeking universal medical truth. It is also a collection of paradigms that have been developed in its long history. Like all paradigms, there are limits and drawbacks: phenomena that do not fit well. Truth seeking tends to be done on established paradigms rather than completely new ones.
The "western" prefix is helpful in contrasting it with TCM, which has a completely different paradigm. Many Chinese, myself included, have the experience that there are all sorts of ailments that are not meaningfully solved by "western" medicine practitioners, but are meaningfully solved by TCM practitioners.
It's like people who proclaim that Linux as a whole is a useless toy because it doesn't run their favorite games or favorite Windows app. They focus on this one flaw and miss all the opportunities.
Many of these people seem to advocate trusting human professionals. Do you have any idea how often human professionals do a half-assed job, and I have to verify them rather than blindly trusting them? The situation is not that much different from LLMs.
Professionals making mistakes do not make them useless. Grandma, with all her armchair expertise, is often right and sometimes wrong, and that does not make her useless either.
Why let perfect be the enemy of good?
At the opposite, my trust of Russian / Chinese / USian platforms is low enough that I consider it my duty to publicly shame people that still use them in 2025.
(With some caveats of course, for instance HN is not a yet negative to the world. Yet.)
There's also the question of stickiness of habits : your grandmas are for life, human professionals you might have a shallow enough relationship with that switching them might be relatively easy, while it might be very hard to stop smoking or to stop using Github once you started smoking / create an account.
If you aren't a coder, it's hard to find much utility in "Google, but it burns a tree whenever you make an API call, and everything it tells you might be wrong". I for one have never used it for anything else. It just hasn't ever come up.
It's great at cheating on homework, kids love GPTs. It's great at cheating in general, in interviews for instance. Or at ruining Christmas, after this year's LLM debacle it's unclear if we'll have another edition of Advent of Code. None of this is the technology's fault, of course, you could say the same about the Internet, phones or what have you, but it's hardly a point in favor either.
And if you are a coder, models like Claude actually do help you, but you have to monitor their output and thoroughly test whatever comes out of them, a far cry from the promises of complete automation and insane productivity gains.
If you are only a consumer of this technology, like the vast majority of us here, there isn't that much of an upside in being an early adopter. I'll sit and wait, slowly integrating new technology in my workflow if and when it makes sense to do so.
Happy new year, I guess.
Other than, y'know, using the new tools. As a programmer heavy forum, we focus a lot on LLMs' (lack of) correctness. There's more than a little bit of annoyance when things are wrong, like being asked to grab the red blanket and then getting into an argument over it being orange instead of what was important, someone needed the blanket because they were cold.
Most of the non-tech people who use ChatGPT that I've talked to absolutely love it because they don't feel it judges them for asking stupid questions and they have conversations about absolutely everything in their lives with it down to which outfit to wear to the party. There are wrong answers to that question as well, but they're far more subjective and just having another opinion in the room is invaluable. It's just a computer and won't get hurt if you totally ignore it's recommendations, and even better, it won't gloat (unless you ask it to) if you tell it later that it was right and you were wrong.
Some people have found upsides for themselves in their lives, even at this nascent stage. No one's forcing you to use one, but your job isn't going to be taken by AI, it's going to be taken by someone else who can outperform you that's using AI.
Clearly said, yet the general sentiment awakens in me a feeling more gothic horror than bright futurism. I am stuck with wonder and worry at the question of how rapidly this stuff will infiltrate into the global tech supply chain, and the eventual consequences of misguided trust.
To my eye, too much current AI and related tech are just exaggerated versions of magic 8-balls, Ouija boards, horoscopes, or Weizenbaum's ELIZA. The fundamental problem is people personifying these toys and letting their guard down. Human instincts take over and people effectively social engineer themselves, putting trust in plausible fictions.
It's not just LLMs though. It's been a long time coming, the way modern tech platforms have been exaggerating their capability with smoke and mirrors UX tricks, where a gleaming facade promises more reality and truth than it actually delivers. Individual users and user populations are left to soak up the errors and omissions and convince themselves everything is working as it should.
Someday, maybe, anthropologists will look back on us and recognize something like cargo cults. When we kept going through the motions of Search and Retrieval even though real information was no longer coming in for a landing.
But not at exploring what is at the border of knowledge itself. And by converging on the conventional, LLMs actually lead you away from anything that actually extends.
> doing boring tasks for which you can provide perfect guidance
That's true but you never need an LLM for that. There are wonderful scripts written by wonderful people and provided for free almost all the time and for those who search in the right places. LLM companies benefit/profit of these without providing anything in return.
They are worse than people who grab FOSS and turn it into overpriced and aggressively marketed business models and services or people who threaten and sue FOSS for being better and free alternatives to their bloated and often "illegally telemetric" services.
> able to accelerate you
True, but you leave too much for data brokers and companies like Meta to abuse and exploit in the future. All that additional "interactional data" will do so much worse to humanity than all those previous data sets did in elections, for example, or pretty much all consumer markets. They will mostly accelerate all these dimwitted Fortune 5000 companies that have sabotaged consumers into way too much dumb shit - way more than is reasonable or "ok". And educated, wealthy and or tech-savvy people won't be able to avoid/evade any of that. Especially when it's paired with meds, drugs, foods, biases, fallacies, priming and so on and all the knowledge we will gain on bio-chemical pathways and human liability to sabotage.
They are great for coders, of course, everyone can be an army of clone-warriors with auto-complete on steroids now and nobody can tell you what to do with all that time that you now have and all that money, which, thanks to all of us but mostly our ancestors, is the default. The problem is the resulting hyper-amplified, augmented financial imbalance. It's gonna fuck our species if all the technical people don't restore some of that balance, and everybody knows what that means and what must be done.
I haven’t found anything comparably good for JetBrains IDEs yet, but I’m also not switching to something else as my main editor.
Each task / programming language / query requires trying different LLM models and novel ways of prompting. If it's not work-related (or work pays for the one you use) sending as much of the code as relevant also helps the answers be more useful.
Most of the people I meet that say LLMs are not useful have only tried one (flavor / plugin), do not know how to pre-prompt or prompt, and do not give the tools a chance. Try one or two things, say yep, it's not good and give up.
Still hard for me to admit that Prompt Engineering is a profession, but it's the same as Google Fu. Once you learn it you can become an LLM Ninja!
I do not believe LLMs are coming for my job (just yet) but do believe they are going to be able to replace some people, are useful and those that do not use them will be at a disadvantage.
These people may not be Software Engineers, but they are coding.
https://www2.math.upenn.edu/~ghrist/preprints/LAEF.pdf - this math textbook was written in just 55 days!
Paraphrasing the acknowledgements -
...Begun November 4, 2024, published December 28, 2024.
...assisted by Claude 3.5 sonnet, trained on my previous books...
...puzzles co-created by the author and Claude
...GPT-4o and -o1 were useful in latex configurations...doing proof-reading.
...Gemini Experimental 1206 was an especially good proof-reader
...Exercises were generated with the help of Claude and may have errors.
...project was impossible without the creative labors of Claude
The obvious comparison is to the classic Strang https://math.mit.edu/~gs/everyone/ which took several *years* to conceptualize, write, peer review, revise and publish.
Ok maybe Strang isn't your cup of tea, :%s/Strang/Halmos/g , :%s/Strang/Lipschutz/g, :%s/Strang/Hefferon/g, :%s/Strang/Larson/g ...
Working through the exercises in this new LLMbook, I'm thinking...maybe this isn't going to stand the test of time. Maybe acceleration is not so hot after all.
Maybe I'm not the target audience, but... that really doesn't make me interested in continuing to read.
The overuse of the $15 synonyms is almost always a bad idea--you want to use them sparingly, where dropping them in for their subtly different meanings enhances the text. But what is extremely sloppy here is that the possibilities of "no solutions, one solution, infinite solutions" is now being described with a different metaphor for solution here. And by the end of the paragraph, I'm not actually sure what point I'm supposed to take away from this text. (As bad as this paragraph is, the next paragraph is actually far worse.)
Mathematics already has a problem for the general audience with a heavy focus on abstraction that can be difficult to intuit on more concrete objects. Adding florid metaphors to spice up your writing makes that problem worse.
I'm agreeing with you.
x+y=1, x+y=2 clearly has no solution since two numbers can’t simultaneously add to both one and two.
x+y=1,2x+2y=2 clearly has infinitely many solutions. There’s only one equation here after canceling the 2, so you can plug in x’s and y’s all day long, no end to it.
x+y=1, 2x+y=1 clearly has exactly one solution (0,1) after elimination.
This example stuck with me so I use it even now. The author/Claude/Gemini/whatever could have just used this simple example instead of “trichotomy of curves through space conjoin through the realm of …” math, not Shakespeare.
To explain this I would first and foremost use a picture, where the 3 cases : parallel, identical, intersection can be intuitively seen (using our visual system, rather than our language system), with merely a glance.
Great on the surface but lacks any depth, cohesive, or substance
Then I'd have Claude create text. I'd then edit/refine each chapter's text.
Wow, was it unpleasant. It was kinda cool to see all the words put together, but editing the output was a slog.
It's bad enough editing your own writing, but for some reason this was even worse.
The date/time that divides my world into before/after is AlphaGo v Lee Sedol game 3 (2016). From that time forward, I don't dismiss out of hand speculations of how soon we can have intelligent machines. Ray Kurzweil date of 2045 is as good as any (and better than most) for an estimate. Like Moore's (and related) Laws, it's not about how but the historical pace of advancements crossing a fairly static point of human capability.
Application coding, requires much less intelligence than playing Go at these high levels. The main differences are concise representation and clear final outcome scoring. LLMs deal quite well with the fuzziness of human communications. There may be a few more pegs to place but when seems predictably unknown.
I wish the author qualified this more. How does one develop that skill?
What makes LLMs so powerful on a day to day basis without a large RAG system around it?
Personally, I try LLMs every now and then, but haven’t seen any indication of their usefulness for my day to day outside of being a smarter auto complete.
In my experience, LLM tools are the same, you ask for something basic initially and then iteratively refine the query either via dialog or a new prompt until you get what you are looking for or hit the end of the LLM's capability. Knowing when you've reached the latter is critically important.
Specifically, I’ve been using Kagi Assistant over the past 1.5 months for serious and lengthy searches, and I can’t imagine going back to traditional search.
I’m currently sold on this model of LLM assisted search (where explicit links are provided) over the old Google foo skills I developed during grad school.
Example search topics include deep dives and guidance for my first NAS build, finding new bioinformatics methods, and other random biomedical info.
* Most existing LLM interfaces are very bad at editing history, instead focusing entirely on appending to history. You can sort of ignore this for one-shot, and this can be properly fixed with additional custom tools, but ...
* By the time you refine your input enough to patch over all the errors in the LLM's output for your sensible input, you're bigger than the LLM can actually handle (much smaller than the alleged context window), so it starts randomly ignoring significant chunks of what you wrote (unlike context-window problems, the ignored parts can be anywhere in the input).
Also ChatGPT has a pretty big context window. Gemini supposedly has the biggest useful context window (~millions of tokens), though I don't have personal experience.
Somebody somewhere needs to provide a threaded interface to an LLM.
A lot of my most complex LLM interactions take place across multiple sessions - and in some cases I'll even move the project from Claude 3.5 Sonnet to OpenAI o1 (or vice versa) to help get out of a rut.
It's infuriatingly difficult to explain why I decide to do that though!
I feel like I’m good at understanding context. I’ve been working in AI startups over the last 2 years. Currently at an AI search startup.
Managing context for info retrieval is the name of the game.
But for my personal use as a developer, they’ve caused me much headache.
Answers that are subtly wrong in such a way that it took me a week to realize my initial assumption based on the LLM response was totally bunk.
This happened twice. With the yjs library, it gave me half incorrect information that led me to misimplementing the sync protocol. Granted it’s a fairly new library.
And again with the web history api. It said that the history stack only exists until a page reload. The examples it gave me ran as it described, but that isn’t how the history api works.
I lost a week of time because of that assumption.
I’ve been hesitant to dive back in since then. I ask questions every now and again, but I jump off much faster now if I even think it may be wrong.
In the case you were in I would go out of my way to feed the docs to the LLM and then use the LLM to interrogate the docs and then verify the understanding I got from the LLM with a personal reading of the docs that were relevant.
You might think it takes just as long of not longer to do it my way rather than just reading the docs myself. Sometimes it can. But as you get good at the workflow you find that the time sien finding the relevant docs goes down and you get an instant plausible interpretation of the docs added too. You can then very quickly produce application code right away and then docs of the code you write.
- Running micro-benchmarks (using Python in Code Interpreter) - if I have a question about which of two approaches is faster I often use this pattern: https://simonwillison.net/2023/Apr/12/code-interpreter/
- Building small ad-hoc one-off tools. Many of the examples in https://simonwillison.net/2024/Oct/21/claude-artifacts/ fit that bill, and I have a bunch more in my tools tag here: https://simonwillison.net/tags/tools/ - Geoffrey Litt wrote a great piece the other day about custom developer tools which matches how I think about this: https://www.geoffreylitt.com/2024/12/22/making-programming-m...
- Building front-end prototypes - I use Claude Artifacts for this all the time, if I have an idea for a UI I'll get Claude to spin up an almost instant demo so I can interact with it and see if it feels right. I'll often copy the code out and use it as the starting point for my production feature.
- DSLs like SQL, Bash scripts, jq, AppleScript, grep - I use these WAY more than I used to because 9/10 times Claude gives me exactly what I needed from a single prompt. I built a CLI tool for prompt-driven jq programs recently: https://simonwillison.net/2024/Oct/27/llm-jq/
- Ad-hoc sidequests. This is a pretty broad category, but it's effectively little coding projects which I shouldn't actually be working on at all but I'll let myself get distracted if an LLM can get me there in a few minutes: https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-cas...
- Writing C extensions for SQLite while I'm walking my dog on the beach. I am not a C programmer but I find it extremely entertaining that ChatGPT Code Interpreter, prompted from my phone, can write, compile and test C extension for SQLite for me: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
- That's actually a good example of a general pattern: I use this stuff for exploratory prototyping outside of my usual (Python+JavaScript) stack all the time. Usually this leads nowhere, but occasionally it might turn into a real project (like this AppleScript example: https://til.simonwillison.net/gpt3/chatgpt-applescript )
- Actually writing code. Here's a Python/Django app I wrote almost entirely with Claude: https://simonwillison.net/2024/Aug/8/django-http-debug/ - again, this was something of a side-project - not something worth spending a full day on but worthwhile if I could get it done in a couple of hours.
- Mucking around with APIs. Having a web UI for exploring an API is really useful, and Claude can often knock those out from a single prompt. https://simonwillison.net/2024/Dec/17/openai-webrtc/ is a good example of that.
There's a TON more, but this probably represents the majority of my usage.
I’ll read through these and try again in the new year.
also nice to interact with an LLM in vim, as the context is the buffer
obviously simon’s llm tool rules. I’ve wrapped it for vim
I'd love to figure this out. I've written more about them than most people at this point, and my goal has always been to help people learn what they can and cannot do - but distilling that down to a concise set of lessons continues to defeat me.
The only way to really get to grips with them is to use them, a lot. You need to try things that fail, and other things that work, and build up an intuition about their strengths and weaknesses.
The problem with intuition is it's really hard to download that into someone else's head.
I share a ton of chat conversations to show how I use them - https://simonwillison.net/tags/tools/ and https://simonwillison.net/tags/ai-assisted-programming/ have a bunch of links to my exported Claude transcripts.
My first stab at trying ChatGPT last year was asking it to write some Rust code to do audio processing. It was not a happy experience. I stepped back and didn't play with LLMs at all for a while after that. Reading your posts has helped me keep tabs on the state of the art and decide to jump back in (though with different/easier problems this time).
You're misrepresenting it here.
The point of that post isn't "look at these incredible projects I've built (proceeds to show simple projects)."
It's "I built 14 small and useful tools in a single week, each taking between 2 and 10 minutes".
The thing that's interesting here is that I can have an LLM kick out a working prototype of a small, useful tool in only a little more time than it takes to run a Google search.
That post isn't meant to be about writing "real production code". I don't know why people are confused over that.
Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.
Abstract that out a bit further, and realize that most managers don't expect their reports to be 100% reliable.
Don't use LLMs where accuracy is paramount. Use it to automate away tedious stuff. Examples for me:
Cleaning up speech recognition. I use a traditional voice recognition tool to transcribe, and then have GPT clean it up. I've tried voice recognition tools for dictation on and off for over a decade, and always gave up because even a 95% accuracy is a pain to clean up. But now, I route the output to GPT automatically. It still has issues, but I now often go paragraphs before I have to correct anything. For personal notes, I mostly don't even bother checking its accuracy - I do it only when dictating things others will look at.
And then add embellishments to that. I was dictating out a recipe I needed to send to someone. I told GPT up front to write any number that appears next to an ingredient as a numeral (i.e. 3 instead of "three"). Did a great job - didn't need to correct anything.
And then there are always the "I could do this myself but I didn't have time so I gave it to GPT" category. I was giving a presentation that involved graphs (nodes, edges, etc). I was on a tight deadline and didn't want to figure out how to draw graphs. So I made a tabular representation of my graph, gave it to GPT, and asked it to write graphviz code to make that graph. It did it perfectly (correct nodes and edges, too!)
Sure, if I had time, I'd go learn graphviz myself. But I wouldn't have. The chances I'll need graphviz again in the next few years is virtually 0.
I've actually used LLMs to do quick reformatting of data a few times. You just have to be careful that you can verify the output quickly. If it's a long table, then don't use LLMs for this.
Another example: I have a custom note taking tool. It's just for me. For convenience, I also made an HTML export. Wouldn't it be great if it automatically made alt text for each image I have in my notes? I would just need to send it to the LLM and get the text. It's fractions of a cent per image! The current services are a lot more accurate at image recognition than I need them to be for this purpose!
Oh, and then of course, having it write Bash scripts and CSS for me :-) (not a frontend developer - I've learned CSS in the past, but it's quicker to verify whatever it throws at me than Google it).
Any time you have a task and lament "Oh, this is likely easy, but I just don't have the time" consider how you could make an LLM do it.
Then why do people keep pushing it for code related tasks?
Accuracy and precision is paramount with code. It needs to express exactly what needs to be done and how.
If the LLM hallucinates something the code won't compile or run.
If the LLM makes a logic error you'll catch it in the manual QA process.
(If you don't have good personal manual QA habits, don't try using LLMs to write your code. And maybe don't hit "accept" on other developer's code reviews either?)
This is an overly simplistic view of software development.
Poorly made abstractions and functions will have knock on effects on future code that can be hard to predict.
Not to mention that code can have side effects that may not affect a given test case, or the code could be poorly optimized, etc.
Just because code compiles or passes a test does not mean it’s entirely correct. If it did, we wouldn’t have bugs anymore.
The usual response to this is something like “we can use the LLM to refactor LLM code if we need” but, in my experience, this leads to very complex, hard to reason about codebases.
Especially if the stack isn’t Python or JavaScript.
Instead of going through a multi step process to get an LLM to generate it, review it, reject it, and repeat…
I wonder why you reply to these comments, but not my other asking what you use LLMs for and specifically explaining how they failed me.
They don't. You are likely experiencing selection bias. My guess is you work in SW, and so it makes sense that you're the target of those campaigns. The bulk of ChatGPT subscribers are not doing SW, and no one is bugging them to use it for code related tasks.
Obviously people not in the software field wouldn’t care…
If you zero-prompt and copy-paste the first result into your codebase, yeah, the accuracy problem will rear its ugly head real quick.
The problem is: for the tasks that I can give the LLM (or human) that I can easily verify and correct, the LLM fails with the majority of them, for example
- programming tasks of my area of expertise (which is more "mathematical" than what is common in SV startups), where I know how a high-level solution has to look like, and where I can ask the LLM to explain the gory details to me. Yes, these gory details are subtle (which is why the task can be menial), but the code has to be right. I can verify this, and the code is not correct.
- getting literature references about more obscure scientific (in particular mathematical) topics. I can easily check whether these literature references (or summaries of these references) are hallucinations - they typically are.
Your second task is not a "task", but a knowledge search. LLMs are not good with searches (unless augmented - like RAG).
My programmer mind tells me that "tedious stuff" is where accuracy is the most important.
The best prompts though are always written in a separate text file for me and pasted in. Follow up questions are never as good as a detailed initial prompt.
I would imagine well formulated questions to solve the problem at hand is a skill but beyond that I don't think there is anything special about how to ask LLMs a question.
In areas the LLM is rather useless, no amount of variation in prompting can solve that problem IMO. Just like if the tasks is something the LLM is good at, the prompt can be pretty sloppy and seem like magic with how it can understand what you want.
The tricky problem with LLMs is identifying failures - if you're asking the question, it's implied that you don't have enough context to assess whether it's a hallucination or a good recommendation! One approach is to build ensembles of agents that can check each other's work, but that's a resource-intensive solution.
Let people work how they want. I wouldn’t not hire someone on the basis of them not using a language server.
The creator of the Odin language famously doesn’t use one. He’s says that he, specifically, is faster without one.
They didn’t say how heavily they weight the question.
(All that said I expect that, soon, experience with the appropriate LLM tooling will be as important as having experience with the language your system is implemented in.)
I can’t use perforce while my company is on git.
But if I do or do not use an LLM to assist me while coding, my team is unaffected.
If someone liked jetbrains, but your team used neovim, would you force them to use neovim?
Though nobody should care if I edited my text files with neovim as long as I still used the same toolchain as everyone else.
If you're "working the way you want to" ie still handrolling all your code, you're going to find my expectations unrealistic, and that is certainly not fair to you.
Recently, I shared a code base with a junior dev and she was surprised with the speed and sophistication of the code. The LLM did 80+% of the "coding".
What was telling was as she was grokking the code (for helping the ~20%), she was surprised at the quality of the code - her use of the LLM did not yield code of similar quality.
I find that the more domain awareness one brings to the table, the better the output is. Basically the clearer one's vision of the end-state, the better the output.
One other positive side-effect of using "LLMs as a junior-dev" for me has been that my ambitions are greater. I want it all - better code, more sophisticated capabilities even for relatively not-important projects, documentation, tests, debug-ability. And once the basic structure is in place, many a time it is trivial to get the rest.
It's never 100%, but even with 80+%, I am faster than ever before, deliver better quality code, and can switch domains multiple times a week and never feel drained.
Sharing best AI hacks within a team will have the same effect as code-reviews do in ensuring consistency. Perhaps an "LLM chat review", especially when something particularly novel was accomplished!
If I was in an environment that didn't allow hosted API models I'd absolutely be looking into the various Llama 3 models or Qwen2.5-Coder-32B.
Then I paste that into the Claude web interface or Google's AI Studio if it's too long for Claude and ask questions there.
Sometimes I'll pipe it straight into my own LLM CLI tool and ask questions that way:
I can later start a chat session on top of the accumulated context like this: (The -c means "continue most recent conversation in the chat").I haven't actually done many experiments with long context local models - I tend to hit the hosted API models for that kind of thing.
Once that's all done, you basically have a well-structured question you could pass to an underling and have them completely independently work on the project without bugging you. That's the goal. Now, pass that to o1 or Claude, depending on whether it's a general-purpose task (o1) or a code-specific task (Claude), and wait for response. From there, have a conversation or test-and-followup of whatever it spits out, this time with you asking questions. If good enough, done. If not, wrap up whatever useful insights from that line of questioning and put it back into the initial prompt and either re-post it at the end of the conversation or start a fresh conversation.
I find 90% of the time this gets exactly what I'm after eventually. The few other cases are usually because we hit some cycle where the AI doesn't fully know what to change/respond, and it keeps repeating itself when I ask. The trick then is to ask things a different way or emphasize something new. This is usually just a code-specific issue, for general problems it's much better. One other trick is to ask it to take a step back and just tackle the problem in a theoretical/philosophical way first before trying to do any coding or practical solving, and then do that in a second phase (asking o1 to architect code structure and then Claude to implement it is a great combo too). Also if there is any way to break up the problem into smaller pieces which can be tackled one conversation at a time - much better. Just remember to include all relevant context it needs to interface with the overall problem too.
That sounds like a lot, but it's essentially just project management and delegation to somewhat-flawed underlings. The upside is instead of waiting a workweek for them to get back to you, you just have to wait 20 seconds. But it does mean a ton of reading and writing. There are certainly already some meta-prompts where you can get the AI to essentially do this whole process for you and assess itself, but like all automation that means extra ways for things to break too. Let the AI devs cook though and those will be a lot more commonplace soon enough...
[Edit: o1 mostly agrees lol. Some good additional suggestions for systematizing this: https://chatgpt.com/share/6775b85c-97c4-8003-bd31-ee288396ab... ]
Do you know if any of the ideas from that project have crossed over into LLM world yet?
>LLM prices crashed
This one has me a little spooked. The white knight on this front (DS) has both announced increases and has had staff poached. There is still Gemini free tier which is ofc basically impossible to beat (solid & functionally unlimited/free) but it's google so reluctant to trust.
Seriously worried about seeing a regression on pricing in first half of 2025. Especially with the OAI $200 price anchoring.
>“Agents” still haven’t really happened yet
Think that's largely because it's a poorly defined concept and true "agent" implies some sort of pseudo-agi autonomy. This is a definition/expectation issue rather than technical in my mind
>LLMs somehow got even harder to use
I don't think that's 100%. An explosion of options is not equal to harder to use. And the guidance for noobs is still pretty much same as always (llama.cp or one of the common frontends like text-generation-webui). It's become harder to tell what is good, but not to get going.
----
One key theme I think is missing is just how hard it has become to tell what is "good" for the average user. There is so much benchmark shenanigans going on that it's just impossible to tell. I'm literally at the "I'm just going to build my own testing framework" stage. Not because I can do better technically (I can't)...but because I can gear it towards things I care about and I can be confident my DIY sample hasn't been gamed.
These companies are incentivized to figure out fast and efficient hosting for the models. They don't need to train any models themselves, their value is added entirely in continuing to drive the price of inference down.
Groq and Cerberus are particularly interesting here because WOW they serve Llama fast.
Is it free free? The last time I checked there was a daily request limit, still generous but limiting for some use cases. Isn't it still the case?
That's an indication that most business-sized models won't need some giant data center. This is going to be a cheap technology most of the time. OpenAI is thus way overvalued.
This means that the definitions of "laptop" and "server" are dependent on use. We should instead talk about RAM, GPU and CPU speed which is more useful and informative but less engaging than "my laptop".
However, it has been clear for a long time that meta are just demolishing any competitor's moats, driving the whole megacorp AI competition to razor thin margins.
It's a very welcome strategy from a consumer pov, but -- it has to be said -- genius from a business pov. By deciding that no one will win, it can prevent anyone leapfrogging them at a relatively cheap price.
That is of course, assuming AGI is possible and exponential, and that marketshare goes to a single entity instead of a set of entities. Lots of big assumptions. Seems like we're heading towards a slow-lackluster singularity though.
It's the simple fact that the ability of assets to generate wealth has far outstripped the abiliy of individuals to earn money by working.
Somehow real estate has become so expensive everywhere that owning a shitty apartment is impossible for the vast majority.
When the world's population was exploding during the 20th century, housing prices were not a problem, yet somehow nowadays, it's impossible to build affordable housing to bring the prices down, though the population is stagnant or growing slowly.
A company can be worth $1B if someone invests $10m in it for 1% stake - where did the remaining $990m come from? Likewise, the stock market is full of trillion-dollar companies whose valuations beggar all explanation, considering the sizes of the markets they are serving.
The rich elites are using the wealth to control access to basic human needs (namely housing and healthcare) to squeeze the working population for every drop of money. Every wealth metric shows the 1% and the 1% of the 1% control successively larger portions of the economic pie. At this point money is ceasing to be a proxy for value and is becoming a tool for population control.
And the weird thing is it didn't use to be nearly this bad even a decade ago, and we can only guess how bad it will get in a decade, AGI or not.
Anyway, I don't want to turn this into a fully-written manifesto, but I have trouble expressing these ideas in a concise manner.
Approximately 2/3s of homes in the US are owner occupied.
Approximately 2/3rds of Australians live in an owner-occupied home.
In Canada, the population is still growing at a fairly impressive rate (https://www.macrotrends.net/global-metrics/countries/CAN/can...), and that growth tends to concentrate in major population centres. There are advocacy groups that seek to push Canadian population growth well above UN projections (e.g. the https://en.wikipedia.org/wiki/Century_Initiative "aims to increase Canada's population to 100 million by 2100") through immigration. In Japan, where the population is declining, housing prices are not anything like the problem we observe in North America.
There's also the supply side. "Impossible to build affordable housing" is in many cases a consequence of zoning restrictions. (Economists also hold very strongly that rent control doesn't work - see e.g. https://www.brookings.edu/articles/what-does-economic-eviden... and https://www.nmhc.org/research-insight/research-notes/2023/re... ; real "affordable housing" is just the effect of more housing.)
That's to be expected when governments forbid people from building housing. The only thing I find surprising is when people blame this on "capitalism".
The last 5 years have reflected a substantial decline in QOL in the states; you don't even have to to look back that far.
The coronacircus money-printing really accelerated the decline.
That's if AGI is possible and not easily replicated. If AGI can be copied and/or re-developed like other software then the value of owning OpenAI stock is more like owning stock in copper producers or other commodity sector companies. (It might even be a poorer investment. Even AGI can't create copper atoms, so owners of real physical resources could be in a better position in a post-human-labor world.)
https://en.wikipedia.org/wiki/Cosmological_lithium_problem
[1] https://en.wikipedia.org/wiki/Siemens#1847_to_1901
Nothing is truly exponential for long, but the logistic curve could be big enough to do almost anything if you get imaginative. Without new physics, there are still some places where we can do some amazing things with the equivalent of several trillion dollars of applied R&D, which AGI gets you.
It astounds me that people dont realize how much of this cutting edge science stuff literally does NOT happen overnight, and not even close to that; typically it takes on the order of decades!
My point being that even if Science ends today, we still have a lot more engineering we can benefit from.
The big problem with LLMs is that most of the time they act smart, and some of the time they do really, really dumb things and don't notice. It's not the ceiling that's the problem. It's the floor. Which is why, as the article points out, "agents" aren't very useful yet. You can't trust them to not screw up big-time.
What does this mean in terms of making me coffee or building houses?
Rinse and repeat.
That is exponential take off.
At the point where you have an army of AIs running at 1000x human speed it can just ask it to design the mechanisms for and write the code to make robots that automate any possible physical task.
We also have people brilliant enough to maybe solve the AGI problem or cause our extinction. Some are amoral. Many mechanisms pushed human intelligences in other directions. They probably will for our AGI’s assuming we even give them all the power unchecked. Why are they so worried the intelligent agents will not likewise be misdirected or restrained?
What smart, resourceful humans have done (and not done) is a good, starting point for what AGI would do. At best, they’ll probably help optimize some chips and LLM runtimes. Patent minefields with sub-28nm design, especially mask-making, will keep unit volumes of true AGI’s much lower at higher prices than systems driven by low-paid workers with some automation.
Not if you remember to count all the computations being done by the quintillions of nanobots across the world known as "human cells."
That's not only inside cells, and not just neurons either. For example, your thyroid is busy brute-forcing the impossibly large space of antibody combinations, and putting every candidate cell-release through a very rigorous set of acceptance tests.
The guy running Anthropic thinks the future is in biotech, developing the cure to all diseases, eternal youth etc.
Which is technology all right, but it's unclear to me how these chatbots (or other AI systems) are the quickest way to get there.
I heard people on HN saying this (even without the money condition) and I fail to grasp the reasoning behind it. Suppose in a few years Altman announces a model, say o11, that is supposedly AGI, and in several benchmarks it hits over 90%. I don't believe it's possible with LLMs because of their inherent limitations but let's assume it can solve general tasks in a way similar to an average human.
Now, how come that "the entire human economy stops making sense"? In order to eat, we need farmers, we need construction workers, shops etc. As for white collar workers, you will need a whole range of people to maintain and further develop this AGI. So IMHO the opposite is true: the human economy will work exactly as before but the job market will continue to evolve withe people using AGI in a similar way that they use LLMs now but probably with greater confidence. (Or not.)
IMO we’re going to hit the point where AI can work on designing automation to replace physical labor before we hit true AGI, much like we’re seeing with coding.
I don't see how OpenAI wouldn't crash and burn here. Given the history of models it would be at most a year before you'd have open AGI, then the horse is out of the barn and the horse begins to self-improve. Pretty soon the horse is a unicorn, then it's a Satyr, and so on.
(I am a near-term AGI skeptic BTW, but I could be wrong.)
OpenAI's valuation is a mixture of hype speculation and the "golden boy" cult around Sam Altman. In the latter sense it's similar to the golden boy cults around Elon Musk and (politically) Donald Trump. To some extent these cults work because they are self-fulfilling feedback loops: these people raise tons of capital (economic or political) because everyone knows they're going to raise tons of capital so they raise tons of capital.
People are buying shares at $x because they believe they will be able to sell them for more later. I don’t think there’s a whole to more to it than that.
OpenAI predicts more revenue from ChatGPT than api access through 2029.
It’s the old Netflix / HBO trope of which can become the other first: hbo figure out streaming or Netflix figure out original programming.
I bet Google will figure this out and thus OpenAI won’t disrupt as much as people think it will.
Tangential: So how is that race going, has either taken a commanding lead? (Or, hey, is it over already; has either of them won and the other lost? (Yeah, guess if I'm very well-infomed on that industry or not...))
So take the entire economy and ask the question: what does AI not impact? Net that out and assume there’s pricing efficiencies, then build in a risk buffer.
1.5t to 15t seems right.
The non-skeptical interpretation is that it's a threshold function, a flat-out race with an unambiguous finish line. If someone actually hit self-improving AGI first there's an argument that no one would ever catch up.
What matters is how you use the AGI, not how much you have, with wrong or bad or limiting regulations it will not lead anywhere.
They run on a laptop, yes - you might squeeze up to 10 token/sec out of a kinda sorta GPT-4 if you paid $5K plus for an Apple laptop in the last 18 months.
And that's after you spent 2 minutes watching 1000 token* prompt prefill at 10 tokens/sec.
Usually it'd be obvious this'd trickle down, things always do, right?
But...Apple infamously has been stuck on 8GB of RAM in even $1500 base models for years. I have 0 idea why, but my intuition is RAM was ~doubling capacity at same cost every 3 years till early 2010s, then it mostly stalled out post 2015.
And regardless of any of the above, this absolutely melts your battery. Like, your 16 hr battery life becomes 40 minutes, no exaggeration.
I don't know why prefill (loading in your prompt) is so slow for local LLMs, but it is. I assume if you have a bunch of servers there's some caching you can do that works across all prompts.
I expect the local LLM community to be roughly the same size it is today 5 years from now.
* ~3 pages / ~750 words; what I expect is a conservative average for prompt size when coding
second-state/llama-2-7b-chat-gguf net me around ~35 tok/sec
lmstudio-community/granite-3.1.-8b-instruct-GGUF - ~50 tok/sec
MBP M3 Max, 64g. - $3k
#1. It is possible to get an arbitrarily fast tokens/second number, given you can pick model size.
#2. Llama 1B is roughly GPT-4.
#3. Given Llama 1B runs at 100 tokens/sec, and given performance at a given model size has continued to improve over the past 2 years, we can assume there will eventually be a GPT-4 quality model at 1B.
On my end:
#1. Agreed.
#2. Vehemently disagree.
#3. TL;DR: I don't expect that, at least, the trend line isn't steep enough for me to expect that in the next decade.
Most web servers can run some number of QPS on a developer laptop, but AWS is a big business, because there are a heck of a lot of QPS across all the servers.
Consumer GPUs top out at 24 GB VRAM.
For example, how close does it get to the peak, and what's the median bandwidth during inference? And is that bandwidth, rather than some other clever optimization elsewhere, actually providing the Mac's performance?
Personally, I don't develop HPC stuff on a laptop - I am much more interested in what a modern PC with Intel or AMD and nvidia can do, when maxxed out. But it's certainly interesting to see that some of Apple's arch decisions have worked out well for local LLMs.
Then, several headings later:
> I have it on good authority that neither Google Gemini nor Amazon Nova (two of the least expensive model providers) are running prompts at a loss.
So...which is it?
They're not running at a loss. I'll fix that.
This means that they could make a profit off inference models without the revenue being large enough to pay the energy costs.
If it's the case I don't know. I'm more concerned with getting rid of those corporations altogether since interacting with them is generally forbidden due to the lack of data protection regulations in the US.
This 100%. “Agentic” especially as a buzzword can piss off
My problem is when people use that definition (or any other) without clarifying, because they assume it's THE obvious definition.
The money is still flowing, for now, to subsidize that fiasco but as soon as that starts to slow, even just a bit, things are gonna get bumpy real quick. Super excited about this tech but there are dark storm clouds building on the horizon and absent a major “moat” breakthrough it’s gonna get rough soon.
That’s exactly what happened with rideshare companies. It was an amazing new thing but subsidized in an unsustainable way, then a bunch of companies exited the space when it was an commoditized race to the bottom and those left let quality slip. Now when you order an Uber a car shows up that smells bad and has wheels about to fall off. The consumer experience was a lot better when Uber was a VC subsidized bonanza
I'm pretty sure that's been possible for a while. There was an example where Claude's computer use feature ordered pizza for the dev team through DoorDash: https://x.com/alexalbert__/status/1848777260503077146?lang=e...
I don't think the released version of the feature can do it, but it should be possible with today's tech.
The big challenge is figuring out how to use it. I usually like working at the function level: I figure out the exact function signature I want in Python or JavaScript and then get Claude to implement it for me.
Claude Artifacts are neat too: Claude can build a full HTML+JavaScript UI, and then iterate on it. I use this for interactive UI prototypes and building small tools.
I've published a whole lot of notes on this stuff here: https://simonwillison.net/tags/ai-assisted-programming/
Things that didn’t work 6 months ago do now. Things that don’t work now, who knows…
Or do you actually mean that the same routines and data that didn't work before suddenly work?
Each new model opens up new possibilities for my work. In a year it's gone from sort of useful but I'd rather write a script, to "gets me 90% of the way there with zero shots and 95% with few-shot"
https://www.economist.com/finance-and-economics/2024/08/21/w...
The closest in that collection is "A division of responsibilities between LLMs that results in some sort of flow?" - https://lite.datasette.io/?json=https://gist.github.com/simo...
1. https://github.com/openai/swarm/tree/main
A small number of people with lots of power are essentially deciding to go all in on this technology presumably because significant gains will mean the long term reduction of human labor needs, and thus human labor power. As the article mentions, this also comes at huge expenditure and environmental impact, which is already a very important domain in crisis that we've neglected. The whole thing especially becomes laughable when you consider that many people are still using these tools to perform tasks that could be preformed with a margin of more effort using existing deterministic tools. Instead we are now opting for a computationally more expensive solution that has a higher margin of error.
I get that making technical progress in this area is interesting, but I really think the lower level workers and researchers exploring the space need to be more emphatic about thinking about socioeconomic impact. Some will argue that this is analogous to any other technological change and markets will adjust to account for new tool use, but I am not so sure about this one. If the technology is really as groundbreaking as everyone wants us to believe then logically we might be facing a situation that isn't as easy to adapt to, and I guarantee those with power will not "give a little back" to the disenfranchised masses out of the goodness of their hearts.
This doesn't even raise all the problems these tools create when it comes to establishing coherent viewpoints and truth in ostensibly democratic societies, which is another massive can of worms.
https://www.bnnbloomberg.ca/investing/2024/09/16/ai-boom-is-...
This is definitely extending the runway of O&G at a crisis point in the climate disaster when we’re supposed to be reducing and shutting down these power plants.
Update: clarified the 200 number is in the US. There are far more world wide.
Methane is favored in many cases because they can be quickly ramped up and down to handle momentary peaks in demand or spotty supply generated from renewables.
Without knowing more details about those projects it is difficult to make the claim that these plants have anything to do with increased demand due to LLMs, though if anything, they’d just add to base load demands and lead to slower decommissioning of old coal plants like we’ve seen with bitcoin mines.
AI is a red herring. If it wasn’t that it would be EV power demand. If it wasn’t that it would be reshoring of manufacturing. If it wasn’t that it would be population growth from immigration. If it wasn’t that it would be replacing old coal power plants reaching EOL.
Replacing coal with gas is an improvement by the way. It’s around half the CO2 per kWh, sometimes less if you factor in that gas turbines are often more efficient than aging old coal plants.
And delivering methane leaks like a sieve into the atmosphere from all parts of the process.
Sure it’s probably “better than coal,” but not by much. It’s a bit like comparing what’s worse: getting burned by fire or being drowned in acid.
LLMs (and the image, sound, and movie generating models) are more coincidentally power-hogs — people are at least trying to make them better at fixed compute, and lower compute at fixed quality.
Because whether we're using tons of compute to provide value or not doesn't change that we are using tons of compute and tons of compute requires tons of energy, both for the chips themselves, and the extensive infrastructure that has to built around them to let them work. And not just electricity: refrigerants, many of which are environmentally questionable themselves, are a big part; hell, just water. Clean, usable water.
If we truly need these data centers, then fine. Then they should be powered by renewable energy, or if they absolutely cannot be, then the costs their nonrenewable energy sources inflict on the biosphere should be priced into their construction and use, and in turn, priced into the tech that is apparently so critical for them to have.
This is like, a basic calculus that every grown person makes dozens of times a day: do I need this? And they don't get to distribute the cost of that need, however prescient it may be, on their wider community because they can't afford it otherwise. I don't see why Microsoft should be able to either. If this is truly the tech of the future as it is constantly propped up to be, cool. Then charge a price for it that reflects what it costs to use.
Combined with the increased cost effectiveness of renewables & batteries, & the new build-out of nuclear, it could plausibly speed up the clean energy transition, rather than just disincentivising building out more polluting power plants.
There are two main options for what to do with revenue from a carbon tax. The one that makes the most macroeconomic sense is to use those proceeds to fund subsidies for clean energy roll outs & grid adaptation. You are directly taxing the polluting power grid to fund the construction of a non-polluting power grid. As CO2 emitting industry (and thus carbon tax revenue) declines, we have less required spend on clean energy roll out, so the tax would balance nicely. The downside would be that a carbon tax would increase cost of living and this does nothing about that.
The other option is a disbursement. Give everyone in society a payment directly from the proceeds of the carbon tax. This would offset the regressive aspects of a carbon tax (because that tax would increase consumer costs), and would also act as a sort of auto-stimulus to stop the economy from turning down due to consumption costs increasing. The downside of this is that the clean energy transition happens slower than the above, and that there may be political instability & perverse incentives as people maybe come to rely on this payment that has to go away over the next few decades.
They're both good options. I don't know which is better and I think that's likely something individual countries will probably choose based on their situation. But we do need some sort of way to make those emitting CO2 pay for its negative externalities.
I think the rapidly decreasing costs of renewables and storage are likely to make the transition happen before the political will to get a carbon tax, but if you recon you can push the right buttons, I encourage you to try it :)
So in a way, it is providing value to someone, whether we like it or not.
Or Drug Cartels. https://www.context.news/digital-rights/how-crypto-helps-lat...
But this is the promise of uncontrollable decentralization providing value, for good or bad?
meanwhile "AI" is used to produce infinity+1 pictures of shrimp jesus and more spam than we've ever known before
and if we're really lucky, it will put us all out of work
I'm curious what peoples thoughts are of what the future of LLMs would be like if we severely overshoot our carbon goals. How bad would thinks have to get for people to stop caring about this technology?
The growth in this technology isn’t outpacing car pollution and O&G extraction… yet, but the growth rate has been enough in recent years to put it on the radar of industries to watch out for.
I hope the compute efficiencies are rapid and more than commensurate with the rate of growth so that we can make progress on our climate targets.
However it seems unlikely to me.
It’s been a year of progress for the tech… but also a lot of setbacks for the rest of the world. I’m fairly certain we don’t need AGI to tell us how to cope with the climate crisis; we already have the answer for that.
Although if the industry does continue to grow and the efficiency gains aren’t enough… will society/investors be willing to scale back growth in order to meet climate targets (assuming that AI becomes a large enough segment of global emissions to warrant reductions)?
Interesting times for the field.
Nowt, owt, -- nothing, anything
We all have silently started to realize Slops, hopefully we can recognize them more easily and prevent them.
Test Driven Development (Integration Tests or functional tests specifically) for Prompt Driven Development seems like the way to go.
Thank you, Simon.
"""
LLMs need better criticism # A lot of people absolutely hate this stuff. In some of the spaces I hang out (Mastodon, Bluesky, Lobste.rs, even Hacker News on occasion) even suggesting that “LLMs are useful” can be enough to kick off a huge fight.
I like people who are skeptical of this stuff. The hype has been deafening for more than two years now, and there are enormous quantities of snake oil and misinformation out there. A lot of very bad decisions are being made based on that hype. Being critical is a virtue.
If we want people with decision-making authority to make good decisions about how to apply these tools we first need to acknowledge that there ARE good applications, and then help explain how to put those into practice while avoiding the many unintiutive traps.
"""
LLMs are here to stay, and there is a need for more thoughtful critique rather than just "LLMs are all slop, I'll never use it" comments.
The signal-to-noise ratio just goes completely out of control.
https://journal.everypixel.com/ai-image-statistics
One reason is that it's cheaper to use AI, even if the result is poor. It doesn't have to be high quality, because most of the time we don't care about quality, unless something interests us. I wonder what kind of shift in power dynamics will occur, but so far it looks just like many of us will just lose a job. There's no UBI (or social credit proposed by Douglas), salaries are low and not everyone lives in good location, but corporations try to enforce RTO. Some will simply get fired and won't be able to find a new job (that won't be sustainable for personal budget, unless someone already has low costs of living and is debt-free or has somewhat wealthy family that will cover for you).
Well, maybe at least government will protect us? Low chance, world is shifting right and it will get worse, once we start to experience more and more results of global warming. I don't see scenario, where world is becoming better place in foreseeable future. We're trapped in society of achievement, but soon we may be not able to deliver achievements, because if business can get similar results for fraction of the price (that is needed to hire human workers), then guess what will happen?
These are sad times, full of depression and suffering. I hope that some huge transformation in societies will happen soon or that AI development slows down, so that some future generation will have to deal with consequences (people will prioritize saving their own and it won't be pretty, so it's better to just pass it down like debt).
https://en.wikipedia.org/wiki/The_Human_Use_of_Human_Beings
https://en.wikipedia.org/wiki/Inventing_the_Future:_Postcapi...
https://en.wikipedia.org/wiki/The_Right_to_Be_Lazy
https://en.wikipedia.org/wiki/In_Praise_of_Idleness_and_Othe... (That's Bertrand Russell)
https://en.wikipedia.org/wiki/The_Abolition_of_Work
https://en.wikipedia.org/wiki/The_Society_of_the_Spectacle
https://en.wikipedia.org/wiki/Bonjour_paresse
AI systems are literally the most amazing technology on earth for this exact reason. I am so glad that it is destroying the minds of time thieves world-wide!
Capital --> capitalist, capitalism.
Commune --> communist, communism.
Ned Ludd --> Luddite, Luddism.
Not "capitalistism" or "communistism", so not "ludditism" either.
Yup, English may be the most inconsistent of languages. When I was a kid, we used to blame French for being "just exceptions to rules, exceptions to exceptions, and exceptions to those exceptions!", but with a few decades of perspective... Nope, English is far worse.
These are the people who regulate and legislate for us, they are the risk-adverse fools who would rather things be nice and harmless lest they be bad but work.
Personally, I think my only serious ideology in this area is that I am fundamentally biased towards the power of human agency. I'd rather not need to, but in a (perhaps) Nietzschean sense I view so-called AI as a force multiplier to totally avoid the above people.
AI will enable the creative to be more concrete, and drag those on the other end of the scale towards the normie mean. This is of great relevance to the developing world too - AI may end up a tool for enforcing western culture upon the rest of the world but perhaps a force decorrelating it from the McKinsey's of tall buildings in big cities.
I suspect people don't particularly hate or despise LLMs per se. They're probably reacting mostly to "tech industry" boom-bust bullsh*tter/guru culture. Especially since the cycles seem to burn increasingly hotter and brighter the less actual, practical value they provide. Which is supremely annoying when the second-order effect is having all the oxygen (e.g. capital) sucked out of the room for pretty much anything else.
Step 2: write a slack style message as if you are discussing the solution with a teammate that you have authority over as a delegate to get shit done & to revise as needed.
Step 3: press enter, LLM does something you don't like, delete history, fix prompt in step 2 and ask again, rinse and repeat until you have working code.
Step 4: ask for the changes to be written as a bash file that cat EOF all the files that change into place, run the script.
Step 5: git diff & play test the changes using functional testing (use your mouse & keyboard test the code paths that changed...)
Step 6: continue prompting & deleting history as needed to refine.
Step 7: commit code to repos
Here is my resume. Make it look nice (some design hints).
They can spit html and css, but not Google doc.
On the other hand, Google results are dominated by SEO spam. You can probably find one usable result on page 10.
The problem is not technology. It's a business model that can support the humans feeding data into the LLM.
Google doc + PDF is likely the most commonly used combination based on what I see in the SEO spam.
Some of them make you watch ads and then allow you to download something that looks like a doc, but you'll find out soon that you downloaded a ppt with an image that you can't edit.
Wow. At this stage, I think people are just searching for excuses to complain about anything that the LLM does NOT do.
If a multi-modal LLM can read a 100 page PDF and answer questions about it or replace a median white collar worker, this should be a relatively trivial task. Suggest some nice fonts, backgrounds and give me something that I can lightly edit and generate a PDF from.
There were a few interesting papers - the Anthropic one about alignment faking https://www.anthropic.com/news/alignment-faking and the OpenAI o1 system card https://simonwillison.net/2024/Dec/5/openai-o1-system-card/ - and OpenAI continued to push their "instruction hierarchy" idea, any other big moments?
I'll be honest, I don't follow that side of things very closely (outside of complaining that prompt injection still isn't fixed yet).
https://daringfireball.net/2024/12/openai_unimaginable
OpenAI’s board now stating “We once again need to raise more capital than we’d imagined” less than three months after raising another $6.6 billion at a valuation of $157 billion sounds alarmingly like a Ponzi scheme — an argument akin to “Trust us, we can maintain our lead, and all it will take is a never-ending stream of infinite investment.”
This just doesn't hold true for open ai
Anyone who bought in at the ground floor is now rich. Anyone who buys in now is incentivized to try and keep getting more people to buy in so their investment will give a return regardless of if actual value is being created.
The money being invested does not go directly to investors.
It goes to the cost of R&D, which in turn increases the value of openai shares, then the early investors can sell those shares to realize those gains.
The difference between that and a ponzi is that the investment creates value which is reflected in the share price.
No value is created in a Ponzi scheme.
The actual dollar worth of the value generated is what people speculate on.
I do agree it’s a very very thin line.
Aha: So if my future line of Covid Cancer Candy takes off even faster, there's "value" in that, too?
What kind of value, exactly? Does the value of being "the fastest growing product of all time" not at all depend on what kind of product it is?
Yeah, true, not exactly a Ponzi scheme: This has even fewer redeeming qualities.
[1]: Only indirectly, by selling off their investment to that next sucker.
Using this as an opportunity to grind an axe (not your fault, cactusfrog!): I find it clearer when people write "not every X is a Y" than "every X is not a Y", which could be (and would be, literally) interpreted to mean the same thing as "no X is a Y".
Whether they offer the best model or not may not matter if you need a PhD in <subject> to differentiate the response quality between LLMs.
In my limited tests (primarily code) nothing from llama or Gemini have come close, Claude I’m not so sure about.
I have been bashing my head against the wall over the course of the past few days trying to create my (quite complex) dream app.
Most of LLM coding I've done involved in writing code to interface with already existing libs or services and the LLMs are great at that.
I'm hung up on architecture questions that are unique to my app and definitely not something you can google.
For example if someone just takes random information about a topic, organizes it in chronological order and adds empty opinions and preferences to it and does that for years on end - what do you call that?
> LLM generated content need to be verified.
There maybe should be a bright red flashing disclaimer at this point.
Having Slop generations from an LLM is a choice. There are so many tricks to make models genuinely creative just at the sampler level alone.
https://github.com/sam-paech/antislop-sampler
https://openreview.net/forum?id=FBkpCyujtS
You're not seeing how the future of the world will develop.
Some people might like slop.
Slop is over-representation of model's stereotypes and lack of prediction variety in cases that need it. Modern models are insufficiently random when it's required. It's not just specific words or idioms, it's concepts on very different abstraction levels, from words to sentence patterns to entire literary devices. You can't fix issues that appear on the latent level by working with tokens. The antislop link you give seems particularly misguided, trying to solve an NLP task programmatically.
Research like [1] suggests algorithms like PPO as one of the possible culprits in the lack of variety, as they can filter out entire token trajectories. Another possible reason is training on outputs from the previous models and insufficient filtering of web scraping results.
And of course, prediction variety != creativity, although it's certainly a factor. Creativity is an ill-defined term like many in these discussions.
[1] https://arxiv.org/abs/2406.05587
DRY does in fact solve repetition issues. You're not using the right settings with it. Set the penalty sky high like 5+. Yes that means you're going to have to modify the ui_paramas in oobabooga cus they have stupid defaults on what limits you can set the knobs to.
There's several other excellent samplers which deserve high ranking papers and will get them in due time. Constrained beam search, tfs (oldie but goodie), mirostat, typicality, top_a, top-n0, and more coming soon. Don't count out sampler work. It's the next frontier and the least well appreciated.
Also, contrastive search is pretty great. Activation/attention engineering is pretty great, and models can in fact be made to choose their own sampling/decoding settings, even on the fly. We haven't even touched on the value of constrained/structured decoding. You'll probably link a similarly bad paper to the previous one claiming that this too harms creativity. Good thing that folks who actually know what they're doing, i.e. the developers of outlines, pre-bunked that paper already for me: https://blog.dottxt.co/say-what-you-mean.html
I'm so incredibly bullish on AI creativity and I will die on the hill that soon AI systems will be undeniably more creative, and better at extrapolation, than most humans.
I've had PMs believe it can replace all writing of tickets and thinking about the feature, creating completely incomprehensible descriptions and acceptance criteria
I've had Slack messages and emails from people with zero sincerity and classic LLM style and the bs that entails
I've had them totally confidently reply with absolute nonsense about many technical topics
I'm grouchy and already over LLMs
But that doesn’t necessarily reflect the potential of the underlying technology, which is developing rapidly. Websites were goofy and pointless until Amazon came around (or Yahoo or whatever you prefer).
I guess potential isn’t very exciting or interesting on its own.
Recognize what they do well (generate simple code in popular languages) while acknowledging where they are weak (non-trivial algorithms, any novel code situation the LLM hasn't seen before, less popular languages).
As with all things LLM there's a whole lot of undocumented and under appreciated depth to getting decent results.
Code hallucinations are also the least damaging type of hallucinations, because you get fact checking for free: if you run the code and get an error you know there's a problem.
A lot of the time I find pasting that error message back into the LLM gets me a revision that fixes the problem.
This is great when the error is a thrown exception, but less great when the error is a subtle logic bug that only strikes in some subset of cases. For trivial code that only you will ever run this is probably not a big deal—you'll just fix it later when you see it—but for code that must run unattended in business-critical cases it's a totally different story.
I've personally seen a dramatic increase in sloppy logic that looks right coming from previously-reliable programmers as they've adopted LLMs. This isn't an imaginary threat, it's something I now have to actively think about in code reviews.
I work out what the edge cases are by writing and rewriting the code. It's in the process of shaping it that I see where things might go wrong. If an LLM can't do that on its own it isn't of much value for anything complicated.
Where I'm at right now with LLMs is that I find them to be very helpful for greenfield personal projects. Eliminating the blank canvas problem is huge for my productivity on side projects, and they excel at getting projects scaffolded and off the ground.
But as one of the lead engineers working on a million+ line, 10+ year-old codebase, I've yet to see any substantial benefit come from myself or anyone else using LLMs to generate code. For every story where someone found time saved, we have a near miss where flawed code almost made it in or (more commonly) someone eventually deciding it was a waste of time to try because the model just wasn't getting it.
Getting better at manual QA would help, but given the number of times where we just give up in the end I'm not sure that would be worth the trade-off over just discouraging the use of LLMs altogether.
Have you found these things to actually work on large, old codebases given the right context? Or has your success likewise been mostly on small things?
"Here's some example JavaScript code that sends an email through the SendGrid REST API. Write me a python function for sending an email that accepts an email address, subject, path to a Jinja template and a dictionary of template context. It should return true or false for if the email was sent without errors, and log any error messages to stderr"
That prompt is equally effective for a project that's 500 lines or 5,000,000 lines of code.
I also use them for code spelunking - you can pipe quite a lot of code into Gemini and ask questions like "which modules handle incoming API request validation?" - that's why I built https://github.com/simonw/files-to-prompt
It's very bad at Factor but pretty good at naming things, sometimes requiring some extra prompting. [generate 25 possible names for this variable...]
1. Stick with popular languages, libraries, etc with lots of blog articles and example code. The pre-training data is more likely to have patterns similar to what you’re building. OpenAI’s were best with Python. C++ was clearly taxing on it.
2. Separate design from coding. Have an AI output a step by step, high-level design for what you’re doing. Look at a few. This used to teach me about interesting libraries if nothing else.
3. Once a design is had, feed it into the model you want to code. I would hand-make the data structures with stub functions. I’d tell it to generate a single function. I made sure it knew what to take in and return. Repeat for each function.
4. For each block of code, ask it to tell you any mistakes in it and generate a correction. It used to hallucinate on this enough that I only did one or two rounds, make sure I hand-changed the code, and sometimes asked for specific classes of error.
5. Incremental changes. You give it the high-level description, a block of code, and ask it to make one change. Generate new code. Rinse repeat. Keep old versions since it will take you down dead ends at times but incremental is best.
I used the above to generate a number of utilities. I also made a replacement for the ChatGPT application that used the Davinci API. I also made a web proxy with bloat stripping and compression for browsing from low-bandwidth, mobile devices. Best use of incremental modification was semi-automatically making Python web apps async.
Another quick use for CompSci folks. I’d pull algorithm pseudocode out of papers which claimed to improve on existing methods. I’d ask GPT4 to generate a Python version of it. Then, I’d use the incremental change method to adapt it for a use case. One example, which I didn’t run, was porting a pauseless, concurrent GC.
(Seems every job is fair game according to CTOs. Well, except theirs)
That is at least somewhat a valid point. Good workers know how to get the best out of their tools. And yet, good tools accommodate how their users work, instead of expecting the user to accommodate how the tool works.
One could also say that programmers were sold a misleading bill of goods about how LLMs would work. From what they were told, they shouldn't have to learn how to get the best out of LLMs - LLMs were AI, on the way to AGI, and would just give you everything you needed from a simple prompt.
LLMs are power-user tools. They're nowhere near as easy to use as they look (or as their marketing would have you believe).
Learning to get great results out of them takes a significant amount of work.
Isn't that a bit "You're holding it wrong"? I mean, why isn't that the default; did anyone really think one would mainly want bad results out of them?
well, sometimes - other times it'll be wrong with no error, or insecure, or inaccessible, and so on
The only people pushing that you can BUILD AN APP WITHOUT WRITING A LINE OF CODE are the Twitter AI hypesters. Simon doesn't assert anything of the sort.
LLMs are more-than-sufficient for code snippets and small self-contained apps, but they are indeed far from replacing software engineers.
What models have you tried, and what are you trying to do with them? Give us an example prompt too so we can see how you’re coaxing it so we can rule out skill issue.
And a big strength LLMs have is summarizing things - I’d like to see you summarize the latest 10 arxiv papers relating to prompt engineering and produce a report geared towards non-techies. And do this every 30 mins please. Also produce social media threads with that info. Is this a task you could do yourself, better than LLMs?
Right, but this is the part that is silly and sort of disingenuous and I think built upon a weird understanding of value and productivity.
Doing more constantly isn't inherently valuable. If one human writes a magnificently crafted summary of those papers once and it is promulgated across channels effectively, this is both better and more economical than having an LLM compute one (slightly incorrect) summary for each individual on demand. In fact, all the LLM does in this case is increase the amount of possible lower quality noise in the space. The one edge an LLM might have at this stage is to generate a summary that accounts for more recent information, thereby getting around the inevitable gradual "out of dateness" of human authored summaries at time T, but even then, this is not great if the trade off is to pollute the space with a. bunch of ever so slightly different variants of the same text. It's such a weird, warped idea of what productivity is, it's basically the lazy middle-manager's idea of what it means to be productive. We need to remember that not all processes are reducible to their outputs—sometimes the process is the point, not the immediate output (e.g. education).
Being able to summarise multiple articles quicker than a human can read and digest a single one is obviously more productive. I’m not sure why you’re assuming I’m talking about rewriting the papers to produce slightly different variations? It’s a summary. Concerned about the lack of “insight” or something? Then add a workflow that takes the summaries and use your imagination - maybe ask it to find potential applications in completely different fields? You already have comprehensive summaries (or the full papers in a vector db). Am I missing something?
Also the quality of the summary will be linked to the prompts and the way you go about the process (one-shotting the full paper in the prompt, map reduce, semantically chunked summaries, what model you’re using, its context length etc) as well as your RAG setup. I’m still working on my implementation but it’s simple as fuck and pretty decent in giving me, well, summaries of papers.
I can’t articulate it well enough but your human curation argument sounds to me like someone dismissing Google because anyone can lie online, and the good old Yellow Pages book can never be wrong.
By multiple rewrites, I meant that, to me, at least, it is silly to spend N compute on producing effectively the same summary on demand for the Mth chatbot user when, in some cases, we could much more economically generate one summary once and make it available via distribution channels--to be fair, that is sort of orthogonal to whether or not the "golden" summary is produced by humans or LLMs. I guess this is more of a critique of the current UX and computational expenditure model.
Yes, my whole point about the process being the point sometimes is precisely about lack of insight. It goes back to Searle's Chinese Room argument. A person in a room with a perfect dictionary and grammar reference can productively translate english texts (input) into Chinese texts (output) just by consulting the dictionary, but we wouldn't claim that this person knows Chinese. Using LLMs for "understanding" is the same. If all you care about is immediate material gain and output, sure, why not, but some of us realize that human beings still move and exist in the world and some of us still appreciate that we need to help fashion those human beings into rational ones that are able to use reason to get along, and aren't codependent on the past N years of the internet to answer any and all questions (the same criticism applies to over reliance on simplistic "answers" from search engines).
>it is silly to spend N compute on producing effectively the same summary on demand for the Mth chatbot user
Why? The compute is there, unused. Why is it silly to use it the way a user wants to? Is your argument more towards our effective use of electrical power across the globe or the quality of the summaries? What if the summaries are produced once and then loaded from some sort of cache - does that make it better in your eyes? I'm trying to understand exactly your point here... please accept my apologies for not being able to understand and please do not take my questions as "gotchas" or anything like that. I genuinely want to know the issue.
>A person in a room with a perfect dictionary and grammar reference can productively translate english texts (input) into Chinese texts (output) just by consulting the dictionary, but we wouldn't claim that this person knows Chinese.
Agreed, because you can't really know a language just from its words - you need grammar rules, historical/cultural context etc - precisely the kinds of things included in an LLM's training dataset. I'd argue the LLM knows the language better than the human in your example.
Again, i'm not sure how all of this is relevant to using LLMs to summarise long papers? I wouldn't have read them in the first place, because i didn't know they existed, and i don't have time to read them fully. So a summary of the latest papers every day is infinitely more better to me than just not knowing in the first place. Now if you want to talk about how LLMs can confidentally hallucinate facts or disregard things due to inherent bias in the training datasets then i'm interested because those are the things that are stopping me from actually trusting the outputs fully. (Note, i also don't trust human output on the internet either, due to inherent bias within all of us)
>human beings still move and exist in the world and some of us still appreciate that we need to help fashion those human beings into rational ones that are able to use reason to get along, and aren't codependent on the past N years of the internet to answer any and all questions
Do a simple experiment with the people around you. Ask them about something that happened a few years ago and see if they pull up Google or Wikipedia or whatever. I don't think you realise how far and few the humans you're talking about are left nowadays. Everyone, from teens to pensioners, have been affected by brain rot to some degree, whether it's plain disinformation on Facebook, or sweet nothings from their pastor/imam/rabbi, or innacurate Google search summaries (which is a valid point against LLMs - i'm also disappointed with how bad their implementation is).
And let's not assume most humans are even capable of being rational when the data in their own brains has been biased and manipulated by institutions and politicians in "democracies".
At least there is one silver lining: your comments are evidence that not everyone has suffered that brain rot, and some of us are still out there using tools critically—thanks for a good conversation on this!
Btw, I apologise again if I came across as blunt or rude in our exchange, upon reflection, I think you were actually right about me being somewhat emotionally invested in this (albeit due to that sliver of hope that they can be used for good). Peace be with you
I don't mean to nitpick, but how good do you really think the output of this would be? Papers are short and usually have many references, I would expect the LLM to basically miss the important subtleties on every paper it's given, and misunderstand and misattribute any terms of art it encounters.
I mean, of course LLMs are good at summarizing: the summaries are probably mostly sort of good, and anything I'm summarizing I won't read myself. But for technical and specific texts, what's the point when you're getting a "maybe correct" retelling? Best case scenario you get a pretty paragraph that's maybe good for an introduction, and worst case you get incorrect information that misinforms you.
I’m using the summaries as a juicier abstract. I’m not taking them as gospel.
I’m working on following references to then add those papers to a vector db for RAG so it can actually go the step beyond. It’s fun!
I'm not sure of the value of this. Papers already have abstracts, rewording them using LLMs is just playing with your food. If you're seeing use out of it that's awesome though.
P.S my script uses local models - no capacity constraints (apart from VRAM!)
In case you're interested, here's a summarized list (thanks, Claude) of the negative/critical things I said about LLMs and the companies that build them in this post: https://gist.github.com/simonw/73f47184879de4c39469fe38dbf35...
This has always been the benchmark, they are not that useful to me. Everytime I say this, someone hits me with the "yeah, I bet you haven't tried ShitLLM 4.0-pqr". It's very tiring. Your new LLM hype model is nothing but a marginal, over hyped improvement over something that fundamentally is not intelligent.