This kind of mirror my experience with LLMs. If I ask them non-original problems (make this API, write this test, update this function (that must be written 100s of time by develoeprs around the world, etc), it works very well. Some minors changes here and there but it saves time.
When I ask them to code things that they never heard of (I am working on a online sport game), it fails catastrophically. The LLM should know the sport, and what I ask is pretty clear for anyone who understand the game (I tested against actual people and it was obvious what to expect), but the LLM failed miserably. Even worse when I ask them to write some designs in CSS for the game. It seems if you take them outside the 3-columsn layout or bootstrap or the overused landing page, LLMs fails miserably.
It works very well for the known cases, but as soon as you want them to do something original, they just can't.
steve_adams_86 18 days ago [-]
This has made me wonder just how few developers have been doing even remotely interesting things. I encounter a lot of situations where the LLMs catastrophically fail, but I don’t think it’s for lack of trying to guide it through to solutions properly or sufficiently. I give as much or as little context as I can think to, give partial solutions or zero solutions, try different LLMs with the same prompts, etc.
Maybe we were mostly actually doing the exact same stuff and these things are trained on a whole lot of the same. Like with react components, these things are amazing at pumping out the most basic varieties of them.
lm28469 18 days ago [-]
Imho the hordes of code monkeys see an incredible improvement in their productivity and keep boasting about it online which is inflating the idea that LLMs are close to agi and revolutionizing the industry
Meanwhile people who work on things that are even slightly niche or complex (ie. Things which aren't answered in 20 different stack overflow threads, or things which require of few field "experts" to meet around a table and think hard for a few hours) don't understand the enthusiasm at all
I'm working on fairly simple things but in a context/industry that isn't that common and I have to spoon feed LLMs to get anywhere, most of the time it doesn't even work. The next few years will be interesting, it might play out as it did for plane pilots over relying on autopilots to the point of losing key skills
theK 17 days ago [-]
Another aspect of this might be what I'm experiencing.
I'm currently increasingly often confronted with overly bloated codebases that, with the added speed of llms / coding assistants, have reached levels of unmaintainability much faster than traditionally possible.
Now, fixing those repos with LLMs fails miserably in my experience, so you have to invest hundreds of hours of senior engineer time to make the projects workable again, regardless of the novelty of the problem solved by that codebase.
It also frequently makes the engineers pulled in to help miserable.
uludag 18 days ago [-]
My thoughts exactly. I honestly think that the majority of SWE work is completely unoriginal.
But then there are those idiosyncrasies that every company has, a finicky build system, strange hacks to get certain things to work, extensive webs of tribal knowledge, that will be the demise for any LLM SWE application besides being a tool for skilled professionals.
bamboozled 18 days ago [-]
The whole world is wrinkly, that's the way it is, everyone, and everything is slightly different.
bugglebeetle 18 days ago [-]
> This has made me wonder just how few developers have been doing even remotely interesting things.
There’s not much to ponder on here as the majority of human thought and action is unoriginal and irrelevant, so software development is not some special exception to this. This doesn’t mean it lacks meaning or value for those doing it, since that can be self-generated, extrinsically motivated, or both. But, by definition, most things done are within the boundaries of a normal distribution and not exceptional, transcendent, or sublime. You can strive for those latter categories, but it’s a path strewn with much failure, misery, and quite questionable rewards, even where it is achieved. Most prefer more achievable comforts or security to it, or, at most, dull-minded imitations thereof, such as amassing great wealth.
SkyBelow 18 days ago [-]
Is it that they aren't doing interesting things, or is it that developers doing interesting things tend to only ask the AI to handle the boring parts? I find I use AI quite a bit, but it is integrated with my own thinking and problem solving every step of the way. I see how others are using it and they often seem to be asking for full solutions, where as I'm using it more like a rubber duck debugger stuffed with stack overflow feedback and a knowledge of any common APIs. It mostly helps be answering things that would have otherwise been web searches, but it doesn't create any solutions from scratch.
josephg 18 days ago [-]
Yeah; I've long argued that software should be divided into two separate professions.
If you think about it, lots of professions are already divided in two. You often have profession of inventors / designers and a profession of creators. For example: Electrical engineers vs Electricians. Architects / civil engineers and construction workers. Chefs and line cooks. Composers and session musicians. And so on.
The two professions always need somewhat different skill sets. For example, electricians need business sense and to be personable - while electrical engineers need to understand how to design circuits from scratch. Session musicians need a fantastic memory for music, and to be able to perform in lots of different musical styles. Composers need a deeper understanding of how to write compelling music - but usually only in their preferred style.
I think programming should be divided in the same way. It sort of happens already: There's people who invent novel database storage engines, work on programming language design, write schedulers for operating systems and so on. And then there's people who do consulting and write almost all the software that businesses actually need to function. Eg, people who talk to actual clients and figure out their needs.
The first group invents React. The second group uses react to build online stores.
We sort of have these two professions already. We just don't have language to separate them.
The line is a bit fuzzy, but you can tell they're two professions because each needs a slightly different skill set. If you're writing a novel full-text search engine, you need to be really good at data structures & algorithms, but you don't really need to be good at working with clients. And if you work in frontend react all day, you don't really need to understand how a B-Tree works or understand the difference between LR and recursive descent parsing. (And it doesn't make much sense to ask these sort of programmers "leetcode problems" in job interviews).
ChatGPT is good at application programming. Or really, anything that its seen lots of times before. But the more novel a problem - the more its actual "software engineering" - the more it struggles to make working software. "Write a react component that does X" - easy. "Implement a B-tree with this weird quirk" - hard. It'll struggle through but the result will be buggy. "Implement this new algorithm in a paper I just read" - its more or less completely useless at that.
> as soon as you want them to do something original, they just can't.
An LLM is, famously, a "stochastic parrot". It has no reasoning. It has no creativity. It has no basis or mechanism for true originality: it is regurgitating what it has seen before, based on probabilities, nothing more. It looks more impressive, but that's all is behind the curtain.
It surprises me that so many people expect an LLM to do something "clever" or "original". They're compression machines for knowledge, a tool for summarisation, rephrasing, not for creation.
Why is this not that widely known and accepted across the industry yet, and why do so many people have such high expectations?
redlock 18 days ago [-]
Because it isn't true according to Hinton, Sutskever, and other leading AI researchers.
FroshKiller 18 days ago [-]
Most foxes are herbivores according to the fellow guarding my henhouse.
redlock 18 days ago [-]
They do make compelling arguments if you listened to them. Also, from my coding with llms using cursor they obviously understand the code and my request more often than not. Mechanistic interpretability research shows evidence of concepts being represented within the layers. Golden gate claude is evidence of this:
https://www.anthropic.com/news/golden-gate-claude
To me this proves that llms learn concepts and multilayeres representations within its network and not just some dumb statistical inference.
Even famous llm skeptic like Francois Chollet doesn't invoke this stochastic parrots anymore and has moved on to arguing that they don't generalize well and are just memorizing.
With GPT-2 and 3 I was of the same opinion that they seemed like just sophisticated stochastic parrots, but current llms are a different class from early gpts. Now that o3 has beaten memory resisting benchmark like ARC-AGI I think we can confidently move on from
this stochastic parrots notion.
Neural networks are generalizing things as part of their optimization scheme. The current approach is just to dump many layered neural networks (at the core) as in "deep learning" to solve the problems, but the networks are too regular, too "primitive" of sorts. What is needed are network topologies that strongly support this generalization, the creation of "meta abstraction levels", otherwise it will get nowhere.
Biological networks of more intelligent species contain a few billion neurons and upwards from that while even the big LLMs are somewhere in the millions of "equivalent" at best. So, bad topology + much less "neurons" and the resulting capabilities shouldn't be too surprising. Plus it is clear that AGI is nowhere close, because one result of AGI is a proper understanding of "I". Crows have an understanding of "I", for example.
And that is where these "meta abstraction levels" come in: There are many needed to eventually reach the stage of "I". This can also be used to test how well neural networks perform, how far they can abstract things for real, how many levels of generalization are handled by it. But therein lies a problem: Let 2 persons specify abstraction levels and the results will be all across the board. This is also why ARC-AGI, while dives into that, cannot really solve the problem of testing "AI", let alone "AGI": We as humans are currently unable to properly test intelligence in any meaningful way. All the tests we have are mere glimpses into it and often (complex) multivariable + multi(abstraction)layered tests and dealing with the results, consequently, a total mess, even if we throw big formulas at it.
redlock 17 days ago [-]
I don't care much for the concept of AGI since it's ill defined.
However, out of distribution generation and generalization are a better more useful metric. In these Yann Lecun has argued that interpolation is meaningless in the high dimensional spaces these "curves" are embedded in.(https://arxiv.org/abs/2110.09485)
ARC has proven they can generalize, and Alpha Go (not an llm but a deep network) has proven it can generate novel/creative solutions. We don't need AI to have a sense of self "I" for them to beat us at every human skill and activity. Infact it might be detrimental and not useful for us if AI developed a sense of self.
anonzzzies 18 days ago [-]
I ask llms things like I would spec it out for my team, which is to say, when I write a task, I would not include things about the sport, I would explain the logic etc one would need to write. No one needs to know what it is about in abstract as I see the same issue with humans; they get confused when using real world things to see the connection with what they need to do. I mean I don't flesh it out to the extend I might as well write the code, but enough that you don't need to know anything about the underlying subject.
We are doing a large EU healthcare project and things differ per country obviously; if I would assume people to have a modicum of world knowledge or interest in it and look it up, nothing would get done. It is easier to deliver excel sheets with the proper wording and formulas and leave out what it is even for.
Works fine with LLMs.
Disclaimer; the people in my company know everything about the subject matter; the people (and LLMs) that implement it are usually not ours: we just manage them and in my experience, programmers at big companies are borderline worthless, hence, in the past 35 years, I have taken to write tasks as concrete as I can; we found that otherwise we either get garbage commits or tons of meetings with questions and then garbage commits. And now that comes in handy as this works much better on LLMs too.
sagarm 13 days ago [-]
Breaking a problem down to the extent you are describing should only have to be done for fresh grads IMO. LLMs are indeed better at tasks broken down to this degree in my experience.
I'd expect more senior engineers to handle increasingly complex/ambiguous projects, while taking the business context into account as necessary.
18 days ago [-]
jppope 18 days ago [-]
completely agree with this. The problem I'm starting to run into is that I do a lot of things that have been done before so I get really efficient at doing that stuff with LLMs... and now I'm slow to solve the stuff I used to be really fast at.
Recursing 18 days ago [-]
The article and comments here _really_ underestimate the current state of LLMs (or overestimate how hard AoC 2024 was)
LLMs can solve all days except 12, 15, 17, 21, and 24
michaelt 18 days ago [-]
> All the top 5 players on the final leaderboard [...] used LLMs for most of their solutions.
Note that the leaderboard points are given out on the time taken to produce a correct answer. 100 points for the first to submit an answer, 99 for the second, 98 for third and so on. No points if you're not in the first 100.
So if an LLM fails on 5 problems, but for the other 20 it can take a 600-word problem statement and solve it in 12 seconds? It'll rack up loads of points.
Whereas the best human programmer you know might be able to solve all 25 problems taking around 15 minutes per problem. On most days they would have zero points, as all 100 points finishes go to LLMs.
jerpint 18 days ago [-]
Author here: the point of the article was only to evaluate zero-shot capabilities. I’m certain that had I used LLMs I would have definitely gotten more stars on AoC (got 41/50 without). Because I chose to solve this year without LLMs, I was simply curious to see how the opposite setup would do, using basically zero human intervention to solve AoC.
That said, if I cared only to produce the best results I would 100% pair up with an LLM
oytis 18 days ago [-]
Why does neither of articles provide the actual raw chat logs? It's like a recent article about a non-released LLM solving non-public tasks which everyone is supposed to be impressed about
FrustratedMonky 18 days ago [-]
Wonder if different goals.
If the top 5 people on leader board, 'used LLM's'. Meaning, they used an LLM as a helping tool.
But article I think is, what if the LLM played by itself. Just paste the questions in, and see if it can do it all on its own?
Perhaps that is the difference, different goals.
jonathan_landy 18 days ago [-]
Seems to depends strongly on the model perhaps. The Reddit post says
“Some other models tested that just didn't work: gpt-4o, gpt-o1, qwen qwq.”
Notably gpt-4o was used in the post linked here.
jebarker 18 days ago [-]
I don't know what they were doing, but I tried o1 with many problems after I solved them already and it did great. No special prompting, just "solve this problem with a python program".
because the timings were not possible for humans. problems were solved in under 2 minutes. I've experimented with LLM for few days too. yielded the same results.
WorldWideWebb 17 days ago [-]
Am I old and grumpy or does that kinda go against the whole point of AoC? A daily challenge for your brain, not your prompt writing abilities.
383toast 17 days ago [-]
Yep AoC explicitly discourages using LLMs for leaderboard scoring
poincaredisk 17 days ago [-]
That's humanity for you. We automate thinking out of our lives so we can doomscroll unbothered.
nindalf 18 days ago [-]
This needs to be higher. Not only because it shows that LLMs can do better than what OP says, but also that there’s some difference in how they’re used. Clearly the person on reddit was able to use them more effectively.
danielbln 18 days ago [-]
That's the crux with most discussions on here regarding LLMs.
"I used gpt-4o with zero shot prompts and it failed terribly!"
"I used Claude/o1/o3, I fed various bits of information into the context and carefully guided the LLM"
Those two approaches (there are many more) would lead to very different results, yet all we read here are comments giving their opinions on "LLMs", as if there is only one LLM and one way of working with them.
After looking at the charts I was like "Whoa, damn, that Jerpint model seems amazing. Where do I get that??" I spent some time trying to find it on Huggingface before I realized...
j45 18 days ago [-]
Me too.
The author could make a model on huggingface routing requests to him. Latency might vary.
tbagman 18 days ago [-]
The good news is that training jerprint was probably cheaper than training the latest got models...
foldl2022 18 days ago [-]
Just open-wights it. We need this. :)
senordevnyc 18 days ago [-]
lol, I did the same thing
bryan0 18 days ago [-]
Since you did not give the models a chance to test their code and correct any mistakes, I think a more accurate comparison would be if you compared them against you submitting answers without testing (or even running!) your code first
angarg12 18 days ago [-]
People keep evaluating LLMs on essentially zero-shotting a perfect solution to a coding problem.
Once we use tools to easily iterate on code (e.g. generate, compile, test, use outcome to refine prompt) we will turbocharge LLMs coding abilities.
xen0 18 days ago [-]
I'm a bit curious to see how close the solutions were.
When it couldn't solve it within the constraints, was it 'broadly correct' with some bugs? Was it on the right track or completely off?
jerpint 18 days ago [-]
The code they generated and the outputs (including trace back errors) is all included, you can view them on the post itself or from hf space:
Huh, don't know how I missed that. I was hoping for them to have done some analysis themselves.
Glancing through the code for some of the 'easier' problems Claude (his best performer), it seems to have broadly correct (if a little strange and overwrought) code.
But my phone is not a good platform for me to do this kind of reading.
jerpint 18 days ago [-]
I should specify that unfortunately I didn’t store the raw output from the LLM, just the parsed code snippets they produced, but the code to reproduce it is all there
rhubarbtree 18 days ago [-]
This smacks of "moving the goalposts" just as much as the other side is accused of when unimpressed by advances.
It's a reasonable test.
zaptheimpaler 18 days ago [-]
I’m adjacent to some people who do AoC competitively and it’s clear many of the top 10 and maybe 1/2 of the top 100 this year were heavily LLM assisted or wholly done by LLMs in a loop. They won first place on many days. It was disappointing to the community that people cheated and went against the community’s wishes but it’s clear LLMs can do much better than described here
davidclark 18 days ago [-]
I’d like the same article topic but from the person who did Day 1 pt1 in 4s and pt2 in 5s (9s total time).
rhubarbtree 18 days ago [-]
No, that means it's clear that LLM-assisted coding can do better than described here. Which implies humans are adding a lot.
NitpickLawyer 18 days ago [-]
To be fair, there's probably not a lot that "humans" add when the solution is solved in 4 and 5 seconds respectively. That's clearly 100% automated. Most humans can't even read the problem in that timeframes.
A better implication would be that proper use of LLMs implies more than OP did here (proper prompting, looping the answer w/ a code check, etc.)
zaptheimpaler 17 days ago [-]
I think you're trying to dunk on me without actually knowing much about the matter. You can look at the leaderboard times and the github repos too. They are fully automated to fetch input, prompt LLMs and submit the answer within 10 seconds.
rhubarbtree 15 days ago [-]
Some of them, yes.
unclad5968 18 days ago [-]
Half the time I try to use gemini questions about the c++ std library, it fabricates non-existent types and functions. I'm honestly impressed it was able to solve any of the AoC problems.
devjab 18 days ago [-]
I’ve had a side job as an external examiner for CS students for almost a decade now. While LLMs are generally terrible at programming (in my experience) they excel at passing finals. If I were to guess it’s likely a combination of the relatively simple (or perhaps isolated is a better word) tasks coupled with how many times similar problems have been solved before in the available training data. Somewhat ironically the easiest way to spot students who “cheat” is when the code is great. Being an external examiner, meaning that I have a full time job in software development I personally find it sort of silly when students aren’t allowed to use a calculator. I guess changing the way you teach and test is a massive undertaking though, so right now we just pretend LLMs aren’t being used by basically every students. Luckily I’m not part of the “spot the cheater” process, so I can just judge them based on how well they can explain their code.
Anyway, I’m not at all surprised that they can handle AoC. If anything I would worry that AoC will still be a fun project to author when many people solve it with AI. It sure won’t be fun to curate any form of leaderboard.
uludag 18 days ago [-]
leetoced/AoC-like problems are probably the easiest class of software related tasks LLMs can do. Using the correct library, the correct way, at the correct version, especially if the library has some level of churn and isn't extremely common, can be a harder task for LLMs than the hardest AoC problems.
grumple 18 days ago [-]
I’m both surprised and not surprised. I’m surprised because these sort of problems with very clear prompts and fairly clear algorithmic requirements are exactly what I’d expect LLMs to perform best at.
But I’m not surprised because I’ve seen them fail on many problems even with lots of prompt engineering and test cases.
yunwal 18 days ago [-]
With no prompt engineering this seems like a weird comparison. I wouldn’t expect anyone to be able to one-shot most of the AOC problems. A fair fight would at least use something like cursor’s agent on YOLO mode that can review a command’s output, add logs, etc
fumeux_fume 18 days ago [-]
I certainly don't think it's weird to measure one-shot performance on the AOC. Sure, more could be done. More can always be done, but this is still useful and interesting.
rhubarbtree 18 days ago [-]
Seems reasonable to me. More people use ChatGPT than cursor, so use it like ChatGPT.
For most coding problems, people won't magically know if the code is "correct" or not, so any one-shot answer that is wrong could be a real hindrance.
I don't have time to prompt engineer for every bit of code. I need tools that accelerate my work, not tools that absorb my time.
segmondy 18 days ago [-]
I did zero shot 27 solutions successfully with local model code-qwen2.5-32b. I think adding sonnet or latest gemini will probably get me to 40.
ben_w 18 days ago [-]
If you immediately know the candlelight is fire, then the meal was cooked a long time ago.
And so it is with success of LLMs in one-shot challenges and any job that depends on such challenges: cooked a long time ago.
NitpickLawyer 18 days ago [-]
Indeed. (wild to see a sg-1 quote in the wild!)
18 days ago [-]
bawolff 18 days ago [-]
I'm a bit of an AI skeptic, and i think i had the opposite reaction of the author. Even though this is far from welcoming our AI overlords, I am surprised that they are this good.
jebarker 18 days ago [-]
I'd be interested to know how o1 compares. On may days after I completed the AoC puzzles I was putting them question into o1 and it seemed to do really well.
o1 got 20 out of 25 (or 19 out of 24, depending on how you want to count). Unclear experimental setup (it's not obvious how much it was prompted), but it seems to check out with leaderboard times, where problems solvable with LLMs had clear times flat out impossible for humans.
An agent-type setup using Claude got 14 out of 25 (or, again, 13/24)
I have to wonder why o1 didn't work. That post is unfortunately light on details that seem pretty important.
jebarker 18 days ago [-]
I was thinking 20/25 is pretty great! At least 5 of the problems were pretty tricky and easy to fail due to small errors.
moffkalast 18 days ago [-]
At first I was like "What is this jerpint model that's beating the competition so soundly?" then it hit me, lol.
Anyhow this is like night and day compared to last year, and it's impressive that Sonnet is now apparently 50% as good as a professional human at this sort of thing.
zkry 18 days ago [-]
I don't think comparing star counts would be a good measure though, as with AOC 90% of the effort and difficulty goes into the harder problems towards the end and it was the beginning, easy problems where the bulk of the sonnet's stars came from.
moffkalast 18 days ago [-]
Ah yeah that's true, the difficulty curve is not very linear.
demirbey05 18 days ago [-]
o1 is not included, I think each benchmark should include o1 and reasoning models. o-series is really changed the game.
airstrike 18 days ago [-]
I like the idea, but I feel like the execution left a bit to be desired.
My gut tells me you can get much better results from the models with better prompting. The whole "You are solving the 2024 advent of code challenge." form of prompting is just adding noise with no real value. Based on my empirical experience, that likely hurts performance instead of helping.
The time limit feels arbitrary and adds nothing to the benchmark. I don't understand why you wouldn't include o1 in the list of models.
There's just a lot here that doesn't feel very scientific about this analysis...
Tiberium 18 days ago [-]
Wanted to try with o1 and o1-mini but looks like there's no code available, although I guess I could just ask 3.5 Sonnet/o1 to make the evaluation suite ;)
jerpint 18 days ago [-]
Author here: all the code to reproduce this is actually all on the huggingface space here [1]
Thanks, I'll check with o1-mini and o1 and update this comment :) Also, the code has some small errors, although those can be easily fixed.
bongodongobob 18 days ago [-]
I think a major mistake was giving parts 1 and 2 all at once. I had great results having it solve 1, then 2. I think I got 4o to one shot parts 1 then 2 up to about day 12. It started to struggle a bit after that and I got bored with it at day 18. It did way better than I expected, I don't understand why the author is disappointed. This shit is magic.
antirez 18 days ago [-]
The most important thing is missing from this post: the performance of Jerpint+Claude. It's not a VS game.
guerrilla 18 days ago [-]
How far can LLMs get in Project Euler without messing up?
bru3s 18 days ago [-]
[flagged]
BugsJustFindMe 18 days ago [-]
I think this is a terrible analysis with a weak conclusion.
There's zero mention of how long it took the LLM to write the code vs the human. You have a 300 second runtime limit, but what was your coding time limit? The machine spat out code in, what, a few seconds? And how long did your solutions take to write?
Advent of code problems take me longer to just read than it takes an LLM to have a proposed solution ready for evaluation.
> they didn’t perform nearly as well as I’d expect
Is this a joke, though? A machine takes a problem description written as floridly hyperventilated as advent problems are, and, without any opportunity for automated reanalysis, it understands the exact problem domain, it understands exactly what's being asked, correctly models the solution, and spits out a correct single-shot solution on 20 of them in no time flat, often with substantially better running time than your own solutions, and that's disappointing?
> a lot of the submissions had timeout errors, which means that their solutions might work if asked more explicitly for efficient solutions. However the models should know very well what AoC solutions entail
You made up an arbitrary runtime limit and then kept that limit a secret, and you were surprised when the solutions didn't adhere to the secret limit?
> Finally, some of the submissions raised some Exceptions, which would likely be fixed with a human reviewing this code and asking for changes.
How many of your solutions got the correct answer on the first try without going back and fixing something?
mvdtnz 18 days ago [-]
> it understands the exact problem domain, it understands exactly what's being asked, correctly models the solution
It does not "understand" or "model" shit. Good grief you AI shills need to take a breath.
BugsJustFindMe 18 days ago [-]
I'll use a different word when you demonstrate that humans aren't also stochastic parrots constantly hallucinating reality. Until then, good enough for one is good enough for the other.
segmondy 18 days ago [-]
I did experiment with local LLM running on my machine, most solutions were generated within the time of 30-60seconds. My overhead was really copying part 1 of the problem, generating the code, copying and pasting the code, running it, copying the input data, running it. Entering the result, repeating for part 2 and for most of them that was about 5 minutes from start to finish. If I automated the process, it would probably be 1 minute or less for the problems it could solve.
Not the OP, but I was able to zero shot 27 solutions correctly without any back and forth, and 5 more with a little bit back and forth. Using local models.
keyle 18 days ago [-]
You raise some good points about the "total of hours spent" but I guess you don't consider training time included. Also there is no need to quote the author's post and have a go at him personally. There are better ways to get your point across by arguing the point made and not the sentences written.
BugsJustFindMe 18 days ago [-]
> I guess you don't consider training time included
In the same way that I don't consider the time an author spends writing a book when saying how long it took me to read the book. OP lost zero time training ChatGPT.
Or do you mean how much time OP spent training themself? Because that's a whole new can of worms. How many years should we add to OP's development times because probably they learned how to write code before this exercise?
keyle 18 days ago [-]
I was considering the energy used and $, R&D and time spent to train those models, which is colossal compared to the developer and factor that in. Then they don't look so impressive. For reference, I'm not anti-LLM, I use Claude and ChatGPT every day. I'm just raising those if you want to consider all facts.
lgas 18 days ago [-]
That cost is amortized over all requests made against all copies of the model though. Fortunately we don't have to re-train the model every time we want to perform inference.
bawolff 18 days ago [-]
That's a weird comparison.
Would you also count the number of hours it took for someone to write the OS, the editor, etc? The programmer wouldn't be effective without those things.
menaerus 18 days ago [-]
And now also consider the amount of time, school, university, practice and $$$ it took to train a software engineer. Repeat for the population of presumably ~30 million software engineers in the world. Time-wise it doesn't scale at all when contrasted to new LLM releases that occur ~once a year per company. $$$-wise is also debatable.
johnea 18 days ago [-]
LLMs are writing code for the coming of the lil' baby jesus?
valbaca 18 days ago [-]
adventofcode.com
cheevly 18 days ago [-]
Genuinely terrible prompt. Not only in structure, but also contains grammatical errors. I'm confident you could at least double their score if you improve your prompting significantly.
rhubarbtree 18 days ago [-]
This isn't very constructive criticism. You could improve it through a number of levels:
1. You could have pointed out the grammatical errors and explained why they matter.
2. You could have pointed out the structural errors, and explained what structure should have been used.
3. You could have written a new prompt.
4. You could have re-run the experiment with the new prompt.
Otherwise what you say is just unsubstantiated criticism.
When I ask them to code things that they never heard of (I am working on a online sport game), it fails catastrophically. The LLM should know the sport, and what I ask is pretty clear for anyone who understand the game (I tested against actual people and it was obvious what to expect), but the LLM failed miserably. Even worse when I ask them to write some designs in CSS for the game. It seems if you take them outside the 3-columsn layout or bootstrap or the overused landing page, LLMs fails miserably.
It works very well for the known cases, but as soon as you want them to do something original, they just can't.
Maybe we were mostly actually doing the exact same stuff and these things are trained on a whole lot of the same. Like with react components, these things are amazing at pumping out the most basic varieties of them.
Meanwhile people who work on things that are even slightly niche or complex (ie. Things which aren't answered in 20 different stack overflow threads, or things which require of few field "experts" to meet around a table and think hard for a few hours) don't understand the enthusiasm at all
I'm working on fairly simple things but in a context/industry that isn't that common and I have to spoon feed LLMs to get anywhere, most of the time it doesn't even work. The next few years will be interesting, it might play out as it did for plane pilots over relying on autopilots to the point of losing key skills
I'm currently increasingly often confronted with overly bloated codebases that, with the added speed of llms / coding assistants, have reached levels of unmaintainability much faster than traditionally possible.
Now, fixing those repos with LLMs fails miserably in my experience, so you have to invest hundreds of hours of senior engineer time to make the projects workable again, regardless of the novelty of the problem solved by that codebase.
It also frequently makes the engineers pulled in to help miserable.
But then there are those idiosyncrasies that every company has, a finicky build system, strange hacks to get certain things to work, extensive webs of tribal knowledge, that will be the demise for any LLM SWE application besides being a tool for skilled professionals.
There’s not much to ponder on here as the majority of human thought and action is unoriginal and irrelevant, so software development is not some special exception to this. This doesn’t mean it lacks meaning or value for those doing it, since that can be self-generated, extrinsically motivated, or both. But, by definition, most things done are within the boundaries of a normal distribution and not exceptional, transcendent, or sublime. You can strive for those latter categories, but it’s a path strewn with much failure, misery, and quite questionable rewards, even where it is achieved. Most prefer more achievable comforts or security to it, or, at most, dull-minded imitations thereof, such as amassing great wealth.
If you think about it, lots of professions are already divided in two. You often have profession of inventors / designers and a profession of creators. For example: Electrical engineers vs Electricians. Architects / civil engineers and construction workers. Chefs and line cooks. Composers and session musicians. And so on.
The two professions always need somewhat different skill sets. For example, electricians need business sense and to be personable - while electrical engineers need to understand how to design circuits from scratch. Session musicians need a fantastic memory for music, and to be able to perform in lots of different musical styles. Composers need a deeper understanding of how to write compelling music - but usually only in their preferred style.
I think programming should be divided in the same way. It sort of happens already: There's people who invent novel database storage engines, work on programming language design, write schedulers for operating systems and so on. And then there's people who do consulting and write almost all the software that businesses actually need to function. Eg, people who talk to actual clients and figure out their needs.
The first group invents React. The second group uses react to build online stores.
We sort of have these two professions already. We just don't have language to separate them.
The line is a bit fuzzy, but you can tell they're two professions because each needs a slightly different skill set. If you're writing a novel full-text search engine, you need to be really good at data structures & algorithms, but you don't really need to be good at working with clients. And if you work in frontend react all day, you don't really need to understand how a B-Tree works or understand the difference between LR and recursive descent parsing. (And it doesn't make much sense to ask these sort of programmers "leetcode problems" in job interviews).
ChatGPT is good at application programming. Or really, anything that its seen lots of times before. But the more novel a problem - the more its actual "software engineering" - the more it struggles to make working software. "Write a react component that does X" - easy. "Implement a B-tree with this weird quirk" - hard. It'll struggle through but the result will be buggy. "Implement this new algorithm in a paper I just read" - its more or less completely useless at that.
An LLM is, famously, a "stochastic parrot". It has no reasoning. It has no creativity. It has no basis or mechanism for true originality: it is regurgitating what it has seen before, based on probabilities, nothing more. It looks more impressive, but that's all is behind the curtain.
It surprises me that so many people expect an LLM to do something "clever" or "original". They're compression machines for knowledge, a tool for summarisation, rephrasing, not for creation.
Why is this not that widely known and accepted across the industry yet, and why do so many people have such high expectations?
To me this proves that llms learn concepts and multilayeres representations within its network and not just some dumb statistical inference. Even famous llm skeptic like Francois Chollet doesn't invoke this stochastic parrots anymore and has moved on to arguing that they don't generalize well and are just memorizing.
With GPT-2 and 3 I was of the same opinion that they seemed like just sophisticated stochastic parrots, but current llms are a different class from early gpts. Now that o3 has beaten memory resisting benchmark like ARC-AGI I think we can confidently move on from this stochastic parrots notion.
(And before you argue that o3 is not an llm, here is an openai researcher stating that it is an llm https://x.com/__nmca__/status/1870170101091008860?s=46&t=eTe... )
Biological networks of more intelligent species contain a few billion neurons and upwards from that while even the big LLMs are somewhere in the millions of "equivalent" at best. So, bad topology + much less "neurons" and the resulting capabilities shouldn't be too surprising. Plus it is clear that AGI is nowhere close, because one result of AGI is a proper understanding of "I". Crows have an understanding of "I", for example.
And that is where these "meta abstraction levels" come in: There are many needed to eventually reach the stage of "I". This can also be used to test how well neural networks perform, how far they can abstract things for real, how many levels of generalization are handled by it. But therein lies a problem: Let 2 persons specify abstraction levels and the results will be all across the board. This is also why ARC-AGI, while dives into that, cannot really solve the problem of testing "AI", let alone "AGI": We as humans are currently unable to properly test intelligence in any meaningful way. All the tests we have are mere glimpses into it and often (complex) multivariable + multi(abstraction)layered tests and dealing with the results, consequently, a total mess, even if we throw big formulas at it.
However, out of distribution generation and generalization are a better more useful metric. In these Yann Lecun has argued that interpolation is meaningless in the high dimensional spaces these "curves" are embedded in.(https://arxiv.org/abs/2110.09485)
ARC has proven they can generalize, and Alpha Go (not an llm but a deep network) has proven it can generate novel/creative solutions. We don't need AI to have a sense of self "I" for them to beat us at every human skill and activity. Infact it might be detrimental and not useful for us if AI developed a sense of self.
We are doing a large EU healthcare project and things differ per country obviously; if I would assume people to have a modicum of world knowledge or interest in it and look it up, nothing would get done. It is easier to deliver excel sheets with the proper wording and formulas and leave out what it is even for.
Works fine with LLMs.
Disclaimer; the people in my company know everything about the subject matter; the people (and LLMs) that implement it are usually not ours: we just manage them and in my experience, programmers at big companies are borderline worthless, hence, in the past 35 years, I have taken to write tasks as concrete as I can; we found that otherwise we either get garbage commits or tons of meetings with questions and then garbage commits. And now that comes in handy as this works much better on LLMs too.
I'd expect more senior engineers to handle increasingly complex/ambiguous projects, while taking the business context into account as necessary.
Here's a much better analysis from someone who got 45 stars using LLMs. https://www.reddit.com/r/adventofcode/comments/1hnk1c5/resul...
All the top 5 players on the final leaderboard https://adventofcode.com/2024/leaderboard used LLMs for most of their solutions.
LLMs can solve all days except 12, 15, 17, 21, and 24
Note that the leaderboard points are given out on the time taken to produce a correct answer. 100 points for the first to submit an answer, 99 for the second, 98 for third and so on. No points if you're not in the first 100.
So if an LLM fails on 5 problems, but for the other 20 it can take a 600-word problem statement and solve it in 12 seconds? It'll rack up loads of points.
Whereas the best human programmer you know might be able to solve all 25 problems taking around 15 minutes per problem. On most days they would have zero points, as all 100 points finishes go to LLMs.
That said, if I cared only to produce the best results I would 100% pair up with an LLM
If the top 5 people on leader board, 'used LLM's'. Meaning, they used an LLM as a helping tool.
But article I think is, what if the LLM played by itself. Just paste the questions in, and see if it can do it all on its own?
Perhaps that is the difference, different goals.
“Some other models tested that just didn't work: gpt-4o, gpt-o1, qwen qwq.”
Notably gpt-4o was used in the post linked here.
Source for this?
"I used gpt-4o with zero shot prompts and it failed terribly!"
"I used Claude/o1/o3, I fed various bits of information into the context and carefully guided the LLM"
Those two approaches (there are many more) would lead to very different results, yet all we read here are comments giving their opinions on "LLMs", as if there is only one LLM and one way of working with them.
stone->your ingredients->soup
LLM->your prompts->solution
The author could make a model on huggingface routing requests to him. Latency might vary.
Once we use tools to easily iterate on code (e.g. generate, compile, test, use outcome to refine prompt) we will turbocharge LLMs coding abilities.
When it couldn't solve it within the constraints, was it 'broadly correct' with some bugs? Was it on the right track or completely off?
https://huggingface.co/spaces/jerpint/advent24-llm
Glancing through the code for some of the 'easier' problems Claude (his best performer), it seems to have broadly correct (if a little strange and overwrought) code.
But my phone is not a good platform for me to do this kind of reading.
It's a reasonable test.
A better implication would be that proper use of LLMs implies more than OP did here (proper prompting, looping the answer w/ a code check, etc.)
Anyway, I’m not at all surprised that they can handle AoC. If anything I would worry that AoC will still be a fun project to author when many people solve it with AI. It sure won’t be fun to curate any form of leaderboard.
But I’m not surprised because I’ve seen them fail on many problems even with lots of prompt engineering and test cases.
For most coding problems, people won't magically know if the code is "correct" or not, so any one-shot answer that is wrong could be a real hindrance.
I don't have time to prompt engineer for every bit of code. I need tools that accelerate my work, not tools that absorb my time.
And so it is with success of LLMs in one-shot challenges and any job that depends on such challenges: cooked a long time ago.
o1 got 20 out of 25 (or 19 out of 24, depending on how you want to count). Unclear experimental setup (it's not obvious how much it was prompted), but it seems to check out with leaderboard times, where problems solvable with LLMs had clear times flat out impossible for humans.
An agent-type setup using Claude got 14 out of 25 (or, again, 13/24)
https://github.com/JasonSteving99/agent-of-code/tree/main
Anyhow this is like night and day compared to last year, and it's impressive that Sonnet is now apparently 50% as good as a professional human at this sort of thing.
My gut tells me you can get much better results from the models with better prompting. The whole "You are solving the 2024 advent of code challenge." form of prompting is just adding noise with no real value. Based on my empirical experience, that likely hurts performance instead of helping.
The time limit feels arbitrary and adds nothing to the benchmark. I don't understand why you wouldn't include o1 in the list of models.
There's just a lot here that doesn't feel very scientific about this analysis...
https://huggingface.co/spaces/jerpint/advent24-llm/tree/main
There's zero mention of how long it took the LLM to write the code vs the human. You have a 300 second runtime limit, but what was your coding time limit? The machine spat out code in, what, a few seconds? And how long did your solutions take to write?
Advent of code problems take me longer to just read than it takes an LLM to have a proposed solution ready for evaluation.
> they didn’t perform nearly as well as I’d expect
Is this a joke, though? A machine takes a problem description written as floridly hyperventilated as advent problems are, and, without any opportunity for automated reanalysis, it understands the exact problem domain, it understands exactly what's being asked, correctly models the solution, and spits out a correct single-shot solution on 20 of them in no time flat, often with substantially better running time than your own solutions, and that's disappointing?
> a lot of the submissions had timeout errors, which means that their solutions might work if asked more explicitly for efficient solutions. However the models should know very well what AoC solutions entail
You made up an arbitrary runtime limit and then kept that limit a secret, and you were surprised when the solutions didn't adhere to the secret limit?
> Finally, some of the submissions raised some Exceptions, which would likely be fixed with a human reviewing this code and asking for changes.
How many of your solutions got the correct answer on the first try without going back and fixing something?
It does not "understand" or "model" shit. Good grief you AI shills need to take a breath.
Not the OP, but I was able to zero shot 27 solutions correctly without any back and forth, and 5 more with a little bit back and forth. Using local models.
In the same way that I don't consider the time an author spends writing a book when saying how long it took me to read the book. OP lost zero time training ChatGPT.
Or do you mean how much time OP spent training themself? Because that's a whole new can of worms. How many years should we add to OP's development times because probably they learned how to write code before this exercise?
Would you also count the number of hours it took for someone to write the OS, the editor, etc? The programmer wouldn't be effective without those things.
1. You could have pointed out the grammatical errors and explained why they matter.
2. You could have pointed out the structural errors, and explained what structure should have been used.
3. You could have written a new prompt.
4. You could have re-run the experiment with the new prompt.
Otherwise what you say is just unsubstantiated criticism.