20 years ago, one of my professors pointed out during a review that "benchmark is not research." Now I see literally everywhere benchmarks with very basic conclusions. I'm not criticizing this particular work, but such publications are low-effort unless they point to something new.
buildbot 19 days ago [-]
Benchmarks aren’t research, but research is 75% benchmarks of some kind, in my experience at least.
Have to know where you are to know where to go, and once you know where to go and go there, well you have to know where you are!
jszymborski 18 days ago [-]
To paraphrase a wise person, "All benchmarks are wrong, but some are useful".
Benchmarking, like modelling or simulation, are one of many imperfect but useful tools researchers can reach for when trying to better understand something.
Benchmarks help us contextualize things. All benchmarks have biases, and some tell us more than others, but they certainly fall under a class of experimentation that falls under "research". Of course, it's important not to overstate findings. Performing well on any given coding benchmark does not tell us "this LLM is, broadly speaking, great at coding", but that it is great at coding as measured by a narrow testing set.
There's also another snake pit to cross I'm that validation methodology is hard (data leakage, appropriate measures and stats, etc...) but plenty of quality research over comes that with rigour and patience.
gessha 18 days ago [-]
CV ML context: A benchmark by itself is not really interesting but it is a crucial part of novel research since you have to compare your contribution to something else.
MPSimmons 18 days ago [-]
It seems useful to have peer reviewed processes for comparing benchmarks that other research can reference, though.
H-index means nothing when there's always a name under authors that didn't do anything ;)
19 days ago [-]
jpollock 19 days ago [-]
Am I reading the example correctly? The prompt is the same size as the generated code, and likely more difficult to understand? Why would you use that? Why would you use anything that includes a "TODO"?
-- ignoring what I would consider "weird" in the code, I assume that's just style.
The prompt is:
This function performs a forward pass for a model, incorporating conditioning and time step information.
It randomly selects time steps, applies learned conditioning if applicable, and processes the inputs according to the model's conditioning
requirements.
Finally, it computes and returns the loss for the given inputs and conditioning.
The ground truth (am I correct, this is the expected answer?) is:
def forward(self, x, c, *args, **kwargs):
t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()
logging.info(f'Random timestep t generated with shape: {t.shape}')
if self.model.conditioning_key is not None:
assert c is not None
if self.cond_stage_trainable:
c = self.get_learned_conditioning(c)
if self.shorten_cond_schedule: # TODO: drop this option
tc = self.cond_ids[t].to(self.device)
c = self.q_sample(x_start=c, t=tc, noise=torch.randn_like(c.float()))
return self.p_losses(x, c, t, *args, **kwargs)
nyrikki 18 days ago [-]
> so there was basically no chance that models had been trained on submission code yet. I do suspect performance on AoC 24 will increase over time…
No, but several of the problems were similar to problems in the training set from other challenges.
There was one that was almost exactly the same as one on Google Foobar called 'Prepare the Bunnies' Escape' as an example.
It is pretty hard to make these challenges that work across multiple langs, skills etc.. without making them into 'which works dfs/bfs/a*/dykstra' problems.
Obviously outside of the leaderboard group, an LLM being able to solve it doesn't really matter.
My friends and I flipped a coin for TDD vs traditional with each day, testing out assumptions and evaluating what was most readable/maintainable.
It was great for that.
spott 18 days ago [-]
I always find it odd when a new benchmark comes out and uses models that are all 6 months - a year old. I’m pretty sure all these models are >1 year old.
This looks super interesting and useful, but without Claude or o1 data points, it isn’t clear if it is saturated or not, and without more modern open source models, it doesn’t give me a useful signal to choose a model.
t-writescode 18 days ago [-]
This is an academic paper with 81 sources. Those take a bit of time to write and get approved. This is less of a streamlined benchmark and more of a research paper with a bunch of firsts that had to be designed, standardized and build.
A benchmark is more like a program that is run and whose output is evaluated. Those are coming later, but they're not like this. Arguably to make it "scientifically valuable" rather than anecdotally valuable, this sort of work has to come first to figure out what and/or how to benchmark in the first place.
stonogo 18 days ago [-]
The point of the paper is introducing the benchmarking tool. It doesn't matter whether they've got the latest hotness for that; the idea is that anyone can then take the tool and use it to evaluate whatever models they care about.
deadbabe 19 days ago [-]
If a new programming language is created in the future, how long would it take the LLM to become proficient in it for everyday use?
hansvm 18 days ago [-]
IME, not long. ChatGPT (v3) did poorly with Zig when it first came out, despite some Zig in its training code, and despite equivalent Python tasks being handled trivially. I worked around that for a bit by including a moderately lengthy prompt prefix with example Zig code and descriptions of interesting language features, which worked well for bulk boilerplate at the time. A year later, the model was updated (presumably with additional GH Zig source as training data) and works fine without any prompt prefix strategy, at least for the simple sorts of tasks I'd use an LLM for.
gadflyinyoureye 18 days ago [-]
A question might be, will there be new languages? If we start using AI so much that humans lose there job, where will the examples come from on which we’ll train the AI?
cpeterso 18 days ago [-]
What might a programming language that was designed to be more easily or correctly generated by LLMs look like? Probably something Lispy. Without an organic corpus of existing code, it would need to be trained entirely on synthetic examples.
blharr 17 days ago [-]
LLMs are probably the worst at programming in Lisp in my experience though
thorum 18 days ago [-]
The goal is for the LLMs we have to be so good at long context tasks and reasoning about code that you can simply paste the entire documentation for your new language and they’ll figure it out.
Once you have that working, you can generate synthetic examples to train the next version of the LLM, or for open source coding models, a small LoRA file that anyone can download to add support to their model of choice.
billwear 18 days ago [-]
forget benchmarks. use emacs lisp as a test case. i have yet to find an LLM that can consistently generate working functions in elisp (possibly even lisp itself).
chris_5f 18 days ago [-]
Very Sad to see Claude Sonnet not included. Deepseek is now a good competitor too. Hoping to see something similar for both of them. Is there anything available for it already?
Narciss 18 days ago [-]
I was very curious about Anthropic models, sadly they were not included.
behnamoh 19 days ago [-]
I hate this kind of ephemeral research that will be outdated in a few months. Instead of analyzing "why" at a fundamental level LLMs can or cannot do something, researchers just show what current models are or are not capable of.
And often times the results are not replicable either.
rgmerk 19 days ago [-]
The potential utility in the research is not so much the results but the benchmark itself.
I haven’t read the whole thing so I can’t really judge whether this specific benchmark is useful, but if it is, every time a new model comes out they can run the benchmark and breathlessly report its improved performance.
lolinder 19 days ago [-]
> However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of LLMs for specific application domains largely unknown. In this paper, we introduce a new benchmark, MultiCodeBench, to fill this gap.
This is about the benchmark they're introducing, which would have real uses for all subsequent models.
tossandthrow 19 days ago [-]
Well, science moves. It is because of research like this is will be outdated.
Have to know where you are to know where to go, and once you know where to go and go there, well you have to know where you are!
Benchmarking, like modelling or simulation, are one of many imperfect but useful tools researchers can reach for when trying to better understand something.
Benchmarks help us contextualize things. All benchmarks have biases, and some tell us more than others, but they certainly fall under a class of experimentation that falls under "research". Of course, it's important not to overstate findings. Performing well on any given coding benchmark does not tell us "this LLM is, broadly speaking, great at coding", but that it is great at coding as measured by a narrow testing set.
There's also another snake pit to cross I'm that validation methodology is hard (data leakage, appropriate measures and stats, etc...) but plenty of quality research over comes that with rigour and patience.
-- ignoring what I would consider "weird" in the code, I assume that's just style.
The prompt is:
This function performs a forward pass for a model, incorporating conditioning and time step information.
It randomly selects time steps, applies learned conditioning if applicable, and processes the inputs according to the model's conditioning requirements.
Finally, it computes and returns the loss for the given inputs and conditioning.
The ground truth (am I correct, this is the expected answer?) is:
No, but several of the problems were similar to problems in the training set from other challenges.
There was one that was almost exactly the same as one on Google Foobar called 'Prepare the Bunnies' Escape' as an example.
It is pretty hard to make these challenges that work across multiple langs, skills etc.. without making them into 'which works dfs/bfs/a*/dykstra' problems.
Obviously outside of the leaderboard group, an LLM being able to solve it doesn't really matter.
My friends and I flipped a coin for TDD vs traditional with each day, testing out assumptions and evaluating what was most readable/maintainable.
It was great for that.
This looks super interesting and useful, but without Claude or o1 data points, it isn’t clear if it is saturated or not, and without more modern open source models, it doesn’t give me a useful signal to choose a model.
A benchmark is more like a program that is run and whose output is evaluated. Those are coming later, but they're not like this. Arguably to make it "scientifically valuable" rather than anecdotally valuable, this sort of work has to come first to figure out what and/or how to benchmark in the first place.
Once you have that working, you can generate synthetic examples to train the next version of the LLM, or for open source coding models, a small LoRA file that anyone can download to add support to their model of choice.
And often times the results are not replicable either.
I haven’t read the whole thing so I can’t really judge whether this specific benchmark is useful, but if it is, every time a new model comes out they can run the benchmark and breathlessly report its improved performance.
This is about the benchmark they're introducing, which would have real uses for all subsequent models.
Great you enjoy more foundational work!