Seems like a very limited subset of software development to be basing a benchmark on
Where’s the kernel dev? Where’s the embedded dev? Where’s the throwaway python script?
achierius 30 days ago [-]
We all know that nobody actually writes kernels, compilers, firmware, or video games, they're just mined straight from the earth.
sosuke 31 days ago [-]
No huggingface models or did I just miss them? Edit: they mention doing open models at some point at the bottom of the page
danpalmer 31 days ago [-]
> To mimic real-world development, HackerRank’s ASTRA Benchmark Dataset includes, on average, 12 source code and configuration files per question as model inputs.
How is 12 files "real-world development"? My hobby project currently has 142 files and most non-trivial changes would involve adding a new file. My small work project has 79 and similarly, any non-trivial changes will need to add a file. These are small codebases. My previous team was ~450k lines across ~thousands of files, and we managed that pretty effectively with 6 engineers.
Getting the right answer out of an LLM for these sorts of tasks is fine if you give it little enough context that it's an effectively greenfield task as most of these problems end up being. But giving them a whole codebase and expecting the right answer, or the process of choosing the right subset to give them, are still big unsolved problems.
At this point it honestly feels a bit like gaslighting, suggesting that a 12 file NodeJS server is representative of software engineering.
31 days ago [-]
rushingcreek 31 days ago [-]
Would love to see how DeepSeek R1 compares to O1 here.
rokhayakebe 31 days ago [-]
How will programming change when we reach reach 99-100%?
CharlieDigital 31 days ago [-]
My take is that teams should start to think about biasing for selecting for code review skills instead of pure coding skills.
The AI is going to significantly improve coding output, but at least for a while , we're still going to need human shepherds to make a call on quality, check for performance, security, and conformance to the bigger picture of the system. Maybe longer than we think given that we still have conductors and pilots.
The startup I was at just wound down last December and I interviewed with a handful of companies. None had any code reviews as part of their process, even though they probably already have engineers using Copilot or Cursor.
There's only been one company, a YC startup, that I interviewed with that incorporated a code review (as the first round no less). I ended up creating a lightweight, open source app[0] for teams to incorporate code reviews into the interview process more easily after really enjoying that process with the startup.
Nice! I was just think my company should be doing code reviews as part of the interview process. Will definitely check this out.
QuadmasterXLII 31 days ago [-]
Today I, in a fit of laziness, asked claude for a c function to invert a matrix instead of writing it myself. It gave me a function that is wrong if malloc gives a pointer to non-zeroed memory. If its 99% and the 1% continue to be mistakes like that, programming is going to be hell.
maccard 30 days ago [-]
If Claude can spit that out , I am more than competent enough to review it. Saving 15-20m here and there is a good trade off IMO.
bdhcuidbebe 30 days ago [-]
As long as you can analyze and fix those kind of issues, your skill should be in high demand given current developments.
Bjorkbat 31 days ago [-]
My belief is that software engineering benchmarks are still a poor proxy for performance on real world software engineers tasks, and that there's a decent chance a new model might saturate a benchmark while being kind of underwhelming.
A simple example, if a human scored 50% on SWE-bench verified, it's fair to say that this person is a very competent software engineer. Popular frontier models like Claude Sonnet and OpenAI's o3 can score 50% on SWE-bench, and can score even higher with special tooling, but compared to an actual human software engineer can't seem to competently perform a lot of programming tasks on their own.
Although, if a model did consistently score more than 99% on various software engineering benchmarks that might be different, as it would imply a very real sense of competence. That's a pretty substantial if though. To my knowledge there isn't a single model out there that can consistently score more than 99% on any given benchmark. The o1 model scored very well on certain MMLU categories, 98.1% on college mathematics, but I'm not sure if this result will continue to hold on a similar benchmark evaluating college-level undergraduate mathematics.
Also, something else to consider, we take for granted how often we're able to perform tasks with more than 99% accuracy and how quickly things would fall apart if this weren't the case. If the average human driver was only able to make an accident-free trip only 99% of the time that would imply that they'd get in a wreck every 100th time they drive their car.
Granted, software engineering might be the exception to this rule, but then again, depends on what you're measuring. When it comes to more-or-less discrete steps, we're arguably pretty good at writing programs that capture our intent, and I could foresee an AI that only gets this right 99% of the time to be a pain in the ass to work with. If a feature ticket requires 10 different sub-tasks to be done correctly, then an AI that can do each sub-task correctly 99% of the time has a roughly 90% chance of doing the whole feature ticket correctly, which is still good but compounded over many feature tickets could be exhausting to deal with. An AI that has only a 90% chance of doing each sub-task correctly would almost certainly fail to implement this hypothetical feature ticket.
Mind you, statistics is not my domain so if there are any errors in my logic please correct me.
rvz 31 days ago [-]
It means assessment tools like Leetcode and Hackerrank are not enough to evaluate programmer ability. In fact, it never was, and they have always been gamed for the sake of passing the interview.
Every passing day, the value of Leetcode and Hackerrank decreases as a measure, as 'reasoning' AI agents get increasingly better. It now takes a very skilled engineer to review the code from these LLMs as they are still non-deterministic tools and still generate plausible code that look correct but is in fact erroneous or unoptimal.
The real solution for proper software engineering interview assessments are live code reviewing ability and how a candidate reasons with code written by anyone (themselves, another person or AI). For example, open-source contributions towards highly significant repositories *always* require code review by the maintainers and that is very easy to verify and eliminates fraudsters claiming who did what.
Frankly speaking, Leetcode and Hackerrank are playing a losing game against LLMs and AI agents and code-review ability either on open-source projects or example projects are a much better method to assess SWEs on for interviews.
CharlieDigital 31 days ago [-]
Exactly. When an AI can solve the coding challenging nearly instantly, what are you measuring by having a human do it as well?
That's like measuring a human against a hydraulic crane for lifting ability.
No, what you want is to measure if the human can safely and deftly operate the crane.
Where’s the kernel dev? Where’s the embedded dev? Where’s the throwaway python script?
How is 12 files "real-world development"? My hobby project currently has 142 files and most non-trivial changes would involve adding a new file. My small work project has 79 and similarly, any non-trivial changes will need to add a file. These are small codebases. My previous team was ~450k lines across ~thousands of files, and we managed that pretty effectively with 6 engineers.
Getting the right answer out of an LLM for these sorts of tasks is fine if you give it little enough context that it's an effectively greenfield task as most of these problems end up being. But giving them a whole codebase and expecting the right answer, or the process of choosing the right subset to give them, are still big unsolved problems.
At this point it honestly feels a bit like gaslighting, suggesting that a 12 file NodeJS server is representative of software engineering.
The AI is going to significantly improve coding output, but at least for a while , we're still going to need human shepherds to make a call on quality, check for performance, security, and conformance to the bigger picture of the system. Maybe longer than we think given that we still have conductors and pilots.
The startup I was at just wound down last December and I interviewed with a handful of companies. None had any code reviews as part of their process, even though they probably already have engineers using Copilot or Cursor.
There's only been one company, a YC startup, that I interviewed with that incorporated a code review (as the first round no less). I ended up creating a lightweight, open source app[0] for teams to incorporate code reviews into the interview process more easily after really enjoying that process with the startup.
[0] https://coderev.app
A simple example, if a human scored 50% on SWE-bench verified, it's fair to say that this person is a very competent software engineer. Popular frontier models like Claude Sonnet and OpenAI's o3 can score 50% on SWE-bench, and can score even higher with special tooling, but compared to an actual human software engineer can't seem to competently perform a lot of programming tasks on their own.
Although, if a model did consistently score more than 99% on various software engineering benchmarks that might be different, as it would imply a very real sense of competence. That's a pretty substantial if though. To my knowledge there isn't a single model out there that can consistently score more than 99% on any given benchmark. The o1 model scored very well on certain MMLU categories, 98.1% on college mathematics, but I'm not sure if this result will continue to hold on a similar benchmark evaluating college-level undergraduate mathematics.
Also, something else to consider, we take for granted how often we're able to perform tasks with more than 99% accuracy and how quickly things would fall apart if this weren't the case. If the average human driver was only able to make an accident-free trip only 99% of the time that would imply that they'd get in a wreck every 100th time they drive their car.
Granted, software engineering might be the exception to this rule, but then again, depends on what you're measuring. When it comes to more-or-less discrete steps, we're arguably pretty good at writing programs that capture our intent, and I could foresee an AI that only gets this right 99% of the time to be a pain in the ass to work with. If a feature ticket requires 10 different sub-tasks to be done correctly, then an AI that can do each sub-task correctly 99% of the time has a roughly 90% chance of doing the whole feature ticket correctly, which is still good but compounded over many feature tickets could be exhausting to deal with. An AI that has only a 90% chance of doing each sub-task correctly would almost certainly fail to implement this hypothetical feature ticket.
Mind you, statistics is not my domain so if there are any errors in my logic please correct me.
Every passing day, the value of Leetcode and Hackerrank decreases as a measure, as 'reasoning' AI agents get increasingly better. It now takes a very skilled engineer to review the code from these LLMs as they are still non-deterministic tools and still generate plausible code that look correct but is in fact erroneous or unoptimal.
The real solution for proper software engineering interview assessments are live code reviewing ability and how a candidate reasons with code written by anyone (themselves, another person or AI). For example, open-source contributions towards highly significant repositories *always* require code review by the maintainers and that is very easy to verify and eliminates fraudsters claiming who did what.
Frankly speaking, Leetcode and Hackerrank are playing a losing game against LLMs and AI agents and code-review ability either on open-source projects or example projects are a much better method to assess SWEs on for interviews.
That's like measuring a human against a hydraulic crane for lifting ability.
No, what you want is to measure if the human can safely and deftly operate the crane.