SWE-Bench (+ Verified) is the benchmark (of resolving Github Issues) that companies into Coding are chasing - Devin, Claude, OpenAI - all these!
A new leader #1 - CodeStory Midwit Agent + swe-search - has been crowed with a score of 62% on SWE-bench verified (without even using any reasoning models like OpenAI o1 or o3)
This is a very impressive result. OpenAI was able to achieve 72% with o3, but that's at a very high compute cost at inference-time.
I'd be interested for Aide to release more metrics on token counts, total expenditure, etc. to better understand exactly how much test-time compute is involved here. They allude to it being a lot, but it would be nice to compare with OpenAI's o3.
ngl the total expenditure was around $10k, in terms of test-time compute we ran upto 20X agents on the same problem to first understand if the bitter lesson paradigm of "scale is the answer" really holds true.
The final submission which we did ran 5X agents and the decider was based on mean average score of the rewards, per problem the cost was around $20
We are going to push this scaling paradigm a bit more, my honest gut feeling is that swe-bench as a benchmark is prime for saturation real soon
1. These problem statements are in the training data for the LLMs
2. Brute-forcing the answer the way we are doing works and we just proved it, so someone is going to take a better stab at it real soon
> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
So, in the long run, we'll just throw more and more hardware at AI, forever.
> The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex... we should build in only the meta-methods that can find and capture this arbitrary complexity.
So AI will permanently involve throwing a ton of compute at a ton of data.
I guess it's time to buy stock in computer hardware manufacturers.
For Context:
SWE-Bench (+ Verified) is the benchmark (of resolving Github Issues) that companies into Coding are chasing - Devin, Claude, OpenAI - all these!
A new leader #1 - CodeStory Midwit Agent + swe-search - has been crowed with a score of 62% on SWE-bench verified (without even using any reasoning models like OpenAI o1 or o3)
More details on their approach - https://aide.dev/blog/sota-bitter-lesson
This is a very impressive result. OpenAI was able to achieve 72% with o3, but that's at a very high compute cost at inference-time.
I'd be interested for Aide to release more metrics on token counts, total expenditure, etc. to better understand exactly how much test-time compute is involved here. They allude to it being a lot, but it would be nice to compare with OpenAI's o3.
Hey! One of the creators of Aide here.
ngl the total expenditure was around $10k, in terms of test-time compute we ran upto 20X agents on the same problem to first understand if the bitter lesson paradigm of "scale is the answer" really holds true.
The final submission which we did ran 5X agents and the decider was based on mean average score of the rewards, per problem the cost was around $20
We are going to push this scaling paradigm a bit more, my honest gut feeling is that swe-bench as a benchmark is prime for saturation real soon
1. These problem statements are in the training data for the LLMs
2. Brute-forcing the answer the way we are doing works and we just proved it, so someone is going to take a better stab at it real soon
tbh there has been some issue with their previous reporting
https://x.com/Alex_Cuadron/status/1876017241042587964
Thanks! It feels like we should switch the top link to that URL since it's a deeper dive into the new bit that's interesting here.
Edit: I've done that now. Submitted URL was https://www.swebench.com/ and submitted title was "SWE Bench just got updated – new #1s".
> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
So, in the long run, we'll just throw more and more hardware at AI, forever.
> The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex... we should build in only the meta-methods that can find and capture this arbitrary complexity.
So AI will permanently involve throwing a ton of compute at a ton of data.
I guess it's time to buy stock in computer hardware manufacturers.
We've been throwing more compute at our problems for what, 80 years?
Yeah. Longer if you include non-silicon/germanium computers [0].
The approach seems to continue to work.
[0] https://en.wikipedia.org/wiki/Computer_(occupation)
where we can check actual model output? The worry could be that it is unreadable buggy mess, even if it managed to close some specific bug.
It would be helpful to explain what this is and what's interesting about the updates. Anyone?
Edit: URL since changed - see https://news.ycombinator.com/item?id=42639155
---
Edit: I found these past related threads, but not much discussion there:
Pplx and Dbrx founder giving $1M to first OSS AI that gets 90% on SWE-bench - https://news.ycombinator.com/item?id=42413392 - Dec 2024 (3 comments)
We might be overestimating coding agent performance on SWE-Bench - https://news.ycombinator.com/item?id=42054973 - Nov 2024 (1 comment)
SWE-Bench Verified - https://news.ycombinator.com/item?id=41237204 - Aug 2024 (10 comments)
Show HN: Public and Free SWE-bench-lite evaluations - https://news.ycombinator.com/item?id=40974181 - July 2024 (1 comment)
#1 agent on swe-bench wrote 7% of its own code - https://news.ycombinator.com/item?id=40627095 - June 2024 (1 comment)
Aider Is SOTA for Both SWE Bench and SWE Bench Lite - https://news.ycombinator.com/item?id=40562121 - June 2024 (1 comment)
How Aider Scored SOTA 26.3% on SWE Bench Lite - https://news.ycombinator.com/item?id=40477191 - May 2024 (1 comment)
This bench seems to be entirely python based. Are there similar benchmarks that test different languages for these tools?
I'm one of the co-authors of SWE-bench. We just created a Javascript (+visual) SWE-bench: https://www.swebench.com/multimodal.html
We're going to release the eval suite for this soon so that people can start making submissions.