-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Instance Level Images for SWE-Bench Evaluation #2874
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM! Thanks for this!
I can also help run an eval to validate the effectiveness of this when this PR is ready :) |
Co-authored-by: Xingyao Wang <[email protected]>
All comments are addressed.
Yeah, that would be helpful! Since I'm running out of my disk space for storing the images 😢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Let me run an eval first before merging it in!
@Jiayi-Pan There seems to be some issue with exiting Also, I think |
Co-authored-by: Xingyao Wang <[email protected]>
Thanks, I've disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Pending evaluation - set to "request changes" in case of accidental merge.
For some reason, the evaluation results are not as good (64/300) compared to the baseline results (74/300): https://huggingface.co/spaces/OpenDevin/evaluation/tree/main/outputs/swe_bench_lite/CodeActAgent/claude-3-5-sonnet%4020240620_maxiter_30_N_v1.8-no-hint I've attached the evaluated outputs: claude-3-5-sonnet@20240620_maxiter_30_N_v1.8-no-hint-pr2874.tar.gz Feel free to compare & quickly check the instances that failed. |
I assigned you to this @xingyaoww , but feel free to reassign if you think someone else is more appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Give a brief summary of what the PR does, explaining any non-trivial design decisions
We were limited to using the large, all-in-one
ghcr.io/opendevin/eval-swe-bench:full-v1.2.1
for SWE-bench-lite evaluation, which is both heavy and inflexible.This PR introduces support for using instance-level images for SWE-bench evaluation, which complements our previous approach. Each SWE-bench instance is now built into its own image (following SWE-bench's official format), enabling greater granularity and extensibility.
I've tested on 10 instances so far and plan to test on the entire swe-bench-lite and verify its effectivness.
Other references