Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Instance Level Images for SWE-Bench Evaluation #2874

Merged
merged 10 commits into from
Jul 16, 2024

Conversation

Jiayi-Pan
Copy link
Contributor

What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Give a brief summary of what the PR does, explaining any non-trivial design decisions

We were limited to using the large, all-in-one ghcr.io/opendevin/eval-swe-bench:full-v1.2.1 for SWE-bench-lite evaluation, which is both heavy and inflexible.

This PR introduces support for using instance-level images for SWE-bench evaluation, which complements our previous approach. Each SWE-bench instance is now built into its own image (following SWE-bench's official format), enabling greater granularity and extensibility.

I've tested on 10 instances so far and plan to test on the entire swe-bench-lite and verify its effectivness.

Other references

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM! Thanks for this!

evaluation/swe_bench/README.md Outdated Show resolved Hide resolved
evaluation/swe_bench/run_infer.py Outdated Show resolved Hide resolved
evaluation/swe_bench/scripts/run_infer.sh Outdated Show resolved Hide resolved
evaluation/swe_bench/scripts/run_infer.sh Outdated Show resolved Hide resolved
evaluation/swe_bench/swe_env_box.py Outdated Show resolved Hide resolved
@xingyaoww
Copy link
Collaborator

I can also help run an eval to validate the effectiveness of this when this PR is ready :)

@Jiayi-Pan
Copy link
Contributor Author

Jiayi-Pan commented Jul 9, 2024

All comments are addressed.

I can also help run an eval to validate the effectiveness of this when this PR is ready :)

Yeah, that would be helpful! Since I'm running out of my disk space for storing the images 😢

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let me run an eval first before merging it in!

evaluation/swe_bench/README.md Outdated Show resolved Hide resolved
@xingyaoww
Copy link
Collaborator

xingyaoww commented Jul 9, 2024

@Jiayi-Pan There seems to be some issue with exiting git diff commands.. did you run into those in your tests?

Also, I think instance_report won't work here, maybe we can disable instance_report for instance-level docker image?
image

@Jiayi-Pan
Copy link
Contributor Author

Thanks, I've disabled instance_report for instance images. The git diff issue should be fixed after this
#2878

@xingyaoww xingyaoww marked this pull request as ready for review July 11, 2024 17:12
Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Pending evaluation - set to "request changes" in case of accidental merge.

@xingyaoww
Copy link
Collaborator

For some reason, the evaluation results are not as good (64/300) compared to the baseline results (74/300): https://huggingface.co/spaces/OpenDevin/evaluation/tree/main/outputs/swe_bench_lite/CodeActAgent/claude-3-5-sonnet%4020240620_maxiter_30_N_v1.8-no-hint

I've attached the evaluated outputs: claude-3-5-sonnet@20240620_maxiter_30_N_v1.8-no-hint-pr2874.tar.gz

Feel free to compare & quickly check the instances that failed.

@neubig
Copy link
Contributor

neubig commented Jul 15, 2024

I assigned you to this @xingyaoww , but feel free to reassign if you think someone else is more appropriate.

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some offline discussion with Jiayi, i re-ran the eval and got 76/300 (25.3%)! Good to merge now!

image

@xingyaoww xingyaoww merged commit 7111e8e into All-Hands-AI:main Jul 16, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants