Support Instance Level Images for SWE-Bench Evaluation #2874

Jiayi-Pan · 2024-07-09T16:18:14Z

What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?
Give a brief summary of what the PR does, explaining any non-trivial design decisions

We were limited to using the large, all-in-one ghcr.io/opendevin/eval-swe-bench:full-v1.2.1 for SWE-bench-lite evaluation, which is both heavy and inflexible.

This PR introduces support for using instance-level images for SWE-bench evaluation, which complements our previous approach. Each SWE-bench instance is now built into its own image (following SWE-bench's official format), enabling greater granularity and extensibility.

I've tested on 10 instances so far and plan to test on the entire swe-bench-lite and verify its effectivness.

Other references

xingyaoww

Mostly LGTM! Thanks for this!

evaluation/swe_bench/README.md

evaluation/swe_bench/run_infer.py

evaluation/swe_bench/scripts/docker/pull_all_eval_docker.sh

evaluation/swe_bench/scripts/run_infer.sh

evaluation/swe_bench/swe_env_box.py

xingyaoww · 2024-07-09T16:40:42Z

I can also help run an eval to validate the effectiveness of this when this PR is ready :)

Co-authored-by: Xingyao Wang <[email protected]>

Jiayi-Pan · 2024-07-09T17:14:47Z

All comments are addressed.

I can also help run an eval to validate the effectiveness of this when this PR is ready :)

Yeah, that would be helpful! Since I'm running out of my disk space for storing the images 😢

xingyaoww

LGTM! Let me run an eval first before merging it in!

evaluation/swe_bench/README.md

xingyaoww · 2024-07-09T18:44:59Z

@Jiayi-Pan There seems to be some issue with exiting git diff commands.. did you run into those in your tests?

Also, I think instance_report won't work here, maybe we can disable instance_report for instance-level docker image?

Co-authored-by: Xingyao Wang <[email protected]>

Jiayi-Pan · 2024-07-09T19:36:00Z

Thanks, I've disabled instance_report for instance images. The git diff issue should be fixed after this
#2878

xingyaoww

LGTM! Pending evaluation - set to "request changes" in case of accidental merge.

xingyaoww · 2024-07-14T18:57:34Z

For some reason, the evaluation results are not as good (64/300) compared to the baseline results (74/300): https://huggingface.co/spaces/OpenDevin/evaluation/tree/main/outputs/swe_bench_lite/CodeActAgent/claude-3-5-sonnet%4020240620_maxiter_30_N_v1.8-no-hint

I've attached the evaluated outputs: claude-3-5-sonnet@20240620_maxiter_30_N_v1.8-no-hint-pr2874.tar.gz

Feel free to compare & quickly check the instances that failed.

neubig · 2024-07-15T18:42:06Z

I assigned you to this @xingyaoww , but feel free to reassign if you think someone else is more appropriate.

xingyaoww

After some offline discussion with Jiayi, i re-ran the eval and got 76/300 (25.3%)! Good to merge now!

Jiayi-Pan added 2 commits July 9, 2024 15:58

rename pulled instance images

df3bbc4

Swebench: add support to instance level images

36ed95f

xingyaoww reviewed Jul 9, 2024

View reviewed changes

Jiayi-Pan and others added 2 commits July 9, 2024 12:49

Update evaluation/swe_bench/run_infer.py

7bc99dd

Co-authored-by: Xingyao Wang <[email protected]>

instance swebench: use env var and docker tags instead

124fdcc

xingyaoww reviewed Jul 9, 2024

View reviewed changes

evaluation/swe_bench/README.md Outdated Show resolved Hide resolved

Jiayi-Pan and others added 2 commits July 9, 2024 19:33

swebench disable instance report for instance images

6aede5a

Update evaluation/swe_bench/README.md

b10f47a

Co-authored-by: Xingyao Wang <[email protected]>

Merge branch 'main' into main

6bcaea5

xingyaoww mentioned this pull request Jul 11, 2024

Improve SWE-Bench evaluation inference with repo-specific pre-build Docker Image #2636

Closed

Merge branch 'main' into main

b8e3a4c

xingyaoww marked this pull request as ready for review July 11, 2024 17:12

xingyaoww requested changes Jul 11, 2024

View reviewed changes

xingyaoww added 2 commits July 12, 2024 05:53

Merge branch 'main' into main

332d909

Merge branch 'main' into main

35828c0

neubig assigned xingyaoww Jul 15, 2024

xingyaoww approved these changes Jul 16, 2024

View reviewed changes

xingyaoww merged commit 7111e8e into All-Hands-AI:main Jul 16, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Instance Level Images for SWE-Bench Evaluation #2874

Support Instance Level Images for SWE-Bench Evaluation #2874

Jiayi-Pan commented Jul 9, 2024

xingyaoww left a comment

xingyaoww commented Jul 9, 2024

Jiayi-Pan commented Jul 9, 2024 •

edited

Loading

xingyaoww left a comment

xingyaoww commented Jul 9, 2024 •

edited

Loading

Jiayi-Pan commented Jul 9, 2024

xingyaoww left a comment •

edited

Loading

xingyaoww commented Jul 14, 2024

neubig commented Jul 15, 2024

xingyaoww left a comment

Support Instance Level Images for SWE-Bench Evaluation #2874

Support Instance Level Images for SWE-Bench Evaluation #2874

Conversation

Jiayi-Pan commented Jul 9, 2024

xingyaoww left a comment

Choose a reason for hiding this comment

xingyaoww commented Jul 9, 2024

Jiayi-Pan commented Jul 9, 2024 • edited Loading

xingyaoww left a comment

Choose a reason for hiding this comment

xingyaoww commented Jul 9, 2024 • edited Loading

Jiayi-Pan commented Jul 9, 2024

xingyaoww left a comment • edited Loading

Choose a reason for hiding this comment

xingyaoww commented Jul 14, 2024

neubig commented Jul 15, 2024

xingyaoww left a comment

Choose a reason for hiding this comment

Jiayi-Pan commented Jul 9, 2024 •

edited

Loading

xingyaoww commented Jul 9, 2024 •

edited

Loading

xingyaoww left a comment •

edited

Loading