Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: eval-this workflow is not working #5106

Open
1 task done
neubig opened this issue Nov 18, 2024 · 4 comments
Open
1 task done

[Bug]: eval-this workflow is not working #5106

neubig opened this issue Nov 18, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@neubig
Copy link
Contributor

neubig commented Nov 18, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Describe the bug and reproduction steps

Currently the eval-this workflow is not working, so we should fix it.

Idea:

  • Switch LM to Claude Haiku
  • Reduce to use a subset of SWE-bench instances to make it affordable
  • Run and make sure that it works

@csmith49 will take a look at this

OpenHands Installation

Docker command in README

OpenHands Version

No response

Operating System

None

Logs, Errors, Screenshots, and Additional Context

No response

@neubig neubig added the bug Something isn't working label Nov 18, 2024
@enyst
Copy link
Collaborator

enyst commented Nov 23, 2024

The eval-this workflow has two parts:

  • actual evaluation
  • integration tests.

openhands-agent has split them here:

I can confirm that the new integration tests workflow works with Haiku:

I felt like we need the integration tests in their new form to be back in working state. We have removed them at some point from ./tests, and refactored them as external scripts like evals, also using Deepseek, but they weren't working either.

What do you think about this? Could we have a nightly for them - and maybe a label too, just in case needed -, also with Haiku?

IMHO it would be cool if we can also have a nightly on Deepseek or something. Because

  • Haiku has native function calling
  • Deepseek doesn't, so the runs use different prompt/code/conversion/pydantic serialization/etc (it really affects stuff IMHO)

These integration tests are just like 6 tests currently (and I'm working on a seventh), but they do try to cover some things in real-like use that we just don't have coverage elsewhere.

@enyst
Copy link
Collaborator

enyst commented Nov 23, 2024

Oh, also: at this time, the Deepseek API key defined on this repo is depleted. I doubt it's the original reason why the eval workflow wasn't working, but it looks like the first reason currently. 😅 Cc: @neubig

@enyst enyst mentioned this issue Nov 24, 2024
1 task
@enyst
Copy link
Collaborator

enyst commented Nov 25, 2024

This is my proposal on these: (source)

image

@neubig
Copy link
Contributor Author

neubig commented Nov 27, 2024

Thanks so much for digging in to this @enyst!
I unfortunately am a bit short on time to look at this, but @mamoodi if you'd be able to take a look I'd love your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants