[Bug]: eval-this workflow is not working #5106

neubig · 2024-11-18T17:41:52Z

Is there an existing issue for the same bug?

I have checked the existing issues.

Describe the bug and reproduction steps

Currently the eval-this workflow is not working, so we should fix it.

Idea:

Switch LM to Claude Haiku
Reduce to use a subset of SWE-bench instances to make it affordable
Run and make sure that it works

@csmith49 will take a look at this

OpenHands Installation

Docker command in README

OpenHands Version

No response

Operating System

None

Logs, Errors, Screenshots, and Additional Context

No response

enyst · 2024-11-23T17:28:01Z

The eval-this workflow has two parts:

actual evaluation
integration tests.

openhands-agent has split them here:

Fix issue #5076: Integration test github action #5077

I can confirm that the new integration tests workflow works with Haiku:

Integration tests (openhands fix issue 5076) enyst/playground#8 (comment)
(browsing ones fail, I'm looking into that, that's a good thing in a way!)

I felt like we need the integration tests in their new form to be back in working state. We have removed them at some point from ./tests, and refactored them as external scripts like evals, also using Deepseek, but they weren't working either.

What do you think about this? Could we have a nightly for them - and maybe a label too, just in case needed -, also with Haiku?

IMHO it would be cool if we can also have a nightly on Deepseek or something. Because

Haiku has native function calling
Deepseek doesn't, so the runs use different prompt/code/conversion/pydantic serialization/etc (it really affects stuff IMHO)

These integration tests are just like 6 tests currently (and I'm working on a seventh), but they do try to cover some things in real-like use that we just don't have coverage elsewhere.

enyst · 2024-11-23T18:36:40Z

Oh, also: at this time, the Deepseek API key defined on this repo is depleted. I doubt it's the original reason why the eval workflow wasn't working, but it looks like the first reason currently. 😅 Cc: @neubig

enyst · 2024-11-25T21:40:21Z

This is my proposal on these: (source)

neubig · 2024-11-27T14:23:13Z

Thanks so much for digging in to this @enyst!
I unfortunately am a bit short on time to look at this, but @mamoodi if you'd be able to take a look I'd love your comments.

neubig added the bug Something isn't working label Nov 18, 2024

enyst mentioned this issue Nov 20, 2024

Add eval workflow that triggers remote eval job #5108

Merged

1 task

enyst mentioned this issue Nov 24, 2024

Clean up unused workflows #5235

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: eval-this workflow is not working #5106

[Bug]: eval-this workflow is not working #5106

neubig commented Nov 18, 2024

enyst commented Nov 23, 2024

enyst commented Nov 23, 2024

enyst commented Nov 25, 2024

neubig commented Nov 27, 2024

[Bug]: eval-this workflow is not working #5106

[Bug]: eval-this workflow is not working #5106

Comments

neubig commented Nov 18, 2024

Is there an existing issue for the same bug?

Describe the bug and reproduction steps

OpenHands Installation

OpenHands Version

Operating System

Logs, Errors, Screenshots, and Additional Context

enyst commented Nov 23, 2024

enyst commented Nov 23, 2024

enyst commented Nov 25, 2024

neubig commented Nov 27, 2024