Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional LM-Eval job freeze #354

Closed
ruivieira opened this issue Nov 7, 2024 · 0 comments · Fixed by #358
Closed

Occasional LM-Eval job freeze #354

ruivieira opened this issue Nov 7, 2024 · 0 comments · Fixed by #358
Labels
kind/bug Something isn't working lm-eval Issues related to LM-Eval
Milestone

Comments

@ruivieira
Copy link
Member

ruivieira commented Nov 7, 2024

When using the following deployment, specifically with the Phi-3 model, sometimes the job freezes after a few requests (e.g. 2% with mmlu).

Performing tail -f output/stderr.log from inside the main container confirms this to be the case.

However, if the exact same lm_eval invocation (as created by the driver) is manually started from the pod, it does not appear to freeze.

@ruivieira ruivieira added the kind/bug Something isn't working label Nov 7, 2024
@ruivieira ruivieira added this to the LM-Eval milestone Nov 7, 2024
@ruivieira ruivieira added the lm-eval Issues related to LM-Eval label Nov 7, 2024
@ruivieira ruivieira linked a pull request Nov 11, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In Progress to Done in TrustyAI planning Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working lm-eval Issues related to LM-Eval
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant