Occasional LM-Eval job freeze #354

ruivieira · 2024-11-07T18:12:01Z

When using the following deployment, specifically with the Phi-3 model, sometimes the job freezes after a few requests (e.g. 2% with mmlu).

Performing tail -f output/stderr.log from inside the main container confirms this to be the case.

However, if the exact same lm_eval invocation (as created by the driver) is manually started from the pod, it does not appear to freeze.

The text was updated successfully, but these errors were encountered:

ruivieira added the kind/bug Something isn't working label Nov 7, 2024

ruivieira added this to the LM-Eval milestone Nov 7, 2024

ruivieira added this to TrustyAI planning Nov 7, 2024

github-project-automation bot moved this to Todo in TrustyAI planning Nov 7, 2024

ruivieira added the lm-eval Issues related to LM-Eval label Nov 7, 2024

ruivieira linked a pull request Nov 11, 2024 that will close this issue

fix(lmeval): Send Job logs to Pod StdOut #358

Merged

ruivieira closed this as completed in #358 Nov 12, 2024

github-project-automation bot moved this from In Progress to Done in TrustyAI planning Nov 12, 2024

Provide feedback