You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a mlflow logger coupled with a slurm plugin and at the end of the slurm job I have a bug in the mlflow logger due to a bad string that should be in upper case.
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/signal_connector.py", line 33, in __call__ signal_handler(signum, frame)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/signal_connector.py", line 74, in slurm_sigusr_handler_fn logger.finalize("finished")
File "/opt/conda/lib/python3.7/site-packages/lightning_utilities/core/rank_zero.py", line 24, in wrapped_fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loggers/mlflow.py", line 265, in finalize
self.experiment.set_terminated(self.run_id, status)
File "/opt/conda/lib/python3.7/site-packages/mlflow/tracking/client.py", line 1647, in set_terminated self._tracking_client.set_terminated(run_id, status, end_time)
File "/opt/conda/lib/python3.7/site-packages/mlflow/tracking/_tracking_service/client.py", line 508, in set_terminated run_status=RunStatus.from_string(status),
File "/opt/conda/lib/python3.7/site-packages/mlflow/entities/run_status.py", line 22, in from_string "status strings: %s" % (status_str, list(RunStatus._STRING_TO_STATUS.keys()))
Exception: Could not get run status corresponding to string finished. Valid run status strings: ['RUNNING', 'SCHEDULED', 'FINISHED', 'FAILED', 'KILLED']
srun: error: cluster: task 0: Exited with exit code 1
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
When I look at the signal function:
defslurm_sigusr_handler_fn(self, signum: _SIGNUM, frame: FrameType) ->None:
rank_zero_info("handling auto-requeue signal")
# save logger to make sure we get all the metricsforloggerinself.trainer.loggers:
logger.finalize("finished")
The problem seems to be that "finished" is in lowcase but mlflow need an upper case ...
I can push a simple modification in the mlflow logger to change the "finished" in upper case.
The text was updated successfully, but these errors were encountered:
Bug description
I have a mlflow logger coupled with a slurm plugin and at the end of the slurm job I have a bug in the mlflow logger due to a bad string that should be in upper case.
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
When I look at the signal function:
The problem seems to be that "finished" is in lowcase but mlflow need an upper case ...
I can push a simple modification in the mlflow logger to change the "finished" in upper case.
The text was updated successfully, but these errors were encountered: