Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM signal and mlflow logger not compatible #16332

Closed
Forbu opened this issue Jan 11, 2023 · 2 comments · Fixed by #16340
Closed

SLURM signal and mlflow logger not compatible #16332

Forbu opened this issue Jan 11, 2023 · 2 comments · Fixed by #16340
Labels
bug Something isn't working logger: mlflow

Comments

@Forbu
Copy link
Contributor

Forbu commented Jan 11, 2023

Bug description

I have a mlflow logger coupled with a slurm plugin and at the end of the slurm job I have a bug in the mlflow logger due to a bad string that should be in upper case.

How to reproduce the bug

mlf_logger = MLFlowLogger(experiment_name="XXXXX", tracking_uri=PATH_MLFLOW)

    if MODE == "training":
        modele.train()
        trainer = pl.Trainer(max_time={"hours": NB_HOURS}, logger=mlf_logger,
                                ... , plugins=[SLURMEnvironment(requeue_signal=signal.SIGUSR1)])

Error messages and logs

File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/signal_connector.py", line 33, in __call__ signal_handler(signum, frame)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/signal_connector.py", line 74, in slurm_sigusr_handler_fn logger.finalize("finished")  
File "/opt/conda/lib/python3.7/site-packages/lightning_utilities/core/rank_zero.py", line 24, in wrapped_fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loggers/mlflow.py", line 265, in finalize 
self.experiment.set_terminated(self.run_id, status)
File "/opt/conda/lib/python3.7/site-packages/mlflow/tracking/client.py", line 1647, in set_terminated self._tracking_client.set_terminated(run_id, status, end_time)
File "/opt/conda/lib/python3.7/site-packages/mlflow/tracking/_tracking_service/client.py", line 508, in set_terminated    run_status=RunStatus.from_string(status),
File "/opt/conda/lib/python3.7/site-packages/mlflow/entities/run_status.py", line 22, in from_string "status strings: %s" % (status_str, list(RunStatus._STRING_TO_STATUS.keys()))  
Exception: Could not get run status corresponding to string finished. Valid run status strings: ['RUNNING', 'SCHEDULED', 'FINISHED', 'FAILED', 'KILLED']                            
srun: error: cluster: task 0: Exited with exit code 1     

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

When I look at the signal function:

    def slurm_sigusr_handler_fn(self, signum: _SIGNUM, frame: FrameType) -> None:
        rank_zero_info("handling auto-requeue signal")

        # save logger to make sure we get all the metrics
        for logger in self.trainer.loggers:
            logger.finalize("finished")

The problem seems to be that "finished" is in lowcase but mlflow need an upper case ...
I can push a simple modification in the mlflow logger to change the "finished" in upper case.

@Forbu Forbu added the needs triage Waiting to be triaged by maintainers label Jan 11, 2023
@Forbu
Copy link
Contributor Author

Forbu commented Jan 11, 2023

My idea was to simply modify the finalized method of class MLFlowLogger(Logger) with something like that :

 def finalize(self, status: str = "success") -> None:
        if not self._initialized:
            return
        if status == "success":
            status = "FINISHED"
        elif status == "failed":
            status = "FAILED"
        elif status == "finished":
            status = "FINISHED"

@Forbu Forbu changed the title SLURM signal SLURM signal and mlflow logger not compatible Jan 11, 2023
@awaelchli
Copy link
Contributor

@Forbu Your proposal sounds reasonable to me! Would you like to send a PR for this?

@awaelchli awaelchli added bug Something isn't working logger: mlflow and removed needs triage Waiting to be triaged by maintainers labels Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logger: mlflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants