[tune] [ray] _stop in Trainable not running to completion #7963

VishDev12 · 2020-04-10T14:52:01Z

Ray version and other system information (Python version, TensorFlow version, OS):
ray 0.8.4
python 3.6.9

What is the problem?

We're using _stop in our Trainable class to upload the weights files and call other end-of-training functions.

When uploading a file using boto3's S3Transfer or using subprocess.Popen("aws s3 cp source target", shell=True), the code exits too soon and doesn't complete.

Here's the rough flow:
_stop() -> our_function() -> upload_function() -> exit

So, as soon as the upload_function makes a call to a different subprocess, the code seems to exit and the remaining lines of code in _stop are left unexecuted.

Observations:

The code seems to exit exactly at the point where a different subprocess is called:
a. future.result() in boto3.s3.transfer.S3Transfer.upload_file.
b. After calling subprocess.Popen().
If local_mode=True is passed into ray.init, there are no problems and the code in _stop executes fully.
When a single trial is called, this error pops up every single time without fail.
For larger experiments with n trials, (n-1) trials have no issue, but for the final trial, the code exits before _stop completes.

Tested Fixes:

[crude] Calling time.sleep just after tune.run so that the _stop function has time to finish its execution.
In _stop_trial of RayTrialExecutor, get an object ID from the trial.runner.stop.remote() line and call ray.get(object_id). This works perfectly.

object_id = trial.runner.stop.remote()
ray.get(object_id)
trial.runner.__ray_terminate__.remote()

Question:
While fix 2 above seems to work fine, there's no visibility for any outputs from _stop since it seems to run in the background and produces no logs. What could be a better way to resolve this?

Note:
Wasn't able to provide code to reproduce this issue without the use of external libraries like boto3 or the AWS CLI.

cc: @richardliaw

The text was updated successfully, but these errors were encountered:

richardliaw · 2020-04-10T17:32:36Z

@VishDev12 this is super helpful; I'm pretty sure I know what the problem is.

I'll push a fix soon; for now, one thing you can do is just check that all of the resources have been released:

tune.run()...
def all_resources_released():
    available = ray.available_resources()
    for resource, value in ray.cluster_resources():
        if not available[resource] == value:
            return False
    return True

while not all_resources_released():
    time.sleep(1)

richardliaw · 2020-04-10T21:16:07Z

Opened a PR to fix this (#7967).

VishDev12 · 2020-04-11T04:31:33Z

Thanks @richardliaw, I will implement the resource checking solution for now. I'll wait for the fix in the upcoming releases.

VishDev12 added the bug Something that is supposed to be working; but isn't label Apr 10, 2020

VishDev12 changed the title ~~_stop in Trainable not running to completion [tune] [ray]~~ [tune] [ray] _stop in Trainable not running to completion Apr 10, 2020

richardliaw self-assigned this Apr 10, 2020

richardliaw mentioned this issue Apr 10, 2020

[tune] Ensure Cleanup #7967

Merged

6 tasks

richardliaw closed this as completed in #7967 Apr 11, 2020

VishDev12 mentioned this issue Feb 28, 2021

[tune] Cleanup function not running to completion #14407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] [ray] _stop in Trainable not running to completion #7963

[tune] [ray] _stop in Trainable not running to completion #7963

VishDev12 commented Apr 10, 2020 •

edited

Loading

richardliaw commented Apr 10, 2020 •

edited

Loading

richardliaw commented Apr 10, 2020

VishDev12 commented Apr 11, 2020

[tune] [ray] _stop in Trainable not running to completion #7963

[tune] [ray] _stop in Trainable not running to completion #7963

Comments

VishDev12 commented Apr 10, 2020 • edited Loading

What is the problem?

richardliaw commented Apr 10, 2020 • edited Loading

richardliaw commented Apr 10, 2020

VishDev12 commented Apr 11, 2020

VishDev12 commented Apr 10, 2020 •

edited

Loading

richardliaw commented Apr 10, 2020 •

edited

Loading