Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] [ray] _stop in Trainable not running to completion #7963

Closed
VishDev12 opened this issue Apr 10, 2020 · 3 comments · Fixed by #7967
Closed

[tune] [ray] _stop in Trainable not running to completion #7963

VishDev12 opened this issue Apr 10, 2020 · 3 comments · Fixed by #7967
Assignees
Labels
bug Something that is supposed to be working; but isn't

Comments

@VishDev12
Copy link
Contributor

VishDev12 commented Apr 10, 2020

Ray version and other system information (Python version, TensorFlow version, OS):
ray 0.8.4
python 3.6.9

What is the problem?

We're using _stop in our Trainable class to upload the weights files and call other end-of-training functions.

When uploading a file using boto3's S3Transfer or using subprocess.Popen("aws s3 cp source target", shell=True), the code exits too soon and doesn't complete.

Here's the rough flow:
_stop() -> our_function() -> upload_function() -> exit

So, as soon as the upload_function makes a call to a different subprocess, the code seems to exit and the remaining lines of code in _stop are left unexecuted.

Observations:

  1. The code seems to exit exactly at the point where a different subprocess is called:
    a. future.result() in boto3.s3.transfer.S3Transfer.upload_file.
    b. After calling subprocess.Popen().
  2. If local_mode=True is passed into ray.init, there are no problems and the code in _stop executes fully.
  3. When a single trial is called, this error pops up every single time without fail.
  4. For larger experiments with n trials, (n-1) trials have no issue, but for the final trial, the code exits before _stop completes.

Tested Fixes:

  1. [crude] Calling time.sleep just after tune.run so that the _stop function has time to finish its execution.
  2. In _stop_trial of RayTrialExecutor, get an object ID from the trial.runner.stop.remote() line and call ray.get(object_id). This works perfectly.
object_id = trial.runner.stop.remote()
ray.get(object_id)
trial.runner.__ray_terminate__.remote()

Question:
While fix 2 above seems to work fine, there's no visibility for any outputs from _stop since it seems to run in the background and produces no logs. What could be a better way to resolve this?

Note:
Wasn't able to provide code to reproduce this issue without the use of external libraries like boto3 or the AWS CLI.

cc: @richardliaw

@VishDev12 VishDev12 added the bug Something that is supposed to be working; but isn't label Apr 10, 2020
@VishDev12 VishDev12 changed the title _stop in Trainable not running to completion [tune] [ray] [tune] [ray] _stop in Trainable not running to completion Apr 10, 2020
@richardliaw
Copy link
Contributor

richardliaw commented Apr 10, 2020

@VishDev12 this is super helpful; I'm pretty sure I know what the problem is.

I'll push a fix soon; for now, one thing you can do is just check that all of the resources have been released:

tune.run()...
def all_resources_released():
    available = ray.available_resources()
    for resource, value in ray.cluster_resources():
        if not available[resource] == value:
            return False
    return True

while not all_resources_released():
    time.sleep(1)

@richardliaw richardliaw self-assigned this Apr 10, 2020
@richardliaw richardliaw mentioned this issue Apr 10, 2020
6 tasks
@richardliaw
Copy link
Contributor

Opened a PR to fix this (#7967).

@VishDev12
Copy link
Contributor Author

Thanks @richardliaw, I will implement the resource checking solution for now. I'll wait for the fix in the upcoming releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants