You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ray version and other system information (Python version, TensorFlow version, OS):
ray 0.8.4
python 3.6.9
What is the problem?
We're using _stop in our Trainable class to upload the weights files and call other end-of-training functions.
When uploading a file using boto3's S3Transfer or using subprocess.Popen("aws s3 cp source target", shell=True), the code exits too soon and doesn't complete.
So, as soon as the upload_function makes a call to a different subprocess, the code seems to exit and the remaining lines of code in _stop are left unexecuted.
Observations:
The code seems to exit exactly at the point where a different subprocess is called:
a. future.result() in boto3.s3.transfer.S3Transfer.upload_file.
b. After calling subprocess.Popen().
If local_mode=True is passed into ray.init, there are no problems and the code in _stop executes fully.
When a single trial is called, this error pops up every single time without fail.
For larger experiments with n trials, (n-1) trials have no issue, but for the final trial, the code exits before _stop completes.
Tested Fixes:
[crude] Calling time.sleep just after tune.run so that the _stop function has time to finish its execution.
In _stop_trial of RayTrialExecutor, get an object ID from the trial.runner.stop.remote() line and call ray.get(object_id). This works perfectly.
Question:
While fix 2 above seems to work fine, there's no visibility for any outputs from _stop since it seems to run in the background and produces no logs. What could be a better way to resolve this?
Note:
Wasn't able to provide code to reproduce this issue without the use of external libraries like boto3 or the AWS CLI.
VishDev12
changed the title
_stop in Trainable not running to completion [tune] [ray]
[tune] [ray] _stop in Trainable not running to completion
Apr 10, 2020
Ray version and other system information (Python version, TensorFlow version, OS):
ray 0.8.4
python 3.6.9
What is the problem?
We're using _stop in our Trainable class to upload the weights files and call other end-of-training functions.
When uploading a file using boto3's S3Transfer or using
subprocess.Popen("aws s3 cp source target", shell=True)
, the code exits too soon and doesn't complete.Here's the rough flow:
_stop() -> our_function() -> upload_function() -> exit
So, as soon as the upload_function makes a call to a different subprocess, the code seems to exit and the remaining lines of code in _stop are left unexecuted.
Observations:
a. future.result() in boto3.s3.transfer.S3Transfer.upload_file.
b. After calling subprocess.Popen().
Tested Fixes:
time.sleep
just aftertune.run
so that the _stop function has time to finish its execution.trial.runner.stop.remote()
line and callray.get(object_id)
. This works perfectly.Question:
While fix 2 above seems to work fine, there's no visibility for any outputs from _stop since it seems to run in the background and produces no logs. What could be a better way to resolve this?
Note:
Wasn't able to provide code to reproduce this issue without the use of external libraries like boto3 or the AWS CLI.
cc: @richardliaw
The text was updated successfully, but these errors were encountered: