-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attach traceback to Exception & Test disatpching process #1036
Conversation
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good to me! I ran the code locally to confirm that it works and the message looks right. It worked.
One thing is that it took awhile to raise the exception (and eventually shutdown). Any reason why?
) | ||
return datapipe | ||
return [datapipe] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: In what cases will users intentionally want to set num_instances=1
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case when num_workers=1
with dispatching process.
) | ||
return process, req_queue, res_queue | ||
|
||
|
||
def CreateThreadForDataPipeline(datapipe): | ||
def CreateThreadForDataPipeline(datapipe, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to ask before: if this function is used anywhere currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any usage for sure. We can remove it in the future since we might rely on asyncio
to achieve threading.
self.assertTrue("Caught _CustomException in dispatching process" in exc_msg) | ||
self.assertTrue("Original Traceback" in exc_msg) | ||
self.assertTrue("_CustomException: oops" in exc_msg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe check if DL2 is shutdown properly? Is that possible to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, DL2 is not properly shutdown for now because we only raise Error in the main process. We haven't handled to shutdown other worker processes when on process has Error. This probably the same reason that it takes longer time to shutdown.
This is tracked in #969
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Partially fixes pytorch#969 ### Changes - Add `ExceptionWrapper` to attach traceback to the Exception - Reason: traceback is unserializable. So, it has to be passed by string - In order to provide informative Error message, pass name for each process like `dispatching process` and `worker process <id>`. - Add tests to validate Error propagation from the dispatching process - parametrize the tests - Fix a bug for `round_robin_demux` to return a list of DataPipe rather than a single DataPipe when `num_of_instances` is 1. Pull Request resolved: pytorch#1036 Reviewed By: NivekT Differential Revision: D43472709 Pulled By: ejguan fbshipit-source-id: e5c9e581ca881f523fb568b6f46bf16ecfc243d2
Summary: Partially fixes #969 ### Changes - Add `ExceptionWrapper` to attach traceback to the Exception - Reason: traceback is unserializable. So, it has to be passed by string - In order to provide informative Error message, pass name for each process like `dispatching process` and `worker process <id>`. - Add tests to validate Error propagation from the dispatching process - parametrize the tests - Fix a bug for `round_robin_demux` to return a list of DataPipe rather than a single DataPipe when `num_of_instances` is 1. Pull Request resolved: #1036 Reviewed By: NivekT Differential Revision: D43472709 Pulled By: ejguan fbshipit-source-id: e5c9e581ca881f523fb568b6f46bf16ecfc243d2
Partially fixes #969
Changes
ExceptionWrapper
to attach traceback to the Exceptiondispatching process
andworker process <id>
.round_robin_demux
to return a list of DataPipe rather than a single DataPipe whennum_of_instances
is 1.