-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training hangs at the end of the first epoch when using a PyDataset and workers > 1. #20425
Comments
Thanks for the report. This issue appears to have been introduced in fd8bbe2 @hertschuh can you take a look? I started debugging it, and here's my reading: the following code except queue.Empty:
pass is reached and leads to an infinite loop. That's because we never get to the exit condition: if i >= num_batches - 1:
self.enqueuer.stop()
return which is because |
I added a workaround at HEAD to continue training when the issue occur. It's not a definitive solution but it should help. |
The issue was caused by the fact that the iterator was not fully consumed and `on_epoch_end` was not called. Added an exception to catch this situation in the future. Added a unit test to test `model.fit()` with all the combinations of data adapters.
The issue was caused by the fact that the iterator was not fully consumed and `on_epoch_end` was not called. Added an exception to catch this situation in the future. Added a unit test to test `model.fit()` with all the combinations of data adapters.
The issue was caused by the fact that the iterator was not fully consumed and `on_epoch_end` was not called. Added an exception to catch this situation in the future. Added a unit test to test `model.fit()` with all the combinations of data adapters.
The issue was caused by the fact that the iterator was not fully consumed and `on_epoch_end` was not called. Added an exception to catch this situation in the future. Added a unit test to test `model.fit()` with all the combinations of data adapters.
The issue was caused by the fact that the iterator was not fully consumed and `on_epoch_end` was not called. Added an exception to catch this situation in the future. Added a unit test to test `model.fit()` with all the combinations of data adapters.
The issue was caused by the fact that the iterator was not fully consumed and `on_epoch_end` was not called. Added an exception to catch this situation in the future. Added a unit test to test `model.fit()` with all the combinations of data adapters.
This should now be fixed at HEAD. |
Training using a PyDataset and workers > 1 will hang at the end of the first epoch with Keras 3.6. This issue does not seem to occur with Keras 3.5.
Example Code
Here is a slightly modified version of https://keras.io/examples/vision/mnist_convnet/ to reproduce the issue.
Traceback
Here is the traceback I receive when interrupting the process.
The text was updated successfully, but these errors were encountered: