-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataLoader2 Memory Behavior is very strange on Epoch Resets #1185
Comments
@ejguan: Do you have any suggestions for properly resetting Dataloader 2 after each epoch? With e.g. |
Hello, I have also encountered a situation where the DL2 memory usage has skyrocketed. I have temporarily decided to switch back to DL1. May I ask how to set up datapipe+DL1 for multi process and multi card training? Do I need to set up distributed sampling in DL1? |
@Adenialzz: To get what I showed above, it's more or less the same setup as for the a torch dataset. Replace the dataset with a datapipe.
|
This DistributedSampler requires my dataset(datapipe) must have len method, but the length of my datapipe cannot be calculated cause it is a iterable datapipe. Have you ever met problem like this? |
I'll give it a try today. |
Thanks, please let me know when you make progress. |
Sorry for the delay @Adenialzz. You are correct that it doesn't work with DDP and without a length on an iterable data pipe. I reverted to DL2 despite its notably slower performance as it only really occurs at the start of the epoch. |
set |
@Adenialzz hi, could i ask you for a clarification? how was it used to fixed which problem exactly? i'd appreciate it very much. |
🐛 Describe the bug
Memory increase at the start of iteration after the start
I have been trying to use DataLoader2 with multiprocessing (and distributed in some cases). In general, its behavior is pretty strange relative to the original data loader implementation (which I'll call DataLoader1 below). It seems that after the completion of an epoch (iteration) the dataloader holds all data states instead of resetting. As a result, memory usage increases from the train epoch to the validation epoch.
More problematic still; when starting the next epoch the previous epochs states seem to be held and cause memory usage to spike upwards. I imagine this causes (some of the many recent issues) Memory Errors, and did for sure in my case when training with DDP. DataLoader1 has none of these issues.
I tested with a relatively complicated datapipe, using Multiplexing, several intermediate 1:Multi yielding mechanisms, and producing a pair (audio: tensor, metadata: dict).
I saw a recent post claiming that dictionaries were the issue. At least from what I have seen it is the reading service more than dictionaries.
DataLoader2 compared with torch.data DataLoader1
Here is the code that I used to produce the results below.
Results:
Memory usage graph:
Memory Usage:
Attempts to get the resetting behavior of DL1
I studied the internal variable states embedded in DataLoader2 and the reading service.
In the reading service, there are pipes that stick around (per worker) after the epoch.
I was able to effectively resolve the early startup time by resetting all of these values to their original values. However, this resulted in the creation of a whole new dataloader and doubled the memory usage (attached image).
Q: Is there a way to embed the reset behavior into the 'worker_reset_fn' variable of the reading service without causing the memory increase?
Other recommendations to hard reset the data loader every step? Compared to DL1, it is much less efficient to keep the memory stored and when resetting to briefly have 2 dataloaders worth of RAM usage. It also causes startup time for my jobs per epoch to nearly triple, before proceeding as normal.
I left my original comment here: #1150
Small comment about datapipes, isolating to the reading service
Datapipe performance is very consistent after resetting the iterator. This may be clear already from DL1 but I ran the test so showing it here:
Versions
EnvInfo.txt
The text was updated successfully, but these errors were encountered: