Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM issues with 3D FCMAE fine-tuning #201

Open
edyoshikun opened this issue Nov 5, 2024 · 2 comments
Open

OOM issues with 3D FCMAE fine-tuning #201

edyoshikun opened this issue Nov 5, 2024 · 2 comments
Labels
bug Something isn't working translation Image translation (VS)

Comments

@edyoshikun
Copy link
Contributor

Currently if we use the ddp and the fcmae model for fine-tuning for the virtual staining tasks, there seems to be a 'memory leak'. The solution could be to expose these parameters at the ViscyTrainer.

Using PyTorch Lightning’s CombinedLoader with Distributed Data Parallel (DDP spawns multiple processes (one per GPU) and seems to lead to excessive accumulation in a subset of worker processes. Setting persistent_workers=False restarts the DataLoader workers at the beginning of each epoch, which prevents the accumulation of memory or disk space. There is a performance trade-off here as well as reducing the hardcoded prefetch factor from 4 to 2.

@edyoshikun
Copy link
Contributor Author

Using the prefetch=4 vs prefetch=2 has no effect on the training speed for the neuromast VS training. Here we are mostly limited by CPU->GPU pipes.

@ziw-liu
Copy link
Collaborator

ziw-liu commented Nov 14, 2024

When I enable pinned memory in #195 I see this issue: pytorch/pytorch#97432. But this is likely not related to the HCS datamodule since that one is not using pinned memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working translation Image translation (VS)
Projects
None yet
Development

No branches or pull requests

2 participants