OOM issues with 3D FCMAE fine-tuning #201

edyoshikun · 2024-11-05T23:38:48Z

Currently if we use the ddp and the fcmae model for fine-tuning for the virtual staining tasks, there seems to be a 'memory leak'. The solution could be to expose these parameters at the ViscyTrainer.

Using PyTorch Lightning’s CombinedLoader with Distributed Data Parallel (DDP spawns multiple processes (one per GPU) and seems to lead to excessive accumulation in a subset of worker processes. Setting persistent_workers=False restarts the DataLoader workers at the beginning of each epoch, which prevents the accumulation of memory or disk space. There is a performance trade-off here as well as reducing the hardcoded prefetch factor from 4 to 2.

The text was updated successfully, but these errors were encountered:

edyoshikun · 2024-11-05T23:46:04Z

Using the prefetch=4 vs prefetch=2 has no effect on the training speed for the neuromast VS training. Here we are mostly limited by CPU->GPU pipes.

ziw-liu · 2024-11-14T18:42:35Z

When I enable pinned memory in #195 I see this issue: pytorch/pytorch#97432. But this is likely not related to the HCS datamodule since that one is not using pinned memory.

edyoshikun mentioned this issue Nov 12, 2024

Expose prefetch_factor and persistent_worker for the HCS datamodule #203

Merged

ziw-liu added bug Something isn't working translation Image translation (VS) labels Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM issues with 3D FCMAE fine-tuning #201

OOM issues with 3D FCMAE fine-tuning #201

edyoshikun commented Nov 5, 2024

edyoshikun commented Nov 5, 2024

ziw-liu commented Nov 14, 2024

OOM issues with 3D FCMAE fine-tuning #201

OOM issues with 3D FCMAE fine-tuning #201

Comments

edyoshikun commented Nov 5, 2024

edyoshikun commented Nov 5, 2024

ziw-liu commented Nov 14, 2024