espaloma training with reweighting (train_sampler) is not well tested #10

kntkb · 2024-03-16T22:53:12Z

CUDA out of memory error is raised for tag <=0.1.2

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 10.75 GiB total capacity; 9.76 GiB >already allocated; 7.62 MiB free; 10.40 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

From tag 0.1.3, gradient accumulation is used to resolve this problem but have not been well tested.

The text was updated successfully, but these errors were encountered:

kntkb · 2024-03-16T23:27:28Z

Unexpected error from espfit/tests/test_app_train_sampler.py

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have >already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). >Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved >tensors after calling backward.

kntkb added the bug Something isn't working label Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

espaloma training with reweighting (train_sampler) is not well tested #10

espaloma training with reweighting (train_sampler) is not well tested #10

kntkb commented Mar 16, 2024 •

edited

Loading

kntkb commented Mar 16, 2024 •

edited

Loading

espaloma training with reweighting (train_sampler) is not well tested #10

espaloma training with reweighting (train_sampler) is not well tested #10

Comments

kntkb commented Mar 16, 2024 • edited Loading

kntkb commented Mar 16, 2024 • edited Loading

kntkb commented Mar 16, 2024 •

edited

Loading

kntkb commented Mar 16, 2024 •

edited

Loading