Good practices of training #73

DarioMarzella · 2022-09-28T14:53:49Z

List of good practices to prevent already-occurred issues:

Check that the batch size is not too big. This might cause a memory overload, with a consequent missing memory to allocate error.
Error:

RuntimeError: CUDA out of memory. Tried to allocate 494.00 MiB (GPU 0; 39.44 GiB total capacity; 13.79 MiB already allocated; 168.62 MiB free; 22.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
RuntimeError: CUDA error: out of memory

Check that your batch size is smaller than the smaller cluster of test data you have, otherwise it will try to load more cases than available and crash.
Did you get a "ValueError: No avaiable training data after filtering" error? You might have just entered the wrong data path, so there is no hdf5 file available, so no data
Use the pytorch.DataLoader num_worker argument to have multiple workers pre-loading the data for the training, especially if you train on GPU. In deeprank, you have to specify it in model.train(). Do not assign more num_workers than number of CPU cores you have available (on snellius, 18 CPU for each GPU card).
In your model you want to set one shape (input or output) as a fraction of a variable (e.g. input_shape/2)? You might encounter the following issue:
TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of: \* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad) \* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

To prevent it, write your shape as in the following example, with '\' and 'int':
nn.Conv3d(input_shape[0], int(input_shape[0]//2), kernel_size=1)
You can check your GPU memory consumption in real time by submitting a job, connecting through ssh to the node running that job ([email protected]) and running nvidia-smi

The text was updated successfully, but these errors were encountered:

DarioMarzella added the notes Things to remember that do not necessarily cause an issue at the moment, but may do in the future label Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Good practices of training #73

Good practices of training #73

DarioMarzella commented Sep 28, 2022 •

edited

Loading

Good practices of training #73

Good practices of training #73

Comments

DarioMarzella commented Sep 28, 2022 • edited Loading

DarioMarzella commented Sep 28, 2022 •

edited

Loading