You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
List of good practices to prevent already-occurred issues:
Check that the batch size is not too big. This might cause a memory overload, with a consequent missing memory to allocate error.
Error:
RuntimeError: CUDA out of memory. Tried to allocate 494.00 MiB (GPU 0; 39.44 GiB total capacity; 13.79 MiB already allocated; 168.62 MiB free; 22.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
RuntimeError: CUDA error: out of memory
Check that your batch size is smaller than the smaller cluster of test data you have, otherwise it will try to load more cases than available and crash.
Did you get a "ValueError: No avaiable training data after filtering" error? You might have just entered the wrong data path, so there is no hdf5 file available, so no data
Use the pytorch.DataLoader num_worker argument to have multiple workers pre-loading the data for the training, especially if you train on GPU. In deeprank, you have to specify it in model.train(). Do not assign more num_workers than number of CPU cores you have available (on snellius, 18 CPU for each GPU card).
In your model you want to set one shape (input or output) as a fraction of a variable (e.g. input_shape/2)? You might encounter the following issue: TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of: \* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad) \* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
To prevent it, write your shape as in the following example, with '\' and 'int': nn.Conv3d(input_shape[0], int(input_shape[0]//2), kernel_size=1)
You can check your GPU memory consumption in real time by submitting a job, connecting through ssh to the node running that job ([email protected]) and running nvidia-smi
The text was updated successfully, but these errors were encountered:
DarioMarzella
added
the
notes
Things to remember that do not necessarily cause an issue at the moment, but may do in the future
label
Sep 28, 2022
List of good practices to prevent already-occurred issues:
Error:
Check that your batch size is smaller than the smaller cluster of test data you have, otherwise it will try to load more cases than available and crash.
Did you get a "ValueError: No avaiable training data after filtering" error? You might have just entered the wrong data path, so there is no hdf5 file available, so no data
Use the pytorch.DataLoader num_worker argument to have multiple workers pre-loading the data for the training, especially if you train on GPU. In deeprank, you have to specify it in model.train(). Do not assign more num_workers than number of CPU cores you have available (on snellius, 18 CPU for each GPU card).
In your model you want to set one shape (input or output) as a fraction of a variable (e.g. input_shape/2)? You might encounter the following issue:
TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of: \* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad) \* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
To prevent it, write your shape as in the following example, with '\' and 'int':
nn.Conv3d(input_shape[0], int(input_shape[0]//2), kernel_size=1)
You can check your GPU memory consumption in real time by submitting a job, connecting through ssh to the node running that job ([email protected]) and running
nvidia-smi
The text was updated successfully, but these errors were encountered: