Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Good practices of training #73

Open
DarioMarzella opened this issue Sep 28, 2022 · 0 comments
Open

Good practices of training #73

DarioMarzella opened this issue Sep 28, 2022 · 0 comments
Labels
notes Things to remember that do not necessarily cause an issue at the moment, but may do in the future

Comments

@DarioMarzella
Copy link
Member

DarioMarzella commented Sep 28, 2022

List of good practices to prevent already-occurred issues:

  1. Check that the batch size is not too big. This might cause a memory overload, with a consequent missing memory to allocate error.
    Error:
RuntimeError: CUDA out of memory. Tried to allocate 494.00 MiB (GPU 0; 39.44 GiB total capacity; 13.79 MiB already allocated; 168.62 MiB free; 22.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
RuntimeError: CUDA error: out of memory
  1. Check that your batch size is smaller than the smaller cluster of test data you have, otherwise it will try to load more cases than available and crash.

  2. Did you get a "ValueError: No avaiable training data after filtering" error? You might have just entered the wrong data path, so there is no hdf5 file available, so no data

  3. Use the pytorch.DataLoader num_worker argument to have multiple workers pre-loading the data for the training, especially if you train on GPU. In deeprank, you have to specify it in model.train(). Do not assign more num_workers than number of CPU cores you have available (on snellius, 18 CPU for each GPU card).

  4. In your model you want to set one shape (input or output) as a fraction of a variable (e.g. input_shape/2)? You might encounter the following issue:
    TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of: \* (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad) \* (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

    To prevent it, write your shape as in the following example, with '\' and 'int':
    nn.Conv3d(input_shape[0], int(input_shape[0]//2), kernel_size=1)

  5. You can check your GPU memory consumption in real time by submitting a job, connecting through ssh to the node running that job ([email protected]) and running nvidia-smi

@DarioMarzella DarioMarzella added the notes Things to remember that do not necessarily cause an issue at the moment, but may do in the future label Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
notes Things to remember that do not necessarily cause an issue at the moment, but may do in the future
Projects
None yet
Development

No branches or pull requests

1 participant