Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu usage for Rnn Training #172

Open
thatsri9ht opened this issue May 23, 2024 · 5 comments
Open

Gpu usage for Rnn Training #172

thatsri9ht opened this issue May 23, 2024 · 5 comments

Comments

@thatsri9ht
Copy link

With a NVIDIA GeForce RTX 3070 GPU and 16 GB of RAM on my PC, When training my LSTM model on the GPU, I encountered a CUDA out-of-memory error, indicating insufficient GPU memory to allocate tensors. I've tried reducing the batch size and simplifying the model architecture, but the issue persists. Any suggestions or guidance on how to address this problem would be greatly appreciated!

torchvision is not available - cannot save figures
INFO:root:##################################################
Starting training sequence 1...
##################################################
Training: 75%|█████████████████████ | 3750/5000 [00:12<00:04, 310.02it/s]
Traceback (most recent call last):
File "/home/thatsri9ht/python/openwakeword/openwakeword/train.py", line 888, in
best_model = oww.auto_train(
File "/home/thatsri9ht/python/openwakeword/openwakeword/train.py", line 276, in auto_train
self.train_model(
File "/home/thatsri9ht/python/openwakeword/openwakeword/train.py", line 519, in train_model
val_predictions = self.model(x_val)
File "/home/thatsri9ht/anaconda3/envs/openwakeword/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/thatsri9ht/anaconda3/envs/openwakeword/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/thatsri9ht/python/openwakeword/openwakeword/train.py", line 94, in forward
out, h = self.layer1(x)
File "/home/thatsri9ht/anaconda3/envs/openwakeword/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/thatsri9ht/anaconda3/envs/openwakeword/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/thatsri9ht/anaconda3/envs/openwakeword/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 888, in forward
c_zeros = torch.zeros(self.num_layers * num_directions,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 472.00 MiB. GPU

@EthanEpp
Copy link

Are you using your own feature embedding npys, or the provided ones?

@thatsri9ht
Copy link
Author

I use both my own features and the features of 2000 hours available in huggingface

@EthanEpp
Copy link

EthanEpp commented Jun 12, 2024

I see, I believe the issue comes from this line of the train function:

X_val_fp = np.array([X_val_fp[i:i+input_shape[0]] for i in range(0, X_val_fp.shape[0]-input_shape[0], 1)]) # reshape to match model
as it is compatible with the false positive validation set features also from the huggingface, but it does this by loading the entire false validation feature set into memory in order to reshape it so memory runs out on other sets often. If your features are already the same shape as the model, (n, 16, 96) i believe, you can just comment out this line and it should work. I am working on a more robust fix that can do the resizing and will hopefully have a PR up for that in the next day or so.

The 75% comes from this line

val_steps = np.linspace(steps-int(steps*0.25), steps, 20).astype(np.int64)

since it runs the false positive validation test at 75* completion of training. This is not the actual source of the issue though, that is just the reason it occurs at 75%

Also are you generating your own features using the training_models notebook? I also think there might be an issue with using these generating embeddings with the automatic model training notebook, as is. I am also working on a robust fix for that, but if you are I can post the sort of bandaid fix I am doing now to make it compatible.

@dscripka
Copy link
Owner

@EthanEpp is correct, the script currently loads the validation data into memory as it generally is small enough to not cause an issues. Training RNN based models can dramatically increase the memory requirements for training (at least in comparison to the default simple DNN modles), so in this case you may need to make modifications to train.py.

If it helps, from my testing RNN based models only rarely perform better than DNN models for short wake words.

@Joseph513shen
Copy link

Hello, when I want to train the code through train.py, I found that there is no "generate_samples" function. Have you encountered this problem?thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants