-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA OOM Issues #8
Comments
That's a big problem... batch size and vram usage are only partially correlated, there is a minimum amount of vram to load the full optimizer states of the GPT model. There is one immediate thing you could attempt -- enable FP16 training. Keep the batch sizes at a reasonable level (some multiple of If that is not sufficient, then more complicated efforts will be required to reduce vram usage. One possible effort would be to preprocess the dataset into quantized mels, rather than making use of the vqvae on the fly. But personally I would recommend using the new colab notebook if it doesn't work at that point. |
That will not work. The architecture of the model must not be adjusted; you will get nonsense results if the model isn't fully loaded. |
if someone knew how to implement LORA, it might be applicable to this situation but I think colab is the best option for now. I will close this issue until the situation changes. |
have you tried implementing it in colossal AI? it claims to get 1.5x to 8x speedup on PCs for training OPT and GPT type models through larger ram pagefile magic
|
The primary problem with using ColossalAI, or any other "GPT-2 infer/train Speedup" project, is that the GPT model here is not exactly the same as a normal GPT-2 model. It injects the conditional latents into the input embeddings on the first forward pass (or on all forward passes when there is no kv_cache). A speedup framework that doesn't expose callbacks at the forward pass (which is all I have seen) has to be redeveloped in some manner It is possible I am missing some obvious performance gains, but so far integration has not been an immediately obvious process for me |
bitsandbytes |
Following the mrq implementation, I have added 8bit training with bitsandbytes in 091c6b1 However, this will only work for Linux, because Windows has issues with direct pip installations for bnb. Paging @devilismyfriend for help here |
Yeah should be an easy fix for windows |
Sorry to be here again.
I have a 3070 8GB
Now my dataset is fine. I keep getting cuda errros. I've identified 3 places in the yml I can edit to reduce batch sizes but even putting it to 1 gets me an error.
I've also tried changing
mega_batch_factor:
as your notes.I tried a much smaller dataset of 600 wav files.
I get this :
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ H:\DL-Art-School\codes\train.py:370 in <module> │ │ │ │ 367 │ │ torch.cuda.set_device(torch.distributed.get_rank()) │ │ 368 │ │ │ 369 │ trainer.init(args.opt, opt, args.launcher) │ │ ❱ 370 │ trainer.do_training() │ │ 371 │ │ │ │ H:\DL-Art-School\codes\train.py:325 in do_training │ │ │ │ 322 │ │ │ │ │ 323 │ │ │ _t = time() │ │ 324 │ │ │ for train_data in tq_ldr: │ │ ❱ 325 │ │ │ │ self.do_step(train_data) │ │ 326 │ │ │ 327 │ def create_training_generator(self, index): │ │ 328 │ │ self.logger.info('Start training from epoch: {:d}, iter: {:d}'.format(self.start │ │ │ │ H:\DL-Art-School\codes\train.py:206 in do_step │ │ │ │ 203 │ │ │ print("Update LR: %f" % (time() - _t)) │ │ 204 │ │ _t = time() │ │ 205 │ │ self.model.feed_data(train_data, self.current_step) │ │ ❱ 206 │ │ gradient_norms_dict = self.model.optimize_parameters(self.current_step, return_g │ │ 207 │ │ iteration_rate = (time() - _t) / batch_size │ │ 208 │ │ if self._profile: │ │ 209 │ │ │ print("Model feed + step: %f" % (time() - _t)) │ │ │ │ H:\DL-Art-School\codes\trainer\ExtensibleTrainer.py:302 in optimize_parameters │ │ │ │ 299 │ │ │ new_states = {} │ │ 300 │ │ │ self.batch_size_optimizer.focus(net) │ │ 301 │ │ │ for m in range(self.batch_factor): │ │ ❱ 302 │ │ │ │ ns = step.do_forward_backward(state, m, step_num, train=train_step, no_d │ │ 303 │ │ │ │ # Call into post-backward hooks. │ │ 304 │ │ │ │ for name, net in self.networks.items(): │ │ 305 │ │ │ │ │ if hasattr(net.module, "after_backward"): │ │ │ │ H:\DL-Art-School\codes\trainer\steps.py:214 in do_forward_backward │ │ │ │ 211 │ │ local_state = {} # <-- Will store the entire local state to be passed to inject │ │ 212 │ │ new_state = {} # <-- Will store state values created by this step for returning │ │ 213 │ │ for k, v in state.items(): │ │ ❱ 214 │ │ │ local_state[k] = v[grad_accum_step] │ │ 215 │ │ local_state['train_nets'] = str(self.get_networks_trained()) │ │ 216 │ │ loss_accumulator = self.loss_accumulator if loss_accumulator is None else loss_a │ │ 217 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range
The text was updated successfully, but these errors were encountered: