Avoid OOM on TPU #1690

borisdayma · 2021-11-03T15:43:11Z

borisdayma
Nov 3, 2021

Hi,

I've been able to solve an OOM on a TPU v3-8 with an ugly hack that I don't understand.
I feel like it has to do with flushing the memory.

Problem you have encountered:

When running my training script on a TPU v3-8, I get RuntimeError: RESOURCE_EXHAUSTED.

What you expected to happen:

Due to my quick hack (see below), it should run with no problem.

Logs, error messages, etc:

2021-11-03 15:31:06.338128: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2085] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Attempting to reserve 11.35G at the bottom of memory. That was not possible. There are 12.18G free, 0B reserved, and 10.52G reservable.
Epoch ... (1/6):   0%|                                                                                                                                                                                | 0/6 [03:02<?, ?it/s]
Traceback (most recent call last):                                                                                                                                                                                          
  File "/home/koush/dalle-mini/dev/seq2seq/run_seq2seq_flax.py", line 991, in <module>
    main()
  File "/home/koush/dalle-mini/dev/seq2seq/run_seq2seq_flax.py", line 962, in main
    state, train_metric = p_train_step(state, batch)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/_src/api.py", line 1946, in cache_miss
    out_tree, out_flat = f_pmapped_(*args, **kwargs)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/_src/api.py", line 1825, in f_pmapped
    out = pxla.xla_pmap(
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/core.py", line 1698, in bind
    return call_bind(self, fun, *args, **params)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/core.py", line 1623, in call_bind
    outs = primitive.process(top_trace, fun, tracers, params)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/core.py", line 1701, in process
    return trace.process_map(self, fun, tracers, params)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/core.py", line 627, in process_call
    return primitive.impl(f, *tracers, **params)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/interpreters/pxla.py", line 723, in xla_pmap_impl
    return compiled_fun(*args)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/interpreters/pxla.py", line 1264, in execute_replicated
    out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: RESOURCE_EXHAUSTED: Attempting to reserve 11.35G at the bottom of memory. That was not possible. There are 12.18G free, 0B reserved, and 10.52G reservable.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/koush/dalle-mini/dev/seq2seq/run_seq2seq_flax.py", line 991, in <module>
    main()
  File "/home/koush/dalle-mini/dev/seq2seq/run_seq2seq_flax.py", line 962, in main
    state, train_metric = p_train_step(state, batch)
  File "/home/koush/.pyenv/versions/dev/lib/python3.9/site-packages/jax/interpreters/pxla.py", line 1264, in execute_replicated
    out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
RuntimeError: RESOURCE_EXHAUSTED: Attempting to reserve 11.35G at the bottom of memory. That was not possible. There are 12.18G free, 0B reserved, and 10.52G reservable.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

How do I solve it?

add print(model.params) here
the model is now training 🤯

Note:

I tested it a few times to check it was not just a non-deterministic error
I also created a bash script that tries running the training script (with no breakpoint) for 100 times but the error would always happen
The script runs with no problem if I decrease the batch size

Steps to reproduce:

git clone https://github.com/borisdayma/dalle-mini, make sure you're on commit 0cc04f208d6218aa63165ed732f41a013f4a8698
pip install -e path_to_repo
at root of repo, cd dev/seq2seq
python run_seq2seq_flax.py --dataset_repo_or_path dalle-mini/encoded-vqgan_imagenet_f16_16384 --train_file data/train/*.jsonl --validation_file data/valid/*.jsonl --len_train 129832935 --len_eval 171505 --eval_steps 1000 --from_checkpoint dalle-mini/dalle-mini/model-1e6bsdiv:latest --streaming --normalize_text --output_dir output --per_device_train_batch_size 56 --per_device_eval_batch_size 56 --preprocessing_num_workers 80 --warmup_steps 5000 --gradient_accumulation_steps 8 --do_train --do_eval --adafactor --num_train_epochs 6 --log_model --learning_rate 0.005

Answered by jheek

Nov 29, 2021

This is quite odd for sure. Fragmentation and being close to the limit in terms of memory could off course result in errors that appear almost randomly. One thing you could try is to initialize the model on CPU jax.jit(model.init, backend="cpu") The params are moved to TPU automatically during training or during replication of the state (eg jax_utils.replicate)

View full answer

borisdayma · 2021-11-10T16:56:28Z

borisdayma
Nov 10, 2021
Author

So I think the issue is due to the following:

model parameters are initialized by default (in the huggingface/transformers library)
then we replace them with parameters from a checkpoint
the memory gets fragmented, somehow printing the parameters fixes it

Another solution I had considered is move all the weights to CPU and move them back to TPU again. Is there a cleaner way to handle TPU memory allocation?

0 replies

jheek · 2021-11-29T13:09:18Z

jheek
Nov 29, 2021
Maintainer

This is quite odd for sure. Fragmentation and being close to the limit in terms of memory could off course result in errors that appear almost randomly. One thing you could try is to initialize the model on CPU jax.jit(model.init, backend="cpu") The params are moved to TPU automatically during training or during replication of the state (eg jax_utils.replicate)

5 replies

borisdayma Jan 23, 2022
Author

Thanks, this worked!
I think a minor inconvenience is that it takes more time to load the model, probably due to compilation.

borisdayma Feb 21, 2022
Author

Just a note that while it works, it's a bit slow on very large models due to the compilation.
My process is to initially load on CPU and then move to proper devices with pjit.

jheek Feb 22, 2022
Maintainer

Perhaps you could wrap the initialization (of the entire state) in a pjit as well?

sarataylor2000 Jan 15, 2023

This is quite odd for sure. Fragmentation and being close to the limit in terms of memory could off course result in errors that appear almost randomly. One thing you could try is to initialize the model on CPU jax.jit(model.init, backend="cpu") The params are moved to TPU automatically during training or during replication of the state (eg jax_utils.replicate)

in which part of the code this snippet "jax.jit(model.init, backend="cpu") " should be added?

steveepreston Oct 11, 2024

Where to put jax.jit(model.init, backend="cpu")?
Because code dies on model creation.

on TPU VM v3-8 (330GB RAM):

keras_nlp.models.GemmaCausalLM.from_preset("hf://google/gemma-2-27b-it")

#XlaRuntimeError: RESOURCE_EXHAUSTED: Error allocating device buffer: Attempting to allocate 4.39G. That was not possible. There are 2.30G free.; (0x0x0_HBM0)

awehrs · 2021-12-23T04:14:57Z

awehrs
Dec 23, 2021

I'm having a similar issue, even when I run it on CPU. It's baffling. I'm not sure how it's a parameter issue if, like @borisdayma said, the memory allocation seems to scale with batch size.

1 reply

marcvanzee Jan 3, 2022
Maintainer

@awehrs it is quite expected that the memory allocation scales with batch size, since if you jit compile a function for a certain input shape which has a batch dimenion, then this will take up more memory if your batch size increases.

I think as @jheek explained, the issue is likely due to fragmentation and the fact you are very close to the memory limit. If you have a specific problem, can you file a separate issue?

Habush · 2023-06-14T13:53:07Z

Habush
Jun 14, 2023

One issue that I faced in using jax.jit approach (as suggested by @jheek) when calling model.init is if the model uses BatchNorm then a jax.errors.ConcretizationTypeError error is raised.

0 replies

mikesol · 2024-01-30T07:35:43Z

mikesol
Jan 30, 2024

I'm having a similar issue, also around checkpointing. The difference is that I'm starting from randomly initialized weights and then checkpointing with orbax. The initial training works fine and the checkpointing works fine, but when the training loop resumes after the first checkpoint, the error appears. I'm guessing that checkpointing moves the model off of the TPU to the CPU, so my hunch is that when it tries to get it back onto the TPU, there isn't enough space. I'm wondering if it'd be possible somehow to "flush" the TPU after checkpointing?

I tried the print(model.params) hack but it doesn't work here.

  File "train_tcn.py", line 492, in <module>
    state, loss = jit_train_step(state, input, target, config.loss_fn)
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 166, in reraise_w
ith_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/pjit.py", line 250, in cache_miss
    outs, out_flat, out_tree, args_flat, jaxpr = _python_pjit_helper(
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/pjit.py", line 163, in _python_pjit_helper
    out_flat = pjit_p.bind(*args_flat, **params)
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/core.py", line 2677, in bind
    return self.bind_with_trace(top_trace, args, params)
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/core.py", line 383, in bind_with_trace
    out = trace.process_primitive(self, map(trace.full_raise, args), params)
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/core.py", line 815, in process_primitive
    return primitive.impl(*tracers, **params)
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/pjit.py", line 1203, in _pjit_call_impl
    return xc._xla.pjit(name, f, call_impl_cache_miss, [], [], donated_argnums,
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/pjit.py", line 1187, in call_impl_cache_miss
    out_flat, compiled = _pjit_call_impl_python(
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/pjit.py", line 1143, in _pjit_call_impl_python
    return compiled.unsafe_call(*args), compiled
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/profiler.py", line 314, in wrapper
    return func(*args, **kwargs)
  File "/home/mike/jaxfun/.venv/lib/python3.8/site-packages/jax/_src/interpreters/pxla.py", line 1349, in __call__
    results = self.xla_executable.execute_sharded(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Error loading program: Attempting to reserve 2.61G at the bottom of memory. That was not possible. There are 1.70G free, 0B reserved, and 1.70G reservable.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid OOM on TPU #1690

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Avoid OOM on TPU #1690

borisdayma Nov 3, 2021

Problem you have encountered:

What you expected to happen:

Logs, error messages, etc:

How do I solve it?

Steps to reproduce:

Replies: 5 comments · 6 replies

borisdayma Nov 10, 2021 Author

jheek Nov 29, 2021 Maintainer

borisdayma Jan 23, 2022 Author

borisdayma Feb 21, 2022 Author

jheek Feb 22, 2022 Maintainer

sarataylor2000 Jan 15, 2023

steveepreston Oct 11, 2024

awehrs Dec 23, 2021

marcvanzee Jan 3, 2022 Maintainer

Habush Jun 14, 2023

mikesol Jan 30, 2024

borisdayma
Nov 3, 2021

Replies: 5 comments 6 replies

borisdayma
Nov 10, 2021
Author

jheek
Nov 29, 2021
Maintainer

borisdayma Jan 23, 2022
Author

borisdayma Feb 21, 2022
Author

jheek Feb 22, 2022
Maintainer

awehrs
Dec 23, 2021

marcvanzee Jan 3, 2022
Maintainer

Habush
Jun 14, 2023

mikesol
Jan 30, 2024