-
Notifications
You must be signed in to change notification settings - Fork 443
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU #8
Comments
Whats going on here? am I out of memory? I can't get it to train. |
conda install -c anaconda cudnn==6.0.0 --yes seemed to fix the problem... |
@josai how did you know that you needed to do |
googling similar errors and trying their solutions until one worked. |
Is this with the 345M model? I've found it only just fits in a 1080TI, so anything using substantial vram like a browser running in the background can push it over the edge. |
No, neither models were working until I conda installed cudnn. I am currently retraining the 345m model on GTX 970 with several applications including chrome in the background with no problems. |
I have a gtx 1060 6gb and I also have this problem. |
Same problem. Are there any diagnostics we can run? |
iocaposk8, that change is one way to reduce the memory usage. You are basically shortening the model's memory there, by allowing it to only remember the last 512 words instead of the full 1024, during training. I'm not sure how much the effect on output quality would be from that. |
thanks for the answer, so can I use for example 895 as value? or is better number like 128, 512, 1024 ecc...? Another question: I am training a model for my language, according to you how much should the loss be for having a good model but not going overfit? last question: how can I generate texts that speak about a certain topic? Thank you |
i'm having the same problem trying to train the 355M model on a RTX2070 8GB, even with both i didn't use Conda but i have cudnn installed manually (cudnn v7.6.2.24 on CUDA 9.0) |
This fix worked for me. |
Caused by op 'model/h3/attn/truediv_1', defined at:
File "train.py", line 293, in
main()
File "train.py", line 138, in main
opt_grads = memory_saving_gradients.gradients(loss, train_vars)
File "C:\Users\The Atomizer\Desktop\text\gpt2\memory_saving_gradients.py", line 250, in gradients
copied_sgv, info = ge.copy_with_input_replacements(ge.sgv(ops_to_copy), {})
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 673, in copy_with_input_replacements
sgv, dst_graph, dst_scope, src_scope, reuse_dst_scope=reuse_dst_scope)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 453, in call
self.copy_ops(info)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 467, in copy_ops
op, op_outputs = self.transform_op_handler(info, op, new_inputs)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py", line 177, in copy_op_handler
[], input_types_, None, op_def_)
File "C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,12,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/h3/attn/truediv_1 (defined at C:\Users\The Atomizer\Miniconda3\envs\gtext\lib\site-packages\tensorflow\contrib\graph_editor\transform.py:177) = RealDiv[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/h3/attn/Exp_1, model/h3/attn/Sum_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
The text was updated successfully, but these errors were encountered: