Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

munmap_chunk(): invalid pointer #358

Closed
soares-f opened this issue Mar 3, 2019 · 10 comments
Closed

munmap_chunk(): invalid pointer #358

soares-f opened this issue Mar 3, 2019 · 10 comments

Comments

@soares-f
Copy link

soares-f commented Mar 3, 2019

Hello,

I am encountering an error while trying to use onmt-main train_and_eval.
After evaluation, the following error happens: munmap_chunk(): invalid pointer

2019-03-03 11:38:19.711832: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /gpfs/scratch/bsc88/bsc88260/WMT_2018_BSC/ESPT_EN/Tokenized/processed_espt_enSP32k.vocab is already initialized.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation predictions saved to /gpfs/scratch/bsc88/bsc88260/WMT_2018_BSC/ESPT_EN/model/eval/predictions.txt.1323
*** Error in /gpfs/projects/bsc88/Environments_P9/OpenNMT-tf/bin/python3': munmap_chunk(): invalid pointer: 0x00000000f8b7a860 *** *** Error in /gpfs/projects/bsc88/Environments_P9/OpenNMT-tf/bin/python3': munmap_chunk(): invalid pointer: 0x00000000f8a2f690 ***
*** Error in /gpfs/projects/bsc88/Environments_P9/OpenNMT-tf/bin/python3': munmap_chunk(): invalid pointer: 0x00000000f9afebd0 *** *** Error in /gpfs/projects/bsc88/Environments_P9/OpenNMT-tf/bin/python3': munmap_chunk(): invalid pointer: 0x00000000f81e3260 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x92324)[0x7fffac212324]
/lib64/libc.so.6(cfree+0x970)[0x7fffac21a1a0]
/lib64/libcuda.so.1(+0x24e37c)[0x7fff7569e37c]
/lib64/libcuda.so.1(+0x2cab34)[0x7fff7571ab34]
/lib64/libcuda.so.1(+0x2511ac)[0x7fff756a11ac]
/lib64/libpthread.so.0(+0x8af4)[0x7fffac4c8af4]
/lib64/libc.so.6(clone+0xe4)[0x7fffac2a8814]

I am running in a HPC cluster with:
2 x IBM Power9 8335-GTH
4 x GPU NVIDIA V100
CUDA 9.1
CUDNN 7.1.3
Python 3.6.5

The job is running in an exclusive node using all 4 GPUs.

My configuration file is:

train:
save_checkpoints_steps: 1000
batch_type: tokens
batch size: 3048
num_threads: 160
train_steps: 800000
eval:
eval_delay: 1200 # Every 1 hour
external_evaluators: BLEU
batch_size: 32
num_threads: 160
infer:
batch_size: 32

The complete stderr dump is attached.
config_basic1.txt

Any ideas about what is causing this and how to solve it?

@guillaumekln
Copy link
Contributor

Hi,

The first thing I notice is the very high num_threads value. It's actually the number of threads used to preprocess the dataset and probably does not need to be higher than 2 or 4 for training. (If you interpret this parameter differently please let me know, the documentation should then be improved.)

More generally, this error can hide a out of memory error. Are you running low on system memory when running the training?

@soares-f
Copy link
Author

soares-f commented Mar 5, 2019

Hi,

Actually I set the num_threads to the actual number of threads available in the node (since each CPU has 40 cores with 2 threads per core). The memory usage is currently low (less than 10%) as the node has 512GB of RAM.

I managed to get it running without the train_and_eval, only with train. However, the speed is incredibly slow (~500 words/sec). My For the logs, tensorflow seems to be using the GPUS.
My training data has around 14M sentences, but this shouldn't be a reason for this slow performance, should it?

I am using the standard tensorflow compiled in the cluster (version 1.9.0), later I will try the latest one with CUDA 10.0 to see if anything changes.

Regards,

@guillaumekln
Copy link
Contributor

Actually I set the num_threads to the actual number of threads available in the node (since each CPU has 40 cores with 2 threads per core)

I still recommend setting it to a small value. Can you try without this parameter to use the default value?

I managed to get it running without the train_and_eval, only with train. However, the speed is incredibly slow (~500 words/sec). My For the logs, tensorflow seems to be using the GPUS.
My training data has around 14M sentences, but this shouldn't be a reason for this slow performance, should it?

How does the GPU usage look like when running watch nvidia-smi?

@soares-f
Copy link
Author

soares-f commented Mar 5, 2019

I tried without the parameter and got the same result.
It takes ages for the GPU start to get busy, but only around 8%
This is the screenshot of both CPU and Memory usage:
screenshot

I then tried with a smaller dataset of around 50k sentences, and the GPU started to be completely busy and got a speed of around 17000 words/sec.
Thus, the only thing that is varying is the size of the training data, which is weird. I might be missing something, because the cluster has SSD drives and even the whole dataset would fit in memory.

@guillaumekln
Copy link
Contributor

Thanks for all the info.

Can you try setting a fixed and smaller shuffle buffer, e.g.:

train:
  sample_buffer_size: 1000000

@soares-f
Copy link
Author

soares-f commented Mar 5, 2019

It finally worked!
Thank you!

By the way, what exactly is sample_buffer_size? I have checked in another issue but it was not veery clear for me. #21
In a practical situation, what would be the implications of different numbers?

Thank you so much for your help and quick responses!

@guillaumekln
Copy link
Contributor

Great!

This parameter is to control the level of data shuffling. Instead of reading the dataset sequentially, it will randomly sample sentences from the next sample_buffer_size sentences. When auto_config is used, this is set to the dataset size so that the dataset is uniformly shuffled.

I find it a bit surprising that performance is impacted that much when using a very large dataset. I should probably set an upper limit to 5M or something to this buffer.

@soares-f
Copy link
Author

soares-f commented Mar 5, 2019

Yes, in fact, this solved the long time waiting for filling the shuffle buffer also.
Tomorrow I'll do more experiments on this, to see at what extent I can change this value (for the sake of curiosity).
I got very confused, because I'm using a pretty good hardware and wasn't get a good performance.
Thanks again!

@soares-f
Copy link
Author

soares-f commented Mar 6, 2019

Just for information, the maximum value I could use for this dataset was 8M for the sample_buffer_size.

Everything is working fine. Thanks :)

@guillaumekln
Copy link
Contributor

Thanks for the information!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants