munmap_chunk(): invalid pointer #358

soares-f · 2019-03-03T12:29:42Z

Hello,

I am encountering an error while trying to use onmt-main train_and_eval.
After evaluation, the following error happens: munmap_chunk(): invalid pointer

2019-03-03 11:38:19.711832: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /gpfs/scratch/bsc88/bsc88260/WMT_2018_BSC/ESPT_EN/Tokenized/processed_espt_enSP32k.vocab is already initialized.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation predictions saved to /gpfs/scratch/bsc88/bsc88260/WMT_2018_BSC/ESPT_EN/model/eval/predictions.txt.1323
*** Error in /gpfs/projects/bsc88/Environments_P9/OpenNMT-tf/bin/python3': munmap_chunk(): invalid pointer: 0x00000000f8b7a860 *** *** Error in /gpfs/projects/bsc88/Environments_P9/OpenNMT-tf/bin/python3': munmap_chunk(): invalid pointer: 0x00000000f8a2f690 ***
*** Error in /gpfs/projects/bsc88/Environments_P9/OpenNMT-tf/bin/python3': munmap_chunk(): invalid pointer: 0x00000000f9afebd0 *** *** Error in /gpfs/projects/bsc88/Environments_P9/OpenNMT-tf/bin/python3': munmap_chunk(): invalid pointer: 0x00000000f81e3260 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x92324)[0x7fffac212324]
/lib64/libc.so.6(cfree+0x970)[0x7fffac21a1a0]
/lib64/libcuda.so.1(+0x24e37c)[0x7fff7569e37c]
/lib64/libcuda.so.1(+0x2cab34)[0x7fff7571ab34]
/lib64/libcuda.so.1(+0x2511ac)[0x7fff756a11ac]
/lib64/libpthread.so.0(+0x8af4)[0x7fffac4c8af4]
/lib64/libc.so.6(clone+0xe4)[0x7fffac2a8814]

I am running in a HPC cluster with:
2 x IBM Power9 8335-GTH
4 x GPU NVIDIA V100
CUDA 9.1
CUDNN 7.1.3
Python 3.6.5

The job is running in an exclusive node using all 4 GPUs.

My configuration file is:

train:
save_checkpoints_steps: 1000
batch_type: tokens
batch size: 3048
num_threads: 160
train_steps: 800000
eval:
eval_delay: 1200 # Every 1 hour
external_evaluators: BLEU
batch_size: 32
num_threads: 160
infer:
batch_size: 32

The complete stderr dump is attached.
config_basic1.txt

Any ideas about what is causing this and how to solve it?

The text was updated successfully, but these errors were encountered:

guillaumekln · 2019-03-04T08:25:41Z

Hi,

The first thing I notice is the very high num_threads value. It's actually the number of threads used to preprocess the dataset and probably does not need to be higher than 2 or 4 for training. (If you interpret this parameter differently please let me know, the documentation should then be improved.)

More generally, this error can hide a out of memory error. Are you running low on system memory when running the training?

soares-f · 2019-03-05T08:40:58Z

Hi,

Actually I set the num_threads to the actual number of threads available in the node (since each CPU has 40 cores with 2 threads per core). The memory usage is currently low (less than 10%) as the node has 512GB of RAM.

I managed to get it running without the train_and_eval, only with train. However, the speed is incredibly slow (~500 words/sec). My For the logs, tensorflow seems to be using the GPUS.
My training data has around 14M sentences, but this shouldn't be a reason for this slow performance, should it?

I am using the standard tensorflow compiled in the cluster (version 1.9.0), later I will try the latest one with CUDA 10.0 to see if anything changes.

Regards,

guillaumekln · 2019-03-05T11:03:09Z

Actually I set the num_threads to the actual number of threads available in the node (since each CPU has 40 cores with 2 threads per core)

I still recommend setting it to a small value. Can you try without this parameter to use the default value?

I managed to get it running without the train_and_eval, only with train. However, the speed is incredibly slow (~500 words/sec). My For the logs, tensorflow seems to be using the GPUS.
My training data has around 14M sentences, but this shouldn't be a reason for this slow performance, should it?

How does the GPU usage look like when running watch nvidia-smi?

soares-f · 2019-03-05T18:55:00Z

I tried without the parameter and got the same result.
It takes ages for the GPU start to get busy, but only around 8%
This is the screenshot of both CPU and Memory usage:

I then tried with a smaller dataset of around 50k sentences, and the GPU started to be completely busy and got a speed of around 17000 words/sec.
Thus, the only thing that is varying is the size of the training data, which is weird. I might be missing something, because the cluster has SSD drives and even the whole dataset would fit in memory.

guillaumekln · 2019-03-05T19:47:00Z

Thanks for all the info.

Can you try setting a fixed and smaller shuffle buffer, e.g.:

train:
  sample_buffer_size: 1000000

soares-f · 2019-03-05T20:37:07Z

It finally worked!
Thank you!

By the way, what exactly is sample_buffer_size? I have checked in another issue but it was not veery clear for me. #21
In a practical situation, what would be the implications of different numbers?

Thank you so much for your help and quick responses!

guillaumekln · 2019-03-05T20:52:27Z

Great!

This parameter is to control the level of data shuffling. Instead of reading the dataset sequentially, it will randomly sample sentences from the next sample_buffer_size sentences. When auto_config is used, this is set to the dataset size so that the dataset is uniformly shuffled.

I find it a bit surprising that performance is impacted that much when using a very large dataset. I should probably set an upper limit to 5M or something to this buffer.

soares-f · 2019-03-05T21:00:59Z

Yes, in fact, this solved the long time waiting for filling the shuffle buffer also.
Tomorrow I'll do more experiments on this, to see at what extent I can change this value (for the sake of curiosity).
I got very confused, because I'm using a pretty good hardware and wasn't get a good performance.
Thanks again!

soares-f · 2019-03-06T13:14:49Z

Just for information, the maximum value I could use for this dataset was 8M for the sample_buffer_size.

Everything is working fine. Thanks :)

guillaumekln · 2019-03-06T13:38:27Z

Thanks for the information!

guillaumekln closed this as completed Mar 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

munmap_chunk(): invalid pointer #358

munmap_chunk(): invalid pointer #358

soares-f commented Mar 3, 2019

guillaumekln commented Mar 4, 2019

soares-f commented Mar 5, 2019

guillaumekln commented Mar 5, 2019

soares-f commented Mar 5, 2019

guillaumekln commented Mar 5, 2019

soares-f commented Mar 5, 2019

guillaumekln commented Mar 5, 2019

soares-f commented Mar 5, 2019

soares-f commented Mar 6, 2019

guillaumekln commented Mar 6, 2019

munmap_chunk(): invalid pointer #358

munmap_chunk(): invalid pointer #358

Comments

soares-f commented Mar 3, 2019

guillaumekln commented Mar 4, 2019

soares-f commented Mar 5, 2019

guillaumekln commented Mar 5, 2019

soares-f commented Mar 5, 2019

guillaumekln commented Mar 5, 2019

soares-f commented Mar 5, 2019

guillaumekln commented Mar 5, 2019

soares-f commented Mar 5, 2019

soares-f commented Mar 6, 2019

guillaumekln commented Mar 6, 2019