Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FORCE-TERMINATE AT Data unpack would read past end of buffer #1172

Open
sj6077 opened this issue Jun 27, 2019 · 3 comments
Open

FORCE-TERMINATE AT Data unpack would read past end of buffer #1172

sj6077 opened this issue Jun 27, 2019 · 3 comments
Labels

Comments

@sj6077
Copy link

sj6077 commented Jun 27, 2019

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet)
    TensorFlow
  2. Framework version:
    1.11
  3. Horovod version:
    0.16.4
  4. MPI version:
    4.0.1
  5. CUDA version:
    10.0
  6. NCCL version:
    2.4
  7. Python version:
    3.6
  8. OS and version:
    Ubuntu 18.04
  9. GCC version:
    7.4.0

Checklist:

  1. Did you search issues to find if somebody asked this question before?
    yes
  2. If your question is about hang, did you read this doc?
    yes
  3. If your question is about docker, did you read this doc?
    yes

Bug report:
I got an error as below. It works okay when the number of mpi processes is less than 12 but the error is shown with more than 12 processes. Can you let me know why it is happend?
I also checked hwloc is installed but no machine does not include it.

`[elsa-05:25920] [[9783,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
[elsa-02:18424] [[9783,0],2] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355

An internal error has occurred in ORTE:

[[9783,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.


An internal error has occurred in ORTE:

[[9783,0],2] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------`

@sj6077 sj6077 added the bug label Jun 27, 2019
@sj6077
Copy link
Author

sj6077 commented Jul 1, 2019

@alsrgv Yes, I confirmed hwloc is not installed in any machine.

@alsrgv
Copy link
Member

alsrgv commented Jul 1, 2019

@sj6077, this may be a good question to ask in https://github.com/open-mpi/ompi, perhaps in open-mpi/ompi#6762

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants