Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Cpp CI may be broken #10974

Closed
asitstands opened this issue May 16, 2018 · 4 comments
Closed

Cpp CI may be broken #10974

asitstands opened this issue May 16, 2018 · 4 comments

Comments

@asitstands
Copy link
Contributor

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10970/1/pipeline/717

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10311/36/pipeline/718

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10433/30/pipeline/721

[11:04:45] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:133: Stopping: NaiveEngine
[11:04:45] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:135: Stopped: NaiveEngine Starting...
[11:04:45] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:137: Started: NaiveEngine Done...
[11:04:45] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:133: Stopping: ThreadedEnginePooled
terminate called after throwing an instance of 'std::system_error'
  what():  Operation not permitted
/work/runtime_functions.sh: line 477:     7 Aborted                 (core dumped) build/tests/mxnet_unit_tests
build.py: 2018-05-16 11:04:45,909 Running of command in container failed (134): nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp
build.py: 2018-05-16 11:04:45,909 You can try to get into the container by using the following command: nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 -ti --entrypoint /bin/bash mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp
into container: False
Traceback (most recent call last):
  File "ci/build.py", line 263, in <module>
    sys.exit(main())
  File "ci/build.py", line 204, in main
    container_run(platform, docker_binary, shared_memory_size, command)
  File "ci/build.py", line 126, in container_run
    raise subprocess.CalledProcessError(ret, cmd)
subprocess.CalledProcessError: Command 'nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp' returned non-zero exit status 134
script returned exit code 1
@marcoabreu
Copy link
Contributor

marcoabreu commented May 16, 2018

Confirmed. Steps to reproduce:

Instance: g3.8xlarge (don't forget nvidia-docker)

Build using

ci/build.py --download-docker-cache --docker-cache-bucket mxnet-ci-docker-cache-prod --platform ubuntu_gpu --shm-size 500m /work/runtime_functions.sh build_ubuntu_gpu_cmake

Run using

ci/build.py --download-docker-cache --docker-cache-bucket mxnet-ci-docker-cache-prod --nvidiadocker --platform ubuntu_gpu --shm-size 500m /work/runtime_functions.sh unittest_ubuntu_gpu_cpp

Failure log:

[----------] 4 tests from Engine

[ RUN      ] Engine.start_stop

[13:17:58] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:133: Stopping: NaiveEngine

[13:17:58] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:135: Stopped: NaiveEngine Starting...

[13:17:58] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:137: Started: NaiveEngine Done...

[13:17:58] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:133: Stopping: ThreadedEnginePooled

terminate called after throwing an instance of 'std::system_error'

  what():  Operation not permitted

/work/runtime_functions.sh: line 477:     7 Aborted                 (core dumped) build/tests/mxnet_unit_tests

build.py: 2018-05-16 13:17:59,886 Running of command in container failed (134): nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp

build.py: 2018-05-16 13:17:59,886 You can try to get into the container by using the following command: nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 -ti --entrypoint /bin/bash mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp

into container: False

Traceback (most recent call last):

  File "ci/build.py", line 306, in <module>

    sys.exit(main())

  File "ci/build.py", line 243, in main

    container_run(platform, docker_binary, shared_memory_size, command)

  File "ci/build.py", line 155, in container_run

    raise subprocess.CalledProcessError(ret, cmd)

subprocess.CalledProcessError: Command 'nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp' returned non-zero exit status 134

script returned exit code 1

@marcoabreu
Copy link
Contributor

GDB reports "During startup program terminated with signal SIGABRT, Aborted.". Unfortunately, I can't get more informations since the error is flaky. Sometimes it happens multiple times in a row, sometimes it takes ages to show up.

@eric-haibin-lin
Copy link
Member

@szha
Copy link
Member

szha commented Jun 12, 2018

This has been fixed by #11065 thanks to @haojin2

@szha szha closed this as completed Jun 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants