Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Address already in use during tutorial test #11120

Closed
marcoabreu opened this issue Jun 1, 2018 · 30 comments
Closed

Address already in use during tutorial test #11120

marcoabreu opened this issue Jun 1, 2018 · 30 comments

Comments

@marcoabreu
Copy link
Contributor

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10656/7/pipeline/

+ nosetests-3.4 test_tutorials.py --nologcapture

....[19:25:36] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: ./data/caltech.rec, use 4 threads for decoding..

.......[19:27:51] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

.....[19:28:58] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

..[19:32:06] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

[19:37:22] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

....[19:39:25] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

.[19:39:51] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...

[19:39:51] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!

.....[19:40:24] src/operator/././../common/utils.h:416: Optimizer with lazy_update = True detected. Be aware that lazy update with row_sparse gradient is different from standard update, and may lead to different empirical results. See https://mxnet.incubator.apache.org/api/python/optimization/optimization.html for more details.

..[19:40:43] src/operator/././../common/utils.h:416: Optimizer with lazy_update = True detected. Be aware that lazy update with row_sparse gradient is different from standard update, and may lead to different empirical results. See https://mxnet.incubator.apache.org/api/python/optimization/optimization.html for more details.

..Traceback (most recent call last):

  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main

    "__main__", mod_spec)

  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code

    exec(code, run_globals)

  File "/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py", line 16, in <module>

    app.launch_new_instance()

  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 657, in launch_instance

    app.initialize(argv)

  File "<decorator-gen-123>", line 2, in initialize

  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 87, in catch_config_error

    return method(app, *args, **kwargs)

  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 456, in initialize

    self.init_sockets()

  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 248, in init_sockets

    self.control_port = self._bind_socket(self.control_socket, self.control_port)

  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 180, in _bind_socket

    s.bind("tcp://%s:%i" % (self.ip, port))

  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind

  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc

zmq.error.ZMQError: Address already in use

F..

======================================================================

FAIL: test_tutorials.test_unsupervised_learning_gan

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

    self.test(*self.arg)

  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 203, in test_unsupervised_learning_gan

    assert _test_tutorial_nb('unsupervised_learning/gan')

AssertionError: 

-------------------- >> begin captured stdout << ---------------------

Kernel died before replying to kernel_info



--------------------- >> end captured stdout << ----------------------



----------------------------------------------------------------------

Ran 35 tests in 1024.216s



FAILED (failures=1)

build.py: 2018-05-31 19:41:38,780 Running of command in container failed (1): nvidia-docker run --rm -t --shm-size=3g -v /home/jenkins_slave/workspace/it-tutorials-py2:/work/mxnet -v /home/jenkins_slave/workspace/it-tutorials-py2/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh tutorialtest_ubuntu_python2_gpu

build.py: 2018-05-31 19:41:38,780 You can try to get into the container by using the following command: nvidia-docker run --rm -t --shm-size=3g -v /home/jenkins_slave/workspace/it-tutorials-py2:/work/mxnet -v /home/jenkins_slave/workspace/it-tutorials-py2/build:/work/build -u 1001:1001 -ti --entrypoint /bin/bash mxnet/build.ubuntu_gpu /work/runtime_functions.sh tutorialtest_ubuntu_python2_gpu

into container: False

Traceback (most recent call last):

  File "ci/build.py", line 307, in <module>

    sys.exit(main())

  File "ci/build.py", line 243, in main

    container_run(platform, docker_binary, shared_memory_size, command)

  File "ci/build.py", line 154, in container_run

    raise subprocess.CalledProcessError(ret, cmd)

subprocess.CalledProcessError: Command 'nvidia-docker run --rm -t --shm-size=3g -v /home/jenkins_slave/workspace/it-tutorials-py2:/work/mxnet -v /home/jenkins_slave/workspace/it-tutorials-py2/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh tutorialtest_ubuntu_python2_gpu' returned non-zero exit status 1

script returned exit code 1

Possibly related to hardcoding ports.

@marcoabreu
Copy link
Contributor Author

@ThomasDelteil

@ThomasDelteil
Copy link
Contributor

I tracked this issue for a while, still not sure of why it is happening. My first guess was that the previous notebook run didn't clear the socket in time for the next run.

I added a delay between each notebook run which should have been plenty of time for socket to be released.

I'll do some more digging.

@marcoabreu, how many containers are running per instance concurrently? only one right?

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Jun 1, 2018

They are running concurrently. We got up to 4 containers in parallel. Why should it matter whether a socket is released or not? First of all, networking is virtualized per container, so there should be no problem on that side. Second, we should not require a certain port but just use a random free one.

@ThomasDelteil
Copy link
Contributor

The ports are already random, but I don't know whether it checks for availability. I think the issue could be that the ports are not released and rarely but sometimes it happens to try to reuse the same random port.

I have also seen a 'linger=1000', which could mean that the port might be still taken 1s after closing, so ugprading the delay between notebook to 1.1 from 0.5 could also solve the issue.

I'll see if I can make the port deterministically different.

@marcoabreu
Copy link
Contributor Author

I'm not a fan of the delay in between notebooks anyways because it masks a problem. I'd propose that we now remove the delay entirely and track down all the issues coming from that. Otherwise, we're flaky and depend on timing.

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Jun 1, 2018

the linger variable is hard coded in the kernelapp.py, if we don't want to use this technique then we need to assign deterministic ports that are different from each other and update the config file of the python notebook between runs.

@marcoabreu
Copy link
Contributor Author

I think that's what we should do. In general, this is required to prepare the path for parallel execution.

@access2rohit
Copy link
Contributor

HI Marco,
For reproducing this issue I have to build docs within MXNet since it requires gan.ipynb file. Can you share the setup to reproduce it

@marcoabreu
Copy link
Contributor Author

@ThomasDelteil would you mind assisting here?

@anirudh2290
Copy link
Member

anirudh2290 commented Jun 27, 2018

assigned to @reminisce. @access2rohit is working on this.

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Jun 28, 2018 via email

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Jun 28, 2018 via email

@access2rohit
Copy link
Contributor

@ThomasDelteil : Can you specify which "jenkins functions for tutorial tests" are you referring to in order to reproduce the tests ?

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Jul 12, 2018

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Jul 17, 2018

@marcoabreu the suite was run 300 times and the address already in use wasn't triggered. Can we close this issue?

@access2rohit
Copy link
Contributor

@marcoabreu The address in use issue not being triggered by running the suite couple hundred times. Should this issue be closed ? If not then can you shared the exact setup to reproduce it. I worked with @ThomasDelteil to get these tests to run and since they take time to complete a single run I ran it 300 times in order to reproduce it but it still succeeded.

@marcoabreu
Copy link
Contributor Author

Did you remove the thread.sleep you introduced to work around this error?

The environment is the regular CI pipeline.

I'll leave it up to you, Thomas.

@ThomasDelteil
Copy link
Contributor

The sleep is still there and I think it is necessary. It doesn't bring much extra delay. 30s on a 30min test run, that 1.6% extra delay, which is acceptable.

I would close it, thanks.

@marcoabreu
Copy link
Contributor Author

The question is whether we are masking a problem with that sleep. It feels like fixing a race condition by adding a sleep.

@ThomasDelteil
Copy link
Contributor

there is a hardcoded 1000ms linger on the socket in the base code of Jupyter, which I believe is the root cause of the problem.

We can fork jupyter or monkey patch it but I think the better trade-off with performance and maintainability is the sleep.

@marcoabreu
Copy link
Contributor Author

Okay, that makes sense. Please note that even with the increased delay we sometimes experienced this error.

For reproducing, please run multiple instances of the tests in parallel to see if the error occurs.

We can easily increase the delay to more than a second if that fixes the problem.

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Jul 25, 2018

To clarify: I had first put a 0.5s delay as a wild guess when we still had the error, I only then found out about the 1000ms socket linger, increased delay to 1.1s but the test was disabled already at that point. access2rohit tested that version and didn't reproduce the bug.

@marcoabreu
Copy link
Contributor Author

Oh, I was under the impression that we disabled the test after adding the 1.1s delay because it was still failing and we didn't know why because it should have worked due to the 1s timeout.

@ThomasDelteil
Copy link
Contributor

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Jul 25, 2018

Great, thanks a lot for the link Thomas!

Can we bring the tutorial tests into nightly and then close this issue? Currently, this issue serves as master ticket to mark the tests as disabled

@ThomasDelteil
Copy link
Contributor

Sure, it's in the works by @vishaalkapoor

@marcoabreu
Copy link
Contributor Author

marcoabreu commented Aug 3, 2018

Happened again: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTests_onBinaries/detail/NightlyTests_onBinaries/102/pipeline

+ nosetests-3.4 --with-xunit --xunit-file nosetests_straight_dope_python3_single_gpu.xml test_notebooks_single_gpu.py --nologcapture

.......[01:16:58] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

.....[01:28:00] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

[01:28:00] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

.Traceback (most recent call last):

  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main

    "__main__", mod_spec)

  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code

    exec(code, run_globals)

  File "/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py", line 16, in <module>

    app.launch_new_instance()

  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 657, in launch_instance

    app.initialize(argv)

  File "<decorator-gen-123>", line 2, in initialize

  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 87, in catch_config_error

    return method(app, *args, **kwargs)

  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 456, in initialize

    self.init_sockets()

  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 238, in init_sockets

    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)

  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 180, in _bind_socket

    s.bind("tcp://%s:%i" % (self.ip, port))

  File "zmq/backend/cython/socket.pyx", line 549, in zmq.backend.cython.socket.Socket.bind

  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc

zmq.error.ZMQError: Address already in use

ERROR:root:Kernel died before replying to kernel_info

F...........................[01:48:50] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

.

======================================================================

FAIL: test_generative_adversarial_networks (test_notebooks_single_gpu.StraightDopeSingleGpuTests)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/work/mxnet/tests/nightly/straight_dope/test_notebooks_single_gpu.py", line 274, in test_generative_adversarial_networks

    assert _test_notebook('chapter14_generative-adversarial-networks/conditional')

AssertionError



----------------------------------------------------------------------

Ran 42 tests in 2291.515s



FAILED (failures=1)

build.py: 2018-08-03 01:49:31,003 Running of command in container failed (1):

nvidia-docker\

	run\

	--rm\

	-t\

	--shm-size=500m\

	-v\

	/home/jenkins_slave/workspace/straight_dope-single_gpu:/work/mxnet\

	-v\

	/home/jenkins_slave/workspace/straight_dope-single_gpu/build:/work/build\

	-v\

	/tmp/ci_ccache:/work/ccache\

	-u\

	1001:1001\

	-e\

	CCACHE_MAXSIZE=500G\

	-e\

	CCACHE_TEMPDIR=/tmp/ccache\

	-e\

	CCACHE_DIR=/work/ccache\

	-e\

	CCACHE_LOGFILE=/tmp/ccache.log\

	mxnetci/build.ubuntu_nightly_gpu\

	/work/runtime_functions.sh\

	nightly_straight_dope_python3_single_gpu_tests



build.py: 2018-08-03 01:49:31,003 You can get into the container by adding the -i option

Traceback (most recent call last):

  File "ci/build.py", line 408, in <module>

    sys.exit(main())

  File "ci/build.py", line 337, in main

    local_ccache_dir=args.ccache_dir, interactive=args.interactive)

  File "ci/build.py", line 224, in container_run

    raise subprocess.CalledProcessError(ret, cmd)

subprocess.CalledProcessError: Command 'nvidia-docker\

	run\

	--rm\

	-t\

	--shm-size=500m\

	-v\

	/home/jenkins_slave/workspace/straight_dope-single_gpu:/work/mxnet\

	-v\

	/home/jenkins_slave/workspace/straight_dope-single_gpu/build:/work/build\

	-v\

	/tmp/ci_ccache:/work/ccache\

	-u\

	1001:1001\

	-e\

	CCACHE_MAXSIZE=500G\

	-e\

	CCACHE_TEMPDIR=/tmp/ccache\

	-e\

	CCACHE_DIR=/work/ccache\

	-e\

	CCACHE_LOGFILE=/tmp/ccache.log\

	mxnetci/build.ubuntu_nightly_gpu\

	/work/runtime_functions.sh\

	nightly_straight_dope_python3_single_gpu_tests' returned non-zero exit status 1

script returned exit code 1

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Aug 3, 2018 via email

vishaalkapoor added a commit to vishaalkapoor/incubator-mxnet that referenced this issue Oct 11, 2018
The tutorial tests were disabled due to a flaky port acquisition
strategy (apache#11170). The issue has been remedied (apache#11120). This change
re-enables the tutorial tests in the Nightly test suite.
@vishaalkapoor
Copy link
Contributor

Notes:

@srochel
Copy link
Contributor

srochel commented Nov 10, 2018

#13099 is merged, please close.
Why did we not use mechanism describe in https://blog.github.com/2013-05-14-closing-issues-via-pull-requests/ i.e. add fixes #13099 in PR? Are people not aware of it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants