-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Address already in use during tutorial test #11120
Comments
I tracked this issue for a while, still not sure of why it is happening. My first guess was that the previous notebook run didn't clear the socket in time for the next run. I added a delay between each notebook run which should have been plenty of time for socket to be released. I'll do some more digging. @marcoabreu, how many containers are running per instance concurrently? only one right? |
They are running concurrently. We got up to 4 containers in parallel. Why should it matter whether a socket is released or not? First of all, networking is virtualized per container, so there should be no problem on that side. Second, we should not require a certain port but just use a random free one. |
The ports are already random, but I don't know whether it checks for availability. I think the issue could be that the ports are not released and rarely but sometimes it happens to try to reuse the same random port. I have also seen a 'linger=1000', which could mean that the port might be still taken 1s after closing, so ugprading the delay between notebook to 1.1 from 0.5 could also solve the issue. I'll see if I can make the port deterministically different. |
I'm not a fan of the delay in between notebooks anyways because it masks a problem. I'd propose that we now remove the delay entirely and track down all the issues coming from that. Otherwise, we're flaky and depend on timing. |
the |
I think that's what we should do. In general, this is required to prepare the path for parallel execution. |
HI Marco, |
@ThomasDelteil would you mind assisting here? |
assigned to @reminisce. @access2rohit is working on this. |
To reproduce the setup I suggest looking at the jenkins function for
tutorial tests.
I had put a fix in my last PR before removing them from CI, the issue might
be gone already. Can someone start a few hundred runs of the tutorial tests
and see if it still happen? Note that they take ~25min so that could take a
few days.
Actually commenting out most tests except three very fast ones might be a
better idea since it isn't related to a specific test, and as one simple
test runs in 2-3s with the jupyter kernel overhead. To know which one is
fast check the tutorials, some do not much like the NDArray ones.
My current best guess is that the issue is related to the fact that the
ports used by jupiter internal mechanism are chosen randomly and that there
is linger=1000 hard-coded in the jupiter code somewhere that keep it being
used for 1sec. For every test there is ~1/10000 chance that the same port
will be reused (3 ports are picked between 1-100000), which makes it 1/300
because we have 30 tests and 1/150 because we run on python2 and python3.
That seems roughly consistent with the number of reports we've had, about
once every 150 CI runs.
There is no easy way to set the ports to fixed deterministic value. My
latest fix added a non-ideal 1.1 sleep between tests. Let's see if that
fixed it. The above explanation might be bogus too.
I'm on my phone in a plane and can assist more from Friday onwards.
Thanks for looking into it @reminisce <https://github.com/reminisce> and
@access2rohit <https://github.com/access2rohit>!
…On Wed, Jun 27, 2018, 20:45 Anirudh Subramanian ***@***.***> wrote:
assigned to @reminisce <https://github.com/reminisce> @access2rohit
<https://github.com/access2rohit> is working on this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADi001F-ydKxRBX2r1PYR-DJXVfJv14lks5uBBkDgaJpZM4UWPoF>
.
|
Just to be on the safe side: please try what Thomas suggested only locally
and don't submit these jobs to CI. This test suite causes resource
exhaustion on our CI and running so many in parallel would basically knock
off our instances.
Thomas Delteil <[email protected]> schrieb am Do., 28. Juni 2018,
11:29:
… To reproduce the setup I suggest looking at the jenkins function for
tutorial tests.
I had put a fix in my last PR before removing them from CI, the issue might
be gone already. Can someone start a few hundred runs of the tutorial tests
and see if it still happen? Note that they take ~25min so that could take a
few days.
Actually commenting out most tests except three very fast ones might be a
better idea since it isn't related to a specific test, and as one simple
test runs in 2-3s with the jupyter kernel overhead. To know which one is
fast check the tutorials, some do not much like the NDArray ones.
My current best guess is that the issue is related to the fact that the
ports used by jupiter internal mechanism are chosen randomly and that there
is linger=1000 hard-coded in the jupiter code somewhere that keep it being
used for 1sec. For every test there is ~1/10000 chance that the same port
will be reused (3 ports are picked between 1-100000), which makes it 1/300
because we have 30 tests and 1/150 because we run on python2 and python3.
That seems roughly consistent with the number of reports we've had, about
once every 150 CI runs.
There is no easy way to set the ports to fixed deterministic value. My
latest fix added a non-ideal 1.1 sleep between tests. Let's see if that
fixed it. The above explanation might be bogus too.
I'm on my phone in a plane and can assist more from Friday onwards.
Thanks for looking into it @reminisce <https://github.com/reminisce> and
@access2rohit <https://github.com/access2rohit>!
On Wed, Jun 27, 2018, 20:45 Anirudh Subramanian ***@***.***>
wrote:
> assigned to @reminisce <https://github.com/reminisce> @access2rohit
> <https://github.com/access2rohit> is working on this.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#11120 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/ADi001F-ydKxRBX2r1PYR-DJXVfJv14lks5uBBkDgaJpZM4UWPoF
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARxB60DSOdndKcaobWgfxBGzIAMRSv4Dks5uBKH-gaJpZM4UWPoF>
.
|
@ThomasDelteil : Can you specify which "jenkins functions for tutorial tests" are you referring to in order to reproduce the tests ? |
They are defined here: https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L580 Disable was in #11170 |
@marcoabreu the suite was run 300 times and the address already in use wasn't triggered. Can we close this issue? |
@marcoabreu The address in use issue not being triggered by running the suite couple hundred times. Should this issue be closed ? If not then can you shared the exact setup to reproduce it. I worked with @ThomasDelteil to get these tests to run and since they take time to complete a single run I ran it 300 times in order to reproduce it but it still succeeded. |
Did you remove the thread.sleep you introduced to work around this error? The environment is the regular CI pipeline. I'll leave it up to you, Thomas. |
The sleep is still there and I think it is necessary. It doesn't bring much extra delay. 30s on a 30min test run, that 1.6% extra delay, which is acceptable. I would close it, thanks. |
The question is whether we are masking a problem with that sleep. It feels like fixing a race condition by adding a sleep. |
there is a hardcoded 1000ms linger on the socket in the base code of Jupyter, which I believe is the root cause of the problem. We can fork jupyter or monkey patch it but I think the better trade-off with performance and maintainability is the sleep. |
Okay, that makes sense. Please note that even with the increased delay we sometimes experienced this error. For reproducing, please run multiple instances of the tests in parallel to see if the error occurs. We can easily increase the delay to more than a second if that fixes the problem. |
To clarify: I had first put a 0.5s delay as a wild guess when we still had the error, I only then found out about the 1000ms socket linger, increased delay to 1.1s but the test was disabled already at that point. access2rohit tested that version and didn't reproduce the bug. |
Oh, I was under the impression that we disabled the test after adding the 1.1s delay because it was still failing and we didn't know why because it should have worked due to the 1s timeout. |
Great, thanks a lot for the link Thomas! Can we bring the tutorial tests into nightly and then close this issue? Currently, this issue serves as master ticket to mark the tests as disabled |
Sure, it's in the works by @vishaalkapoor |
Happened again: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTests_onBinaries/detail/NightlyTests_onBinaries/102/pipeline
|
Looks like the sleep(1.1) got removed as far as I can see.
edit: it's not, investigating with Vishaal. It seems like we might have misdiagnosed this error, it happens consistently when a notebook crashes.
…On Fri, Aug 3, 2018, 03:44 Marco de Abreu ***@***.***> wrote:
Happened again:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTests_onBinaries/detail/NightlyTests_onBinaries/102/pipeline
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADi000XEYmMEPAqZvOtRcgpBiLfzZgI_ks5uNCmNgaJpZM4UWPoF>
.
|
The tutorial tests were disabled due to a flaky port acquisition strategy (apache#11170). The issue has been remedied (apache#11120). This change re-enables the tutorial tests in the Nightly test suite.
Notes:
|
#13099 is merged, please close. |
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10656/7/pipeline/
Possibly related to hardcoding ports.
The text was updated successfully, but these errors were encountered: