-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures in ML TooManyJobsIT on Debian 8 #66885
Comments
Pinging @elastic/ml-core (:ml) |
Both these failures indicate a problem determining the amount of memory on the machine. All the failures seem to happen on Debian 8. I think this is a special case of #66629. |
This isn't just 7.10. https://gradle-enterprise.elastic.co/s/blt6fjge3bkus is an example from 7.x and https://gradle-enterprise.elastic.co/s/xtx7u77ntetcy is an example from 7.11. Debian 8 isn't newly added to the test matrix, so I am not sure what changed 17 days ago when #66629 was opened. The worry is that this isn't purely a test issue and is affecting end users on Debian 8. |
Between https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.10+multijob-unix-compatibility/os=debian-8&&immutable/141/consoleFull (success) and https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.10+multijob-unix-compatibility/os=debian-8&&immutable/142/consoleFull (failure) the runtime JDK used on the Debian 8 CI workers was upgraded from 8u241 to 8u271 by the look of it, so maybe it’s the combination of Debian 8 with Java 8u271. JDK 8 is no longer supported with 8.x, so this would also explain why the failures aren't seen on master. Hopefully this will also limit the number of affected end users, as the bundled JDK is not Java 8. |
Another bunch of similar failures: https://gradle-enterprise.elastic.co/s/a4wwanx2zcwyk This time it's
And the error messages are as follows:
A new failure message is
It is not reproducible with:
|
Another one, this time is Three failures:
and the error messages are similar:
Reproduce line:
Not reproducible. |
Another similar one for 7.11 on debian-8: https://gradle-enterprise.elastic.co/s/4yd24wpnfyfom |
@ywangd just wanted to clarify when you say “not reproducible” are you trying on Debian 8? (I am not saying you should try on Debian 8, just that since every failure has happened on Debian 8 it’s probably not worth bothering on other distributions.) |
No I didn't try on Debian 8, it was on MacOS. I should have made it explicit. The original title had |
Another one in |
#67089 (comment) contains the likely explanation. I'm not sure what changed in mid-December though that made this start failing. |
The JDK 8 version on debian 8 was upgraded between December 15th and December 17th, at which time OsProbeTests started failing because memory is 0. December 15th build using jdk 8u241: December 17th build, first OsProbeTests failure, using 8u271: This java bug was marked fixed for 8u272: Checking the code of oracle java 8u271, it does include at least the java parts of the change, in which a missing memory subsystem is interpreted as 0 memory. Given that this is fixed in java 15 and at least in the past, it was normal to not have a memory subsystem, it looks like a java bug. |
I had a look on some CI servers for various supported platforms to see what this looks like in the file system.
There is a comment in https://stackoverflow.com/questions/21337522/trying-to-use-cgroups-in-debian-wheezy-and-no-daemons that "Debian disables the CentOS 6 appears not to mount cgroups by default, and this doesn't appear to confuse Java 8u271, so it must be that if the memory subsystem is missing but everything is missing then that's OK. Ubuntu is based on Debian so I suspect the versions based on Debian 7/8 will suffer the same problem. Thankfully this doesn't impact our support matrix enormously, as the last such Ubuntu version was 15.10. Ubuntu 16.04 is based on Debian 9 and that's the oldest we support in ES 7.x. Based on this the only currently supported combination affected apart from Debian 8 would be ES 6.8 on Ubuntu 14.04. To summarize, this problem affects:
Then the workarounds would be to either enable the memory subsystem - instructions in https://dawnbringer.net/blog/1033/cgroup%20support - or else upgrade Java to a fixed version. Since failure to obtain the amount of memory on a node is really bad for ML we will document this as a known issue for ML. |
The 4 affected tests will be selectively muted on Debian 8 when #67422 is merged and backported. |
The selective muting implemented for autoscaling in elastic#67159 is extended to the ML tests that also fail when machine memory is reported as 0. Most of the logic to determine when memory will not be accurately reported is now in a utility method in the base class. Relates elastic#66885 Backport of elastic#67422
The selective muting implemented for autoscaling in elastic#67159 is extended to the ML tests that also fail when machine memory is reported as 0. Most of the logic to determine when memory will not be accurately reported is now in a utility method in the base class. Relates elastic#66885 Backport of elastic#67422
The selective muting implemented for autoscaling in elastic#67159 is extended to the ML tests that also fail when machine memory is reported as 0. Most of the logic to determine when memory will not be accurately reported is now in a utility method in the base class. Relates elastic#66885 Backport of elastic#67422
@droberts195 I've spotted this morning another failure for tests already mentioned in this issue. 7.11 with Debian 8. Judging by the investigation done and the code merge two days ago, there shouldn't be any more failures. For the failing tests, should we proactively mute them using Build scan: https://gradle-enterprise.elastic.co/s/d4utquzs42lsm
|
The two failing tests are YAML tests. I don't know a way to selectively mute based on a complex condition in the |
I noticed the following 3 yaml test failures today on Debian 8:
The tests ( I think these failures are related to this issue. Would someone be able to confirm this? |
Yes, all the test failures on Debian 8 that relate to memory being reported as zero are basically the same thing.
👍 |
Tests still falling, eg https://gradle-enterprise.elastic.co/s/nmzh3f4juwn4e . |
Some tests fitting this issue still fail today on 7.10, e.g. Not sure if this should or shouldn't happen any more with elastic/infra#26251 being merged, maybe @albertzaharovits can see whether this is some different configuration/issue or not? |
Also 7.x just now: https://gradle-enterprise.elastic.co/s/nndh4zezzcr7i |
Today's failures are still using Java 8u271: you can search for |
Looks like at least https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.11+multijob-unix-compatibility/os=debian-8&&immutable/86/ is using oracle-8u281 now but apparently still experiencing similar issues like reported here. |
I muted the failing yml tests in 7.x, 7.11 and 7.10 now. Please remove the general skip on all plattforms when #67681 makes it possible to be more selective here. |
#67681 is ready for review |
This should be fixed by #68542. |
Unmute the YAML tests that were muted due to the problem of elastic#66885. The underlying problem was fixed by elastic#68542.
This has been failing a bunch of times on 7.10 recently:
https://gradle-enterprise.elastic.co/s/ppzyiud65lopu
Interestingly enough, instances of this failure coincide with the following REST test failure twice today.
The text was updated successfully, but these errors were encountered: