-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] mixed-cluster:v6.0.0#mixedClusterTestRunner failure #73459
Comments
Pinging @elastic/es-delivery (Team:Delivery) |
I can't get this to reproduce locally for me with the exact same Java versions. Perhaps it is something to do with the underlying system?
That stood out to me when we hard-code these jobs to run on centos-7 but the versions don't line up. In this case it's versions 5.6-6.0 that are busted but the cgroups issue is with 5.0-5.3. I also find it odd that the stack trace above is being thrown by the current version (6.8.17) but these failures are only with certain BWC versions... odd. |
I don't quite understand the stack trace but is it possible the 6.8.17 node is trying to read this information from the BWC node, which is reporting things wrong, and that's what's blowing up? |
@williamrandolph I'm picking on you since you seemed to be the latest person to touch some of this code some you might have some context or "fresh" memory of what might be happening here. Is it possible that a minor change to the runtime JDK for these BWC clusters is causing them to report bad stats, which then causes the 6.8.17 node in the cluster to fail to start? |
I do think that's possible, though it would take some digging to confirm. I remember looking into this and digging around in JVM code but I haven't managed to find my notes. In our current code, we have guards that will return 0 instead of a negative value if the OS reports used memory higher than free memory or something like that. We also have assertions that catch situations where a negative value is being passed. It's clear how this can be a problem in mixed clusters, where old-version nodes pass bad values to newer nodes, which will then fail to deserialize the NodeStats. It might be better to update the 6.8 code to warn in this situation. But I'd like to see if we can figure out what change in the JDK code brought this up on our test OSes. Linking my PR for awareness: #68554 |
@williamrandolph what area should own this? Core/Infra? |
While backporting that change to warn and use 0 in place of negative values is good for 6.8, it won't fix the issue for already released versions of Elasticsearch, like 6.0.0 that this CI issue was created referencing. Aside from whatever the underlying JDK change may have been, I think we should change 7.x/master to be lenient in reading from older nodes, so that instead of asserting >=0, they do the conversion to 0 from those nodes when reading (and can then warn). |
These failures are on the 6.8 branch though. Don't you mean we should build that leniency in there? There is no wire-compatibility between 7.x/master and 6.0. |
Ah, I misread, I thought it was 7.x testing bwc with 6.x. We should still do the backport William opened, but I just think the change I suggested is also necessary. The change can go back to 6.8 as well. |
The PR that Will linked was merged back in February. I think we'll need something additional here to implement the leniency you describe. |
Yes I meant my suggestion as a new change. Sorry I didn’t mean to conflate the two. |
We've had a series of bug fixes for cases where an OsProbe gives negative values, most often just -1, to the OsStats class. We added assertions to catch cases where we were initializing OsStats with bad values. Unfortunately, these fixes turned to not be backwards-compatible. In this commit, we simply coerce bad values to 0 when data is coming from nodes that don't have the relevant bug fixes. Relevant PRs: * #42725 * #56435 * #57317 Fixes #73459
We've had a series of bug fixes for cases where an OsProbe gives negative values, most often just -1, to the OsStats class. We added assertions to catch cases where we were initializing OsStats with bad values. Unfortunately, these fixes turned to not be backwards-compatible. In this commit, we simply coerce bad values to 0 when data is coming from nodes that don't have the relevant bug fixes. Relevant PRs: * #42725 * #56435 * #57317 Fixes #73459
We've had a series of bug fixes for cases where an OsProbe gives negative values, most often just -1, to the OsStats class. We added assertions to catch cases where we were initializing OsStats with bad values. Unfortunately, these fixes turned to not be backwards-compatible. In this commit, we simply coerce bad values to 0 when data is coming from nodes that don't have the relevant bug fixes. Relevant PRs: * #42725 * #56435 * #57317 Fixes #73459
Build scan:
https://gradle-enterprise.elastic.co/s/64inre6ljqgcc
Repro line:
Reproduces locally?:
Haven't tried
Applicable branches:
6.8
Failure history:
https://gradle-enterprise.elastic.co/scans/failures?failures.failureClassification=all_failures&failures.failureMessage=Execution%20failed%20for%20task%20*%0A%3E%20Process%20%27kill%20%5B-9,%20*%20finished%20with%20non-zero%20exit%20value%201&search.relativeStartTime=P7D&search.timeZoneId=Europe/Athens
Failure excerpt:
Could be related to the JAVA related updates:
14f3273
bb580a4
The text was updated successfully, but these errors were encountered: