Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preventing serialization errors in the nodes stats API #90319

Merged
merged 5 commits into from
Sep 27, 2022

Conversation

masseyke
Copy link
Member

@masseyke masseyke commented Sep 23, 2022

On a few occasions we have seen the nodes stats API fail with this (this stack is from 7.14.0, but we have also seen it on higher 7.x versions and 8.x):

[es/i-38/es.log] [2022-08-24T13:32:54,997][WARN ][org.elasticsearch.transport.OutboundHandler] [instance-0000000038] failed to serialize outbound message [Response{46294885}{false}{false}{false}{class org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse}]
java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-1]
	at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:289) ~[elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.ingest.IngestStats$Stats.writeTo(IngestStats.java:186) ~[elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.ingest.IngestStats.writeTo(IngestStats.java:76) ~[elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:978) ~[elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:288) ~[elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse.writeTo(ClusterStatsNodeResponse.java:89) ~[elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:75) ~[elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:180) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:145) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:61) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:41) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:305) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:299) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:341) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$3.onResponse(SecurityServerTransportInterceptor.java:404) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$3.onResponse(SecurityServerTransportInterceptor.java:394) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:620) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:250) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:136) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.authc.AuthenticatorChain.authenticateAsync(AuthenticatorChain.java:102) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:199) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:128) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:415) [x-pack-security-7.17.5.jar:7.17.5]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:67) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:260) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.17.5.jar:7.17.5]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.5.jar:7.17.5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

The cause is that the ingestCurrent counter has gone below 0, and writeVLong cannot handle negative numbers. The ingestCurrent counter ought to never be below 0. I have not been able to reproduce this despite trying a lot of things. I believe that the only way that this could happen (before 7.17.2 anyway) is if a processor's listener is called more than once, since the counter is always incremented before the processor is run, and then decremented once in the processor's listener. Unfortunately I have not been able to find any place where a listener is called more than once, and there are hundreds of listener uses we would need to go through. Since I have not found the root cause, this PR is meant to detect and prevent the serialization problem, and to log the stack trace when it is found (so that we can potentially find the root cause(s)).
Closes #77973
Related #84838

@masseyke masseyke added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.5.0 v7.17.7 v8.6.0 labels Sep 23, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 23, 2022
@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @masseyke, I've updated the changelog YAML for you.

@masseyke masseyke requested a review from joegallo September 23, 2022 20:23
@masseyke
Copy link
Member Author

@elasticmachine run elasticsearch-ci/part-1

@masseyke
Copy link
Member Author

@elasticmachine run elasticsearch-ci/bwc

@masseyke
Copy link
Member Author

@elasticmachine update branch

@joegallo
Copy link
Contributor

Anything else you'd like to see here @DaveCTurner? It LGTM, but I don't want to front run your ✅ if there's something else you'd like to see.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nothing more from me, I was just passing through 😁

@masseyke masseyke merged commit e21be6e into elastic:main Sep 27, 2022
@masseyke masseyke deleted the fix/prevent-serialization-error branch September 27, 2022 20:40
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Sep 27, 2022
Preventing serialization errors in the nodes stats API, and adding logging to the ingest counter code
so that we can find the root cause of the problem in the future.
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.5
7.17 Commit could not be cherrypicked due to conflicts
8.4

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 90319

masseyke added a commit to masseyke/elasticsearch that referenced this pull request Sep 27, 2022
Preventing serialization errors in the nodes stats API, and adding logging to the ingest counter code
so that we can find the root cause of the problem in the future.
elasticsearchmachine pushed a commit that referenced this pull request Sep 27, 2022
Preventing serialization errors in the nodes stats API, and adding logging to the ingest counter code
so that we can find the root cause of the problem in the future.
elasticsearchmachine pushed a commit that referenced this pull request Sep 28, 2022
Preventing serialization errors in the nodes stats API, and adding logging to the ingest counter code
so that we can find the root cause of the problem in the future.

Co-authored-by: Elastic Machine <[email protected]>
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Sep 28, 2022
slobodanadamovic added a commit to slobodanadamovic/elasticsearch that referenced this pull request Sep 29, 2022
lockewritesdocs pushed a commit that referenced this pull request Sep 29, 2022
* Update docs for v8.4.3 release

* Update release highlights for 8.4.3 version.

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/highlights.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Make link external type

* Update release notes to include #90319 PR after creating BC2.

* Remove release note for #90302

* Minor grammar fix

Co-authored-by: Adam Locke <[email protected]>
lockewritesdocs pushed a commit to lockewritesdocs/elasticsearch that referenced this pull request Sep 29, 2022
* Update docs for v8.4.3 release

* Update release highlights for 8.4.3 version.

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/highlights.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Make link external type

* Update release notes to include elastic#90319 PR after creating BC2.

* Remove release note for elastic#90302

* Minor grammar fix

Co-authored-by: Adam Locke <[email protected]>
(cherry picked from commit 25a196f)

# Conflicts:
#	docs/reference/release-notes.asciidoc
#	docs/reference/release-notes/highlights.asciidoc
lockewritesdocs pushed a commit to lockewritesdocs/elasticsearch that referenced this pull request Sep 29, 2022
* Update docs for v8.4.3 release

* Update release highlights for 8.4.3 version.

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/highlights.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Make link external type

* Update release notes to include elastic#90319 PR after creating BC2.

* Remove release note for elastic#90302

* Minor grammar fix

Co-authored-by: Adam Locke <[email protected]>
(cherry picked from commit 25a196f)

# Conflicts:
#	docs/reference/release-notes.asciidoc
#	docs/reference/release-notes/highlights.asciidoc
lockewritesdocs pushed a commit that referenced this pull request Sep 30, 2022
* Update docs for v8.4.3 release

* Update release highlights for 8.4.3 version.

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/highlights.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Make link external type

* Update release notes to include #90319 PR after creating BC2.

* Remove release note for #90302

* Minor grammar fix

Co-authored-by: Adam Locke <[email protected]>
(cherry picked from commit 25a196f)

# Conflicts:
#	docs/reference/release-notes.asciidoc
#	docs/reference/release-notes/highlights.asciidoc

Co-authored-by: Slobodan Adamović <[email protected]>
lockewritesdocs pushed a commit that referenced this pull request Sep 30, 2022
* Update docs for v8.4.3 release

* Update release highlights for 8.4.3 version.

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/8.4.3.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Update docs/reference/release-notes/highlights.asciidoc

Co-authored-by: Adam Locke <[email protected]>

* Make link external type

* Update release notes to include #90319 PR after creating BC2.

* Remove release note for #90302

* Minor grammar fix

Co-authored-by: Adam Locke <[email protected]>
(cherry picked from commit 25a196f)

# Conflicts:
#	docs/reference/release-notes.asciidoc
#	docs/reference/release-notes/highlights.asciidoc

Co-authored-by: Slobodan Adamović <[email protected]>
justincr-elastic pushed a commit to justincr-elastic/elasticsearch that referenced this pull request Oct 8, 2022
Preventing serialization errors in the nodes stats API, and adding logging to the ingest counter code
so that we can find the root cause of the problem in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team v7.17.7 v8.4.3 v8.5.0 v8.6.0
Projects
None yet
5 participants