Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and Fix Serialization Issue with IngestStats #52339

Closed
original-brownbear opened this issue Feb 13, 2020 · 9 comments · Fixed by #52543
Closed

Investigate and Fix Serialization Issue with IngestStats #52339

original-brownbear opened this issue Feb 13, 2020 · 9 comments · Fixed by #52543
Assignees
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@original-brownbear
Copy link
Member

We have a pretty detailed report about ingest stats not serializing properly in https://discuss.elastic.co/t/netty4tcpchannel-negative-longs-unsupported-repeated-in-logs/219235/6

What it comes down to is that the number of currently executing processors is a negative value somehow and doesn't serialise because of it (+ it should't be negative obviously) :

[2020-02-13T15:56:52,878][WARN ][o.e.t.OutboundHandler    ] [ela3] send message failed [channel: Netty4TcpChannel{localAddress=/A.B.C.95:9300, remoteAddress=/A.B.C.93:56542}]
java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-84034]
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:299) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.ingest.IngestStats$Stats.writeTo(IngestStats.java:197) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.ingest.IngestStats.writeTo(IngestStats.java:103) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:897) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:255) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:87) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:64) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:166) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:152) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:199) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.internalSend(OutboundHandler.java:129) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:124) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:104) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) [x-pack-security-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) [x-pack-security-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.0.jar:7.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

This seems to be caused by a pipeline throwing:

[2020-02-13T17:53:31,229][DEBUG][o.e.a.b.T.BulkRequestModifier] [ela1] failed to execute pipeline [_none] for document [filebeat-7.6.0/_doc/null]

I didn't investigate the deeper cause here but I'm assuming on error there's too many dec calls to org.elasticsearch.ingest.IngestMetric#ingestCurrent by some path.

@original-brownbear original-brownbear added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Feb 13, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

@danhermann danhermann self-assigned this Feb 13, 2020
@jasontedor
Copy link
Member

The bug here is that accept on Processor#execute the handler parameter can be called multiple times if there is a failure, and that accept method decrements the counter:

default void execute(IngestDocument ingestDocument, BiConsumer<IngestDocument, Exception> handler) {
try {
IngestDocument result = execute(ingestDocument);
handler.accept(result, null);
} catch (Exception e) {
handler.accept(null, e);
}
}

This explains the over-decrementing, and thus the negative values.

@jasontedor
Copy link
Member

To be clear about what I mean here, line 51 succeeds, handler.accept(result, null) on line 52 is invoked and the current metric is decremented, then a subsequent exception is thrown later in that accept invocation (there's a lot of potential code that executes after that), so we end up in the catch block on line 54 invoking handler.accept(null, e), invoking accept again, leading to a double decrement. It's effectively a double notification problem.

@MakoWish
Copy link

+1 - I am also seeing this after an upgrade from 7.5.2 to 7.6.0. Although all my data nodes do show up on Stack Monitoring, there are zero metrics for any of them, and the node count does not include the data nodes. I was directed to this issue at the AMA booth of Elastic{ON} Anaheim. Please let me know if there is any information I can provide from my cluster to help.

@danhermann
Copy link
Contributor

Thank you, @MakoWish. We're working on a fix for it and we'll ping you if additional information would be helpful.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Apr 29, 2020

Reopening this issue, as we still see the failures in stats on negative processors number in 7.6.1 where the PR #52543 was merged

@danhermann
Copy link
Contributor

See also the additional information that @srikwit provided on another instance of this bug here: #62087 (comment)

@dnhatn
Copy link
Member

dnhatn commented Jan 19, 2021

Another case happening on 7.10.

java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-3]
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:298) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.ingest.IngestStats$Stats.writeTo(IngestStats.java:197) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.ingest.IngestStats.writeTo(IngestStats.java:87) ~[elasticsearch-7.10.0.jar:7.10.0]

original-brownbear added a commit that referenced this issue Mar 3, 2021
There was an obvious race here where async processor and final pipeline will
run concurrently (or the final pipeline runs multiple times in from the while loop).

relates #52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 3, 2021
There was an obvious race here where async processor and final pipeline will
run concurrently (or the final pipeline runs multiple times in from the while loop).

relates elastic#52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 3, 2021
There was an obvious race here where async processor and final pipeline will
run concurrently (or the final pipeline runs multiple times in from the while loop).

relates elastic#52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)
original-brownbear added a commit that referenced this issue Mar 3, 2021
There was an obvious race here where async processor and final pipeline will
run concurrently (or the final pipeline runs multiple times in from the while loop).

relates #52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)
original-brownbear added a commit that referenced this issue Mar 3, 2021
There was an obvious race here where async processor and final pipeline will
run concurrently (or the final pipeline runs multiple times in from the while loop).

relates #52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 4, 2021
There was an obvious race here where async processor and final pipeline will
run concurrently (or the final pipeline runs multiple times in from the while loop).

relates elastic#52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)
original-brownbear added a commit that referenced this issue Mar 4, 2021
There was an obvious race here where async processor and final pipeline will
run concurrently (or the final pipeline runs multiple times in from the while loop).

relates #52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)
@original-brownbear
Copy link
Member Author

This should be fixed by #69818 and has not been reported since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants