Investigate and Fix Serialization Issue with IngestStats #52339

original-brownbear · 2020-02-13T21:08:49Z

We have a pretty detailed report about ingest stats not serializing properly in https://discuss.elastic.co/t/netty4tcpchannel-negative-longs-unsupported-repeated-in-logs/219235/6

What it comes down to is that the number of currently executing processors is a negative value somehow and doesn't serialise because of it (+ it should't be negative obviously) :

[2020-02-13T15:56:52,878][WARN ][o.e.t.OutboundHandler    ] [ela3] send message failed [channel: Netty4TcpChannel{localAddress=/A.B.C.95:9300, remoteAddress=/A.B.C.93:56542}]
java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-84034]
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:299) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.ingest.IngestStats$Stats.writeTo(IngestStats.java:197) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.ingest.IngestStats.writeTo(IngestStats.java:103) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:897) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:255) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:87) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:64) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:166) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:152) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:199) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.internalSend(OutboundHandler.java:129) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:124) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:104) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) [x-pack-security-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) [x-pack-security-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.0.jar:7.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

This seems to be caused by a pipeline throwing:

[2020-02-13T17:53:31,229][DEBUG][o.e.a.b.T.BulkRequestModifier] [ela1] failed to execute pipeline [_none] for document [filebeat-7.6.0/_doc/null]

I didn't investigate the deeper cause here but I'm assuming on error there's too many dec calls to org.elasticsearch.ingest.IngestMetric#ingestCurrent by some path.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-13T21:08:53Z

Pinging @elastic/es-core-features (:Core/Features/Ingest)

jasontedor · 2020-02-13T22:25:56Z

The bug here is that accept on Processor#execute the handler parameter can be called multiple times if there is a failure, and that accept method decrements the counter:

elasticsearch/server/src/main/java/org/elasticsearch/ingest/Processor.java

Lines 49 to 56 in de6f132

    
           default void execute(IngestDocument ingestDocument, BiConsumer<IngestDocument, Exception> handler) { 
        
               try { 
        
                   IngestDocument result = execute(ingestDocument); 
        
                   handler.accept(result, null); 
        
               } catch (Exception e) { 
        
                   handler.accept(null, e); 
        
               } 
        
           }

This explains the over-decrementing, and thus the negative values.

jasontedor · 2020-02-14T00:04:52Z

To be clear about what I mean here, line 51 succeeds, handler.accept(result, null) on line 52 is invoked and the current metric is decremented, then a subsequent exception is thrown later in that accept invocation (there's a lot of potential code that executes after that), so we end up in the catch block on line 54 invoking handler.accept(null, e), invoking accept again, leading to a double decrement. It's effectively a double notification problem.

MakoWish · 2020-02-18T19:24:00Z

+1 - I am also seeing this after an upgrade from 7.5.2 to 7.6.0. Although all my data nodes do show up on Stack Monitoring, there are zero metrics for any of them, and the node count does not include the data nodes. I was directed to this issue at the AMA booth of Elastic{ON} Anaheim. Please let me know if there is any information I can provide from my cluster to help.

danhermann · 2020-02-18T22:00:53Z

Thank you, @MakoWish. We're working on a fix for it and we'll ping you if additional information would be helpful.

mayya-sharipova · 2020-04-29T16:47:59Z

Reopening this issue, as we still see the failures in stats on negative processors number in 7.6.1 where the PR #52543 was merged

danhermann · 2020-09-09T12:30:48Z

See also the additional information that @srikwit provided on another instance of this bug here: #62087 (comment)

dnhatn · 2021-01-19T03:23:07Z

Another case happening on 7.10.

java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-3]
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:298) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.ingest.IngestStats$Stats.writeTo(IngestStats.java:197) ~[elasticsearch-7.10.0.jar:7.10.0]
        at org.elasticsearch.ingest.IngestStats.writeTo(IngestStats.java:87) ~[elasticsearch-7.10.0.jar:7.10.0]

There was an obvious race here where async processor and final pipeline will run concurrently (or the final pipeline runs multiple times in from the while loop). relates #52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)

There was an obvious race here where async processor and final pipeline will run concurrently (or the final pipeline runs multiple times in from the while loop). relates elastic#52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)

There was an obvious race here where async processor and final pipeline will run concurrently (or the final pipeline runs multiple times in from the while loop). relates #52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)

There was an obvious race here where async processor and final pipeline will run concurrently (or the final pipeline runs multiple times in from the while loop). relates elastic#52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)

There was an obvious race here where async processor and final pipeline will run concurrently (or the final pipeline runs multiple times in from the while loop). relates #52339 (fixes one failure scenario here but since the failure also occurred in 7.10.x not all of them)

original-brownbear · 2021-07-14T15:48:48Z

This should be fixed by #69818 and has not been reported since.

original-brownbear added >bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Feb 13, 2020

danhermann self-assigned this Feb 13, 2020

danhermann mentioned this issue Feb 19, 2020

Handle errors when evaluating if conditions in processors #52543

Merged

danhermann closed this as completed in #52543 Feb 21, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

mayya-sharipova reopened this Apr 29, 2020

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

cbuescher mentioned this issue Sep 8, 2020

Negative longs unsupported, use writeLong or writeZLong for negative numbers #62087

Closed

original-brownbear mentioned this issue Sep 14, 2020

Don't Close Connections on Serialization Bugs #62328

Closed

original-brownbear mentioned this issue Mar 2, 2021

Async Processors and Final Pipelines are Broken #69818

Merged

original-brownbear mentioned this issue Mar 3, 2021

Async Processors and Final Pipelines are Broken (#69818) #69848

Merged

original-brownbear mentioned this issue Mar 3, 2021

Async Processors and Final Pipelines are Broken (#69818) #69849

Merged

original-brownbear mentioned this issue Mar 4, 2021

Async Processors and Final Pipelines are Broken (#69818) #69997

Merged

original-brownbear closed this as completed Jul 14, 2021

joegallo mentioned this issue Sep 21, 2021

failed to serialize outbound message [Response{778394113}{false}{false}{false}{class org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse #77973

Closed

joegallo mentioned this issue Jan 30, 2024

Tighten up exception handling around markItemAsDropped/markItemAsFailed #104931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and Fix Serialization Issue with IngestStats #52339

Investigate and Fix Serialization Issue with IngestStats #52339

original-brownbear commented Feb 13, 2020

elasticmachine commented Feb 13, 2020

jasontedor commented Feb 13, 2020

jasontedor commented Feb 14, 2020

MakoWish commented Feb 18, 2020

danhermann commented Feb 18, 2020

mayya-sharipova commented Apr 29, 2020 •

edited

Loading

danhermann commented Sep 9, 2020

dnhatn commented Jan 19, 2021

original-brownbear commented Jul 14, 2021

Investigate and Fix Serialization Issue with IngestStats #52339

Investigate and Fix Serialization Issue with IngestStats #52339

Comments

original-brownbear commented Feb 13, 2020

elasticmachine commented Feb 13, 2020

jasontedor commented Feb 13, 2020

jasontedor commented Feb 14, 2020

MakoWish commented Feb 18, 2020

danhermann commented Feb 18, 2020

mayya-sharipova commented Apr 29, 2020 • edited Loading

danhermann commented Sep 9, 2020

dnhatn commented Jan 19, 2021

original-brownbear commented Jul 14, 2021

mayya-sharipova commented Apr 29, 2020 •

edited

Loading