Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to serialize outbound message [Response{778394113}{false}{false}{false}{class org.elasticsearch.action.admin.cluster.stats.ClusterStatsNodeResponse #77973

Closed
ptamba opened this issue Sep 17, 2021 · 12 comments · Fixed by #90319
Labels
>bug :Data Management/Stats Statistics tracking and retrieval APIs Team:Data Management Meta label for data/management team

Comments

@ptamba
Copy link

ptamba commented Sep 17, 2021

Elasticsearch version (bin/elasticsearch --version): 7.14.0

Plugins installed: []

JVM version (java -version): openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

OS version (uname -a if on a Unix-like system): Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Feb 18 16:39:12 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Cluster status is green,

{
  "cluster_name" : "******",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 213,
  "active_shards" : 390,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

however cluster stats shows two failed nodes

{
  "_nodes" : {
    "total" : 4,
    "successful" : 2,
    "failed" : 2,
    "failures" : [
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [iZZhcgQNSzyxyKlc8JtP6A]",
        "node_id" : "iZZhcgQNSzyxyKlc8JtP6A",
        "caused_by" : {
          "type" : "illegal_state_exception",
          "reason" : "Negative longs unsupported, use writeLong or writeZLong for negative numbers [-1089]"
        }
      },
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [fIBQM07-QiaABb73sYjjpw]",
        "node_id" : "fIBQM07-QiaABb73sYjjpw",
        "caused_by" : {
          "type" : "illegal_state_exception",
          "reason" : "Negative longs unsupported, use writeLong or writeZLong for negative numbers [-1182]"
        }
      }
    ]
  },

This also causes monitoring cluster to not display anything (not sure whether they are related)

Below log is displayed in node logs

Provide logs (if relevant):

[2021-09-17T10:01:03,243][WARN ][o.e.t.OutboundHandler    ]  failed to serialize outbound message [Response{2808893}{false}{false}{false}{class org.elasticsearch.action.admin.clust
er.node.stats.NodeStats}]
java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-1096]
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:288) ~[elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.ingest.IngestStats$Stats.writeTo(IngestStats.java:186) ~[elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.ingest.IngestStats.writeTo(IngestStats.java:92) ~[elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:972) ~[elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:279) ~[elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:69) ~[elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:140) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:118) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:52) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:41) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:282) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:275) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:259) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$3.onResponse(SecurityServerTransportInterceptor.java:317) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$3.onResponse(SecurityServerTransportInterceptor.java:307) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$authorizeAction$4(AuthorizationService.java:281) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService$AuthorizationResultListener.onResponse(AuthorizationService.java:689) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService$AuthorizationResultListener.onResponse(AuthorizationService.java:664) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.RBACEngine.authorizeClusterAction(RBACEngine.java:154) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeAction(AuthorizationService.java:283) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.maybeAuthorizeRunAs(AuthorizationService.java:266) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$authorize$1(AuthorizationService.java:230) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.RBACEngine.lambda$resolveAuthorizationInfo$1(RBACEngine.java:127) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.store.CompositeRolesStore.roles(CompositeRolesStore.java:165) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.store.CompositeRolesStore.getRoles(CompositeRolesStore.java:250) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.RBACEngine.getRoles(RBACEngine.java:133) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.RBACEngine.resolveAuthorizationInfo(RBACEngine.java:121) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:232) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:130) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$0(AuthenticationService.java:336) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$8(AuthenticationService.java:414) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:425) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:333) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:274) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:184) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:121) [x-pack-security-7.14.0.jar:7.14.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:328) [x-pack-security-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:61) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:212) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.14.0.jar:7.14.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.14.0.jar:7.14.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:831) [?:?]
@ptamba ptamba added >bug needs:triage Requires assignment of a team area label labels Sep 17, 2021
@jtibshirani jtibshirani added :Data Management/Stats Statistics tracking and retrieval APIs and removed needs:triage Requires assignment of a team area label labels Sep 17, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Sep 17, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@joegallo
Copy link
Contributor

Tracing that back to the code, I'm seeing these lines from the stacktrace:

diff --git a/server/src/main/java/org/elasticsearch/action/admin/cluster/node/stats/NodeStats.java b/server/src/main/java/org/elasticsearch/action/admin/cluster/node/stats/NodeStats.java
index 0dae95df480..c0d2ddb93d1 100644
--- a/server/src/main/java/org/elasticsearch/action/admin/cluster/node/stats/NodeStats.java
+++ b/server/src/main/java/org/elasticsearch/action/admin/cluster/node/stats/NodeStats.java
@@ -276,7 +276,7 @@ public class NodeStats extends BaseNodeResponse implements ToXContentFragment {
         out.writeOptionalWriteable(breaker);
         out.writeOptionalWriteable(scriptStats);
         out.writeOptionalWriteable(discoveryStats);
-        out.writeOptionalWriteable(ingestStats);
+        out.writeOptionalWriteable(ingestStats); // this
         if (out.getVersion().onOrAfter(Version.V_6_1_0)) {
             out.writeOptionalWriteable(adaptiveSelectionStats);
         } if (out.getVersion().onOrAfter(Version.V_7_8_0) && out.getVersion().before(Version.V_7_9_0)) {
diff --git a/server/src/main/java/org/elasticsearch/ingest/IngestStats.java b/server/src/main/java/org/elasticsearch/ingest/IngestStats.java
index f485c18b706..3e71978074a 100644
--- a/server/src/main/java/org/elasticsearch/ingest/IngestStats.java
+++ b/server/src/main/java/org/elasticsearch/ingest/IngestStats.java
@@ -159,7 +159,7 @@ public class IngestStats implements Writeable, ToXContentFragment {

         private final long ingestCount;
         private final long ingestTimeInMillis;
-        private final long ingestCurrent;
+        private final long ingestCurrent; // this
         private final long ingestFailedCount;

         public Stats(long ingestCount, long ingestTimeInMillis, long ingestCurrent, long ingestFailedCount) {
@@ -183,7 +183,7 @@ public class IngestStats implements Writeable, ToXContentFragment {
         public void writeTo(StreamOutput out) throws IOException {
             out.writeVLong(ingestCount);
             out.writeVLong(ingestTimeInMillis);
-            out.writeVLong(ingestCurrent);
+            out.writeVLong(ingestCurrent); // here
             out.writeVLong(ingestFailedCount);
         }

So my thinking is ingestCurrent probably rolled over to negative, and now things are blowing up trying to serialize it as a positive long because it's a negative long. 😬

@joegallo
Copy link
Contributor

So my thinking is ingestCurrent probably rolled over to negative, and now things are blowing up trying to serialize it as a positive long because it's a negative long. 😬

Errrmmmmmm, no, thinking about it more, it's meant to be a count of the number of ingest actions currently in flight (hence ingestCurrent) so arithmetic rollover doesn't make much sense. So that means it's probably double decrementing somewhere.

@joegallo
Copy link
Contributor

@ptamba would you perhaps be willing to share your ingest pipeline definitions (GET _ingest/pipeline)? That might help me narrow down the search space.

@joegallo
Copy link
Contributor

We've seen some issues along this line before, though, so there's some prior art here in #52411, #52406, and #52581.

@ptamba
Copy link
Author

ptamba commented Sep 21, 2021

@joegallo I wouldn't mind sharing it but we have some specific client information in the pipeline (just customer names really, but , I prefer to share it in a more controlled manner). We have support contract with Elastic and I can share it through that channel.

@joegallo
Copy link
Contributor

Great! Thanks, @ptamba!

@joegallo joegallo self-assigned this Sep 21, 2021
@joegallo
Copy link
Contributor

Also very much related to #52339

@joegallo
Copy link
Contributor

I merged #81450 to help us detect this earlier at dev-time, but so far we haven't been able to find any other evidence of this happening for other customers in the wild.

Rebooting an affected node would have the effect of restarting the counter at 0, so it wouldn't be negative (and would therefore serialize just fine).

@joegallo joegallo removed their assignment Dec 13, 2021
@matschaffer
Copy link
Contributor

Is there a good person to engage with for work on this issue? We found an ECE customer that's seeing this and I think it's causing internal collection of stack monitoring data to fail.

Additionally I can see 7 ESS clusters currently exhibiting the error message (version range from 7.17.5 to 8.1.3)

@crespocarlos
Copy link
Contributor

@elastic/es-data-management this is an old ticket, but this problem seems to be causing a problem with an ECE customer as @matschaffer said above. Is possible to take a look at this?

@masseyke
Copy link
Member

@joegallo it looks like we can double-decrement in CompoundProcessor doesn't it? Wouldn't that happen if an exception was thrown somewhere in https://github.com/elastic/elasticsearch/blob/v7.17.4/server/src/main/java/org/elasticsearch/ingest/CompoundProcessor.java#L141-L147?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Stats Statistics tracking and retrieval APIs Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants