Bulk index tasks stuck forever #31099

jeancornic · 2018-06-05T10:53:17Z

Elasticsearch version (bin/elasticsearch --version): 6.1.3

Plugins installed: []

JVM version (java -version):

java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux *** 4.4.0-1049-aws #58-Ubuntu SMP Fri Jan 12 23:17:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Seeing a weird behaviour, with bulk index actions that never finish.
They are triggered by the Elasticsearch Java client (version 6.1.3), through a BulkRequestBuilder, with the default timeout (1m). Seems we never enter the ActionListener callbacks.

I found out recently that it's linked with tasks running indefinitely. When checking the /_tasks api, I'm seeing tasks that have been running for several days (!)..

$ curl 'localhost:9200/_tasks?pretty&human&actions=indices:data/write/bulk&detailed'
...
    "pd5MT4eUSEqsOy_oq_q-vw" : {
      "name" : "***",
      "transport_address" : "***",
      "host" : "***",
      "ip" : "***",
      "roles" : [
        "data"
      ],
      "attributes" : {
        "availability_zone" : "us-east-1c",
        "tag" : "fresh"
      },
      "tasks" : {
        "pd5MT4eUSEqsOy_oq_q-vw:208477658" : {
          "node" : "pd5MT4eUSEqsOy_oq_q-vw",
          "id" : 208477658,
          "type" : "netty",
          "action" : "indices:data/write/bulk",
          "start_time" : "2018-04-12T21:17:45.175Z",
          "start_time_in_millis" : 1523567865175,
          "running_time" : "53.4d",
          "running_time_in_nanos" : 4622108629250516,
          "cancellable" : false
        }
      }
    },
...

I would expect the bulk action to run for 1min, and then timeout.

I came across several issues / PRs on github:

Handle throws on tasks submitted to thread pools #28667: could it be related to the issue?
Tasks stuck forever #25863: seems like a similar issue, but we're running a newer version, and we have no rejection logs in ES, so I don't believe that's it.

Steps to reproduce: I have no idea, it happens randomly, since we migrated from 5.6 to 6.1.3.

Provide logs (if relevant): found nothing relevant nor in ES, nor in the client logs.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-06-05T15:27:58Z

Pinging @elastic/es-core-infra

jeancornic · 2018-06-14T11:44:28Z

Do you need more details? Or that I run some commands on the cluster?

I also found out recently that this bug is responsible for stuck relocations. Somehow the ongoing bulk operation prevents the relocation from finishing completely, and the workaround I found is is to close / re-open the index.

s1monw · 2018-06-14T18:42:21Z

@bleskes this seems more of a distributed systems issue? Can you follow up

bleskes · 2018-06-18T13:25:54Z

@jeancornic what's peculiar about this issue is that the task indices:data/write/bulk is a top level bulk request. That one potentially creates the index (are you using this feature?) and the breaks the request into sub requests contain all operations that are targeted at specific shard and sends those off. Are you seeing any of those being stuck? you can identify them by indices:data/write/bulk[s]* (note the [s] at the end).

jeancornic · 2018-06-18T14:57:30Z

@bleskes sure, follows a json extract (group_by=parents view).
And no, we disabled auto_create_index.
We see them on several nodes, several indices.

    ...
    "pd5MT4eUSEqsOy_oq_q-vw:208477658" : {
      "node" : "pd5MT4eUSEqsOy_oq_q-vw",
      "id" : 208477658,
      "type" : "netty",
      "action" : "indices:data/write/bulk",
      "description" : "requests[938], indices[]",
      "start_time" : "2018-04-12T21:17:45.175Z",
      "start_time_in_millis" : 1523567865175,
      "running_time" : "66.7d",
      "running_time_in_nanos" : 5765089168375518,
      "cancellable" : false,
      "children" : [
        {
          "node" : "pd5MT4eUSEqsOy_oq_q-vw",
          "id" : 208477692,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[6], index[17.1523351040000.0]",
          "start_time" : "2018-04-12T21:17:45.177Z",
          "start_time_in_millis" : 1523567865177,
          "running_time" : "66.7d",
          "running_time_in_nanos" : 5765089166199822,
          "cancellable" : false,
          "parent_task_id" : "pd5MT4eUSEqsOy_oq_q-vw:208477658",
          "children" : [
            {
              "node" : "pacubgT0TfeaE0sIHWzbfA",
              "id" : 261543271,
              "type" : "netty",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[6], index[17.1523351040000.0]",
              "start_time" : "2018-04-12T21:17:45.182Z",
              "start_time_in_millis" : 1523567865182,
              "running_time" : "66.7d",
              "running_time_in_nanos" : 5765089159914921,
              "cancellable" : false,
              "parent_task_id" : "pd5MT4eUSEqsOy_oq_q-vw:208477692",
              "children" : [
                {
                  "node" : "pacubgT0TfeaE0sIHWzbfA",
                  "id" : 261543299,
                  "type" : "direct",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[6], index[17.1523351040000.0]",
                  "start_time" : "2018-04-12T21:17:45.196Z",
                  "start_time_in_millis" : 1523567865196,
                  "running_time" : "66.7d",
                  "running_time_in_nanos" : 5765089146064289,
                  "cancellable" : false,
                  "parent_task_id" : "pacubgT0TfeaE0sIHWzbfA:261543271"
                },
                {
                  "node" : "3lgZ2bOCQs6xXcUda6OyuQ",
                  "id" : 256995983,
                  "type" : "netty",
                  "action" : "indices:data/write/bulk[s][r]",
                  "status" : {
                    "phase" : "replica"
                  },
                  "description" : "requests[6], index[17.1523351040000.0]",
                  "start_time" : "2018-04-12T21:17:45.217Z",
                  "start_time_in_millis" : 1523567865217,
                  "running_time" : "66.7d",
                  "running_time_in_nanos" : 5765089124501837,
                  "cancellable" : false,
                  "parent_task_id" : "pacubgT0TfeaE0sIHWzbfA:261543271"
                }
              ]
            }
          ]
        }
      ]
    },
    ...
    "Pnwo6h_ATI6aV0WDKnubfw:225377459" : {
      "node" : "Pnwo6h_ATI6aV0WDKnubfw",
      "id" : 225377459,
      "type" : "netty",
      "action" : "indices:data/write/bulk",
      "description" : "requests[706], indices[]",
      "start_time" : "2018-04-13T07:04:22.308Z",
      "start_time_in_millis" : 1523603062308,
      "running_time" : "66.3d",
      "running_time_in_nanos" : 5729892034527237,
      "cancellable" : false,
      "children" : [
        {
          "node" : "Pnwo6h_ATI6aV0WDKnubfw",
          "id" : 225377480,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[11], index[210.1523585220000.0]",
          "start_time" : "2018-04-13T07:04:22.310Z",
          "start_time_in_millis" : 1523603062310,
          "running_time" : "66.3d",
          "running_time_in_nanos" : 5729892032405972,
          "cancellable" : false,
          "parent_task_id" : "Pnwo6h_ATI6aV0WDKnubfw:225377459",
          "children" : [
            {
              "node" : "pacubgT0TfeaE0sIHWzbfA",
              "id" : 279050835,
              "type" : "netty",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[11], index[210.1523585220000.0]",
              "start_time" : "2018-04-13T07:04:22.312Z",
              "start_time_in_millis" : 1523603062312,
              "running_time" : "66.3d",
              "running_time_in_nanos" : 5729892030298850,
              "cancellable" : false,
              "parent_task_id" : "Pnwo6h_ATI6aV0WDKnubfw:225377480",
              "children" : [
                {
                  "node" : "pacubgT0TfeaE0sIHWzbfA",
                  "id" : 279050836,
                  "type" : "direct",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[11], index[210.1523585220000.0]",
                  "start_time" : "2018-04-13T07:04:22.312Z",
                  "start_time_in_millis" : 1523603062312,
                  "running_time" : "66.3d",
                  "running_time_in_nanos" : 5729892030288175,
                  "cancellable" : false,
                  "parent_task_id" : "pacubgT0TfeaE0sIHWzbfA:279050835"
                },
                {
                  "node" : "pd5MT4eUSEqsOy_oq_q-vw",
                  "id" : 220870271,
                  "type" : "netty",
                  "action" : "indices:data/write/bulk[s][r]",
                  "status" : {
                    "phase" : "replica"
                  },
                  "description" : "requests[11], index[210.1523585220000.0]",
                  "start_time" : "2018-04-13T07:04:22.320Z",
                  "start_time_in_millis" : 1523603062320,
                  "running_time" : "66.3d",
                  "running_time_in_nanos" : 5729892022670077,
                  "cancellable" : false,
                  "parent_task_id" : "pacubgT0TfeaE0sIHWzbfA:279050835"
                }
              ]
            }
          ]
        }
      ]
    },
    ...
   "bGuSrzBFTUii9qto-vPdQw:2391948877" : {
      "node" : "bGuSrzBFTUii9qto-vPdQw",
      "id" : 2391948877,
      "type" : "netty",
      "action" : "indices:data/write/bulk",
      "description" : "requests[835], indices[]",
      "start_time" : "2018-06-12T16:05:56.318Z",
      "start_time_in_millis" : 1528819556318,
      "running_time" : "5.9d",
      "running_time_in_nanos" : 513398024435017,
      "cancellable" : false,
      "children" : [
        {
          "node" : "bGuSrzBFTUii9qto-vPdQw",
          "id" : 2391948914,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[4], index[841.1528651620000.1]",
          "start_time" : "2018-06-12T16:05:56.323Z",
          "start_time_in_millis" : 1528819556323,
          "running_time" : "5.9d",
          "running_time_in_nanos" : 513398020168910,
          "cancellable" : false,
          "parent_task_id" : "bGuSrzBFTUii9qto-vPdQw:2391948877",
          "children" : [
            {
              "node" : "evdbfKMRT_CDufvaHRaDAA",
              "id" : 2819410023,
              "type" : "netty",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[4], index[841.1528651620000.1]",
              "start_time" : "2018-06-12T16:05:56.322Z",
              "start_time_in_millis" : 1528819556322,
              "running_time" : "5.9d",
              "running_time_in_nanos" : 513398020793778,
              "cancellable" : false,
              "parent_task_id" : "bGuSrzBFTUii9qto-vPdQw:2391948914",
              "children" : [
                {
                  "node" : "evdbfKMRT_CDufvaHRaDAA",
                  "id" : 2819410080,
                  "type" : "direct",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[4], index[841.1528651620000.1]",
                  "start_time" : "2018-06-12T16:05:56.346Z",
                  "start_time_in_millis" : 1528819556346,
                  "running_time" : "5.9d",
                  "running_time_in_nanos" : 513397996695634,
                  "cancellable" : false,
                  "parent_task_id" : "evdbfKMRT_CDufvaHRaDAA:2819410023"
                },
                {
                  "node" : "FDW9DMoWTiOX7AeQ8xmdCA",
                  "id" : 2589448571,
                  "type" : "netty",
                  "action" : "indices:data/write/bulk[s][r]",
                  "status" : {
                    "phase" : "replica"
                  },
                  "description" : "requests[4], index[841.1528651620000.1]",
                  "start_time" : "2018-06-12T16:05:56.380Z",
                  "start_time_in_millis" : 1528819556380,
                  "running_time" : "5.9d",
                  "running_time_in_nanos" : 513397960736985,
                  "cancellable" : false,
                  "parent_task_id" : "evdbfKMRT_CDufvaHRaDAA:2819410023"
                }
              ]
            }
          ]
        }
      ]
    },
    ...

elasticmachine · 2018-06-18T15:11:09Z

Pinging @elastic/es-distributed

bleskes · 2018-06-18T15:34:28Z

@jeancornic thanks. It seems these are stuck waiting for a replica that never responded. The two tickets that you mention are similar in nature, but I don't see anything here that point in that direction.

I understand that you see this regularly. Is that true? if so, can you maybe upgrade to 6.3.0 and see how it goes? it will help reduce the scope of the search.

Some more generic questions about your setup:

Do you use dynamic mapping?
Is there anything else of note in the logs? The replica nodes (FDW9DMoWTiOX7AeQ8xmdCA & pd5MT4eUSEqsOy_oq_q-vw) are interesting.

jeancornic · 2018-06-19T17:10:17Z

@bleskes Thanks for those comments.
We see this regularly indeed, several times a week.

Migrating to 6.3.0 is not very straightforward for us to be honest.

We don't use dynamic mapping
Investigated another task than the one in example, more recent:

"WpsOgw9JTxivBHkMatVPrQ:2158610039" : {
      "node" : "WpsOgw9JTxivBHkMatVPrQ",
      "id" : 2158610039,
      "type" : "netty",
      "action" : "indices:data/write/bulk",
      "description" : "requests[671], indices[]",
      "start_time" : "2018-06-17T22:06:29.484Z",
      "start_time_in_millis" : 1529273189484,
      "running_time" : "1.7d",
      "running_time_in_nanos" : 153023445214125,
      "cancellable" : false,
      "children" : [
        {
          "node" : "WpsOgw9JTxivBHkMatVPrQ",
          "id" : 2158610093,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[29], index[13.1529271600000.1]",
          "start_time" : "2018-06-17T22:06:29.487Z",
          "start_time_in_millis" : 1529273189487,
          "running_time" : "1.7d",
          "running_time_in_nanos" : 153023441823453,
          "cancellable" : false,
          "parent_task_id" : "WpsOgw9JTxivBHkMatVPrQ:2158610039",
          "children" : [
            {
              "node" : "63tYulq1Q7qhsMAcQXR6Bg",
              "id" : 1928749409,
              "type" : "netty",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[29], index[13.1529271600000.1]",
              "start_time" : "2018-06-17T22:06:29.488Z",
              "start_time_in_millis" : 1529273189488,
              "running_time" : "1.7d",
              "running_time_in_nanos" : 153023440722690,
              "cancellable" : false,
              "parent_task_id" : "WpsOgw9JTxivBHkMatVPrQ:2158610093",
              "children" : [
                {
                  "node" : "FDW9DMoWTiOX7AeQ8xmdCA",
                  "id" : 2871784213,
                  "type" : "netty",
                  "action" : "indices:data/write/bulk[s][r]",
                  "status" : {
                    "phase" : "replica"
                  },
                  "description" : "requests[29], index[13.1529271600000.1]",
                  "start_time" : "2018-06-17T22:06:29.544Z",
                  "start_time_in_millis" : 1529273189544,
                  "running_time" : "1.7d",
                  "running_time_in_nanos" : 153023387248385,
                  "cancellable" : false,
                  "parent_task_id" : "63tYulq1Q7qhsMAcQXR6Bg:1928749409"
                },
                {
                  "node" : "63tYulq1Q7qhsMAcQXR6Bg",
                  "id" : 1928749487,
                  "type" : "direct",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[29], index[13.1529271600000.1]",
                  "start_time" : "2018-06-17T22:06:29.499Z",
                  "start_time_in_millis" : 1529273189499,
                  "running_time" : "1.7d",
                  "running_time_in_nanos" : 153023430180000,
                  "cancellable" : false,
                  "parent_task_id" : "63tYulq1Q7qhsMAcQXR6Bg:1928749409"
                }
              ]
            }
          ]
        }
      ]
    }

In the logs (on the 3 nodes WpsOgw9JTxivBHkMatVPrQ, 63tYulq1Q7qhsMAcQXR6Bg & FDW9DMoWTiOX7AeQ8xmdCA), I don't find anything relevant:

indexing slowlogs, < 500ms
some debug logs, async flush request, async roll translog generation & merge

...
[2018-06-17T22:11:22,131][DEBUG][o.e.i.s.IndexShard       ] [***] [13.1529271600000.1][4] submitting async flush request
...
[2018-06-17T22:06:37,917][DEBUG][o.e.i.s.IndexShard       ] [***] [13.1529271600000.1][0] submitting async roll translog generation request
...
[2018-06-17T22:06:41,358][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [***] [13.1529271600000.1][4] merge segment [_7e] done: took [25.4s], [375.2 MB], [526,460 docs], [0s stopped], [17.8s throttled], [364.3 MB written], [16.5 MB/sec throttle]
...

Seeing also one

[2018-06-19T02:18:09,096][WARN ][o.e.t.n.Netty4Transport  ] [***] send message failed [channel: org.elasticsearch.transport.netty4.NettyTcpChannel@542b0ecb]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

but 4 hours after the bulk, probably unrelated.

bleskes · 2018-06-22T12:29:14Z

@jeancornic thanks. Nothing pops up directly here, which means it's going to be a long search. It's good it happens frequently as it greatly increases to chance of finding out what its.

The the last log message you sent is important:

[2018-06-19T02:18:09,096][WARN ][o.e.t.n.Netty4Transport  ] [***] send message failed [channel: org.elasticsearch.transport.netty4.NettyTcpChannel@542b0ecb]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

Can you post the full stack trace for this? can you also check other nodes for transport level logs that look suspicious?

linyy2233 · 2018-10-09T01:28:24Z

We also encountered the same problem, this task caused several nodes to load very high, and then the cluster could not write. Restart the node with high load to recover. Is there a solution? Thank you.

linyy2233 · 2018-10-09T01:50:57Z

node hot_threads

ywelsch · 2018-10-09T22:28:25Z

@linyy2233 it looks like you're having a different (unrelated) problem, possibly caused by having a very large JSON number in your document? Can you open a topic on discuss.elastic.co please for further investigation? Thanks.

Also, as no reply on original ticket, closing this.

seborys · 2018-12-20T04:36:11Z

@jeancornic Did you ever get more info on this? Experiencing the same.

ywelsch · 2018-12-20T07:43:40Z

For the case where tasks are stuck on replicas (Tasks are organized in a tree hierarchy, and the leaves are tasks with an action name indices:data/write/bulk[s][r], i.e., ending in [r]), we found the root cause yesterday and created a fix (see #36770). Note that there can be other reasons for stuck tasks, so if your leaf tasks are not replica tasks, they might be stuck for a different reason.

linyy2233 · 2019-01-06T06:05:13Z

Thanks, I changed this parameter to false and it does not appear.
numeric_detection

IvanBoyko · 2020-03-30T20:02:20Z

We experience the same problem on 6.8.7

Is there any generic workaround how to delete stuck tasks?
We are Ok to lose this data, it would be great to recover the cluster at all.

Example:

    "hep4suvHTgWKgYGcDoseCA:5469": {
      "node": "hep4suvHTgWKgYGcDoseCA",
      "id": 5469,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[2], indices[.monitoring-es-6-2020.03.30]",
      "start_time_in_millis": 1585591712424,
      "running_time_in_nanos": 6472053578715,
      "cancellable": false,
      "headers": {},
      "children": [
        {
          "node": "hep4suvHTgWKgYGcDoseCA",
          "id": 5483,
          "type": "transport",
          "action": "indices:data/write/bulk[s]",
          "status": {
            "phase": "waiting_on_primary"
          },
          "description": "requests[2], index[.monitoring-es-6-2020.03.30]",
          "start_time_in_millis": 1585591712440,
          "running_time_in_nanos": 6472037944119,
          "cancellable": false,
          "parent_task_id": "hep4suvHTgWKgYGcDoseCA:5469",
          "headers": {},
          "children": [
            {
              "node": "hep4suvHTgWKgYGcDoseCA",
              "id": 6148,
              "type": "direct",
              "action": "indices:data/write/bulk[s][p]",
              "status": {
                "phase": "primary"
              },
              "description": "requests[2], index[.monitoring-es-6-2020.03.30]",
              "start_time_in_millis": 1585591720956,
              "running_time_in_nanos": 6463522255014,
              "cancellable": false,
              "parent_task_id": "hep4suvHTgWKgYGcDoseCA:5483",
              "headers": {}
            }
          ]
        }
      ]
    }

mayya-sharipova added >bug :Core/Infra/Core Core issues without another label labels Jun 5, 2018

bleskes added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Core/Infra/Core Core issues without another label labels Jun 18, 2018

ywelsch closed this as completed Oct 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk index tasks stuck forever #31099

Bulk index tasks stuck forever #31099

jeancornic commented Jun 5, 2018 •

edited

Loading

elasticmachine commented Jun 5, 2018

jeancornic commented Jun 14, 2018

s1monw commented Jun 14, 2018

bleskes commented Jun 18, 2018

jeancornic commented Jun 18, 2018

elasticmachine commented Jun 18, 2018

bleskes commented Jun 18, 2018

jeancornic commented Jun 19, 2018 •

edited

Loading

bleskes commented Jun 22, 2018

linyy2233 commented Oct 9, 2018 •

edited

Loading

linyy2233 commented Oct 9, 2018

ywelsch commented Oct 9, 2018

seborys commented Dec 20, 2018

ywelsch commented Dec 20, 2018

linyy2233 commented Jan 6, 2019

IvanBoyko commented Mar 30, 2020 •

edited

Loading

Bulk index tasks stuck forever #31099

Bulk index tasks stuck forever #31099

Comments

jeancornic commented Jun 5, 2018 • edited Loading

elasticmachine commented Jun 5, 2018

jeancornic commented Jun 14, 2018

s1monw commented Jun 14, 2018

bleskes commented Jun 18, 2018

jeancornic commented Jun 18, 2018

elasticmachine commented Jun 18, 2018

bleskes commented Jun 18, 2018

jeancornic commented Jun 19, 2018 • edited Loading

bleskes commented Jun 22, 2018

linyy2233 commented Oct 9, 2018 • edited Loading

linyy2233 commented Oct 9, 2018

ywelsch commented Oct 9, 2018

seborys commented Dec 20, 2018

ywelsch commented Dec 20, 2018

linyy2233 commented Jan 6, 2019

IvanBoyko commented Mar 30, 2020 • edited Loading

jeancornic commented Jun 5, 2018 •

edited

Loading

jeancornic commented Jun 19, 2018 •

edited

Loading

linyy2233 commented Oct 9, 2018 •

edited

Loading

IvanBoyko commented Mar 30, 2020 •

edited

Loading