Add details about what acquired the shard lock last #38807

dakrone · 2019-02-12T18:08:53Z

This adds a details parameter to shard locking in NodeEnvironment. This is
intended to be used for diagnosing issues such as

  1> [2019-02-11T14:34:19,262][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] deleting index
  1> [2019-02-11T14:34:19,279][WARN ][o.e.i.IndicesService     ] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] failed to delete index
  1> org.elasticsearch.env.ShardLockObtainFailedException: [.tasks][0]: obtaining shard lock timed out after 0ms
  1> 	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:736) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:655) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:601) ~[main/:?]
  1> 	at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:554) ~[main/:?]

In the hope that we will be able to determine why the shard is still locked.

Relates to #30290 as well as some other CI failures

This adds a `details` parameter to shard locking in `NodeEnvironment`. This is intended to be used for diagnosing issues such as ``` 1> [2019-02-11T14:34:19,262][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] deleting index 1> [2019-02-11T14:34:19,279][WARN ][o.e.i.IndicesService ] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] failed to delete index 1> org.elasticsearch.env.ShardLockObtainFailedException: [.tasks][0]: obtaining shard lock timed out after 0ms 1> at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:736) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:655) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:601) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:554) ~[main/:?] ``` In the hope that we will be able to determine why the shard is still locked. Relates to elastic#30290 as well as some other CI failures

elasticmachine · 2019-02-12T18:08:55Z

Pinging @elastic/es-core-infra

dakrone · 2019-02-12T18:10:26Z

I opened this as a "WIP" because I want to solicit feedback about it.

If folks think there are better ways to approach trying to debug these sorts of things I would love to hear them! Please do let me know if you like or don't like this approach (it's not thread-safe, but I didn't want to complicate the locking, I wish we had a SemaphoreWithMessage class)

talevy · 2019-02-14T22:49:15Z

@dakrone I think this looks like it can be helpful. this issue comes up pretty frequently in CI. mind adding java comment around the threading concern?

danielmitterdorfer

I left a few comments. Can you also please explain why this is not thread-safe?

server/src/main/java/org/elasticsearch/env/NodeEnvironment.java

dakrone · 2019-02-26T22:20:34Z

Can you also please explain why this is not thread-safe?

Actually I think I was in error, this looks like it is thread-safe because the details, even though not synchronized, never escapes the places where the mutex is acquired before returning and therefore shouldn't ever change while the mutex is not held.

danielmitterdorfer

Thanks for iterating! LGTM

This adds a `details` parameter to shard locking in `NodeEnvironment`. This is intended to be used for diagnosing issues such as ``` 1> [2019-02-11T14:34:19,262][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] deleting index 1> [2019-02-11T14:34:19,279][WARN ][o.e.i.IndicesService ] [node_s0] [.tasks/oSYOG0-9SHOx_pfAoiSExQ] failed to delete index 1> org.elasticsearch.env.ShardLockObtainFailedException: [.tasks][0]: obtaining shard lock timed out after 0ms 1> at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:736) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:655) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:601) ~[main/:?] 1> at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:554) ~[main/:?] ``` In the hope that we will be able to determine why the shard is still locked. Relates to #30290 as well as some other CI failures

Today a common reason for a `ShardLockObtainFailedException` is when a shard is removed from a node and then assigned straight back to it again before the node has had a chance to shut the previous shard instance down. For instance, this can happen if a node briefly leaves the cluster holding a primary with no in-sync replicas. The message in this case is typically as follows: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation] This is pretty hard to interpret, and doesn't raise the important question: "why didn't the shard shut down sooner?" With this change we reword the message a bit, report the age of the shard lock, and adjust the details to report that the lock is held by a closing shard: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms] Relates elastic#38807

Today a common reason for a `ShardLockObtainFailedException` is when a shard is removed from a node and then assigned straight back to it again before the node has had a chance to shut the previous shard instance down. For instance, this can happen if a node briefly leaves the cluster holding a primary with no in-sync replicas. The message in this case is typically as follows: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation] This is pretty hard to interpret, and doesn't raise the important question: "why didn't the shard shut down sooner?" With this change we reword the message a bit, report the age of the shard lock, and adjust the details to report that the lock is held by a closing shard: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms] Relates #38807

dakrone added WIP :Core/Infra/Core Core issues without another label v8.0.0 labels Feb 12, 2019

dakrone mentioned this pull request Feb 14, 2019

[CI] AssertionError: Accounting breaker not reset #30290

Closed

dakrone removed the WIP label Feb 22, 2019

danielmitterdorfer reviewed Feb 26, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/env/NodeEnvironment.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/env/NodeEnvironment.java Outdated Show resolved Hide resolved

dakrone added 2 commits February 26, 2019 11:13

Merge remote-tracking branch 'origin/master' into lock-details

7bc76ad

Change argument order, make args final

0efd394

danielmitterdorfer approved these changes Feb 27, 2019

View reviewed changes

danielmitterdorfer added the >enhancement label Feb 27, 2019

dakrone added v7.0.0 v7.2.0 labels Feb 28, 2019

dakrone merged commit d743ea7 into elastic:master Feb 28, 2019

dakrone deleted the lock-details branch February 28, 2019 18:17

michaelbaamonde added v7.0.0-rc1 and removed v7.0.0 labels Mar 25, 2019

henryptung mentioned this pull request Feb 21, 2020

ShardLockObtainFailedException debugging on 6.8, and backport? #52657

Closed

DaveCTurner mentioned this pull request Aug 18, 2020

Report more details of unobtainable ShardLock #61255

Merged

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add details about what acquired the shard lock last #38807

Add details about what acquired the shard lock last #38807

dakrone commented Feb 12, 2019

elasticmachine commented Feb 12, 2019

dakrone commented Feb 12, 2019

talevy commented Feb 14, 2019

danielmitterdorfer left a comment

dakrone commented Feb 26, 2019

danielmitterdorfer left a comment

Add details about what acquired the shard lock last #38807

Add details about what acquired the shard lock last #38807

Conversation

dakrone commented Feb 12, 2019

elasticmachine commented Feb 12, 2019

dakrone commented Feb 12, 2019

talevy commented Feb 14, 2019

danielmitterdorfer left a comment

Choose a reason for hiding this comment

dakrone commented Feb 26, 2019

danielmitterdorfer left a comment

Choose a reason for hiding this comment