Capture stack traces while issuing IndexShard operations permits to easy debugging #28567

bleskes · 2018-02-08T10:31:48Z

Today we acquire a permit from the shard to coordinate between indexing operations, recoveries and other state transitions. When we leak an permit it's practically impossible to find who the culprit is. This PR add stack traces capturing for each permit so we can identify which part of the code is responsible for acquiring the unreleased permit. This code is only active when assertions are active.

The output is something like:

java.lang.AssertionError: shard [test][1] on node [node_s0] has pending operations:
--> java.lang.RuntimeException: something helpful 2
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223)
	at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:322)
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:382)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)

--> java.lang.RuntimeException: something helpful
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223)
	at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:311)
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:382)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)

ywelsch

LGTM. Left some nits

ywelsch · 2018-02-08T14:24:40Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

    }

    public int getActiveOperationsCount() {
        return indexShardOperationPermits.getActiveOperationsCount(); // refCount is incremented on successful acquire and decremented on close
    }

+    /**
+     * @return a list of containing an exceptions for each operation permit that wasn't released yet. The stack traces of the exceptions


there seems to be a missing word

There's an unneeded s in exceptions and a missing s in contain. Can you clarify?

ok, I thought it meant to say "a list of items/things/XYZ containing an exception ..."

ywelsch · 2018-02-08T14:26:11Z

server/src/main/java/org/elasticsearch/index/shard/IndexShardOperationPermits.java

    private volatile boolean closed;
    private boolean delayed; // does not need to be volatile as all accesses are done under a lock on this

+    // only valid when assertions are enabled. Key is AtomicBoolean associated with each permit to ensure close once semantics. Value is an
+    // exception with some extra info in the message + a stack trace of the acquirer
+    private final Map<AtomicBoolean, RuntimeException> issuedPermits;


any particular reason to use RuntimeException, and not just Throwable like in MockPageCacheRecycler

Jason scared me into never using Throwable. will switch.

ywelsch · 2018-02-08T14:27:51Z

server/src/main/java/org/elasticsearch/index/shard/IndexShardOperationPermits.java

+     * @return a list of containing an exceptions for each permit that wasn't released yet. The stack traces of the exceptions
+     *         was captured when the operation acquired the permit and their message contain the debug information supplied at the time.
+     */
+    List<RuntimeException> getActiveOperations() {


can you test this method in IndexShardOperationPermitsTests that it returns something meaningful?

yeah, so I was doubting about that one. What is meaningful? I can check there is something for each open op and also the message contains the debug info. Is that what you mean?

yes. To check that the active operations are actually captured.

ywelsch · 2018-02-08T14:30:19Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -142,7 +142,7 @@ public RecoveryResponse recoverToTarget() throws IOException {
                throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node");
            }
            assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting;
-        });
+        }, shardId + " validating recovery target registered");


add targetShardRouting??

bleskes · 2018-02-08T16:28:15Z

@ywelsch I pushed some commits that addresses your feedback. Can you take another look?

ywelsch

LGTM. Thanks

jasontedor

I left a suggestion.

jasontedor · 2018-02-08T21:37:42Z

server/src/main/java/org/elasticsearch/index/shard/IndexShardOperationPermits.java

+                        final Object debugInfo) {
+        final Throwable debugInfoWithStackTrace;
+        if (Assertions.ENABLED) {
+            debugInfoWithStackTrace = new Throwable(debugInfo.toString());


Is it possible to do this without a Throwable? Can you use a Tuple of Object, StackTraceElement[] populated with Thread.currentThread().getStackTrace() (and as an added bonanza avoid rendering the Object to a String unless absolutely needed)?

bleskes · 2018-02-08T21:58:43Z

@jasontedor and I agreed to get this in now and follow up on removing the throwable.

…asy debugging (#28567) Today we acquire a permit from the shard to coordinate between indexing operations, recoveries and other state transitions. When we leak an permit it's practically impossible to find who the culprit is. This PR add stack traces capturing for each permit so we can identify which part of the code is responsible for acquiring the unreleased permit. This code is only active when assertions are active. The output is something like: ``` java.lang.AssertionError: shard [test][1] on node [node_s0] has pending operations: --> java.lang.RuntimeException: something helpful 2 at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223) at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:322) at org.elasticsearch.index.IndexService.createShard(IndexService.java:382) at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514) at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143) at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552) at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529) at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231) at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495) at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) --> java.lang.RuntimeException: something helpful at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223) at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:311) at org.elasticsearch.index.IndexService.createShard(IndexService.java:382) at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514) at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143) at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552) at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529) at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231) at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495) at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) ```

…ack traces (#28598) The is a follow up to #28567 changing the method used to capture stack traces, as requested during the review. Instead of creating a throwable, we explicitly capture the stack trace of the current thread. This should Make Jason Happy Again ™️ .

In elastic#28567 we introduced into `IndexShardOperationPermits` the tracking of extra information about the permits it has handed out, which would help if a test failed due to a leaked permit. I don't think we've seen any such test failures in a very long time, so this extra test-only code is not really useful any more. This commit removes it.

In #28567 we introduced into `IndexShardOperationPermits` the tracking of extra information about the permits it has handed out, which would help if a test failed due to a leaked permit. I don't think we've seen any such test failures in a very long time, so this extra test-only code is not really useful any more. This commit removes it.

bleskes added 2 commits February 8, 2018 00:24

add debug tooling for permits when assertions are enabled

8e72aa2

java docs

3df5aba

bleskes added >non-issue v7.0.0 v6.3.0 labels Feb 8, 2018

bleskes requested a review from ywelsch February 8, 2018 10:31

ywelsch approved these changes Feb 8, 2018

View reviewed changes

bleskes added 2 commits February 8, 2018 17:23

feedback plus test

44f85ad

Forgot some feedback

a6f2d81

ywelsch approved these changes Feb 8, 2018

View reviewed changes

jasontedor self-requested a review February 8, 2018 16:52

jasontedor requested changes Feb 8, 2018

View reviewed changes

bleskes closed this Feb 8, 2018

bleskes reopened this Feb 8, 2018

bleskes merged commit ba59cf1 into elastic:master Feb 8, 2018

bleskes deleted the indexshard_permit_debug_map branch February 8, 2018 21:59

bleskes mentioned this pull request Feb 9, 2018

IndexShardOperationPermits: shouldn't use new Throwable to capture stack traces #28598

Merged

bleskes mentioned this pull request Apr 23, 2018

[CI] RecoveryWhileUnderLoadIT.testRecoverWhileUnderLoadAllocateReplicasTest fails #29660

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

DaveCTurner mentioned this pull request Apr 17, 2023

Remove debugging cruft from IndexShardOperationPermits #95275

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture stack traces while issuing IndexShard operations permits to easy debugging #28567

Capture stack traces while issuing IndexShard operations permits to easy debugging #28567

bleskes commented Feb 8, 2018

ywelsch left a comment

ywelsch Feb 8, 2018

bleskes Feb 8, 2018

ywelsch Feb 8, 2018

ywelsch Feb 8, 2018

bleskes Feb 8, 2018

ywelsch Feb 8, 2018

bleskes Feb 8, 2018

ywelsch Feb 8, 2018

ywelsch Feb 8, 2018

bleskes Feb 8, 2018

bleskes commented Feb 8, 2018

ywelsch left a comment

jasontedor left a comment

jasontedor Feb 8, 2018

bleskes commented Feb 8, 2018

Capture stack traces while issuing IndexShard operations permits to easy debugging #28567

Capture stack traces while issuing IndexShard operations permits to easy debugging #28567

Conversation

bleskes commented Feb 8, 2018

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Feb 8, 2018

ywelsch left a comment

Choose a reason for hiding this comment

jasontedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Feb 8, 2018