Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture stack traces while issuing IndexShard operations permits to easy debugging #28567

Merged
merged 4 commits into from
Feb 8, 2018

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Feb 8, 2018

Today we acquire a permit from the shard to coordinate between indexing operations, recoveries and other state transitions. When we leak an permit it's practically impossible to find who the culprit is. This PR add stack traces capturing for each permit so we can identify which part of the code is responsible for acquiring the unreleased permit. This code is only active when assertions are active.

The output is something like:

java.lang.AssertionError: shard [test][1] on node [node_s0] has pending operations:
--> java.lang.RuntimeException: something helpful 2
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223)
	at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:322)
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:382)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)

--> java.lang.RuntimeException: something helpful
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223)
	at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:311)
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:382)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some nits

}

public int getActiveOperationsCount() {
return indexShardOperationPermits.getActiveOperationsCount(); // refCount is incremented on successful acquire and decremented on close
}

/**
* @return a list of containing an exceptions for each operation permit that wasn't released yet. The stack traces of the exceptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be a missing word

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an unneeded s in exceptions and a missing s in contain. Can you clarify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I thought it meant to say "a list of items/things/XYZ containing an exception ..."

private volatile boolean closed;
private boolean delayed; // does not need to be volatile as all accesses are done under a lock on this

// only valid when assertions are enabled. Key is AtomicBoolean associated with each permit to ensure close once semantics. Value is an
// exception with some extra info in the message + a stack trace of the acquirer
private final Map<AtomicBoolean, RuntimeException> issuedPermits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason to use RuntimeException, and not just Throwable like in MockPageCacheRecycler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jason scared me into never using Throwable. will switch.

* @return a list of containing an exceptions for each permit that wasn't released yet. The stack traces of the exceptions
* was captured when the operation acquired the permit and their message contain the debug information supplied at the time.
*/
List<RuntimeException> getActiveOperations() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test this method in IndexShardOperationPermitsTests that it returns something meaningful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, so I was doubting about that one. What is meaningful? I can check there is something for each open op and also the message contains the debug info. Is that what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. To check that the active operations are actually captured.

@@ -142,7 +142,7 @@ public RecoveryResponse recoverToTarget() throws IOException {
throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node");
}
assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting;
});
}, shardId + " validating recovery target registered");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add targetShardRouting??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@bleskes
Copy link
Contributor Author

bleskes commented Feb 8, 2018

@ywelsch I pushed some commits that addresses your feedback. Can you take another look?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@jasontedor jasontedor self-requested a review February 8, 2018 16:52
Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a suggestion.

final Object debugInfo) {
final Throwable debugInfoWithStackTrace;
if (Assertions.ENABLED) {
debugInfoWithStackTrace = new Throwable(debugInfo.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to do this without a Throwable? Can you use a Tuple of Object, StackTraceElement[] populated with Thread.currentThread().getStackTrace() (and as an added bonanza avoid rendering the Object to a String unless absolutely needed)?

@bleskes
Copy link
Contributor Author

bleskes commented Feb 8, 2018

@jasontedor and I agreed to get this in now and follow up on removing the throwable.

@bleskes bleskes closed this Feb 8, 2018
@bleskes bleskes reopened this Feb 8, 2018
@bleskes bleskes merged commit ba59cf1 into elastic:master Feb 8, 2018
@bleskes bleskes deleted the indexshard_permit_debug_map branch February 8, 2018 21:59
bleskes added a commit that referenced this pull request Feb 8, 2018
…asy debugging (#28567)

Today we acquire a permit from the shard to coordinate between indexing operations, recoveries and other state transitions. When we leak an  permit it's practically impossible to find who the culprit is. This PR add stack traces capturing for each permit so we can identify which part of the code is responsible for acquiring the unreleased permit. This code is only active when assertions are active.

The output is something like:
```
java.lang.AssertionError: shard [test][1] on node [node_s0] has pending operations:
--> java.lang.RuntimeException: something helpful 2
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223)
	at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:322)
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:382)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)

--> java.lang.RuntimeException: something helpful
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223)
	at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:311)
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:382)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)

```
bleskes added a commit that referenced this pull request Feb 12, 2018
…ack traces (#28598)

The is a follow up to #28567 changing the method used to capture stack traces, as requested
during the review. Instead of creating a throwable, we explicitly capture the stack trace of the
current thread. This should Make Jason Happy Again ™️ .
bleskes added a commit that referenced this pull request Feb 12, 2018
…ack traces (#28598)

The is a follow up to #28567 changing the method used to capture stack traces, as requested
during the review. Instead of creating a throwable, we explicitly capture the stack trace of the
current thread. This should Make Jason Happy Again ™️ .
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Apr 17, 2023
In elastic#28567 we introduced into `IndexShardOperationPermits` the tracking
of extra information about the permits it has handed out, which would
help if a test failed due to a leaked permit. I don't think we've seen
any such test failures in a very long time, so this extra test-only code
is not really useful any more. This commit removes it.
elasticsearchmachine pushed a commit that referenced this pull request Apr 18, 2023
In #28567 we introduced into `IndexShardOperationPermits` the tracking
of extra information about the permits it has handed out, which would
help if a test failed due to a leaked permit. I don't think we've seen
any such test failures in a very long time, so this extra test-only code
is not really useful any more. This commit removes it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants