-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture stack traces while issuing IndexShard operations permits to easy debugging #28567
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some nits
} | ||
|
||
public int getActiveOperationsCount() { | ||
return indexShardOperationPermits.getActiveOperationsCount(); // refCount is incremented on successful acquire and decremented on close | ||
} | ||
|
||
/** | ||
* @return a list of containing an exceptions for each operation permit that wasn't released yet. The stack traces of the exceptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there seems to be a missing word
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an unneeded s in exceptions and a missing s in contain. Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I thought it meant to say "a list of items/things/XYZ containing an exception ..."
private volatile boolean closed; | ||
private boolean delayed; // does not need to be volatile as all accesses are done under a lock on this | ||
|
||
// only valid when assertions are enabled. Key is AtomicBoolean associated with each permit to ensure close once semantics. Value is an | ||
// exception with some extra info in the message + a stack trace of the acquirer | ||
private final Map<AtomicBoolean, RuntimeException> issuedPermits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason to use RuntimeException, and not just Throwable like in MockPageCacheRecycler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jason scared me into never using Throwable. will switch.
* @return a list of containing an exceptions for each permit that wasn't released yet. The stack traces of the exceptions | ||
* was captured when the operation acquired the permit and their message contain the debug information supplied at the time. | ||
*/ | ||
List<RuntimeException> getActiveOperations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test this method in IndexShardOperationPermitsTests that it returns something meaningful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, so I was doubting about that one. What is meaningful? I can check there is something for each open op and also the message contains the debug info. Is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. To check that the active operations are actually captured.
@@ -142,7 +142,7 @@ public RecoveryResponse recoverToTarget() throws IOException { | |||
throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node"); | |||
} | |||
assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting; | |||
}); | |||
}, shardId + " validating recovery target registered"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add targetShardRouting??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@ywelsch I pushed some commits that addresses your feedback. Can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a suggestion.
final Object debugInfo) { | ||
final Throwable debugInfoWithStackTrace; | ||
if (Assertions.ENABLED) { | ||
debugInfoWithStackTrace = new Throwable(debugInfo.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to do this without a Throwable
? Can you use a Tuple
of Object, StackTraceElement[]
populated with Thread.currentThread().getStackTrace()
(and as an added bonanza avoid rendering the Object
to a String
unless absolutely needed)?
@jasontedor and I agreed to get this in now and follow up on removing the throwable. |
…asy debugging (#28567) Today we acquire a permit from the shard to coordinate between indexing operations, recoveries and other state transitions. When we leak an permit it's practically impossible to find who the culprit is. This PR add stack traces capturing for each permit so we can identify which part of the code is responsible for acquiring the unreleased permit. This code is only active when assertions are active. The output is something like: ``` java.lang.AssertionError: shard [test][1] on node [node_s0] has pending operations: --> java.lang.RuntimeException: something helpful 2 at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223) at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:322) at org.elasticsearch.index.IndexService.createShard(IndexService.java:382) at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514) at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143) at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552) at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529) at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231) at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495) at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) --> java.lang.RuntimeException: something helpful at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:223) at org.elasticsearch.index.shard.IndexShard.<init>(IndexShard.java:311) at org.elasticsearch.index.IndexService.createShard(IndexService.java:382) at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:514) at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:143) at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:552) at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529) at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231) at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495) at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641) at java.base/java.lang.Thread.run(Thread.java:844) ```
In elastic#28567 we introduced into `IndexShardOperationPermits` the tracking of extra information about the permits it has handed out, which would help if a test failed due to a leaked permit. I don't think we've seen any such test failures in a very long time, so this extra test-only code is not really useful any more. This commit removes it.
In #28567 we introduced into `IndexShardOperationPermits` the tracking of extra information about the permits it has handed out, which would help if a test failed due to a leaked permit. I don't think we've seen any such test failures in a very long time, so this extra test-only code is not really useful any more. This commit removes it.
Today we acquire a permit from the shard to coordinate between indexing operations, recoveries and other state transitions. When we leak an permit it's practically impossible to find who the culprit is. This PR add stack traces capturing for each permit so we can identify which part of the code is responsible for acquiring the unreleased permit. This code is only active when assertions are active.
The output is something like: