Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make GetTrainedModelsStatsAction cancellable #87931

Closed
DaveCTurner opened this issue Jun 22, 2022 · 5 comments · Fixed by #88009
Closed

Make GetTrainedModelsStatsAction cancellable #87931

DaveCTurner opened this issue Jun 22, 2022 · 5 comments · Fixed by #88009
Assignees
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@DaveCTurner
Copy link
Contributor

I encountered a Cloud cluster with an overworked master due (partly) to processing multiple calls to GET /_ml/anomaly_detectors/_all/_stats originating from an external Metricbeat monitoring process. Metricbeat imposes a 10s timeout after which it closes the HTTP connection and tries again. However, GetTrainedModelsStatsAction does not notice if the client connection closes (i.e. the REST handler does not use RestCancellableNodeClient and the resulting transport task is not a CancellableTask) so it carries on wastefully processing the request even after the client timeout.

Relates #55550

@DaveCTurner DaveCTurner added >bug :ml Machine learning labels Jun 22, 2022
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 22, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent self-assigned this Jun 22, 2022
@droberts195
Copy link
Contributor

Based on the Slack thread that went with this I think it's actually GetJobsStatsAction that was the problem, not GetTrainedModelsStatsAction. But we should probably add cancellation capability to all our stats actions that do more than one step internally to get the stats.

@DaveCTurner
Copy link
Contributor Author

Hm. Sorry, I think I made a mistake translating what I saw in the thread dump to the REST action I mentioned in Slack. The threads in question were here:

"elasticsearch[instance-REDACTED][generic][T#761]" ...
   java.lang.Thread.State: RUNNABLE
	at org.joni.ast.CClassNode.addCTypeByRange(CClassNode.java:236)
	at org.joni.ast.CClassNode.addCType(CClassNode.java:306)
	at org.joni.Parser.parseCharType(Parser.java:1253)
	at org.joni.Parser.parseExp(Parser.java:814)
	at org.joni.Parser.parseBranch(Parser.java:1342)
	at org.joni.Parser.parseSubExp(Parser.java:1368)
	at org.joni.Parser.parseEnclose(Parser.java:437)
	at org.joni.Parser.parseExp(Parser.java:767)
	at org.joni.Parser.parseBranch(Parser.java:1351)
	at org.joni.Parser.parseSubExp(Parser.java:1368)
	at org.joni.Parser.parseEnclose(Parser.java:687)
	at org.joni.Parser.parseExp(Parser.java:767)
	at org.joni.Parser.parseBranch(Parser.java:1351)
	at org.joni.Parser.parseSubExp(Parser.java:1368)
	at org.joni.Parser.parseRegexp(Parser.java:1401)
	at org.joni.Analyser.compile(Analyser.java:78)
	at org.joni.Regex.<init>(Regex.java:155)
	at org.joni.Regex.<init>(Regex.java:134)
	at org.joni.Regex.<init>(Regex.java:129)
	at org.elasticsearch.grok.Grok.<init>(Grok.java:102)
	at org.elasticsearch.grok.Grok.<init>(Grok.java:80)
	at org.elasticsearch.ingest.common.GrokProcessor.<init>(GrokProcessor.java:58)
	at org.elasticsearch.ingest.common.GrokProcessor$Factory.create(GrokProcessor.java:176)
	at org.elasticsearch.ingest.common.GrokProcessor$Factory.create(GrokProcessor.java:136)
	at org.elasticsearch.ingest.ConfigurationUtils.readProcessor(ConfigurationUtils.java:583)
	at org.elasticsearch.ingest.ConfigurationUtils.readProcessor(ConfigurationUtils.java:547)
	at org.elasticsearch.ingest.ConfigurationUtils.readProcessorConfigs(ConfigurationUtils.java:467)
	at org.elasticsearch.ingest.Pipeline.create(Pipeline.java:82)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction.lambda$pipelineIdsByModelIdsOrAliases$16(TransportGetTrainedModelsStatsAction.java:272)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction$$Lambda$8350/0x0000000800ee3c78.accept(Unknown Source)
	at java.util.HashMap.forEach([email protected]/HashMap.java:1421)
	at java.util.Collections$UnmodifiableMap.forEach([email protected]/Collections.java:1553)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction.pipelineIdsByModelIdsOrAliases(TransportGetTrainedModelsStatsAction.java:270)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction.lambda$doExecute$4(TransportGetTrainedModelsStatsAction.java:133)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction$$Lambda$8344/0x0000000800ee2f58.accept(Unknown Source)
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136)
	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)
	at org.elasticsearch.client.internal.node.NodeClient.lambda$executeLocally$0(NodeClient.java:107)
	at org.elasticsearch.client.internal.node.NodeClient$$Lambda$6455/0x0000000801d39680.accept(Unknown Source)
	at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)
	at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)
	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:169)
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter$$Lambda$6463/0x0000000801d2ee78.accept(Unknown Source)
	at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219)
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:447)
	at org.elasticsearch.action.support.nodes.TransportNodesAction.newResponseAsync(TransportNodesAction.java:181)
	at org.elasticsearch.action.support.nodes.TransportNodesAction.newResponse(TransportNodesAction.java:156)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.lambda$finishHim$0(TransportNodesAction.java:291)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$$Lambda$6601/0x0000000801efd570.run(Unknown Source)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:833)

"elasticsearch[instance-REDACTED][generic][T#654]" ...
   java.lang.Thread.State: RUNNABLE
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2120)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2248)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2310)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2248)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2248)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2052)
	at org.joni.Analyser.optimizeNodeLeft(Analyser.java:2063)
	at org.joni.Analyser.setOptimizedInfoFromTree(Analyser.java:2346)
	at org.joni.Analyser.compile(Analyser.java:149)
	at org.joni.Regex.<init>(Regex.java:155)
	at org.joni.Regex.<init>(Regex.java:134)
	at org.joni.Regex.<init>(Regex.java:129)
	at org.elasticsearch.grok.Grok.<init>(Grok.java:102)
	at org.elasticsearch.grok.Grok.<init>(Grok.java:80)
	at org.elasticsearch.ingest.common.GrokProcessor.<init>(GrokProcessor.java:53)
	at org.elasticsearch.ingest.common.GrokProcessor$Factory.create(GrokProcessor.java:176)
	at org.elasticsearch.ingest.common.GrokProcessor$Factory.create(GrokProcessor.java:136)
	at org.elasticsearch.ingest.ConfigurationUtils.readProcessor(ConfigurationUtils.java:583)
	at org.elasticsearch.ingest.ConfigurationUtils.readProcessor(ConfigurationUtils.java:547)
	at org.elasticsearch.ingest.ConfigurationUtils.readProcessorConfigs(ConfigurationUtils.java:467)
	at org.elasticsearch.ingest.Pipeline.create(Pipeline.java:82)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction.lambda$pipelineIdsByModelIdsOrAliases$16(TransportGetTrainedModelsStatsAction.java:272)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction$$Lambda$8350/0x0000000800ee3c78.accept(Unknown Source)
	at java.util.HashMap.forEach([email protected]/HashMap.java:1421)
	at java.util.Collections$UnmodifiableMap.forEach([email protected]/Collections.java:1553)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction.pipelineIdsByModelIdsOrAliases(TransportGetTrainedModelsStatsAction.java:270)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction.lambda$doExecute$4(TransportGetTrainedModelsStatsAction.java:133)
	at org.elasticsearch.xpack.ml.action.TransportGetTrainedModelsStatsAction$$Lambda$8344/0x0000000800ee2f58.accept(Unknown Source)
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136)
	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)
	at org.elasticsearch.client.internal.node.NodeClient.lambda$executeLocally$0(NodeClient.java:107)
	at org.elasticsearch.client.internal.node.NodeClient$$Lambda$6455/0x0000000801d39680.accept(Unknown Source)
	at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:176)
	at org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:170)
	at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:169)
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter$$Lambda$6463/0x0000000801d2ee78.accept(Unknown Source)
	at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219)
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:447)
	at org.elasticsearch.action.support.nodes.TransportNodesAction.newResponseAsync(TransportNodesAction.java:181)
	at org.elasticsearch.action.support.nodes.TransportNodesAction.newResponse(TransportNodesAction.java:156)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.lambda$finishHim$0(TransportNodesAction.java:291)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$$Lambda$6601/0x0000000801efd570.run(Unknown Source)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:833)

I'm not sure that this is due to Metricbeat now. Still, +1 on adding support for cancellation to all the things.

@droberts195
Copy link
Contributor

OK, I saw /_ml/anomaly_detectors/_all/_stats and that corresponds to GetJobsStatsAction.

But it looks like it really is GetTrainedModelsStatsAction.

@benwtrent as well as making the action cancellable it looks like we need a more efficient way to list which ingest pipelines a particular trained model is used in. It looks like it's instantiating all the processors at the moment, then looking for inference processors, taking the ID and throwing everything else away. But instantiating a Grok processor seems to be very expensive. So maybe we need a specialist method at a lower level that can search for inference processors in an ingest pipeline without instantiating every other processor.

@benwtrent
Copy link
Member

Arg, when I first wrote this code, I was extracting from opaque maps. Then I figured "Ya know, we could build the ingest pipelines and just extract from there and not iterate maps...". Welp, Time to go back and find that old code... 🤦

benwtrent added a commit that referenced this issue Jun 24, 2022
This change makes all the trained model APIs cancellable, and addresses the handful of APIs that rely on our abstract resource structure.

closes: #87931
benwtrent added a commit that referenced this issue Jun 28, 2022
Previous, get trained model stats API would build every pipeline defined in cluster state.

This is problematic when MANY pipelines are defined. Especially if those pipelines take some time to parse (consider GROK).

This improvement is part of fixing: #87931
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Jun 28, 2022
Previous, get trained model stats API would build every pipeline defined in cluster state.

This is problematic when MANY pipelines are defined. Especially if those pipelines take some time to parse (consider GROK).

This improvement is part of fixing: elastic#87931
elasticsearchmachine pushed a commit that referenced this issue Jun 28, 2022
Previous, get trained model stats API would build every pipeline defined in cluster state.

This is problematic when MANY pipelines are defined. Especially if those pipelines take some time to parse (consider GROK).

This improvement is part of fixing: #87931
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Jun 28, 2022
Previous, get trained model stats API would build every pipeline defined in cluster state.

This is problematic when MANY pipelines are defined. Especially if those pipelines take some time to parse (consider GROK).

This improvement is part of fixing: elastic#87931
elasticsearchmachine pushed a commit that referenced this issue Jun 28, 2022
Previous, get trained model stats API would build every pipeline defined in cluster state.

This is problematic when MANY pipelines are defined. Especially if those pipelines take some time to parse (consider GROK).

This improvement is part of fixing: #87931
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants