Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Detector creation gets stuck on clusters with large shards and heavy ingestion #870

Closed
eirsep opened this issue Feb 27, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@eirsep
Copy link
Member

eirsep commented Feb 27, 2024

What is the bug?

curl localhost:9200/_cat/tasks?v | less
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
action                                                     task_id                        parent_task_id                 type      start_time    timestamp running_time ip            node
cluster:admin/opensearch/securityanalytics/detector/write  NS5L3EYoSM2ED7Ivqq1snQ:990205  -                              transport 1706578051254 01:27:31  5.2h         10.212.107.51 5112ba5b511cfd4495
cluster:admin/opensearch/securityanalytics/detector/write  NS5L3EYoSM2ED7Ivqq1snQ:991197  -                              transport 1706578171258 01:29:31  5.2h         10.212.107.51 5112ba5b511cfd4495
cluster:admin/opensearch/securityanalytics/rule/search     wWwwf7eRSD2oo8KulgOF7Q:917083  -                              transport 1706578176167 01:29:36  5.2h         10.212.27.178 a173576e5c9b149d2e
cluster:admin/opendistro/alerting/monitor/write            NS5L3EYoSM2ED7Ivqq1snQ:991834  -                              transport 1706578242304 01:30:42  5.2h         10.212.107.51 5112ba5b511cfd4495
cluster:admin/opensearch/securityanalytics/detector/write  6x4YBILlRNqCh-H5SEGz4g:929275  -                              transport 1706578277667 01:31:17  5.2h         10.212.98.228 f8a85ed4b86db333fc
cluster:admin/opensearch/securityanalytics/mapping/get     NS5L3EYoSM2ED7Ivqq1snQ:992971  -                              transport 1706578360527 01:32:40  5.1h         10.212.107.51 5112ba5b511cfd4495
indices:admin/mappings/get   

How can one reproduce the bug?

There are a few blocking calls (due to invocation of actionGet()) which are causing deadlocks in detector creation flow. On clusters with heavy ingestion and large shards this problem is magnified and causes cluster to choke up and run out of resources stuck in deadlocks

What is the expected behavior?
Code should be event driven using the listener-based SPIs exposed by opensearch transport client

@eirsep eirsep added bug Something isn't working untriaged labels Feb 27, 2024
riysaxen-amzn pushed a commit to riysaxen-amzn/security-analytics that referenced this issue Mar 25, 2024
) (opensearch-project#875)

* Fix getAlerts API for standard Alerting monitors

Signed-off-by: Ashish Agrawal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants