Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.16] [ML] APM Correlations: Fix usage in load balancing/HA setups. (#115145) #118004

Merged
merged 1 commit into from
Nov 9, 2021

Conversation

walterra
Copy link
Contributor

@walterra walterra commented Nov 9, 2021

Backports the following commits to 7.16:

…c#115145)

- The way we customized the use of search strategies caused issues with race conditions when multiple Kibana instances were used for load balancing. This PR migrates away from search strategies and uses regular APM API endpoints.
- The task that manages calling the sequence of queries to run the correlations analysis is now in a custom React hook (useFailedTransactionsCorrelations / useLatencyCorrelations) instead of a task on the Kibana server side. While they show up as new lines/files in the git diff, the code for the hooks is more or less a combination of the previous useSearchStrategy and the server side service files that managed queries and state.
- The consuming React UI components only needed minimal changes. The above mentioned hooks return the same data structure as the previously used useSearchStrategy. This also means functional UI tests didn't need any changes and should pass as is.
- API integration tests have been added for the individual new endpoints. The test files that were previously used for the search strategies are still there to simulate a full analysis run, the assertions for the resulting data have the same values, it's just the structure that had to be adapted.
- Previously all ES queries of the analysis were run sequentially. The new endpoints run ES queries in parallel where possible. Chunking is managed in the hooks on the client side.
- For now the endpoints use the standard current user's esClient. I tried to use the APM client, but it was missing a wrapper for the fieldCaps method and I ran into a problem when trying to construct a random_score query. Sticking to the esClient allowed to leave most of the functions that run the actual queries unchanged. If possible I'd like to pick this up in a follow up. All the endpoints still use withApmSpan() now though.
- The previous use of generators was also refactored away, as mentioned above, the queries are now run in parallel.
Because we might run up to hundreds of similar requests for correlation analysis, we don't want the analysis to fail if just a single query fails like we did in the previous search strategy based task. I created a util splitAllSettledPromises() to handle Promise.allSettled() and split the results and errors to make the handling easier. Better naming suggestions are welcome 😅 . A future improvement could be to not run individual queries but combine them into nested aggs or using msearch. That's out of scope for this PR though.
@walterra walterra self-assigned this Nov 9, 2021
@walterra walterra requested a review from qn895 November 9, 2021 13:25
@walterra
Copy link
Contributor Author

walterra commented Nov 9, 2021

Tested this 7.16 manual backport again in a load balancing setup using CCS against dev-oblt.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 1184 1189 +5

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 2.7MB 2.7MB +4.0KB

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
apm 37 41 +4

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @walterra

@qn895
Copy link
Member

qn895 commented Nov 9, 2021

Tested and functionality all works as expected 🎉

@walterra walterra merged commit 304911b into elastic:7.16 Nov 9, 2021
@walterra walterra deleted the backport/7.16/pr-115145 branch November 9, 2021 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants