-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[task manager] provide better diagnostics when task manager performance is degraded #109741
Conversation
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM once using docLinks
plugin 👍 great work!
if (statusWithoutCapacity === HealthStatus.Warning) { | ||
logLevel = LogLevel.Warn; | ||
} else if (statusWithoutCapacity === HealthStatus.Error && !isEmpty(monitoredHealth.stats)) { | ||
logLevel = LogLevel.Error; | ||
} | ||
|
||
const message = `Latest Monitored Stats: ${JSON.stringify(monitoredHealth)}`; | ||
const detectedProblemMessage = `Task Manager detected a degradation in performance. This is usually temporary, and Kibana can recover automatically. If the problem persists, check the docs for troubleshooting information: https://www.elastic.co/guide/en/kibana/current/task-manager-health-monitoring.html .`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably generate the URL using the docLinks
plugin. (example PR https://github.com/elastic/kibana/pull/92953/files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops - I meant to leave a TODO for this (which would be easy to miss anyway), but there is no server-side docLinks plugin, yet - I did just add it as a TODO though :-)
To Do
open issue to change the new performance warning error to use server-side docLinks instead of the
current
reference, once we have server-side doclinks via this issue: [core] docLinks server-side service #95389 - the new issue will be blocked on that
Open for other ideas, felt like hard-coding current
was going to be the best story for now ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh that's true, it's only public.
What about something like the following?
import { kibanaPackageJson } from '@kbn/utils';
https://www.elastic.co/guide/en/kibana/${kibanaPackageJson.branch}/task-manager-health-monitoring.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #109937 to track this (and deleted the TODO in the top comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, it does look like the existing docLink stuff makes use of the branch
property, so seems doable. Up to you - we'll need to thread it through all the calls ... :-) I think long-term we still want docLinks for this, to handle cases when the URLs change (I assume it deals with that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a url shortening service? I vaguely remember using one in the past
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A URL shortener to make the doc link shorter? I'm not sure I want that - I'd prefer the user see that we're pointing them to doc, vs pointing them to the great unknown :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out using @kbn/utils
to get the patch version is absurdly easy: in commit e8c9b08
Thanks for the tip, Mike! I want to use more packages like this :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left one comment about any potential follow up issues.
logger.warn( | ||
`Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging` | ||
); | ||
logger.debug(detectedProblemMessage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why we're changing it back to debug for this PR, but we're effectively undoing this issue that was done for O11y of Alerting, where the intent was to try to have task manager self-introspect when it might be having problems and let the user know so they could turn on debug logging easily. Are we officially giving up on that part of O11y of Alerting because we've determined it's too noisy and too many false positives or will there be another follow up issue to try to achieve it in a less noisy way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's it in a nutshell - it's too noisy and too many false positives. I'll open a new issue to track this, note some of the issues we've already seen. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened #109941 to track this
@elasticmachine merge upstream |
@elasticmachine merge upstream |
@elasticmachine merge upstream |
…ce is degraded (elastic#109741) resolves elastic#109095 resolves elastic#106854 Changes the way task manager and alerting perform their health / status checks: - no longer sets an `unavailable` status; now uses `degraded` instead - change task manager "hot stats freshness" calculation to allow for staler data before signalling a problem - Changed the "Detected potential performance issue" message to sound less scary, include a doc link to task manager health monitoring, and log a debug instead of warning level - add additional debug logging when task manager sets a status that's not `available`, indicating why it's setting that status (in the code, it's when task manager uses HealthStatus.Warning or Error)
…ce is degraded (elastic#109741) resolves elastic#109095 resolves elastic#106854 Changes the way task manager and alerting perform their health / status checks: - no longer sets an `unavailable` status; now uses `degraded` instead - change task manager "hot stats freshness" calculation to allow for staler data before signalling a problem - Changed the "Detected potential performance issue" message to sound less scary, include a doc link to task manager health monitoring, and log a debug instead of warning level - add additional debug logging when task manager sets a status that's not `available`, indicating why it's setting that status (in the code, it's when task manager uses HealthStatus.Warning or Error)
…ce is degraded (elastic#109741) resolves elastic#109095 resolves elastic#106854 Changes the way task manager and alerting perform their health / status checks: - no longer sets an `unavailable` status; now uses `degraded` instead - change task manager "hot stats freshness" calculation to allow for staler data before signalling a problem - Changed the "Detected potential performance issue" message to sound less scary, include a doc link to task manager health monitoring, and log a debug instead of warning level - add additional debug logging when task manager sets a status that's not `available`, indicating why it's setting that status (in the code, it's when task manager uses HealthStatus.Warning or Error) # Conflicts: # x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts # x-pack/plugins/task_manager/server/monitoring/task_run_statistics.test.ts # x-pack/plugins/task_manager/server/routes/health.test.ts
…ce is degraded (#109741) (#110870) resolves #109095 resolves #106854 Changes the way task manager and alerting perform their health / status checks: - no longer sets an `unavailable` status; now uses `degraded` instead - change task manager "hot stats freshness" calculation to allow for staler data before signalling a problem - Changed the "Detected potential performance issue" message to sound less scary, include a doc link to task manager health monitoring, and log a debug instead of warning level - add additional debug logging when task manager sets a status that's not `available`, indicating why it's setting that status (in the code, it's when task manager uses HealthStatus.Warning or Error)
…ce is degraded (#109741) (#110869) resolves #109095 resolves #106854 Changes the way task manager and alerting perform their health / status checks: - no longer sets an `unavailable` status; now uses `degraded` instead - change task manager "hot stats freshness" calculation to allow for staler data before signalling a problem - Changed the "Detected potential performance issue" message to sound less scary, include a doc link to task manager health monitoring, and log a debug instead of warning level - add additional debug logging when task manager sets a status that's not `available`, indicating why it's setting that status (in the code, it's when task manager uses HealthStatus.Warning or Error)
…rformance is degraded (#109741) (#110875) * [task manager] provide better diagnostics when task manager performance is degraded (#109741) resolves #109095 resolves #106854 Changes the way task manager and alerting perform their health / status checks: - no longer sets an `unavailable` status; now uses `degraded` instead - change task manager "hot stats freshness" calculation to allow for staler data before signalling a problem - Changed the "Detected potential performance issue" message to sound less scary, include a doc link to task manager health monitoring, and log a debug instead of warning level - add additional debug logging when task manager sets a status that's not `available`, indicating why it's setting that status (in the code, it's when task manager uses HealthStatus.Warning or Error) # Conflicts: # x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts # x-pack/plugins/task_manager/server/monitoring/task_run_statistics.test.ts # x-pack/plugins/task_manager/server/routes/health.test.ts * fix backport to remove post-7.14 stuff
💔 Build Failed
Failed CI Steps
Test FailuresKibana Pipeline / general / task-queue-process-15 / X-Pack Endpoint API Integration Tests.x-pack/test/security_solution_endpoint_api_int/apis/metadata·ts.Endpoint plugin test metadata api POST /api/endpoint/metadata when index is not empty metadata api should return one entry for each host with default pagingStandard Out
Stack Trace
Kibana Pipeline / general / Plugin Functional Tests.test/plugin_functional/test_suites/data_plugin/session·ts.data plugin Session management Dashboard "after each" hook for "on load there is a single session"Standard Out
Stack Trace
Kibana Pipeline / general / Chrome UI Functional Tests.test/functional/apps/visualize/_heatmap_chart·ts.visualize app visualize ciGroup10 heatmap chart should show 6 color ranges if changed on optionsStandard Out
Stack Trace
and 24 more failures, only showing the first 3. Metrics [docs]
History
To update your PR or re-run it, just comment with: |
Summary
resolves #109095
resolves #106854
resolves #110332
changes to task manager:
unavailable
if it found a Health status of error, which caused Kibana itself to claim it wasunavailable
. It no longer does this, and will instead set it's status todegraded
, which is also reflected as the Kibana status. There have been cases of users seeing theunavailable
status and misunderstanding that Kibana crashed, which it did not.changes to alerting:
unavailable
plugin status, changing all references of that todegraded
Checklist
Delete any items that are not applicable to this PR.
[ ] Documentation was added for features that require explanation or tutorialsAny UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker listThis renders correctly on smaller devices using a responsive layout. (You can test this in your browser)This was checked for cross-browser compatibilityRelease note
Improves Task Manager performance monitoring to log more diagnostic data in debug logs, provide a more accurate logged message when performance degradation is noticed, and never set the plugin status of "unavailable", using "degraded" instead.