Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Potential performance problem log happens on startup #106854

Closed
chrisronline opened this issue Jul 27, 2021 · 14 comments · Fixed by #109741
Closed

[Task Manager] Potential performance problem log happens on startup #106854

chrisronline opened this issue Jul 27, 2021 · 14 comments · Fixed by #109741
Assignees
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager insight Issues related to user insight into platform operations and resilience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@chrisronline
Copy link
Contributor

chrisronline commented Jul 27, 2021

Reported originally by @LeeDr

Some users are seeing this on cloud upgrade and it might lead users to not understand if there is a problem or not. We might need to consider changing the verbiage on the message:

Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging`)

to indicate that not all users need to worry about this, as this might happen if Kibana is stopped for some period of time.

Feel free to add more information to this.

@chrisronline chrisronline added loe:needs-research This issue requires some research before it can be worked on or estimated Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 27, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

We'll keep this issue scoped to changing the message, if we can find a better one.

@pmuellr pmuellr self-assigned this Jul 29, 2021
@gmmorris gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility usability insight Issues related to user insight into platform operations and resilience estimate:needs-research Estimated as too large and requires research to break down into workable issues and removed resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility usability labels Aug 13, 2021
@pmuellr
Copy link
Member

pmuellr commented Aug 20, 2021

@gchaps we're hoping you can provide some alternative wording that would make this less "scary". Even with "potential" in there, folks have been concerned about this.

The basic idea is that if you see this message just once in a while, things are likely fine. Kibana was busy for a bit, then recovered. If you see it a lot, then you probably need to investigate.

We have some separate work, probably longer-term, to try to log this message less frequently, but right now it can get logged a few times in succession, and then stop. These are the situations we want to make this "less scary".

And although the message is actionable, and useful, in terms of the config setting that can be changed, I would actually like to change that to point to our current doc that describes all this stuff, here: https://www.elastic.co/guide/en/kibana/current/task-manager-health-monitoring.html . Do we do that in other places? Presumably we'd use a version-specific URL here (instead of current).

This link might be better, which is a FAQ-like page - https://www.elastic.co/guide/en/kibana/current/task-manager-troubleshooting.html - and presumably we'd update that doc to put info regarding this particular message at the top of the page - and that copy would presumably end up pointing to the task-manager-health-monitoring.html page.

@pmuellr
Copy link
Member

pmuellr commented Aug 20, 2021

I'm not sure there's actually a way to turn this message OFF - and wondering if we should provide that. Presumably a new cloud/docker-allowable config (or piggy-backed somehow on an existing one) to avoid logging it altogether.

@chrisronline
Copy link
Contributor Author

chrisronline commented Aug 20, 2021

FWIW, if the config is disabled, we default to the existing behavior where it would log at debug level: https://github.com/elastic/kibana/blob/master/x-pack/plugins/task_manager/server/lib/log_health_metrics.ts#L47

@pmuellr
Copy link
Member

pmuellr commented Aug 23, 2021

if the config is disabled, we default to the existing behavior where it would log at debug level:

except for the logger.warn(), here :-)

} else {
// This is legacy support - we used to always show this
logger.debug(message);
if (logLevel !== LogLevel.Debug && lastLogLevel === LogLevel.Debug) {
logger.warn(
`Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging`
);
}
}

I'm thinking we probably want to change that to a logger.debug() ...

@chrisronline
Copy link
Contributor Author

Ah yea, true. I think we added that to avoid spamming the user if the config was off, but still give them some awareness that something might be off

@pmuellr
Copy link
Member

pmuellr commented Aug 23, 2021

Per #109095, we're still spamming the user :-), hence my thinking we could change this to a logger.debug(). I don't mind it being as spammy there ...

@gchaps
Copy link
Contributor

gchaps commented Aug 23, 2021

Does something like this work?

Task Manager detected a degradation in performance

This is usually temporary, and Kibana can recover automatically. If the problem persists, check the docs for information about debug logging and troubleshooting.

@pmuellr
Copy link
Member

pmuellr commented Aug 23, 2021

That does sound better, but is quite wordy for a log message :-). I think it captures what I'm hoping to express though, so is better than what we have now.

Any thoughts on adding a doc link? I believe we may be able to add a doc link to our "health status" that gets percolated all the way up to Kibana's overall status, so another option is to refer to the overall Kibana status here (also a link), which presumably would provide more information, including a doc link into the task manager docs.

@gchaps
Copy link
Contributor

gchaps commented Aug 23, 2021

Here are a couple of shorter versions

Task Manager detected a degradation in performance
This is usually temporary, and Kibana can recover automatically. If the problem persists, check the docs for troubleshooting information.

Task Manager detected a potential performance problem
Kibana was likely busy, or stopped, and then recovered. If the error persists, check the docs.

Can we put the link in the description?

@pmuellr
Copy link
Member

pmuellr commented Aug 24, 2021

Seeing the formatting, I'm wondering if you think these messages will be in a UI. Currently, this message only occurs in the Kibana logs. There's no fixed length, but it's nice to be shorter of course. And no notion of "bolding" anything. And link would need to be displayed in full.

For example, the log message would be (this is my preferred version so far):

Task Manager detected a degradation in performance. This is usually temporary, and Kibana can recover automatically. If the problem persists, check the docs for troubleshooting information at https://www.elastic.co/guide/en/kibana/7.14/task-manager-health-monitoring.html

Super-long, but I think captures everything I wanted it to.

Will need to figure out how to generate a per-version URL to the doc. I assume other folks are doing this as well - I think I've seen it in the client code, but this is the server :-)

@gchaps
Copy link
Contributor

gchaps commented Aug 24, 2021

The text and the way you have formatted looks good to me. I'd add a period after the link.

@pmuellr
Copy link
Member

pmuellr commented Aug 24, 2021

We don't yet have a docLinks service server-side, so not sure about including a link now. We could point to current - for now, until we do, or we could just name the chapter in the doc. Using the current version sounds better, we can open an issue to fix that to use a docLink, once we have docLinks server-side.

I'd add a period after the link.

Oooh, I hate this! :-). Some renderers aren't smart enough to strip these off, so would use the URL WITH THE PERIOD AT THE END. Github gets it right, when rendered in markdown: https://www.elastic.co/guide/en/kibana/7.14/task-manager-health-monitoring.html. But other UI things, like "select word" kind of UI actions will pick it up, and assume the URL includes the period, which would send them to https://www.elastic.co/guide/en/kibana/7.14/task-manager-health-monitoring.html.

I hate it because I want to include puctuation, but to be safe, have to put a space in front of it, then it looks AWFUL.

But heh, it's just a log message. I should probably include a space-period, so if they "select word" from a log viewer, it would select the URL since it will be space separated, since if it's at the end of a string, it could have a " at the end of the URL (same problem as with ., but different char).

pmuellr added a commit that referenced this issue Sep 1, 2021
…ce is degraded (#109741)

resolves #109095
resolves #106854

Changes the way task manager and alerting perform their health / status
checks:

- no longer sets an `unavailable` status; now uses `degraded` instead
- change task manager "hot stats freshness" calculation to allow for
  staler data before signalling a problem
- Changed the "Detected potential performance issue" message to sound
  less scary, include a doc link to task manager health monitoring, and
  log a debug instead of warning level
- add additional debug logging when task manager sets a status that's
  not `available`, indicating why it's setting that status (in the code,
  it's when task manager uses HealthStatus.Warning or Error)
pmuellr added a commit to pmuellr/kibana that referenced this issue Sep 1, 2021
…ce is degraded (elastic#109741)

resolves elastic#109095
resolves elastic#106854

Changes the way task manager and alerting perform their health / status
checks:

- no longer sets an `unavailable` status; now uses `degraded` instead
- change task manager "hot stats freshness" calculation to allow for
  staler data before signalling a problem
- Changed the "Detected potential performance issue" message to sound
  less scary, include a doc link to task manager health monitoring, and
  log a debug instead of warning level
- add additional debug logging when task manager sets a status that's
  not `available`, indicating why it's setting that status (in the code,
  it's when task manager uses HealthStatus.Warning or Error)
pmuellr added a commit to pmuellr/kibana that referenced this issue Sep 1, 2021
…ce is degraded (elastic#109741)

resolves elastic#109095
resolves elastic#106854

Changes the way task manager and alerting perform their health / status
checks:

- no longer sets an `unavailable` status; now uses `degraded` instead
- change task manager "hot stats freshness" calculation to allow for
  staler data before signalling a problem
- Changed the "Detected potential performance issue" message to sound
  less scary, include a doc link to task manager health monitoring, and
  log a debug instead of warning level
- add additional debug logging when task manager sets a status that's
  not `available`, indicating why it's setting that status (in the code,
  it's when task manager uses HealthStatus.Warning or Error)
pmuellr added a commit to pmuellr/kibana that referenced this issue Sep 1, 2021
…ce is degraded (elastic#109741)

resolves elastic#109095
resolves elastic#106854

Changes the way task manager and alerting perform their health / status
checks:

- no longer sets an `unavailable` status; now uses `degraded` instead
- change task manager "hot stats freshness" calculation to allow for
  staler data before signalling a problem
- Changed the "Detected potential performance issue" message to sound
  less scary, include a doc link to task manager health monitoring, and
  log a debug instead of warning level
- add additional debug logging when task manager sets a status that's
  not `available`, indicating why it's setting that status (in the code,
  it's when task manager uses HealthStatus.Warning or Error)

# Conflicts:
#	x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts
#	x-pack/plugins/task_manager/server/monitoring/task_run_statistics.test.ts
#	x-pack/plugins/task_manager/server/routes/health.test.ts
pmuellr added a commit that referenced this issue Sep 1, 2021
…ce is degraded (#109741) (#110870)

resolves #109095
resolves #106854

Changes the way task manager and alerting perform their health / status
checks:

- no longer sets an `unavailable` status; now uses `degraded` instead
- change task manager "hot stats freshness" calculation to allow for
  staler data before signalling a problem
- Changed the "Detected potential performance issue" message to sound
  less scary, include a doc link to task manager health monitoring, and
  log a debug instead of warning level
- add additional debug logging when task manager sets a status that's
  not `available`, indicating why it's setting that status (in the code,
  it's when task manager uses HealthStatus.Warning or Error)
pmuellr added a commit that referenced this issue Sep 1, 2021
…ce is degraded (#109741) (#110869)

resolves #109095
resolves #106854

Changes the way task manager and alerting perform their health / status
checks:

- no longer sets an `unavailable` status; now uses `degraded` instead
- change task manager "hot stats freshness" calculation to allow for
  staler data before signalling a problem
- Changed the "Detected potential performance issue" message to sound
  less scary, include a doc link to task manager health monitoring, and
  log a debug instead of warning level
- add additional debug logging when task manager sets a status that's
  not `available`, indicating why it's setting that status (in the code,
  it's when task manager uses HealthStatus.Warning or Error)
pmuellr added a commit that referenced this issue Sep 1, 2021
…rformance is degraded (#109741) (#110875)

* [task manager] provide better diagnostics when task manager performance is degraded (#109741)

resolves #109095
resolves #106854

Changes the way task manager and alerting perform their health / status
checks:

- no longer sets an `unavailable` status; now uses `degraded` instead
- change task manager "hot stats freshness" calculation to allow for
  staler data before signalling a problem
- Changed the "Detected potential performance issue" message to sound
  less scary, include a doc link to task manager health monitoring, and
  log a debug instead of warning level
- add additional debug logging when task manager sets a status that's
  not `available`, indicating why it's setting that status (in the code,
  it's when task manager uses HealthStatus.Warning or Error)

# Conflicts:
#	x-pack/plugins/task_manager/server/monitoring/capacity_estimation.ts
#	x-pack/plugins/task_manager/server/monitoring/task_run_statistics.test.ts
#	x-pack/plugins/task_manager/server/routes/health.test.ts

* fix backport to remove post-7.14 stuff
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager insight Issues related to user insight into platform operations and resilience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants