-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/api/stats
endpoint responds with 503 if a plugin is 'critical'
#169788
Comments
Pinging @elastic/kibana-core (Team:Core) |
IMO, I don't think this is an issue exclusive to That error comes from the authentication service: kibana/x-pack/plugins/security/server/authentication/authentication_service.ts Lines 127 to 134 in 987a850
The license becomes not available because, when we fail to fetch the license, it emits a new Errored License: kibana/x-pack/plugins/licensing/server/plugin.ts Lines 181 to 216 in 9e940b5
I wonder if we should handle the errors and reuse the cached cache for some ES errors (i.e.: network glitches). WDYT? |
Correct, this is common to most (only authenticated? all? not sure) routes. If the security plugin can't fetch the license for any reason, then authc can't be performed and requests fails with a 503. This is done so many layers below the actual route handler, that I'm not really sure what the best way to improve this would be (or if we even want/need to). |
The only way I think we can improve this is by improving the error handler in the license fetcher: kibana/x-pack/plugins/licensing/server/plugin.ts Lines 204 to 216 in 9e940b5
If Kibana or Elasticsearch is overloaded and the license refresh fails, we might want to reuse the cached cache. Maybe a retry count before flagging the license as failed could help? Separate finding: did you know that we force-refresh the license whenever we reply >= 400? kibana/x-pack/plugins/licensing/server/on_pre_response_handler.ts Lines 17 to 25 in 3730dd0
That feels so wrong on many levels!
|
It could be a good quick win to avoid sporadic problem due to connectivity issues or such. We don't want to re-use the cached value for a long period and should still throw at some point though, otherwise we're fully changing our behavior.
I probably knew at some point, and my brain likely decided it was better for my sanity to delete the information (and I can't blame it) Yeah, it feels somewhat wrong (and the associated issue introducing that behavior is from 6 years ago). The interval-based refresh might be sufficient. Or maybe we can adapt this Note, we're using kibana/x-pack/plugins/licensing/common/license_update.ts Lines 32 to 36 in b3a67b9
We will be queuing a lot of refresh requests though, which I agree should ideally be avoided. |
I think this wouldn't help solving the related issue, as it could mean showing Kibana as "healthy" while it is not. Regarding the RxJS refresh pipeline, perhaps we could add a |
IMO, if there's a network issue when connecting to ES, the
That sounds like a good compromise, IMO. |
I agree that it wouldn't directly help for the issue. Now, I found that issue (for totally unrelated reasons, from a SDH) but it looks like it could help with some other problems: From the linked issue:
If it doesn't fix the issue, that's probably still an improvement we may want to perform. I'll take a quick look, given it seems like a fairly quick win that could help in those edge cases. |
## Summary Related to #169788 Fix #117394 --------- Co-authored-by: kibanamachine <[email protected]>
#170006 has been merged, and I don't think we want to go further on this, so I'll consider it done. |
Kibana version:
Fails on current
main
Steps to reproduce:
/api/stats
endpointProblems it causes:
If/when this endpoint fails, Metricbeat is not able to properly collect Kibana status.
NB the information collected by Metricbeat is used from Stack Management UI to show Kibana status.
Thus, SM UI will rely on potentially old information for some time (up to 2 minutes).
Stale
:I believe this explains issues such as https://github.com/elastic/sdh-kibana/issues/4194
Proposed solution:
Make
/api/stats
endpoint more resilient to degraded plugins, and allow it to correctly report the appropriatestatus
.The text was updated successfully, but these errors were encountered: