-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570
Comments
Pinging @elastic/security-solution (Team: SecuritySolution) |
Pinging @elastic/security-onboarding-and-lifecycle-mgt (Team:Onboarding and Lifecycle Mgt) |
Pinging @elastic/fleet (Team:Fleet) |
I agree that the error shouldn't block the Fleet UI, but I would also be interested to learn how to clean up the problematic transform, especially to have a workaround for already affected customers. |
It appears like these can be symptoms of the problem: not having any transform nodes in a cluster. If that is how a cluster is configured, that is a show-stopper for security solution, and likely fleet. So we may still end up with this situation leading to a blocking-box, but hopefully we can make that more informative, of what requirements must be met (at least one transform node). Currently, we may need some reworked error messaging lower in the stack before we can handle that properly from fleet code |
thanks for getting the user unblocked @pzl I agree with your assessment in bubbling up errors and keeping Fleet unblocked. I think we should relax this failure in the Fleet code for the transform so that Fleet remains usable for everything else. @pzl - is there any quick way to unblock Fleet in these cases? Are there are specific error codes from the the ES Transform API, we could ignore those like we do with 404's to get users unblocked. I'm think we try to get something out in the next patch release along with some docs on the workaround. @caitlinbetz I think we we could put messaging both when users add the Endpoint integration and in the Admin tab. Ideally, we're able to detect the node configuration and conditionally show warnings. |
Looking at the slack discussion it seems that the transform errors coming back have statuses |
We cannot wholesale ignore those error codes. There are legitimate cases where we want to bubble up stuck transform states that have 409. Say, if we continue to trigger transform race conditions, and the transform exists, but it's internal configuration does not (a state we have triggered before). That would show up as a 409, with, actually, the exact same error text. There are a few other ways to end up with those errors that we need to keep surfacing (otherwise endpoint data will silently fall on the floor). I don't yet see a good identifier to know that these are the benign errors, but I can keep looking. Otherwise we perform a check for the root cause: query the node definitions and see if there is no transform node. |
After a chat with @pzl and the team: Possible Actions:
|
@kevinlog We are looking for 7.11.2, but we haven't confirmed the release yet. |
I am looking on the transform side as we speak and target a fix asap (7.11.2). The timeout error we see in 7.11 should in my opinion not get a workaround, but transform will properly answer the request. Having that said, it of course makes sense to harden fleet <-> transform for other error conditions. |
@ph no problem, I edited my comment above. Let's figure out what's possible. |
I did some investigations and created the following upstream issue in ES: We plan to fix transform:
Ux on fleet side, e.g. warn the user if he tries to provision fleet without a transform node, need to be done by fleet. |
@kevinlog @ph can we confirm specs on what is implemented for 7.11.2 ? The test team need not be involved if we have adequate automation of course, but let us know if we want regression coverage or new tests written out for executing specific tests. And we can update the label if nothing new went in for 7.11.2 from kibana / Fleet / OLM side? |
@EricDavisX for 7.11.2, there has only been some better error handling when users hit this case from the ES side. Also the API will be able to stop and delete a transform even if the user doesn't have the correct node roles. @hendrikmuhs can give more here. There have been no Kibana changes yet. There is a larger issue for better error handling in Fleet: #91864 The expectation here is that users will need to have the correct node roles to use the transform. This is will not change. Our future work is to make this more clear in the UI when cases like this come up. So for this, there isn't much to test from the Kibana side. |
Thanks - I'll update our test plans accordingly, as nothing was checked in to Kibana side for it for 7.11.2 |
For 7.13: elastic/elasticsearch#70139 introduces extra checks and warnings on the API level for LBNL This is to inform you about potential side effects although best to my knowledge headers are ignored. |
addressed this with docs and finer grained error handling. |
Describe the bug:
Fleet page presents error box:
"Unable to initialize Fleet - An internal server error ocurred"
Kibana/Elasticsearch Stack version:
7.10 -> 7.11 Upgrade
may also be present on package upgrades staying within the same minor (e.g. 7.11)
Steps to reproduce:
Alternatively, you can just spin up a 7.11, and define roles without including a transform node, but this would be incorrect configuration according to the docs. Omitting a transform role in 7.10 was allowed (It may not have existed as a role then).
Current behavior:
Fleet error box visible on Fleet page
Expected behavior:
No error box
Screenshots (if relevant):
Errors in browser console (if relevant):
Provide logs and/or server output (if relevant):
Any additional context (logs, chat logs, magical formulas, etc.):
Workarounds
Please ensure at least one node in your cluster has the
"transform"
role.You can view nodes and roles with
GET /_nodes
. Make sure one is configured as a transform nodeThe text was updated successfully, but these errors were encountered: