Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

Closed
pzl opened this issue Feb 16, 2021 · 19 comments
Closed

[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

pzl opened this issue Feb 16, 2021 · 19 comments
Assignees
Labels
bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Defend Workflows “EDR Workflows” sub-team of Security Solution Team:Fleet Team label for Observability Data Collection Fleet team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v7.11.1 v7.12.0

Comments

@pzl
Copy link
Member

pzl commented Feb 16, 2021

Describe the bug:

Fleet page presents error box: "Unable to initialize Fleet - An internal server error ocurred"

Kibana/Elasticsearch Stack version:
7.10 -> 7.11 Upgrade

may also be present on package upgrades staying within the same minor (e.g. 7.11)

Steps to reproduce:

  1. Explicitly define node roles in a 7.10 cluster (without any transform roles)
  2. Upgrade to 7.11
  3. Visit Fleet page in kibana to trigger package upgrade
  4. See Timeouts in log lines (below)

Alternatively, you can just spin up a 7.11, and define roles without including a transform node, but this would be incorrect configuration according to the docs. Omitting a transform role in 7.10 was allowed (It may not have existed as a role then).

Current behavior:

Fleet error box visible on Fleet page

Expected behavior:

No error box

  • a more robust rollback perhaps
  • or handling the timeout in a way that doesn't leave a blocked state

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

{"type":"log","@timestamp":"2021-02-15T22:23:57+01:00","tags":["info","plugins","fleet"],"pid":6159,"message":"Found previous transform references:\n [{\"id\":\"endpoint.metadata_current-default-0.16.1\",\"type\":\"transform\"}]"}
{"type":"log","@timestamp":"2021-02-15T22:23:57+01:00","tags":["info","plugins","fleet"],"pid":6159,"message":"Deleting currently installed transform ids endpoint.metadata_current-default-0.16.1"}
{"type":"log","@timestamp":"2021-02-15T22:24:27+01:00","tags":["error","plugins","fleet"],"pid":6159,"message":"Request Timeout after 30000ms"}
{"type":"log","@timestamp":"2021-02-15T22:24:27+01:00","tags":["error","http"],"pid":6159,"message":"Error: options.statusCode is expected to be set. given options: undefined\n    at Object.customError (/usr/share/kibana/src/core/server/http/router/response.js:136:13)\n    at defaultIngestErrorHandler (/usr/share/kibana/x-pack/plugins/fleet/server/errors/handlers.js:117:19)\n    at FleetSetupHandler (/usr/share/kibana/x-pack/plugins/fleet/server/routes/setup/handlers.js:111:50)\n    at processTicksAndRejections (internal/process/task_queues.js:93:5)\n    at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:163:30)\n    at handler (/usr/share/kibana/src/core/server/http/router/router.js:124:50)\n    at module.exports.internals.Manager.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/toolkit.js:45:28)\n    at Object.internals.handler (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)\n    at exports.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)\n    at Request._lifecycle (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:312:32)\n    at Request._execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:221:9)"}
{"type":"error","@timestamp":"2021-02-15T22:23:47+01:00","tags":[],"pid":6159,"level":"error","error":{"message":"Internal Server Error","name":"Error","stack":"Error: Internal Server Error\n    at HapiResponseAdapter.toInternalError (/usr/share/kibana/src/core/server/http/router/response_adapter.js:58:19)\n    at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:177:34)\n    at processTicksAndRejections (internal/process/task_queues.js:93:5)\n    at handler (/usr/share/kibana/src/core/server/http/router/router.js:124:50)\n    at module.exports.internals.Manager.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/toolkit.js:45:28)\n    at Object.internals.handler (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)\n    at exports.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)\n    at Request._lifecycle (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:312:32)\n    at Request._execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:221:9)"},"url":"https://kib01.tld.local:5601/api/fleet/setup","message":"Internal Server Error"}

Any additional context (logs, chat logs, magical formulas, etc.):

Workarounds

Please ensure at least one node in your cluster has the "transform" role.

You can view nodes and roles with GET /_nodes. Make sure one is configured as a transform node

@pzl pzl added bug Fixes for quality problems that affect the customer experience Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. labels Feb 16, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@kevinlog kevinlog added the Team:Defend Workflows “EDR Workflows” sub-team of Security Solution label Feb 16, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-onboarding-and-lifecycle-mgt (Team:Onboarding and Lifecycle Mgt)

@kevinlog kevinlog added the Team:Fleet Team label for Observability Data Collection Fleet team label Feb 16, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@skh
Copy link
Contributor

skh commented Feb 17, 2021

I agree that the error shouldn't block the Fleet UI, but I would also be interested to learn how to clean up the problematic transform, especially to have a workaround for already affected customers.

@pzl
Copy link
Member Author

pzl commented Feb 17, 2021

It appears like these can be symptoms of the problem: not having any transform nodes in a cluster.

If that is how a cluster is configured, that is a show-stopper for security solution, and likely fleet. So we may still end up with this situation leading to a blocking-box, but hopefully we can make that more informative, of what requirements must be met (at least one transform node).

Currently, we may need some reworked error messaging lower in the stack before we can handle that properly from fleet code

@kevinlog
Copy link
Contributor

thanks for getting the user unblocked @pzl I agree with your assessment in bubbling up errors and keeping Fleet unblocked.

@ph @skh

I think we should relax this failure in the Fleet code for the transform so that Fleet remains usable for everything else. @pzl - is there any quick way to unblock Fleet in these cases? Are there are specific error codes from the the ES Transform API, we could ignore those like we do with 404's to get users unblocked. I'm think we try to get something out in the next patch release along with some docs on the workaround.

@caitlinbetz
We're going to need some type of messaging for users to let them know if they need to configure a transform node. This only applies to users with custom configurations. Users who haven't touched their node configs directly should be OK.

I think we we could put messaging both when users add the Endpoint integration and in the Admin tab. Ideally, we're able to detect the node configuration and conditionally show warnings.

@skh
Copy link
Contributor

skh commented Feb 18, 2021

Looking at the slack discussion it seems that the transform errors coming back have statuses 408 and 409 -- @pzl would it be safe to ignore these, or should we check the exact error messages?

@pzl
Copy link
Member Author

pzl commented Feb 18, 2021

We cannot wholesale ignore those error codes. There are legitimate cases where we want to bubble up stuck transform states that have 409. Say, if we continue to trigger transform race conditions, and the transform exists, but it's internal configuration does not (a state we have triggered before). That would show up as a 409, with, actually, the exact same error text. There are a few other ways to end up with those errors that we need to keep surfacing (otherwise endpoint data will silently fall on the floor).

I don't yet see a good identifier to know that these are the benign errors, but I can keep looking. Otherwise we perform a check for the root cause: query the node definitions and see if there is no transform node.

@ph
Copy link
Contributor

ph commented Feb 18, 2021

@kevinlog Agree we should relax this behavior here, @skh is looking into this from our side but let's collaborate to define the behavior.

@kevinlog
Copy link
Contributor

kevinlog commented Feb 18, 2021

After a chat with @pzl and the team:

Possible Actions:

  • ML team is working on a specific error for this case for - maybe 7.11.2 ?
  • Fleet/Endpoint will detect error in Fleet and not lock it up for - maybe 7.11.2 ?
  • Endpoint admin page will display an error/info box that the user needed to configure a transform node - maybe 7.11.2 ?
  • Add Endpoint Integration in Fleet flow can display need to configure transform node - maybe 7.11.2 ?

@ph
Copy link
Contributor

ph commented Feb 18, 2021

@kevinlog We are looking for 7.11.2, but we haven't confirmed the release yet.

@hendrikmuhs
Copy link

hendrikmuhs commented Feb 18, 2021

I am looking on the transform side as we speak and target a fix asap (7.11.2).

The timeout error we see in 7.11 should in my opinion not get a workaround, but transform will properly answer the request.

Having that said, it of course makes sense to harden fleet <-> transform for other error conditions.

@kevinlog
Copy link
Contributor

We are looking for 7.11.2, but we haven't confirmed the release yet.

@ph no problem, I edited my comment above. Let's figure out what's possible.

@hendrikmuhs
Copy link

I did some investigations and created the following upstream issue in ES:

elastic/elasticsearch#69260

We plan to fix transform:

  • operational: stabilize the API's to answer with proper error messages
  • ux: better user feedback if he tries to use transform without having the corresponding role

Ux on fleet side, e.g. warn the user if he tries to provision fleet without a transform node, need to be done by fleet.

@EricDavisX
Copy link
Contributor

@kevinlog @ph can we confirm specs on what is implemented for 7.11.2 ? The test team need not be involved if we have adequate automation of course, but let us know if we want regression coverage or new tests written out for executing specific tests. And we can update the label if nothing new went in for 7.11.2 from kibana / Fleet / OLM side?

@kevinlog
Copy link
Contributor

kevinlog commented Mar 2, 2021

@EricDavisX for 7.11.2, there has only been some better error handling when users hit this case from the ES side. Also the API will be able to stop and delete a transform even if the user doesn't have the correct node roles. @hendrikmuhs can give more here.

There have been no Kibana changes yet.

There is a larger issue for better error handling in Fleet: #91864
I believe the above is targeted for 7.13, @ph could give more updates

The expectation here is that users will need to have the correct node roles to use the transform. This is will not change. Our future work is to make this more clear in the UI when cases like this come up.

So for this, there isn't much to test from the Kibana side.

@EricDavisX EricDavisX added v7.11.1 and removed v7.11.2 labels Mar 2, 2021
@EricDavisX
Copy link
Contributor

Thanks - I'll update our test plans accordingly, as nothing was checked in to Kibana side for it for 7.11.2

@hendrikmuhs
Copy link

For 7.13:

elastic/elasticsearch#70139 introduces extra checks and warnings on the API level for 7.13. This might save time and effort for the fleet integration, if you haven't looked into implementing own checks (or even in case you have), you can make use of the transform API's. E.g. _stats will return a warning as part of headers.

LBNL This is to inform you about potential side effects although best to my knowledge headers are ignored.

@kevinlog
Copy link
Contributor

addressed this with docs and finer grained error handling.

#91864
elastic/security-docs#608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Defend Workflows “EDR Workflows” sub-team of Security Solution Team:Fleet Team label for Observability Data Collection Fleet team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v7.11.1 v7.12.0
Projects
None yet
Development

No branches or pull requests

8 participants