[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

pzl · 2021-02-16T21:42:18Z

Describe the bug:

Fleet page presents error box: "Unable to initialize Fleet - An internal server error ocurred"

Kibana/Elasticsearch Stack version:
7.10 -> 7.11 Upgrade

may also be present on package upgrades staying within the same minor (e.g. 7.11)

Steps to reproduce:

Explicitly define node roles in a 7.10 cluster (without any transform roles)
Upgrade to 7.11
Visit Fleet page in kibana to trigger package upgrade
See Timeouts in log lines (below)

Alternatively, you can just spin up a 7.11, and define roles without including a transform node, but this would be incorrect configuration according to the docs. Omitting a transform role in 7.10 was allowed (It may not have existed as a role then).

Current behavior:

Fleet error box visible on Fleet page

Expected behavior:

No error box

a more robust rollback perhaps
or handling the timeout in a way that doesn't leave a blocked state

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

{"type":"log","@timestamp":"2021-02-15T22:23:57+01:00","tags":["info","plugins","fleet"],"pid":6159,"message":"Found previous transform references:\n [{\"id\":\"endpoint.metadata_current-default-0.16.1\",\"type\":\"transform\"}]"}
{"type":"log","@timestamp":"2021-02-15T22:23:57+01:00","tags":["info","plugins","fleet"],"pid":6159,"message":"Deleting currently installed transform ids endpoint.metadata_current-default-0.16.1"}

{"type":"log","@timestamp":"2021-02-15T22:24:27+01:00","tags":["error","plugins","fleet"],"pid":6159,"message":"Request Timeout after 30000ms"}
{"type":"log","@timestamp":"2021-02-15T22:24:27+01:00","tags":["error","http"],"pid":6159,"message":"Error: options.statusCode is expected to be set. given options: undefined\n    at Object.customError (/usr/share/kibana/src/core/server/http/router/response.js:136:13)\n    at defaultIngestErrorHandler (/usr/share/kibana/x-pack/plugins/fleet/server/errors/handlers.js:117:19)\n    at FleetSetupHandler (/usr/share/kibana/x-pack/plugins/fleet/server/routes/setup/handlers.js:111:50)\n    at processTicksAndRejections (internal/process/task_queues.js:93:5)\n    at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:163:30)\n    at handler (/usr/share/kibana/src/core/server/http/router/router.js:124:50)\n    at module.exports.internals.Manager.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/toolkit.js:45:28)\n    at Object.internals.handler (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)\n    at exports.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)\n    at Request._lifecycle (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:312:32)\n    at Request._execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:221:9)"}
{"type":"error","@timestamp":"2021-02-15T22:23:47+01:00","tags":[],"pid":6159,"level":"error","error":{"message":"Internal Server Error","name":"Error","stack":"Error: Internal Server Error\n    at HapiResponseAdapter.toInternalError (/usr/share/kibana/src/core/server/http/router/response_adapter.js:58:19)\n    at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:177:34)\n    at processTicksAndRejections (internal/process/task_queues.js:93:5)\n    at handler (/usr/share/kibana/src/core/server/http/router/router.js:124:50)\n    at module.exports.internals.Manager.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/toolkit.js:45:28)\n    at Object.internals.handler (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)\n    at exports.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)\n    at Request._lifecycle (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:312:32)\n    at Request._execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:221:9)"},"url":"https://kib01.tld.local:5601/api/fleet/setup","message":"Internal Server Error"}

Any additional context (logs, chat logs, magical formulas, etc.):

Workarounds

Please ensure at least one node in your cluster has the "transform" role.

You can view nodes and roles with GET /_nodes. Make sure one is configured as a transform node

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-16T21:42:19Z

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine · 2021-02-16T21:43:13Z

Pinging @elastic/security-onboarding-and-lifecycle-mgt (Team:Onboarding and Lifecycle Mgt)

elasticmachine · 2021-02-16T21:43:20Z

Pinging @elastic/fleet (Team:Fleet)

skh · 2021-02-17T14:02:31Z

I agree that the error shouldn't block the Fleet UI, but I would also be interested to learn how to clean up the problematic transform, especially to have a workaround for already affected customers.

pzl · 2021-02-17T20:23:52Z

It appears like these can be symptoms of the problem: not having any transform nodes in a cluster.

If that is how a cluster is configured, that is a show-stopper for security solution, and likely fleet. So we may still end up with this situation leading to a blocking-box, but hopefully we can make that more informative, of what requirements must be met (at least one transform node).

Currently, we may need some reworked error messaging lower in the stack before we can handle that properly from fleet code

kevinlog · 2021-02-18T13:35:35Z

thanks for getting the user unblocked @pzl I agree with your assessment in bubbling up errors and keeping Fleet unblocked.

@ph @skh

I think we should relax this failure in the Fleet code for the transform so that Fleet remains usable for everything else. @pzl - is there any quick way to unblock Fleet in these cases? Are there are specific error codes from the the ES Transform API, we could ignore those like we do with 404's to get users unblocked. I'm think we try to get something out in the next patch release along with some docs on the workaround.

@caitlinbetz
We're going to need some type of messaging for users to let them know if they need to configure a transform node. This only applies to users with custom configurations. Users who haven't touched their node configs directly should be OK.

I think we we could put messaging both when users add the Endpoint integration and in the Admin tab. Ideally, we're able to detect the node configuration and conditionally show warnings.

skh · 2021-02-18T13:57:33Z

Looking at the slack discussion it seems that the transform errors coming back have statuses 408 and 409 -- @pzl would it be safe to ignore these, or should we check the exact error messages?

pzl · 2021-02-18T14:04:59Z

We cannot wholesale ignore those error codes. There are legitimate cases where we want to bubble up stuck transform states that have 409. Say, if we continue to trigger transform race conditions, and the transform exists, but it's internal configuration does not (a state we have triggered before). That would show up as a 409, with, actually, the exact same error text. There are a few other ways to end up with those errors that we need to keep surfacing (otherwise endpoint data will silently fall on the floor).

I don't yet see a good identifier to know that these are the benign errors, but I can keep looking. Otherwise we perform a check for the root cause: query the node definitions and see if there is no transform node.

ph · 2021-02-18T15:32:19Z

@kevinlog Agree we should relax this behavior here, @skh is looking into this from our side but let's collaborate to define the behavior.

kevinlog · 2021-02-18T15:35:11Z

After a chat with @pzl and the team:

Possible Actions:

ML team is working on a specific error for this case for - maybe 7.11.2 ?
Fleet/Endpoint will detect error in Fleet and not lock it up for - maybe 7.11.2 ?
Endpoint admin page will display an error/info box that the user needed to configure a transform node - maybe 7.11.2 ?
Add Endpoint Integration in Fleet flow can display need to configure transform node - maybe 7.11.2 ?

ph · 2021-02-18T15:36:20Z

@kevinlog We are looking for 7.11.2, but we haven't confirmed the release yet.

hendrikmuhs · 2021-02-18T15:37:36Z

I am looking on the transform side as we speak and target a fix asap (7.11.2).

The timeout error we see in 7.11 should in my opinion not get a workaround, but transform will properly answer the request.

Having that said, it of course makes sense to harden fleet <-> transform for other error conditions.

kevinlog · 2021-02-18T15:37:42Z

We are looking for 7.11.2, but we haven't confirmed the release yet.

@ph no problem, I edited my comment above. Let's figure out what's possible.

hendrikmuhs · 2021-02-19T13:03:36Z

I did some investigations and created the following upstream issue in ES:

elastic/elasticsearch#69260

We plan to fix transform:

operational: stabilize the API's to answer with proper error messages
ux: better user feedback if he tries to use transform without having the corresponding role

Ux on fleet side, e.g. warn the user if he tries to provision fleet without a transform node, need to be done by fleet.

EricDavisX · 2021-03-02T15:19:21Z

@kevinlog @ph can we confirm specs on what is implemented for 7.11.2 ? The test team need not be involved if we have adequate automation of course, but let us know if we want regression coverage or new tests written out for executing specific tests. And we can update the label if nothing new went in for 7.11.2 from kibana / Fleet / OLM side?

kevinlog · 2021-03-02T16:08:15Z

@EricDavisX for 7.11.2, there has only been some better error handling when users hit this case from the ES side. Also the API will be able to stop and delete a transform even if the user doesn't have the correct node roles. @hendrikmuhs can give more here.

There have been no Kibana changes yet.

There is a larger issue for better error handling in Fleet: #91864
I believe the above is targeted for 7.13, @ph could give more updates

The expectation here is that users will need to have the correct node roles to use the transform. This is will not change. Our future work is to make this more clear in the UI when cases like this come up.

So for this, there isn't much to test from the Kibana side.

EricDavisX · 2021-03-02T20:00:54Z

Thanks - I'll update our test plans accordingly, as nothing was checked in to Kibana side for it for 7.11.2

hendrikmuhs · 2021-03-10T09:24:42Z

For 7.13:

elastic/elasticsearch#70139 introduces extra checks and warnings on the API level for 7.13. This might save time and effort for the fleet integration, if you haven't looked into implementing own checks (or even in case you have), you can make use of the transform API's. E.g. _stats will return a warning as part of headers.

LBNL This is to inform you about potential side effects although best to my knowledge headers are ignored.

kevinlog · 2021-04-29T12:47:02Z

addressed this with docs and finer grained error handling.

#91864
elastic/security-docs#608

pzl added bug Fixes for quality problems that affect the customer experience Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. labels Feb 16, 2021

kevinlog added the Team:Defend Workflows “EDR Workflows” sub-team of Security Solution label Feb 16, 2021

kevinlog added the Team:Fleet Team label for Observability Data Collection Fleet team label Feb 16, 2021

kevinlog added the v7.12.0 label Feb 16, 2021

kevinlog added planning v7.11.2 labels Feb 18, 2021

kevinlog assigned pzl Feb 18, 2021

kevinlog removed the planning label Feb 18, 2021

skh mentioned this issue Feb 18, 2021

[Fleet] Have finer-grained error handling for errors during /api/fleet/setup #91864

Closed

MindyRS added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Feb 18, 2021

hendrikmuhs mentioned this issue Feb 19, 2021

[Transform] can't delete transform, stop start after rolling upgrade / node role change elastic/elasticsearch#69260

Closed

EricDavisX added v7.11.1 and removed v7.11.2 labels Mar 2, 2021

kevinlog closed this as completed Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

pzl commented Feb 16, 2021 •

edited

Loading

elasticmachine commented Feb 16, 2021

elasticmachine commented Feb 16, 2021

elasticmachine commented Feb 16, 2021

skh commented Feb 17, 2021

pzl commented Feb 17, 2021

kevinlog commented Feb 18, 2021

skh commented Feb 18, 2021

pzl commented Feb 18, 2021 •

edited

Loading

ph commented Feb 18, 2021

kevinlog commented Feb 18, 2021 •

edited

Loading

ph commented Feb 18, 2021

hendrikmuhs commented Feb 18, 2021 •

edited

Loading

kevinlog commented Feb 18, 2021

hendrikmuhs commented Feb 19, 2021

EricDavisX commented Mar 2, 2021

kevinlog commented Mar 2, 2021

EricDavisX commented Mar 2, 2021

hendrikmuhs commented Mar 10, 2021

kevinlog commented Apr 29, 2021

[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

[Fleet] [Security Solution] Fleet cannot upgrade: Transform stop times out #91570

Comments

pzl commented Feb 16, 2021 • edited Loading

elasticmachine commented Feb 16, 2021

elasticmachine commented Feb 16, 2021

elasticmachine commented Feb 16, 2021

skh commented Feb 17, 2021

pzl commented Feb 17, 2021

kevinlog commented Feb 18, 2021

skh commented Feb 18, 2021

pzl commented Feb 18, 2021 • edited Loading

ph commented Feb 18, 2021

kevinlog commented Feb 18, 2021 • edited Loading

ph commented Feb 18, 2021

hendrikmuhs commented Feb 18, 2021 • edited Loading

kevinlog commented Feb 18, 2021

hendrikmuhs commented Feb 19, 2021

EricDavisX commented Mar 2, 2021

kevinlog commented Mar 2, 2021

EricDavisX commented Mar 2, 2021

hendrikmuhs commented Mar 10, 2021

kevinlog commented Apr 29, 2021

pzl commented Feb 16, 2021 •

edited

Loading

pzl commented Feb 18, 2021 •

edited

Loading

kevinlog commented Feb 18, 2021 •

edited

Loading

hendrikmuhs commented Feb 18, 2021 •

edited

Loading