Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Finer-grained error information from install/upgrade API #95649

Merged
merged 11 commits into from
Apr 18, 2021

Conversation

skh
Copy link
Contributor

@skh skh commented Mar 29, 2021

Summary

Partially implements #91864

The approach, trying to stay minimally invasive, is:

  • leave _installPackage() unchanged. If errors happen, they are thrown upwards.
  • in installPackageFromRegistry() and installPackageByUpload(), add a try / catch block around the call to _installPackage(). If errors are caught, they are not thrown, but added to the installResult they return, along with the information what installType this operation was (install, upgrade etc.)
  • for this purpose, the interface InstallResult is changed
  • the changed InstallResult is returned from installPackage()
  • callers of installPackage() inspect the return value and throw any error they find, so that overall behavior isn't changed.

This is all in preparation for callers being able to not re-throw the error they find, but report it back to the user in case they deem it non-fatal. This will happen in a second PR.

How to test this

In all tests, adjust BASEPATH to match your locally running system, or remove it. For reference, BASEPATH in all curl commands below is rch. (You can set this with server.basePath: "/rch" in kibana.dev.yml)

After the changes in this PR, the behavior in error situations should be exactly like before. Unfortunately, not all error scenarios are covered by our tests run in CI and need to be tested manually. Specifically, testing the setup API endpoints with broken required packages is not possible in CI, as we can have only one registry setup in CI (a combination of a docker container containing the registry and most packages, and the packages contained in fleet_api_integration/api/fixtures), and we need a working setup for most other integration tests.

Tests with system

Craft a system package that triggers an error during installation, serve it from a locally running registry. Do do so, edit any dashboard in system/0.10.9 to include

"migrationVersion": {
    "dashboard": "9.3.0"
  }

instead of 7.3.0.

Use this to test the error during a plain installation, by calling

curl -X POST -u elastic:changeme http://localhost:5601/rch/api/fleet/epm/packages/system-0.10.9 -H 'kbn-xsrf: xyz'
  • observe the return value 422 Unprocessable entity
  • observe the errors in the log:
server    log   [17:10:15.106] [error][fleet][plugins] uninstalling system-0.10.9 after error installing
server    log   [17:10:15.115] [error][fleet][plugins] failed to uninstall or rollback package after installation error Error: system is installed by default and cannot be removed
server    log   [17:10:15.116] [error][fleet][plugins] Document "windows-01c54730-fee6-11e9-8405-516218e3d268" has property "dashboard" which belongs to a more recent version of Kibana [9.3.0]. The last known version is [7.11.0]

Note that the rollback refuses to uninstall system because it is a required package.

Now delete system with force: true like this:

curl -X DELETE -u elastic:changeme http://localhost:5601/rch/api/fleet/epm/packages/system-0.10.9 -H 'kbn-xsrf: xyz' -H "Content-Type: application/json" -d '{"force": true}'

and verify it is uninstalled with

curl -X GET -u elastic:changeme http://localhost:5601/rch/api/fleet/epm/packages/system-0.10.9 -H 'kbn-xsrf: xyz'

The response should contain "status":"not_installed".


Use the same broken system package as above.

Install a previous, non-broken version of the package with

curl -X POST -u elastic:changeme http://localhost:5601/rch/api/fleet/epm/packages/system-0.10.7 -H 'kbn-xsrf: xyz' -H "Content-Type: application/json" -d '{"force": true}'

Then update the system package with a call to

curl -X POST -u elastic:changeme http://localhost:5601/rch/api/fleet/epm/packages/system-0.10.9 -H 'kbn-xsrf: xyz'

Observe the errors in the return value from the API call, and in the log. Verify that the rollback to the older version worked with another call to

curl -X GET -u elastic:changeme http://localhost:5601/rch/api/fleet/epm/packages/system-0.10.7 -H 'kbn-xsrf: xyz'

and searching for the value of status, it should be installed. Observe that "install_version":"0.10.9" and "install_status":"installing", these are leftovers of the failed upgrade and subsequent rollback.

Delete the system package again with

curl -X DELETE -u elastic:changeme http://localhost:5601/rch/api/fleet/epm/packages/system-0.10.7 -H 'kbn-xsrf: xyz' -H "Content-Type: application/json" -d '{"force": true}'

Now use the same broken system package to test the setup and bulk install endpoints. Test the fresh install of the broken system-0.10.9 as well as the update from the non broken system-0.10.7 to the broken system-0.10.9. The relevant curl commands are:

curl -X POST -u elastic:changeme http://localhost:5601/rch/api/fleet/setup -H 'kbn-xsrf: xyz'
curl -X POST -u elastic:changeme http://localhost:5601/rch/api/fleet/agents/setup -H 'kbn-xsrf: xyz'
curl -X POST -u elastic:changeme http://localhost:5601/rch/api/fleet/epm/packages/_bulk -H 'kbn-xsrf: xyz' -H "Content-Type: application/json" -d '{"packages": ["system"]}'

Tests with endpoint

One way to break the endpoint package is to make one of the ingest pipelines in the contained data streams invalid by changing "ignore_failure": true to "ignore_failures": true in the processor (or introducing any other syntax error here).

Then open the UI and navigate directly to Security -> Overview. (Do NOT open the Fleet page first, as this would call the setup endpoints which would invalidate this test.)

Observe the call to the /api/fleet/epm/packages/_bulk endpoint in the network requests. It responds with 200 OK, and the response {"response":[{"name":"endpoint","statusCode":500,"error":"parse_exception"}]}. This behavior is unchanged from before. (The parse_exception comes from the invalid ingest pipeline.)

In the log, observe these errors:

server    log   [14:16:56.892] [error][fleet][plugins] uninstalling endpoint-0.18.0 after error installing
server    log   [14:16:56.938] [error][fleet][plugins] failed to uninstall or rollback package after installation error Error: endpoint is installed by default and cannot be removed
server    log   [14:16:56.943] [error][fleet][plugins] ResponseError: parse_exception
    at onBody (/home/skh/projects/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:337:23)
    [...]

(Specifically, note that Fleet again refuses to uninstall the endpoint package during rollback as it is a required package.)

Additional tests would be to first install a non-broken earlier version of endpoint, then open the Security UI and verify that an attempt to update the package to the current, broken version was made and the rollback was successful. This is, however, already covered by testing the _bulk endpoint in isolation as described above.


Other possible tests

  • Use a broken non-mandatory package to test errors from the /epm/packages route (optional, if you're only testing through the API with curl calls, as you can install system like any other package before the UI triggers the setup)
  • Use the same non-mandatory package to test errors from the direct package upload route
  • Configure a non-existing registry and test the install, setup, and bulk install endpoints again
  • Check the UI that an error during setup still blocks access to the Fleet UI.

In addition to that, verify that normal functionality is not broken, i.e. try to break it in any way you can think of.

@skh skh self-assigned this Mar 29, 2021
@skh skh added Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Feature:Fleet Fleet team's agent central management project release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v7.13.0 v8.0.0 labels Mar 29, 2021
@skh skh force-pushed the 91864-return-error-metainfo branch 3 times, most recently from 166f76f to 716957a Compare April 1, 2021 16:22
@skh
Copy link
Contributor Author

skh commented Apr 1, 2021

@jen-huang after 716957a installing a broken package with a direct POST should correctly report errors and attempt to roll back, but the bulk install code path (which is also used by the setup code path) will still fail. Feel free to have a look, otherwise I'll just continue working on it.

@skh skh force-pushed the 91864-return-error-metainfo branch 3 times, most recently from 8bed5f9 to 3f217d8 Compare April 8, 2021 09:29
@skh skh marked this pull request as ready for review April 8, 2021 09:44
@skh skh requested a review from a team as a code owner April 8, 2021 09:44
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Feature:EPM)

@skh
Copy link
Contributor Author

skh commented Apr 8, 2021

@jen-huang This is ready for review. If you don't have time to do all the manual tests, could you review the test scenarios in the initial description to check if I missed something important?

@kevinlog This touches the bulk install functionality used by the security solution, who from your team could review this?

@skh skh requested a review from jen-huang April 8, 2021 09:47
@skh
Copy link
Contributor Author

skh commented Apr 8, 2021

@afgomez maybe you could have a look at this too, and maybe you find the test descriptions helpful for other EPM work.

@skh skh force-pushed the 91864-return-error-metainfo branch 2 times, most recently from f1a8da4 to f2243fc Compare April 12, 2021 10:20
@afgomez afgomez self-assigned this Apr 12, 2021
@skh skh force-pushed the 91864-return-error-metainfo branch from f2243fc to 9e9e11e Compare April 15, 2021 12:33
Copy link
Contributor

@jfsiii jfsiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not run this locally but I am comfortable merging it based on the description, tests, and my understanding of the code.

Let's get this in before FF and keep 👁️ and👂open for any issues 🚀

@skh skh force-pushed the 91864-return-error-metainfo branch from 9e9e11e to d3a4379 Compare April 16, 2021 10:43
@ruflin ruflin requested a review from afgomez April 16, 2021 11:52
@ruflin
Copy link
Contributor

ruflin commented Apr 16, 2021

++ on getting this in.

Copy link
Contributor

@kevinlog kevinlog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this will be really helpful moving forward

@skh skh force-pushed the 91864-return-error-metainfo branch from a198cb6 to 9fa2e24 Compare April 18, 2021 12:39
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

API count

id before after diff
fleet 1069 1071 +2

API count missing comments

id before after diff
fleet 979 981 +2

History

  • 💚 Build #120273 succeeded d3a43791e36d1362afd062130e670d0557fb2335
  • 💚 Build #119964 succeeded 9e9e11e9d02ee9deeb4cfb824c3e32e901265220
  • 💚 Build #118653 succeeded f2243fc9dfc50e9cafa4104d714a2c950d58dabc
  • 💔 Build #118505 failed f1a8da47eb21fcb034b40b8bd186c852e5a828e9
  • 💔 Build #118460 failed 573c1be3a125584293d4404bbbfc1f04df0ad265

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @afgomez @skh

@skh skh merged commit 05bd1c0 into elastic:master Apr 18, 2021
skh added a commit to skh/kibana that referenced this pull request Apr 18, 2021
…stic#95649)

* Intercept installation errors and add meta info.

* Adjust mock.

* Catch errors in all steps of install/upgrade.

* Adjust handler for direct package upload.

* Don't throw not-found errors on assets during rollback.

* Correctly catch errors from _installPackage()

* Propagate error from installResult in bulk install case.

* Add tests for rollback.

* Remove unused code.

* Skipping test that doesn't test what it says.

* Fix and reenable test.
skh added a commit that referenced this pull request Apr 18, 2021
) (#97400)

* Intercept installation errors and add meta info.

* Adjust mock.

* Catch errors in all steps of install/upgrade.

* Adjust handler for direct package upload.

* Don't throw not-found errors on assets during rollback.

* Correctly catch errors from _installPackage()

* Propagate error from installResult in bulk install case.

* Add tests for rollback.

* Remove unused code.

* Skipping test that doesn't test what it says.

* Fix and reenable test.
@skh skh deleted the 91864-return-error-metainfo branch April 18, 2021 17:12
jloleysens added a commit to jloleysens/kibana that referenced this pull request Apr 19, 2021
…te-legacy-es-client

* 'master' of github.com:elastic/kibana: (102 commits)
  [Exploratory view] integerate page views to exploratory view (elastic#97258)
  Fix typo in license_api_guard README name and import http server mocks from public interface (elastic#97334)
  Avoid mutating KQL query when validating it (elastic#97081)
  Add description as title on tag badge (elastic#97109)
  Remove legacy ES client usages in `home` and `xpack_legacy` (elastic#97359)
  [Fleet] Finer-grained error information from install/upgrade API (elastic#95649)
  Rule registry bundle size (elastic#97251)
  [Partial Results] Move other bucket into Search Source (elastic#96384)
  [Dashboard] Makes lens default editor for creating new panels (elastic#96181)
  skip flaky suite (elastic#97387)
  [Asset Management] Agent picker follow up (elastic#97357)
  skip flaky suite (elastic#97382)
  [Security Solutions] Fixes flake with cypress tests (elastic#97329)
  skip flaky suite (elastic#97355)
  Skip test to try and stabilize master
  minimize number of so fild asserted in tests. it creates flakines when implementation details change (elastic#97374)
  [Search Sessions] Client side search cache (elastic#92439)
  [SavedObjects] Add aggregations support (elastic#96292)
  [Reporting] Remove legacy elasticsearch client usage from the reporting plugin (elastic#97184)
  [kbnClient] fix basePath handling and export reponse type (elastic#97277)
  ...

# Conflicts:
#	x-pack/plugins/watcher/server/lib/license_pre_routing_factory/license_pre_routing_factory.ts
#	x-pack/plugins/watcher/server/plugin.ts
#	x-pack/plugins/watcher/server/routes/api/indices/register_get_route.ts
#	x-pack/plugins/watcher/server/routes/api/license/register_refresh_route.ts
#	x-pack/plugins/watcher/server/routes/api/register_list_fields_route.ts
#	x-pack/plugins/watcher/server/routes/api/register_load_history_route.ts
#	x-pack/plugins/watcher/server/routes/api/settings/register_load_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/action/register_acknowledge_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/register_activate_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/register_deactivate_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/register_delete_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/register_execute_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/register_history_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/register_load_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/register_save_route.ts
#	x-pack/plugins/watcher/server/routes/api/watch/register_visualize_route.ts
#	x-pack/plugins/watcher/server/routes/api/watches/register_delete_route.ts
#	x-pack/plugins/watcher/server/routes/api/watches/register_list_route.ts
#	x-pack/plugins/watcher/server/shared_imports.ts
#	x-pack/plugins/watcher/server/types.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Feature:Fleet Fleet team's agent central management project release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v7.13.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants