Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] Fix fetching package info from registry for installed integrations #134732

Conversation

banderror
Copy link
Contributor

@banderror banderror commented Jun 20, 2022

Fixes: #134639

Summary

In Cloud, Elastic APM and Fleet Server integrations are installed by default. However, attempts to fetch their packages from Elastic Package Registry via Fleet services on the server-side fail with the following errors:

{
    "message": "[email protected] not found",
    "status_code": 500
}
{
    "message": "[email protected] not found",
    "status_code": 500
}

Screenshot 2022-06-20 at 11 28 18

This behavior happens in some Cloud environments (like the one in the related ticket). It seems to not happen in Cloud CI environments and locally.

This PR adds error handling for this edge case to GET /internal/detection_engine/fleet/integrations/installed?packages= endpoint.

  • It logs fetching errors to the console logs of Kibana.
  • It uses a "best-effort" approach for returning data from the endpoint. If we could successfully read existing integration policies, we already have all of the needed data except correct integration titles. So, if after that any request to EPR results in an error, we:
    • Still return 200 with a list of installed integrations
    • Include correct titles for those packages that were successfully fetched
    • Include "best guess" titles for those packages that failed
[2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for 
[email protected]. Boom!
[2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for 
[email protected]. Boom!

Screenshot 2022-06-20 at 13 05 08

Checklist

@banderror banderror force-pushed the fix-fetching-package-info-for-installed-integrations branch from 6b3aab9 to be59ee5 Compare June 20, 2022 10:40
@banderror banderror self-assigned this Jun 20, 2022
@banderror banderror added bug Fixes for quality problems that affect the customer experience release_note:skip Skip the PR/issue when compiling release notes impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Feature:Detection Rules Security Solution rules and Detection Engine Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. auto-backport Deprecated - use backport:version if exact versions are needed Feature:Rule Management Security Solution Detection Rule Management area Team:Detection Rule Management Security Detection Rule Management Team v8.3.0 Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area Feature:Rule Details Security Solution Detection Rule Details page v8.4.0 ci:deploy-cloud labels Jun 20, 2022
@banderror banderror marked this pull request as ready for review June 20, 2022 10:47
@banderror banderror requested review from a team as code owners June 20, 2022 10:47
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@banderror banderror force-pushed the fix-fetching-package-info-for-installed-integrations branch from be59ee5 to 58bb62c Compare June 20, 2022 11:53
@banderror
Copy link
Contributor Author

banderror commented Jun 20, 2022

So I tested it in the Cloud CI deployment. There seem to be no errors when fetching Elastic APM and Fleet Server integrations on the BE. This is different from the behavior described in #134639.

Notice the "correct" Elastic APM Integration title. It indicates that the package data with this title was fetched from EPR. If there was a fetch error, it would be the "best guess" title Elastic APM Apmserver (see the PR description).

Screenshot 2022-06-20 at 15 25 03

Screenshot 2022-06-20 at 15 24 40

set.getPackages().map((packageInfo) => {
return fleet.packages.getRegistryPackage(
const registryPackages = await initPromisePool({
concurrency: MAX_CONCURRENT_REQUESTS_TO_PACKAGE_REGISTRY,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a pool and capping the requests here! I think we discussed this in the initial PR for this endpoint, but I'd be curious to see what the upper bounds are here for large deployments leveraging integrations. Would be good to test with a large number of integration policies, and then also with a large number of installed integrations.

One issue we have with fetching package policies is we're not handling pagination, so we'll cap out at the max for that initial page. Then for fetching the individual packages, I'm a little worried if there's say 100+ installed packages and we end up making 100+ requests to fleet each time an individual user hits the Rules Mgmt or Details pages. We do have caching client-side through the react-query wrapper, but nothing here server side, so could get messy if there's a few different users bouncing around those pages.

If this becomes an issue in production, users can at least disable this feature and short-circuit this request on the Rule Mgmt page w/ the Kibana Advanced Setting, but definitely something we'll want to test further. Will be good to get feedback from the fleet folks about any other API's we might be able to use here to ease the pressure on the fleet side -- would be great if we could just get all installed packages in one request instead of going the integration policy route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to test with a large number of integration policies, and then also with a large number of installed integrations.

Agree, will do it 👍

One issue we have with fetching package policies is we're not handling pagination, so we'll cap out at the max for that initial page.

I'll check how this method works when the perPage is not specified. I thought it just returns all the policies, but it might also set a default page size instead. Anyway, definitely need to test this 👍

I'm a little worried if there's say 100+ installed packages and we end up making 100+ requests to fleet each time an individual user hits the Rules Mgmt or Details pages. We do have caching client-side through the react-query wrapper, but nothing here server side

FWIW Fleet service caches fetched packages on the server-side. So only the first request may be slow (it depends linearly on the number of installed packages). The subsequent requests should be fast and not generate outgoing HTTP requests to EPR. I'm not sure 100% about this though -- so will test it as well. 👍

I will also test it with a large number of installed packages and a large number of integration policies.

Will be good to get feedback from the fleet folks about any other API's we might be able to use here to ease the pressure on the fleet side -- would be great if we could just get all installed packages in one request instead of going the integration policy route.

I'll open a discussion issue and share it with Fleet folks. 👍

Thank you @spong, this is a lot of great points!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW Fleet service caches fetched packages on the server-side. So only the first request may be slow (it depends linearly on the number of installed packages). The subsequent requests should be fast and not generate outgoing HTTP requests to EPR. I'm not sure 100% about this though -- so will test it as well. 👍

This makes me feel a lot better about the upper bound cases. I didn't realize the fleet service cached as well, thanks @banderror!

And sounds good with the other points as well, thank you for verifying here 👍

Comment on lines 62 to 64
const logMessage = `Error fetching package info from registry for ${item.package_name}@${item.package_version}`;
const logReason = error instanceof Error ? error.message : String(error);
logger.error(`${logMessage}. ${logReason}`);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While nice we're logging these now, I'm curious how actionable this will be for users, and potential overall noisiness of these logs as well. Perhaps debug may be a better log level for now?

Copy link
Contributor Author

@banderror banderror Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spong I think it depends on the error. Something like [email protected] not found as in the GH issue probably wouldn't be actionable for users. But if a user has a custom package registry, and it's not available, it would likely end up with an exception here and let the user know that they need to fix their environment.

It could be noisy, I agree. I think we have a similar situation with the implementation of IRuleExecutionLogForRoutes. There, we don't know for sure what exceptions we might catch (it could be a network error, a bug in the underlying code of SO Client or Event Log Reader, or a business error), but we log them via logger.error. Would you rather log them on the debug level?

I don't have a strong opinion on this, both options are ok to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spong I think it depends on the error. Something like [email protected] not found as in the GH issue probably wouldn't be actionable for users. But if a user has a custom package registry, and it's not available, it would likely end up with an exception here and let the user know that they need to fix their environment.

Yeah, that's a good point. I was just thinking from the standpoint of linking this error back to the related integrations feature so there was some context as to where this was bubbling up from.

This was the error I saw in testing:
image

and was just thinking adding some context like Unable to retrieve required integrations. Error fetching package info from registry for test@7. whoops may be helpful.


As for the log-level, no strong opinions here either, but was just trying to think if there's a common scenario where this could be spammy. I'm fine leaving as-is, and we can just check up on things post-release to see what the impact looks like. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved console logs a bit in d6a6a46

Errors should be less noisy but still visible to users by default. Added Unable to retrieve installed integrations for more context.

[2022-06-20T21:03:44.551+02:00][ERROR][plugins.securitySolution] Unable to retrieve installed integrations. Error fetching packages from registry: [email protected], [email protected].
[2022-06-20T21:03:44.551+02:00][DEBUG][plugins.securitySolution] Error fetching package info from registry for [email protected]. Boom!
[2022-06-20T21:03:44.551+02:00][DEBUG][plugins.securitySolution] Error fetching package info from registry for [email protected]. Boom!

Copy link
Member

@spong spong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked out, tested locally and LGTM! 👍

Left a couple comments around scale testing and a known bug with pagination, but this specific change LGTM with regards to fixing issues when errors are returned from fleet.packages.getRegistryPackage(), so I'm all good for merging and getting this extra guard in. Note: I was not able to do a full e2e test as I'm not sure of the scenarios where fleet would return errors here (still waiting to hear back from QA on any specific config they had that was causing this), but I did repro by throwing an error manually and everything looked good as a result of that.

Copy link
Contributor

@dplumlee dplumlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@banderror banderror force-pushed the fix-fetching-package-info-for-installed-integrations branch from 58bb62c to 068ad21 Compare June 20, 2022 19:05
@banderror banderror force-pushed the fix-fetching-package-info-for-installed-integrations branch from 068ad21 to d6a6a46 Compare June 20, 2022 19:06
@banderror banderror enabled auto-merge (squash) June 20, 2022 19:14
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

  • 💚 Build #51815 succeeded 58bb62cd4d8f2beea7bdad1500a06f123122b6d3
  • 💔 Build #51786 failed be59ee5aed3d737509b7f0efea1e11c383442d38

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @banderror

@banderror banderror merged commit cdcb272 into elastic:main Jun 20, 2022
kibanamachine pushed a commit that referenced this pull request Jun 20, 2022
…ry for installed integrations (#134732)

**Fixes:** #134639

## Summary

In Cloud, `Elastic APM` and `Fleet Server` integrations are installed by default. However, attempts to fetch their packages from Elastic Package Registry via Fleet services on the server-side fail with the following errors:

```json
{
    "message": "[email protected] not found",
    "status_code": 500
}
```
```json
{
    "message": "[email protected] not found",
    "status_code": 500
}
```

<img width="797" alt="Screenshot 2022-06-20 at 11 28 18" src="https://user-images.githubusercontent.com/7359339/174571610-4c24e777-c49a-49e0-addf-54c6301cc8ca.png">

This behavior happens in some Cloud environments (like the one in the related ticket). It seems to not happen in Cloud CI environments and locally.

This PR adds error handling for this edge case to `GET /internal/detection_engine/fleet/integrations/installed?packages=` endpoint.

- It logs fetching errors to the console logs of Kibana.
- It uses a "best-effort" approach for returning data from the endpoint. If we could successfully read existing integration policies, we already have all of the needed data except correct integration titles. So, if after that any request to EPR results in an error, we:
  - Still return 200 with a list of installed integrations
  - Include correct titles for those packages that were successfully fetched
  - Include "best guess" titles for those packages that failed

```
[2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for
[email protected]. Boom!
[2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for
[email protected]. Boom!
```

<img width="1085" alt="Screenshot 2022-06-20 at 13 05 08" src="https://user-images.githubusercontent.com/7359339/174588468-d28c1383-3a25-4f16-8905-bad3ca73e63e.png">

### Checklist

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

(cherry picked from commit cdcb272)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.3

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Jun 20, 2022
…ry for installed integrations (#134732) (#134784)

**Fixes:** #134639

## Summary

In Cloud, `Elastic APM` and `Fleet Server` integrations are installed by default. However, attempts to fetch their packages from Elastic Package Registry via Fleet services on the server-side fail with the following errors:

```json
{
    "message": "[email protected] not found",
    "status_code": 500
}
```
```json
{
    "message": "[email protected] not found",
    "status_code": 500
}
```

<img width="797" alt="Screenshot 2022-06-20 at 11 28 18" src="https://user-images.githubusercontent.com/7359339/174571610-4c24e777-c49a-49e0-addf-54c6301cc8ca.png">

This behavior happens in some Cloud environments (like the one in the related ticket). It seems to not happen in Cloud CI environments and locally.

This PR adds error handling for this edge case to `GET /internal/detection_engine/fleet/integrations/installed?packages=` endpoint.

- It logs fetching errors to the console logs of Kibana.
- It uses a "best-effort" approach for returning data from the endpoint. If we could successfully read existing integration policies, we already have all of the needed data except correct integration titles. So, if after that any request to EPR results in an error, we:
  - Still return 200 with a list of installed integrations
  - Include correct titles for those packages that were successfully fetched
  - Include "best guess" titles for those packages that failed

```
[2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for
[email protected]. Boom!
[2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for
[email protected]. Boom!
```

<img width="1085" alt="Screenshot 2022-06-20 at 13 05 08" src="https://user-images.githubusercontent.com/7359339/174588468-d28c1383-3a25-4f16-8905-bad3ca73e63e.png">

### Checklist

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

(cherry picked from commit cdcb272)

Co-authored-by: Georgii Gorbachev <[email protected]>
@banderror banderror deleted the fix-fetching-package-info-for-installed-integrations branch June 20, 2022 22:43
@banderror
Copy link
Contributor Author

@spong Yesterday I noticed the same type of errors, but locally. For example, right now I can see this in the logs (my changes are rebased on top of the latest main):

[2022-06-22T14:01:34.683+02:00][ERROR][plugins.securitySolution] Unable to retrieve installed integrations. Error fetching packages from registry: [email protected].
[2022-06-22T14:01:34.684+02:00][DEBUG][plugins.securitySolution] Error fetching package info from registry for [email protected]. [email protected] not found

This is how it looks in the UI.

Screenshot 2022-06-22 at 14 24 08

Screenshot 2022-06-22 at 14 25 42

It feels like whether an error will be thrown or not depends on the requested package version. Maybe. It's weird because the Fleet page itself says that [email protected] is installed, so how could it be not found??

I think there's still some kind of a bug here. I will open a separate ticket for it.

@spong
Copy link
Member

spong commented Jun 23, 2022

Yeah, something interesting here... I was able to repro by installing an older version of a package ([email protected]) and got the error in the kibana log the first time, but haven't seen it since. Tried adding another older version package and no error log for that one. Restarted with debugger attached and still not able to reproduce the error from the fleet side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed bug Fixes for quality problems that affect the customer experience ci:cloud-deploy Create or update a Cloud deployment Feature:Detection Rules Security Solution rules and Detection Engine Feature:Prebuilt Detection Rules Security Solution Prebuilt Detection Rules area Feature:Rule Details Security Solution Detection Rule Details page Feature:Rule Management Security Solution Detection Rule Management area impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. release_note:skip Skip the PR/issue when compiling release notes Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.3.0 v8.4.0
Projects
None yet
7 participants