-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution][Detections] Fix fetching package info from registry for installed integrations #134732
[Security Solution][Detections] Fix fetching package info from registry for installed integrations #134732
Conversation
6b3aab9
to
be59ee5
Compare
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
be59ee5
to
58bb62c
Compare
So I tested it in the Cloud CI deployment. There seem to be no errors when fetching Notice the "correct" |
set.getPackages().map((packageInfo) => { | ||
return fleet.packages.getRegistryPackage( | ||
const registryPackages = await initPromisePool({ | ||
concurrency: MAX_CONCURRENT_REQUESTS_TO_PACKAGE_REGISTRY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding a pool and capping the requests here! I think we discussed this in the initial PR for this endpoint, but I'd be curious to see what the upper bounds are here for large deployments leveraging integrations. Would be good to test with a large number of integration policies, and then also with a large number of installed integrations.
One issue we have with fetching package policies is we're not handling pagination, so we'll cap out at the max for that initial page. Then for fetching the individual packages, I'm a little worried if there's say 100+ installed packages and we end up making 100+ requests to fleet each time an individual user hits the Rules Mgmt or Details pages. We do have caching client-side through the react-query wrapper, but nothing here server side, so could get messy if there's a few different users bouncing around those pages.
If this becomes an issue in production, users can at least disable this feature and short-circuit this request on the Rule Mgmt page w/ the Kibana Advanced Setting, but definitely something we'll want to test further. Will be good to get feedback from the fleet folks about any other API's we might be able to use here to ease the pressure on the fleet side -- would be great if we could just get all installed packages in one request instead of going the integration policy route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to test with a large number of integration policies, and then also with a large number of installed integrations.
Agree, will do it 👍
One issue we have with fetching package policies is we're not handling pagination, so we'll cap out at the max for that initial page.
I'll check how this method works when the perPage
is not specified. I thought it just returns all the policies, but it might also set a default page size instead. Anyway, definitely need to test this 👍
I'm a little worried if there's say 100+ installed packages and we end up making 100+ requests to fleet each time an individual user hits the Rules Mgmt or Details pages. We do have caching client-side through the react-query wrapper, but nothing here server side
FWIW Fleet service caches fetched packages on the server-side. So only the first request may be slow (it depends linearly on the number of installed packages). The subsequent requests should be fast and not generate outgoing HTTP requests to EPR. I'm not sure 100% about this though -- so will test it as well. 👍
I will also test it with a large number of installed packages and a large number of integration policies.
Will be good to get feedback from the fleet folks about any other API's we might be able to use here to ease the pressure on the fleet side -- would be great if we could just get all installed packages in one request instead of going the integration policy route.
I'll open a discussion issue and share it with Fleet folks. 👍
Thank you @spong, this is a lot of great points!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW Fleet service caches fetched packages on the server-side. So only the first request may be slow (it depends linearly on the number of installed packages). The subsequent requests should be fast and not generate outgoing HTTP requests to EPR. I'm not sure 100% about this though -- so will test it as well. 👍
This makes me feel a lot better about the upper bound cases. I didn't realize the fleet service cached as well, thanks @banderror!
And sounds good with the other points as well, thank you for verifying here 👍
const logMessage = `Error fetching package info from registry for ${item.package_name}@${item.package_version}`; | ||
const logReason = error instanceof Error ? error.message : String(error); | ||
logger.error(`${logMessage}. ${logReason}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While nice we're logging these now, I'm curious how actionable this will be for users, and potential overall noisiness of these logs as well. Perhaps debug
may be a better log level for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spong I think it depends on the error. Something like [email protected] not found
as in the GH issue probably wouldn't be actionable for users. But if a user has a custom package registry, and it's not available, it would likely end up with an exception here and let the user know that they need to fix their environment.
It could be noisy, I agree. I think we have a similar situation with the implementation of IRuleExecutionLogForRoutes
. There, we don't know for sure what exceptions we might catch (it could be a network error, a bug in the underlying code of SO Client or Event Log Reader, or a business error), but we log them via logger.error
. Would you rather log them on the debug
level?
I don't have a strong opinion on this, both options are ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spong I think it depends on the error. Something like [email protected] not found as in the GH issue probably wouldn't be actionable for users. But if a user has a custom package registry, and it's not available, it would likely end up with an exception here and let the user know that they need to fix their environment.
Yeah, that's a good point. I was just thinking from the standpoint of linking this error back to the related integrations feature so there was some context as to where this was bubbling up from.
This was the error I saw in testing:
and was just thinking adding some context like Unable to retrieve required integrations. Error fetching package info from registry for test@7. whoops
may be helpful.
As for the log-level, no strong opinions here either, but was just trying to think if there's a common scenario where this could be spammy. I'm fine leaving as-is, and we can just check up on things post-release to see what the impact looks like. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved console logs a bit in d6a6a46
Errors should be less noisy but still visible to users by default. Added Unable to retrieve installed integrations
for more context.
[2022-06-20T21:03:44.551+02:00][ERROR][plugins.securitySolution] Unable to retrieve installed integrations. Error fetching packages from registry: [email protected], [email protected].
[2022-06-20T21:03:44.551+02:00][DEBUG][plugins.securitySolution] Error fetching package info from registry for [email protected]. Boom!
[2022-06-20T21:03:44.551+02:00][DEBUG][plugins.securitySolution] Error fetching package info from registry for [email protected]. Boom!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked out, tested locally and LGTM! 👍
Left a couple comments around scale testing and a known bug with pagination, but this specific change LGTM with regards to fixing issues when errors are returned from fleet.packages.getRegistryPackage()
, so I'm all good for merging and getting this extra guard in. Note: I was not able to do a full e2e test as I'm not sure of the scenarios where fleet would return errors here (still waiting to hear back from QA on any specific config they had that was causing this), but I did repro by throwing an error manually and everything looked good as a result of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
58bb62c
to
068ad21
Compare
068ad21
to
d6a6a46
Compare
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: cc @banderror |
…ry for installed integrations (#134732) **Fixes:** #134639 ## Summary In Cloud, `Elastic APM` and `Fleet Server` integrations are installed by default. However, attempts to fetch their packages from Elastic Package Registry via Fleet services on the server-side fail with the following errors: ```json { "message": "[email protected] not found", "status_code": 500 } ``` ```json { "message": "[email protected] not found", "status_code": 500 } ``` <img width="797" alt="Screenshot 2022-06-20 at 11 28 18" src="https://user-images.githubusercontent.com/7359339/174571610-4c24e777-c49a-49e0-addf-54c6301cc8ca.png"> This behavior happens in some Cloud environments (like the one in the related ticket). It seems to not happen in Cloud CI environments and locally. This PR adds error handling for this edge case to `GET /internal/detection_engine/fleet/integrations/installed?packages=` endpoint. - It logs fetching errors to the console logs of Kibana. - It uses a "best-effort" approach for returning data from the endpoint. If we could successfully read existing integration policies, we already have all of the needed data except correct integration titles. So, if after that any request to EPR results in an error, we: - Still return 200 with a list of installed integrations - Include correct titles for those packages that were successfully fetched - Include "best guess" titles for those packages that failed ``` [2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for [email protected]. Boom! [2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for [email protected]. Boom! ``` <img width="1085" alt="Screenshot 2022-06-20 at 13 05 08" src="https://user-images.githubusercontent.com/7359339/174588468-d28c1383-3a25-4f16-8905-bad3ca73e63e.png"> ### Checklist - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios (cherry picked from commit cdcb272)
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation |
…ry for installed integrations (#134732) (#134784) **Fixes:** #134639 ## Summary In Cloud, `Elastic APM` and `Fleet Server` integrations are installed by default. However, attempts to fetch their packages from Elastic Package Registry via Fleet services on the server-side fail with the following errors: ```json { "message": "[email protected] not found", "status_code": 500 } ``` ```json { "message": "[email protected] not found", "status_code": 500 } ``` <img width="797" alt="Screenshot 2022-06-20 at 11 28 18" src="https://user-images.githubusercontent.com/7359339/174571610-4c24e777-c49a-49e0-addf-54c6301cc8ca.png"> This behavior happens in some Cloud environments (like the one in the related ticket). It seems to not happen in Cloud CI environments and locally. This PR adds error handling for this edge case to `GET /internal/detection_engine/fleet/integrations/installed?packages=` endpoint. - It logs fetching errors to the console logs of Kibana. - It uses a "best-effort" approach for returning data from the endpoint. If we could successfully read existing integration policies, we already have all of the needed data except correct integration titles. So, if after that any request to EPR results in an error, we: - Still return 200 with a list of installed integrations - Include correct titles for those packages that were successfully fetched - Include "best guess" titles for those packages that failed ``` [2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for [email protected]. Boom! [2022-06-20T12:57:10.270+02:00][ERROR][plugins.securitySolution] Error fetching package info from registry for [email protected]. Boom! ``` <img width="1085" alt="Screenshot 2022-06-20 at 13 05 08" src="https://user-images.githubusercontent.com/7359339/174588468-d28c1383-3a25-4f16-8905-bad3ca73e63e.png"> ### Checklist - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios (cherry picked from commit cdcb272) Co-authored-by: Georgii Gorbachev <[email protected]>
@spong Yesterday I noticed the same type of errors, but locally. For example, right now I can see this in the logs (my changes are rebased on top of the latest
This is how it looks in the UI. It feels like whether an error will be thrown or not depends on the requested package version. Maybe. It's weird because the Fleet page itself says that I think there's still some kind of a bug here. I will open a separate ticket for it. |
Yeah, something interesting here... I was able to repro by installing an older version of a package ( |
Fixes: #134639
Summary
In Cloud,
Elastic APM
andFleet Server
integrations are installed by default. However, attempts to fetch their packages from Elastic Package Registry via Fleet services on the server-side fail with the following errors:This behavior happens in some Cloud environments (like the one in the related ticket). It seems to not happen in Cloud CI environments and locally.
This PR adds error handling for this edge case to
GET /internal/detection_engine/fleet/integrations/installed?packages=
endpoint.Checklist