Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Endpoint] Potential improvements in endpoint user artifact packager #160387

Conversation

dasansol92
Copy link
Contributor

@dasansol92 dasansol92 commented Jun 23, 2023

Summary

This pr includes different changes to:

  • Do less requests to .kibana and .fleet-artifacts indices
  • Speed up the endpoint:user-artifact-packager task.

Actions taken:

  • Do package policy updates in parallel using pMap. This is the slowest phase of the task as a package policy update takes a few seconds to be done. Doing it in parallel (with concurrency = 10) speeds up this phase, specially when there are a lot of policies. The concurrency value could be replaced by a config parameter as a safe option to decrease or increase the concurrency.

  • Introduces new fleet artifacts client functionality for bulk delete. It is used in the deleteArtifacts function in manifest manager but also in the cleanup function. Artifacts deletion are done always there is an artifact update as artifacts are not updated, instead, it does create + delete. The single delete artifact does a previous artifact verification by getting it by id and ensuring it is from the current package. This is not done now in the new bulk operation, but let me know if this is strongly needed. This was also increasing the number of calls we were doing into .fleet-artifacts index as an update/create/delete of a global exception might impact all the artifacts, so all artifacts being created and deleted.

  • Package policy ids where loaded (using a pagesize of 20) on each artifact generation (so for each policy and OS). This was causing a lot of .kibana index requests, specially for the cases with a lot of policies. Now all the policy ids are loaded at the beginning (with a 100 page size) and used, so a few .kibana index queries done for all the artifacts, policyIds an OS's.

  • Exceptions (Event Filters, Trusted Apps, endpoint exceptions, etc) were queried for each OS and policyId, this also ended up doing a lot of requests in .kibana index. Now it does a query by ListId + OS one time and stores the result in a cache which is reused for each policy. Now we do less queries and the performance has improved.

  • All unit and integration tests have been updated according to the changes.

The behaviour has changed with all these changes, but the end result should be the same.

TBD:

  • Adds new config param for package policy update concurrency. ✅
  • e2e testing comparing with the previous version of the task.
  • [...]

@dasansol92 dasansol92 added release_note:skip Skip the PR/issue when compiling release notes Team:Defend Workflows “EDR Workflows” sub-team of Security Solution v8.1.0 labels Jun 23, 2023
// Expect 2 exceptions, the first two calls returned the same exception list items
expect(resp.entries.length).toEqual(2);
// Expect 1 exceptions, the first two calls returned the same exception list items
expect(translated.entries.length).toEqual(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is due to a bug fix (I hope) from the old version.
Before changes, it was doing deduplication on each exceptions request call, now it is done after all exceptions have been requested, so the deduplication is over all the exceptions, not on each pagination. Let me know if this was done on this way before for any specific reason I missed.

osOptions: BuildArtifactsForOsOptions
) {
const policySpecificArtifacts: Record<string, InternalArtifactCompleteSchema[]> = {};
await pMap(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this could be done in a single for loop. My concern here is that 2 concurrent buildArtifactsForOs writes to the cache at the same time. It shouldn't happen since the cache is filled before when doing the global artifacts. But let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow, but is this still a concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it shouldn't be as the first one filling this cache will be exceptions for global artifacts, and those runs sequentially. So when this (by policy) is running, the cache is already populated.
See the globals in lines: 262, 284, 312, 338 and 367.

Let me know if I'm following this correctly

@dasansol92 dasansol92 marked this pull request as ready for review June 26, 2023 11:33
@dasansol92 dasansol92 requested review from a team as code owners June 26, 2023 11:33
@dasansol92 dasansol92 requested a review from gergoabraham June 26, 2023 11:33
@dasansol92 dasansol92 requested a review from paul-tavares June 27, 2023 11:07
Copy link
Contributor

@gergoabraham gergoabraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite comprehensive refactor, this is looking awesome! I left some suggestions

also tested the caching, with your modification there is only 15 fetches instead of 27 - I don't know why 27, but that's the measurement 🤷 💯

@dasansol92 dasansol92 requested a review from gergoabraham June 27, 2023 12:30
Copy link
Contributor

@gergoabraham gergoabraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the modifications! 🚀

Copy link
Contributor

@paul-tavares paul-tavares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my prior set of comments.
I reveiwed this PR again and left some more comments - only one is of real concern to me (call to .bulkUpdate() does not pass esClient throught). Let me know if you have any questions

const res = await withPackageSpan(`Bulk delete fleet artifacts`, () =>
esClient.bulk({
body,
refresh: 'wait_for',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.


paging = (page - 1) * 20 + items.length < total;
paging = (page - 1) * 1000 + items.length < total;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes me confusion (it may just be me ). You are assuming that the page size that pageSupplier() will use is 1000, but really, you can't be sure of that because one could alter the callbasck given to this function. That makes this a bit fragile because someone could (unknownly) change how pageSupplier() works and not realize that this is assuming a page size of 1000 here - thus possibly causing items to be dropped or duplicated.

Suggestion: change pageSupplier() to take in as arguments the page and the perPage arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point! Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (const currentBatch of updateBatches) {
const response = await this.packagePolicyService.bulkUpdate(
this.savedObjectsClient,
// @ts-expect-error TS2345
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we can get around this? Do we have an (internal) ESclient we can provide to the service method?

not doing so, when it is a required dependency of this service method, will likely leave us exposed to bugs in the future and possibly make fleet unstable. I checked the code in fleet for this method and it passes the esClient on to other services, the most important one (i think) is the one that bumps the revision number on the Agent policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revisit this and see if we can use any esClient from our service here. Note that this was done in this way before with the .update, but good point on ensuring this works to avoid future bugs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dasansol92 dasansol92 requested a review from paul-tavares June 27, 2023 15:04
Copy link
Contributor

@paul-tavares paul-tavares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes.

🚢 it

@@ -82,3 +82,7 @@ export const uniqueIdFromArtifact = <
}: T): string => {
return `${packageName}:${identifier}-${decodedSha256}`;
};

export const uniqueIdFromId = (id: string, packageName: string): string => {
Copy link
Contributor

@juliaElastic juliaElastic Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: uniqueIdFromId is a bit confusing, is id the artifact id? We could rename the function to uniqueIdFromArtifactId

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id of the artifact without the packageName, so it adds the packageName to the current id as an ES id: ${packageName}:${artifactId}.
Is the same thing uniqueIdFromArtifact does, but as we only have the id here I thought would be better to just create this new method instead of load the artifact to generate the unique Id with the existing method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also rename this to packagePrefixedId. That might be less confusing to read. 😅

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fleet changes LGTM

@dasansol92
Copy link
Contributor Author

@juliaElastic what do you think about this comment in the description related to the new bulkDelete:

Introduces new fleet artifacts client functionality for bulk delete. It is used in the deleteArtifacts function in manifest manager but also in the cleanup function. Artifacts deletion are done always there is an artifact update as artifacts are not updated, instead, it does create + delete. The single delete artifact does a previous artifact verification by getting it by id and ensuring it is from the current package. This is not done now in the new bulk operation, but let me know if this is strongly needed. This was also increasing the number of calls we were doing into .fleet-artifacts index as an update/create/delete of a global exception might impact all the artifacts, so all artifacts being created and deleted.

Is it ok not loading the artifact first and checking the package name as it is done with the single delete artifact method?

@juliaElastic
Copy link
Contributor

I'm not that familiar with the artifacts API to answer this. It sounds as it should be fine to delete without loading. What happens if the artifact doesn't exist, do we catch the delete failures?

@dasansol92
Copy link
Contributor Author

I'm not that familiar with the artifacts API to answer this. It sounds as it should be fine to delete without loading. What happens if the artifact doesn't exist, do we catch the delete failures?

@juliaElastic Yes, we capture all errors here: https://github.com/elastic/kibana/pull/160387/files#diff-231337d0ab0963e2654e75eecdbdabf8a771c3e3be716a5a93333bf972a4f272R222

@dasansol92
Copy link
Contributor Author

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Investigations - Security Solution Tests #2 / Detections : Page Filters Alert list is updated when the alerts are updated
  • [job] [logs] Investigations - Security Solution Tests #2 / Detections : Page Filters Impact of inputs should recover from invalide kql Query result
  • [job] [logs] Security Solution Tests #1 / Prebuilt rules Alerts rules, prebuilt rules "before each" hook for "Loads prebuilt rules"

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fleet 1073 1075 +2
Unknown metric groups

API count

id before after diff
fleet 1189 1191 +2

ESLint disabled line counts

id before after diff
enterpriseSearch 14 16 +2
securitySolution 413 416 +3
total +5

References to deprecated APIs

id before after diff
securitySolution 614 606 -8

Total ESLint disabled count

id before after diff
enterpriseSearch 15 17 +2
securitySolution 492 495 +3
total +5

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Member

@ashokaditya ashokaditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! 🎉 I've a few nits and questions but nothing that's should block to 🚢


const res = await withPackageSpan(`Bulk delete fleet artifacts`, () =>
esClient.bulk({
body,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: body is deprecated over operations for bulk. You can rename body to operations. You can change the other bulk calls in the file as well.

Similarly, the create method (used in createArtifact) prefers document instead of a body in the request, and the search (used in listArtifacts) method accepts sort directly in the request instead of being nested inside a body.

@@ -82,3 +82,7 @@ export const uniqueIdFromArtifact = <
}: T): string => {
return `${packageName}:${identifier}-${decodedSha256}`;
};

export const uniqueIdFromId = (id: string, packageName: string): string => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also rename this to packagePrefixedId. That might be less confusing to read. 😅


paging = (page - 1) * 100 + response.data.length < response.total;
paging = (page - 1) * 1000 + response.data.length < response.total;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could declare this size as a const as it is the same as perPage value.

perPage: 20,
kuery: 'ingest-package-policies.package.name:endpoint',
});
private async listEndpointPolicyIds() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider adding a return type to the function.

@@ -583,20 +664,29 @@ export class ManifestManager {
this.logger.info(`Committed manifest ${manifest.getSemanticVersion()}`);
}

private async listEndpointPolicies(page: number) {
private async listEndpointPolicies(page: number, perPage: number) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider adding a return type here.

os: string;
policyId?: string;
schemaVersion: string;
}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a return type

currentBatch
);

// Parse errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: // Update errors

allPolicyIds: string[],
supportedOSs: string[],
osOptions: BuildArtifactsForOsOptions
) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a return type

osOptions: BuildArtifactsForOsOptions
) {
const policySpecificArtifacts: Record<string, InternalArtifactCompleteSchema[]> = {};
await pMap(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow, but is this still a concern?

@dasansol92
Copy link
Contributor Author

@ashokaditya Thanks for the review and your suggestions, if it's ok to you, I will add all nits in a follow up pr I'm working on.

@dasansol92 dasansol92 merged commit 1a81529 into elastic:main Jun 29, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jun 29, 2023
dasansol92 added a commit that referenced this pull request Jun 30, 2023
## Summary

- Reduces fleet artifact get calls by getting them all at one time
instead of requesting it one by one using id's
- Updates unit test.
- Adds duration time log to endpoint packager task.
- Adds missing return types to functions.

Follow up of: #160387
dasansol92 added a commit that referenced this pull request Aug 11, 2023
#163509)

## Summary

Removes pMap and uses a for loop instead when fetching exceptions (as it
was done before) since exceptions are cached and there is no need for
parallel code here.

Original PR with previous changes:
#160387

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Defend Workflows “EDR Workflows” sub-team of Security Solution Team:Fleet Team label for Observability Data Collection Fleet team v8.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants