Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections Engine] Update search_after to use PIT to work with tie breakers and issues in @timestamp fields and overrides #103944

Closed
FrankHassanabad opened this issue Jun 30, 2021 · 5 comments
Assignees
Labels
8.2 candidate considered, but not committed, for 8.2 release SecuritySolution:QAAssist Part of QA testing process for release Team:Detections and Resp Security Detection Response Team Team:Security Solution Platform Security Solution Platform Team technical debt Improvement of the software architecture and operational architecture

Comments

@FrankHassanabad
Copy link
Contributor

FrankHassanabad commented Jun 30, 2021

Agents can have multiple documents having the same @timestamp, event.ingested, or other fields since multiple events can happen within the same granularity of second, millisecond, nanosecond, etc... IoC's and lists has the same potential use case where people can add multiple list items with the same exact timestamps. This can cause an issue where when we use search_after we do not have a tie-breaker and can cause missed alerts/signals.

Also, we have the problem where mutations to a list can happen while the rule is running or events are changed during the time we are trying to detect.

Luckily Elasticsearch has implemented a new way for us to do this called (PIT) Point In Time which should solve these use cases nicely:
https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after

We just have to implement and use this within detection engine and anywhere such as IoC lists where we are using search_after:

Unknowns are if we have large volumes of data, large look back times, or taking a very long time and we set the PIT to a very large number if that is going to block ingest, slow things down, or if we would have to "bend" this part a bit or not. But we should attempt to implement and test this and use it everywhere for users.

@FrankHassanabad FrankHassanabad added bug Fixes for quality problems that affect the customer experience Team:Detections and Resp Security Detection Response Team labels Jun 30, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

andrewkroh added a commit to andrewkroh/kibana that referenced this issue Jul 1, 2021
The `event.ingested` field is added to all documents ingested via
Fleet plus Agent. By removing the subseconds we can be better compression
of the values in Elasticsearch.

The primary user of `event.ingested` today is the the Security Detection Engine
as a tie-breaker in search_after, but once it moves to the using the
point-in-time API the need for precision will be lessened because PIT has
an implicit tie-breaker.

Relates elastic#103944
Relates elastic/beats#22388
@andrewkroh
Copy link
Member

On a related note in #104044 I'm removing the sub-seconds from event.ingested for Fleet integrations.

andrewkroh added a commit that referenced this issue Aug 3, 2021
The `event.ingested` field is added to all documents ingested via
Fleet plus Agent. By removing the subseconds we can be better compression
of the values in Elasticsearch.

The primary user of `event.ingested` today is the the Security Detection Engine
as a tie-breaker in search_after, but once it moves to the using the
point-in-time API the need for precision will be lessened because PIT has
an implicit tie-breaker.

Relates #103944
Relates elastic/beats#22388

Co-authored-by: Kibana Machine <[email protected]>
kibanamachine added a commit to kibanamachine/kibana that referenced this issue Aug 3, 2021
The `event.ingested` field is added to all documents ingested via
Fleet plus Agent. By removing the subseconds we can be better compression
of the values in Elasticsearch.

The primary user of `event.ingested` today is the the Security Detection Engine
as a tie-breaker in search_after, but once it moves to the using the
point-in-time API the need for precision will be lessened because PIT has
an implicit tie-breaker.

Relates elastic#103944
Relates elastic/beats#22388

Co-authored-by: Kibana Machine <[email protected]>
kibanamachine added a commit that referenced this issue Aug 3, 2021
The `event.ingested` field is added to all documents ingested via
Fleet plus Agent. By removing the subseconds we can be better compression
of the values in Elasticsearch.

The primary user of `event.ingested` today is the the Security Detection Engine
as a tie-breaker in search_after, but once it moves to the using the
point-in-time API the need for precision will be lessened because PIT has
an implicit tie-breaker.

Relates #103944
Relates elastic/beats#22388

Co-authored-by: Kibana Machine <[email protected]>

Co-authored-by: Andrew Kroh <[email protected]>
streamich pushed a commit to vadimkibana/kibana that referenced this issue Aug 8, 2021
The `event.ingested` field is added to all documents ingested via
Fleet plus Agent. By removing the subseconds we can be better compression
of the values in Elasticsearch.

The primary user of `event.ingested` today is the the Security Detection Engine
as a tie-breaker in search_after, but once it moves to the using the
point-in-time API the need for precision will be lessened because PIT has
an implicit tie-breaker.

Relates elastic#103944
Relates elastic/beats#22388

Co-authored-by: Kibana Machine <[email protected]>
@rylnd
Copy link
Contributor

rylnd commented Aug 31, 2021

@MikePaquette RE your performance concerns about PIT (Keeping point in time alive):

Normally, the background merge process optimizes the index by merging together smaller segments to create new, bigger segments. Once the smaller segments are no longer needed they are deleted. However, open point-in-times prevent the old segments from being deleted since they are still in use.

So it sounds like the impact here would be on storage and open handles. Reasonably small values (1m in general, rule.interval for rule executions) should minimize the impact here.

@peluja1012 peluja1012 added Team:Security Solution Platform Security Solution Platform Team technical debt Improvement of the software architecture and operational architecture labels Sep 15, 2021
@spong spong removed their assignment Nov 10, 2021
@peluja1012 peluja1012 added the 8.2 candidate considered, but not committed, for 8.2 release label Jan 19, 2022
@yctercero
Copy link
Contributor

Ticket by core tracking these changes - #93770

FrankHassanabad added a commit that referenced this issue Feb 15, 2022
… from saved objects to exception lists (#125182)

## Summary

Exposes the functionality of
* search_after
* point in time (pit)

From saved objects to the exception lists. This _DOES NOT_ expose these to the REST API just yet. Rather this exposes it at the API level to start with and changes code that had hard limits of 10k and other limited loops. I use the batching of 1k for this at a time as I thought that would be a decent batch guess and I see other parts of the code changed to it. It's easy to change the 1k if we find we need to throttle back more as we get feedback from others.

See this PR where `PIT` and `search_after` were first introduced: #89915
See these 2 issues where we should be using more paging and PIT (Point in Time) with search_after: #93770 #103944

The new methods added to the `exception_list_client.ts` client class are:
* openPointInTime
* closePointInTime
* findExceptionListItemPointInTimeFinder
* findExceptionListPointInTimeFinder
* findExceptionListsItemPointInTimeFinder
* findValueListExceptionListItemsPointInTimeFinder

The areas of functionality that have been changed:
* Exception list exports
* Deletion of lists
* Getting exception list items when generating signals

Note that currently we use our own ways of looping over the saved objects which you can see in the codebase such as this older way below which does work but had a limitation of 10k against saved objects and did not do point in time (PIT)

Older way example (deprecated):
```ts
  let page = 1;
  let ids: string[] = [];
  let foundExceptionListItems = await findExceptionListItem({
    filter: undefined,
    listId,
    namespaceType,
    page,
    perPage: PER_PAGE,
    pit: undefined,
    savedObjectsClient,
    searchAfter: undefined,
    sortField: 'tie_breaker_id',
    sortOrder: 'desc',
  });
  while (foundExceptionListItems != null && foundExceptionListItems.data.length > 0) {
    ids = [
      ...ids,
      ...foundExceptionListItems.data.map((exceptionListItem) => exceptionListItem.id),
    ];
    page += 1;
    foundExceptionListItems = await findExceptionListItem({
      filter: undefined,
      listId,
      namespaceType,
      page,
      perPage: PER_PAGE,
      pit: undefined,
      savedObjectsClient,
      searchAfter: undefined,
      sortField: 'tie_breaker_id',
      sortOrder: 'desc',
    });
  }
  return ids;
```

But now that is replaced with this newer way using PIT:
```ts
  // Stream the results from the Point In Time (PIT) finder into this array
  let ids: string[] = [];
  const executeFunctionOnStream = (response: FoundExceptionListItemSchema): void => {
    const responseIds = response.data.map((exceptionListItem) => exceptionListItem.id);
    ids = [...ids, ...responseIds];
  };

  await findExceptionListItemPointInTimeFinder({
    executeFunctionOnStream,
    filter: undefined,
    listId,
    maxSize: undefined, // NOTE: This is unbounded when it is "undefined"
    namespaceType,
    perPage: 1_000,
    savedObjectsClient,
    sortField: 'tie_breaker_id',
    sortOrder: 'desc',
  });
  return ids;
```

We also have areas of code that has perPage listed at 10k or a constant that represents 10k which this removes in most areas (but not all areas):
```ts
      const items = await client.findExceptionListsItem({
        listId: listIds,
        namespaceType: namespaceTypes,
        page: 1,
        pit: undefined,
        perPage: MAX_EXCEPTION_LIST_SIZE, // <--- Really bad to send in 10k per page at a time
        searchAfter: undefined,
        filter: [],
        sortOrder: undefined,
        sortField: undefined,
      });
```

That is now:
```ts
      // Stream the results from the Point In Time (PIT) finder into this array
      let items: ExceptionListItemSchema[] = [];
      const executeFunctionOnStream = (response: FoundExceptionListItemSchema): void => {
        items = [...items, ...response.data];
      };

      await client.findExceptionListsItemPointInTimeFinder({
        executeFunctionOnStream,
        listId: listIds,
        namespaceType: namespaceTypes,
        perPage: 1_000,
        filter: [],
        maxSize: undefined, // NOTE: This is unbounded when it is "undefined"
        sortOrder: undefined,
        sortField: undefined,
      });
```

Left over areas will be handled in separate PR's because they are in other people's code ownership areas.

### Checklist
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
@yctercero yctercero added the SecuritySolution:QAAssist Part of QA testing process for release label Feb 28, 2022
@yctercero
Copy link
Contributor

@MadameSheema this one will need QA check but I'm working on breaking this out into individual tickets for each section that needs updating so that it'll be more clear exactly what to test. Adding SecuritySolutionPlatform:QAAssist tag here so that I don't forget.

@peluja1012 peluja1012 removed the bug Fixes for quality problems that affect the customer experience label Apr 4, 2022
@FrankHassanabad FrankHassanabad closed this as not planned Won't fix, can't repro, duplicate, stale Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.2 candidate considered, but not committed, for 8.2 release SecuritySolution:QAAssist Part of QA testing process for release Team:Detections and Resp Security Detection Response Team Team:Security Solution Platform Security Solution Platform Team technical debt Improvement of the software architecture and operational architecture
Projects
None yet
Development

No branches or pull requests

7 participants