Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory efficient source filtering #25168

Closed
amir20001 opened this issue Jun 9, 2017 · 15 comments
Closed

Memory efficient source filtering #25168

amir20001 opened this issue Jun 9, 2017 · 15 comments
Assignees
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team

Comments

@amir20001
Copy link

Example:

Using Twitter as an example, each user is a document, and each tweet is a document nested under the user. For active users, each document can end up with thousands of tweets and thus a single document can be a few megabytes in size.

{
  "userId": "1",
  "tweets": [
    {
      "id": 1,
      "message": "tweet 1",
      
    },
    {
      "id": 2,
      "message": "tweet 2"
    },
   ...
  ]
}

Use Case:
We want to find users that have used a specific hashtag in their tweets and view only those tweets. We use source filtering and nested inner hit queries to get back just the users and matching tweets.

Problem:
Even though we are using source filtering, ElasticSearch will load the entire document into memory before doing source filtering. Since each record is so large, that means with any real throughput, we see constant garbage collection happening in the logs.

Feature Request:
Can you load filtered source in a more memory efficient manner - where you do not have to load the entire source into memory first?

@jpountz
Copy link
Contributor

jpountz commented Jun 12, 2017

If the issue is about not loading the _source in memory, then this is a high hanging fruit. However, you mentioned that the issue is mostly about garbage collection in your case, which I think we could improve by avoiding going through a map of maps intermediate representation, which I suspect is the source of all that garbage.

@jpountz
Copy link
Contributor

jpountz commented Jun 12, 2017

Also relates to #9034.

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories discuss >enhancement labels Jun 12, 2017
@osman
Copy link

osman commented Jun 13, 2017

👍

1 similar comment
@jaemoore
Copy link

👍

@amir20001
Copy link
Author

yeah, it seems related to #9034. in this case, since the large items are under a single nested field it would also require each nested item to be stored separately.

@tlrx
Copy link
Member

tlrx commented Jun 16, 2017

We discussed about this offline in our "Fix-it Friday" meeting and we agreed that we could still reduce the garbage collection issue by filtering the source of documents in a streaming fashion instead of the current in-memory map implementation. We could use the same feature as what is used in response filtering for filter_path.

I'll give it a try in the next few weeks and update this issue.

@tlrx tlrx removed the discuss label Jun 16, 2017
@tlrx tlrx self-assigned this Jun 16, 2017
@tlrx
Copy link
Member

tlrx commented Jul 4, 2017

So I finally looked at this. I created tlrx@3362a50 that uses a streamed based implementation of the source filtering that replaces the current in-memory maps implementation. The results are similar to what I saw more than one year ago when I looked at this optimization, but at that time we didn't have Rally so tests were hard to reproduce.

tl;dr
Benchmarks show that both implementations have almost the same performance, because most of the time is spent loading and parsing the source and these steps are always executed however the filtering is done. Differences appear only for edge cases like the one described in this issue (ie, a document with a lot of fields where most of them are filtered out). The streamed based implementation has less memory pressure since it creates a lot less objects, so I think it is a good long term solution. Sadly, the filtering methods does not behave exactly the same making the change not trivial and other features like inner hits, highlighting or scripted fields requires the source to be parsed as a map anyway. So I think we should investigate #9034 instead of optimizing for edge cases like this issue.

A new XContentHelper.filter(BytesReference, XContentType, String[], String[]) has been added in https://github.com/tlrx/elasticsearch/tree/use-streamed-based-source-filtering. It uses Jackson streaming filtering under the hood. Implementation is quite straightforward. This method is used in the FetchSourceSubPhase to filter the source.

The implementation was tested with Rally using a JFR telemetry and memory profiling enabled on our default benchmarks. Note that the JFR options has been changed to use a custom profile and I created a new challenge with only searches with source filtering operations.

It has been tested with multiple benchmarks but pmc gave the most eloquent results because it contains a large body field that can be filtered out:

Rally results for the map based filtering indicate a median throughput of 187.097 ops/s and a 99th percentile latency of 470.179 ms compared to 195.384ops/s and 196.684 ms for the streaming based results.

Memory overview

Looking at the JFR records is interesting and show less memory usage using streaming based filtering:

Map based filtering

master_overview

Streaming based filtering

test_overview

Garbage collections

And less GCs using streaming based filtering, which is expected.

Map based filtering

gc_master

Streaming based filtering

gc_test

Allocations

The allocations statistics also show much less allocations for the streaming based filtering (4949 allocations for 3,5 Gb in TLAB, 7567 for 501Mb outside TLAB) compared to map based filtering (9676 allocations for 6,5 Gb in TLAB, 32428 for 2,96Gb outside TLAB)

Map based filtering

master_allocs

Streaming based filtering

test_allocsq

JFR records:

Other considerations

While investigating the change I noticed that our filtering methods do not behave the same so I created #25491 so that all methods share a same set of tests. But there are still some differences: map based filtering prints out empty objects (#4715) while the streaming based implementation excludes empty objects. Also, map based filtering handles dot in field names as sub objects (#20736) and streaming based does not work exactly like this and requires some non trivial changes.

Also, some features require the source to be parsed as a map in order to work (like highlighting or scripted fields). If combined with source filtering, we don't want to parse the source as raw bytes for source filtering and another time as a map for the highlighting. Changing the way it works is not easy and I think we could instead investigate other solution like #9034 instead of optimizing the source filtering for edge cases like the one described in this issue.

I'd be happy to hear any thoughts or comments on this! I might have miss something...

@jpountz
Copy link
Contributor

jpountz commented Jul 4, 2017

It is a pity that we managed to come up with different semantics about filtering values in a document. I'd be keen to switching to stream-based filtering even if that implies minor bw breaks.

@s1monw
Copy link
Contributor

s1monw commented Jul 4, 2017

I'd be keen to switching to stream-based filtering even if that implies minor bw breaks.

++ is there a chance we can stick with object based parsing based on the index version created or some setting and remove it in 7.0?

@osman
Copy link

osman commented Mar 1, 2018

Any updates on whether this may be included in 7.0?

@talevy talevy added :Search/Search Search-related issues that do not fall into other categories and removed :Search/Search Search-related issues that do not fall into other categories labels Mar 19, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jpountz
Copy link
Contributor

jpountz commented Mar 20, 2018

@osman No updates for now.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@stu-elastic stu-elastic added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Jul 23, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Scripting)

@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jul 23, 2020
@jtibshirani
Copy link
Contributor

We use source filtering and nested inner hit queries to get back just the users and matching tweets.

I noticed that when using source filtering in inner_hits, we were reloading and reparsing the _source for each nested document. So we recently merged #60494 to only load and parse the _source once per root document. This doesn't address the memory consumption of source filtering itself, but could help here (if I'm understanding the use case right).

@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@stu-elastic stu-elastic removed the needs:triage Requires assignment of a team area label label Dec 9, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Sep 1, 2021
I found myself needing support for something like `filter_path` on
`XContentParser`. It was simple enough to plug it in so I did. Then I
realized that it might offer more memory efficient source filtering
(elastic#25168) so I put together a quick benchmark comparing the source
filtering that we do in `_search`.

Filtering using the parser is about 33% faster than how we filter now
when you select a single field from a 300 byte document:
```
Benchmark                                          (excludes)  (includes)  (source)  Mode  Cnt     Score    Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message     short  avgt    5  2360.342 ±  4.715  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message     short  avgt    5  2010.278 ± 15.042  ns/op
FetchSourcePhaseBenchmark.filterXContentOnParser                  message     short  avgt    5  1588.446 ± 18.593  ns/op
```

The top line is the way we filter now. The middle line is adding a
filter to `XContentBuilder` - something we can do right now without any
of my plumbing work. The bottom line is filtering on the parser,
requiring all the new plumbing.

This isn't particularly impresive. 33% *sounds* great! But 700
nanoseconds per document isn't going to cut into anyone's search times.
If you fetch a thousand docuents that's .7 milliseconds of savings.

But we mostly advise folks to use source filtering on fetch when the
source is large and you only want a small part of it. So I tried when
the source is about 4.3kb and you want a single field:
```
Benchmark                                          (excludes)  (includes)      (source)  Mode  Cnt     Score     Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message  one_4k_field  avgt    5  5957.128 ± 117.402  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message  one_4k_field  avgt    5  4999.073 ±  96.003  ns/op
FetchSourcePhaseBenchmark.filterXContentonParser                  message  one_4k_field  avgt    5  3261.478 ±  48.879  ns/op
```

That's 45% faster. Put another way, 2.7 microseconds a document. Not
bad!

But have a look at how things come out when you want a single field from
a 4 *megabyte* document:
```
Benchmark                                          (excludes)  (includes)      (source)  Mode  Cnt        Score        Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message  one_4m_field  avgt    5  8266343.036 ± 176197.077  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message  one_4m_field  avgt    5  6227560.013 ±  68306.318  ns/op
FetchSourcePhaseBenchmark.filterXContentonParser                  message  one_4m_field  avgt    5  1617153.472 ±  80164.547  ns/op
```

These documents are very large. I've encountered documents like them in
real life, but they've always been the outlier for me. But a 6.5
millisecond per document savings ain't anything to sneeze at.

Take a look at what you get when I turn on gc metrics:
```
FetchSourcePhaseBenchmark.filterObjects                          message  one_4m_field  avgt    5   7036097.561 ±  84721.312   ns/op
FetchSourcePhaseBenchmark.filterObjects:·gc.alloc.rate           message  one_4m_field  avgt    5      2166.613 ±     25.975  MB/sec
FetchSourcePhaseBenchmark.filterXContentOnBuilder                message  one_4m_field  avgt    5   6104595.992 ±  55445.508   ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder:·gc.alloc.rate message  one_4m_field  avgt    5      2496.978 ±     22.650  MB/sec
FetchSourcePhaseBenchmark.filterXContentonParser                 message  one_4m_field  avgt    5   1614980.846 ±  31716.956   ns/op
FetchSourcePhaseBenchmark.filterXContentonParser:·gc.alloc.rate  message  one_4m_field  avgt    5         1.755 ±      0.035  MB/sec
```
nik9000 added a commit that referenced this issue Sep 13, 2021
I found myself needing support for something like `filter_path` on
`XContentParser`. It was simple enough to plug it in so I did. Then I
realized that it might offer more memory efficient source filtering
(#25168) so I put together a quick benchmark comparing the source
filtering that we do in `_search`.

Filtering using the parser is about 33% faster than how we filter now
when you select a single field from a 300 byte document:
```
Benchmark                                          (excludes)  (includes)  (source)  Mode  Cnt     Score    Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message     short  avgt    5  2360.342 ±  4.715  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message     short  avgt    5  2010.278 ± 15.042  ns/op
FetchSourcePhaseBenchmark.filterXContentOnParser                  message     short  avgt    5  1588.446 ± 18.593  ns/op
```

The top line is the way we filter now. The middle line is adding a
filter to `XContentBuilder` - something we can do right now without any
of my plumbing work. The bottom line is filtering on the parser,
requiring all the new plumbing.

This isn't particularly impresive. 33% *sounds* great! But 700
nanoseconds per document isn't going to cut into anyone's search times.
If you fetch a thousand docuents that's .7 milliseconds of savings.

But we mostly advise folks to use source filtering on fetch when the
source is large and you only want a small part of it. So I tried when
the source is about 4.3kb and you want a single field:
```
Benchmark                                          (excludes)  (includes)      (source)  Mode  Cnt     Score     Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message  one_4k_field  avgt    5  5957.128 ± 117.402  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message  one_4k_field  avgt    5  4999.073 ±  96.003  ns/op
FetchSourcePhaseBenchmark.filterXContentonParser                  message  one_4k_field  avgt    5  3261.478 ±  48.879  ns/op
```

That's 45% faster. Put another way, 2.7 microseconds a document. Not
bad!

But have a look at how things come out when you want a single field from
a 4 *megabyte* document:
```
Benchmark                                          (excludes)  (includes)      (source)  Mode  Cnt        Score        Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message  one_4m_field  avgt    5  8266343.036 ± 176197.077  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message  one_4m_field  avgt    5  6227560.013 ±  68306.318  ns/op
FetchSourcePhaseBenchmark.filterXContentonParser                  message  one_4m_field  avgt    5  1617153.472 ±  80164.547  ns/op
```

These documents are very large. I've encountered documents like them in
real life, but they've always been the outlier for me. But a 6.5
millisecond per document savings ain't anything to sneeze at.

Take a look at what you get when I turn on gc metrics:
```
FetchSourcePhaseBenchmark.filterObjects                          message  one_4m_field  avgt    5   7036097.561 ±  84721.312   ns/op
FetchSourcePhaseBenchmark.filterObjects:·gc.alloc.rate           message  one_4m_field  avgt    5      2166.613 ±     25.975  MB/sec
FetchSourcePhaseBenchmark.filterXContentOnBuilder                message  one_4m_field  avgt    5   6104595.992 ±  55445.508   ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder:·gc.alloc.rate message  one_4m_field  avgt    5      2496.978 ±     22.650  MB/sec
FetchSourcePhaseBenchmark.filterXContentonParser                 message  one_4m_field  avgt    5   1614980.846 ±  31716.956   ns/op
FetchSourcePhaseBenchmark.filterXContentonParser:·gc.alloc.rate  message  one_4m_field  avgt    5         1.755 ±      0.035  MB/sec
```
nik9000 added a commit to nik9000/elasticsearch that referenced this issue Sep 13, 2021
I found myself needing support for something like `filter_path` on
`XContentParser`. It was simple enough to plug it in so I did. Then I
realized that it might offer more memory efficient source filtering
(elastic#25168) so I put together a quick benchmark comparing the source
filtering that we do in `_search`.

Filtering using the parser is about 33% faster than how we filter now
when you select a single field from a 300 byte document:
```
Benchmark                                          (excludes)  (includes)  (source)  Mode  Cnt     Score    Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message     short  avgt    5  2360.342 ±  4.715  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message     short  avgt    5  2010.278 ± 15.042  ns/op
FetchSourcePhaseBenchmark.filterXContentOnParser                  message     short  avgt    5  1588.446 ± 18.593  ns/op
```

The top line is the way we filter now. The middle line is adding a
filter to `XContentBuilder` - something we can do right now without any
of my plumbing work. The bottom line is filtering on the parser,
requiring all the new plumbing.

This isn't particularly impresive. 33% *sounds* great! But 700
nanoseconds per document isn't going to cut into anyone's search times.
If you fetch a thousand docuents that's .7 milliseconds of savings.

But we mostly advise folks to use source filtering on fetch when the
source is large and you only want a small part of it. So I tried when
the source is about 4.3kb and you want a single field:
```
Benchmark                                          (excludes)  (includes)      (source)  Mode  Cnt     Score     Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message  one_4k_field  avgt    5  5957.128 ± 117.402  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message  one_4k_field  avgt    5  4999.073 ±  96.003  ns/op
FetchSourcePhaseBenchmark.filterXContentonParser                  message  one_4k_field  avgt    5  3261.478 ±  48.879  ns/op
```

That's 45% faster. Put another way, 2.7 microseconds a document. Not
bad!

But have a look at how things come out when you want a single field from
a 4 *megabyte* document:
```
Benchmark                                          (excludes)  (includes)      (source)  Mode  Cnt        Score        Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message  one_4m_field  avgt    5  8266343.036 ± 176197.077  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message  one_4m_field  avgt    5  6227560.013 ±  68306.318  ns/op
FetchSourcePhaseBenchmark.filterXContentonParser                  message  one_4m_field  avgt    5  1617153.472 ±  80164.547  ns/op
```

These documents are very large. I've encountered documents like them in
real life, but they've always been the outlier for me. But a 6.5
millisecond per document savings ain't anything to sneeze at.

Take a look at what you get when I turn on gc metrics:
```
FetchSourcePhaseBenchmark.filterObjects                          message  one_4m_field  avgt    5   7036097.561 ±  84721.312   ns/op
FetchSourcePhaseBenchmark.filterObjects:·gc.alloc.rate           message  one_4m_field  avgt    5      2166.613 ±     25.975  MB/sec
FetchSourcePhaseBenchmark.filterXContentOnBuilder                message  one_4m_field  avgt    5   6104595.992 ±  55445.508   ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder:·gc.alloc.rate message  one_4m_field  avgt    5      2496.978 ±     22.650  MB/sec
FetchSourcePhaseBenchmark.filterXContentonParser                 message  one_4m_field  avgt    5   1614980.846 ±  31716.956   ns/op
FetchSourcePhaseBenchmark.filterXContentonParser:·gc.alloc.rate  message  one_4m_field  avgt    5         1.755 ±      0.035  MB/sec
```
elasticsearchmachine pushed a commit that referenced this issue Sep 13, 2021
* Memory efficient xcontent filtering (backport of #77154)

I found myself needing support for something like `filter_path` on
`XContentParser`. It was simple enough to plug it in so I did. Then I
realized that it might offer more memory efficient source filtering
(#25168) so I put together a quick benchmark comparing the source
filtering that we do in `_search`.

Filtering using the parser is about 33% faster than how we filter now
when you select a single field from a 300 byte document:
```
Benchmark                                          (excludes)  (includes)  (source)  Mode  Cnt     Score    Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message     short  avgt    5  2360.342 ±  4.715  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message     short  avgt    5  2010.278 ± 15.042  ns/op
FetchSourcePhaseBenchmark.filterXContentOnParser                  message     short  avgt    5  1588.446 ± 18.593  ns/op
```

The top line is the way we filter now. The middle line is adding a
filter to `XContentBuilder` - something we can do right now without any
of my plumbing work. The bottom line is filtering on the parser,
requiring all the new plumbing.

This isn't particularly impresive. 33% *sounds* great! But 700
nanoseconds per document isn't going to cut into anyone's search times.
If you fetch a thousand docuents that's .7 milliseconds of savings.

But we mostly advise folks to use source filtering on fetch when the
source is large and you only want a small part of it. So I tried when
the source is about 4.3kb and you want a single field:
```
Benchmark                                          (excludes)  (includes)      (source)  Mode  Cnt     Score     Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message  one_4k_field  avgt    5  5957.128 ± 117.402  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message  one_4k_field  avgt    5  4999.073 ±  96.003  ns/op
FetchSourcePhaseBenchmark.filterXContentonParser                  message  one_4k_field  avgt    5  3261.478 ±  48.879  ns/op
```

That's 45% faster. Put another way, 2.7 microseconds a document. Not
bad!

But have a look at how things come out when you want a single field from
a 4 *megabyte* document:
```
Benchmark                                          (excludes)  (includes)      (source)  Mode  Cnt        Score        Error  Units
FetchSourcePhaseBenchmark.filterObjects                           message  one_4m_field  avgt    5  8266343.036 ± 176197.077  ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder                 message  one_4m_field  avgt    5  6227560.013 ±  68306.318  ns/op
FetchSourcePhaseBenchmark.filterXContentonParser                  message  one_4m_field  avgt    5  1617153.472 ±  80164.547  ns/op
```

These documents are very large. I've encountered documents like them in
real life, but they've always been the outlier for me. But a 6.5
millisecond per document savings ain't anything to sneeze at.

Take a look at what you get when I turn on gc metrics:
```
FetchSourcePhaseBenchmark.filterObjects                          message  one_4m_field  avgt    5   7036097.561 ±  84721.312   ns/op
FetchSourcePhaseBenchmark.filterObjects:·gc.alloc.rate           message  one_4m_field  avgt    5      2166.613 ±     25.975  MB/sec
FetchSourcePhaseBenchmark.filterXContentOnBuilder                message  one_4m_field  avgt    5   6104595.992 ±  55445.508   ns/op
FetchSourcePhaseBenchmark.filterXContentOnBuilder:·gc.alloc.rate message  one_4m_field  avgt    5      2496.978 ±     22.650  MB/sec
FetchSourcePhaseBenchmark.filterXContentonParser                 message  one_4m_field  avgt    5   1614980.846 ±  31716.956   ns/op
FetchSourcePhaseBenchmark.filterXContentonParser:·gc.alloc.rate  message  one_4m_field  avgt    5         1.755 ±      0.035  MB/sec
```

* Fixup benchmark for 7.x
@romseygeek
Copy link
Contributor

This has been implemented in #77154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Scripting Scripting abstractions, Painless, and Mustache >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests