Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mmapfs as default store type #38157

Conversation

danielmitterdorfer
Copy link
Member

With this commit we switch the default store type from hybridfs to mmapfs.
While hybridfs is beneficial for random access workloads (think: updates and
queries) when the index size is much larger than the available page cache, it
incurs a performance penalty on smaller indices that fit into the page cache (or
are not much larger than that).

This performance penalty shows not only for bulk updates or queries but also for
bulk indexing (without any conflicts) when an external document id is provided
by the client. For example, in the geonames benchmark this results in a
throughput reduction of roughly 17% compared to mmapfs. This reduction is
caused by document id lookups that show up as the top contributor in the profile
when enabling hybridfs. Below is such an example stack trace as captured by
async-profiler during a benchmarking trial where we can see that the overhead is
caused by additional read system calls for document id lookups:

__GI_pread64
sun.nio.ch.FileDispatcherImpl.pread0
sun.nio.ch.FileDispatcherImpl.pread
sun.nio.ch.IOUtil.readIntoNativeBuffer
sun.nio.ch.IOUtil.read sun.nio.ch.FileChannelImpl.readInternal
sun.nio.ch.FileChannelImpl.read
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal
org.apache.lucene.store.BufferedIndexInput.refill
org.apache.lucene.store.BufferedIndexInput.readByte
org.apache.lucene.store.DataInput.readVInt
org.apache.lucene.store.BufferedIndexInput.readVInt
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact
org.elasticsearch.common.lucene.uid.PerThreadIDVersionAndSeqNoLookup.getDocID
org.elasticsearch.common.lucene.uid.PerThreadIDVersionAndSeqNoLookup.
    lookupVersion
org.elasticsearch.common.lucene.uid.VersionsAndSeqNoResolver.loadDocIdAndVersion
org.elasticsearch.index.engine.InternalEngine.resolveDocVersion
org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary
org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation
org.elasticsearch.index.engine.InternalEngine.index
org.elasticsearch.index.shard.IndexShard.index
org.elasticsearch.index.shard.IndexShard.applyIndexOperation
org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary
[...]

For these reasons we are restoring mmapfs as the default store type.

Relates #36668

With this commit we switch the default store type from `hybridfs` to `mmapfs`.
While `hybridfs` is beneficial for random access workloads (think: updates and
queries) when the index size is much larger than the available page cache, it
incurs a performance penalty on smaller indices that fit into the page cache (or
are not much larger than that).

This performance penalty shows not only for bulk updates or queries but also for
bulk indexing (without *any* conflicts) when an external document id is provided
by the client. For example, in the `geonames` benchmark this results in a
throughput reduction of roughly 17% compared to `mmapfs`. This reduction is
caused by document id lookups that show up as the top contributor in the profile
when enabling `hybridfs`. Below is such an example stack trace as captured by
async-profiler during a benchmarking trial where we can see that the overhead is
caused by additional `read` system calls for document id lookups:

```
__GI_pread64
sun.nio.ch.FileDispatcherImpl.pread0
sun.nio.ch.FileDispatcherImpl.pread
sun.nio.ch.IOUtil.readIntoNativeBuffer
sun.nio.ch.IOUtil.read sun.nio.ch.FileChannelImpl.readInternal
sun.nio.ch.FileChannelImpl.read
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal
org.apache.lucene.store.BufferedIndexInput.refill
org.apache.lucene.store.BufferedIndexInput.readByte
org.apache.lucene.store.DataInput.readVInt
org.apache.lucene.store.BufferedIndexInput.readVInt
org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact
org.elasticsearch.common.lucene.uid.PerThreadIDVersionAndSeqNoLookup.getDocID
org.elasticsearch.common.lucene.uid.PerThreadIDVersionAndSeqNoLookup.
lookupVersion
org.elasticsearch.common.lucene.uid.VersionsAndSeqNoResolver.loadDocIdAndVersion
org.elasticsearch.index.engine.InternalEngine.resolveDocVersion
org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary
org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation
org.elasticsearch.index.engine.InternalEngine.index
org.elasticsearch.index.shard.IndexShard.index
org.elasticsearch.index.shard.IndexShard.applyIndexOperation
org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary
[...]
```

For these reasons we are restoring `mmapfs` as the default store type.

Relates elastic#36668
@danielmitterdorfer danielmitterdorfer added >enhancement v7.0.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 1, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jpountz
Copy link
Contributor

jpountz commented Feb 1, 2019

I am confused why NIOFSDirectory appears in the stack since the terms dict is supposed to open with mmap?

@danielmitterdorfer
Copy link
Member Author

danielmitterdorfer commented Feb 1, 2019

After further investigation it turns out that this is due the compound format of Lucene (.cfs files). These files are written by Lucene in order to save file handles and combine multiple files into one and this approach is used on segments that are less than 10% of the index size. As hybridfs does not have a special handling for this file type, it gets read via NIO instead of memory-mapping it.

We could add .cfs to the list of files that hybridfs memory-maps instead of reading them via NIO. While this would resolve the performance impact for small indices, it would neuter the positive effect of hybridfs on larger indices because then we'd see page cache thrashing again and avoiding this effect is the whole point of hybridfs. As an additional measure we can disallow the compound format on larger segments (for some definition of "large"). This would mean that:

  • Smaller segments use the compound format (.cfs). These files get memory-mapped and thus do not incur the performance penalty that we see in the profile above.
  • Larger segments do not use the compound format anymore (but Elasticsearch would use more file handles). This means that we see individual files (e.g. .tim, .tip, ...) on the file system and we read them according to their expected data access pattern either via NIO or memory-mapping.

We expect that this approach would provide good performance for small and large indices but do not have experimental evidence to back up this hypothesis. Also, there might be other side effects (apart from the increased number of file handles) that we need to consider first.

As we first need to decide on the way forward, I have marked this PR as WIP effectively putting it on hold for now.

@jasontedor jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019
@danielmitterdorfer
Copy link
Member Author

I have run further experiments by now. Adding .cfs to the list of files to memory-map improves performance for smaller and larger indices and thus I am going to abandon this PR and instead open a follow-up where we add .cfs.

@danielmitterdorfer
Copy link
Member Author

I have opened #38940 instead where I also present benchmark results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement WIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants