-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve realtime Lucene text index freshness/cpu/disk io usage #13503
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13503 +/- ##
============================================
+ Coverage 61.75% 62.00% +0.24%
+ Complexity 207 198 -9
============================================
Files 2436 2554 +118
Lines 133233 140552 +7319
Branches 20636 21868 +1232
============================================
+ Hits 82274 87145 +4871
- Misses 44911 46765 +1854
- Partials 6048 6642 +594
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc I think it may be useful to analyze these numbers for a prod like setting for a steady state workload. IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ? |
Yes. For CPU/Mem/GC, we found the queue poll/offer pattern in such a tight loop caused ~1.2gbps allocations per thread (realizing now this was w/ less than 10ms delay). We use Generational ZGC and this had an impact on % CPU spent on GC, especially when increasing refresh thread count. I can't find the flame graphs for this, but the simple change to an ArrayList solves that and the reduced allocations should be apparent even profiling locally. For Disk IO improvement, it is mostly from taking advantage of the Here's an example of reduced delay w/ 10 threads/server. For reference, this is in a production cluster with hundreds of consuming partitions per node. The narrow spikes are mostly due to server restart, the wide periods of narrow spikes are due to rollouts (I need to make a change to avoid emitting delay metrics if server is still catching up). With a single queue, we see all tables are sensitive to ingestion spikes/data pattern changes in a single table. Partitioning helps reduce the 'noisy neighbor' indexes. Here's some host metrics around the same time frame, showing no significant change in heap, a slight disk IO reduction, and increased CPU usage (since we went from 1 to 10 threads).
I think this is mostly a separate effort. As I understand it, the snapshot refresh business is done since it's inherently expensive to build/use Lucene like structures in memory (especially since input is not necessarily ordered). For an entire segment, this is prohibitive and part of the reason why native text index's true real-time indexing is relatively resource intensive. By reducing the indexing delay, I think we can reduce the scope of the problem so that we only require building/holding such a structure in memory for a very small portion of data (i.e., the portion that has not been refreshed yet). I opened an issue to track this and will share a doc there with more details, pending further testing. For now, I think this is a standalone feature that is good to have regardless as it can reduce the amount of incorrect data. If you have any thoughts on this, I would love to continue discussion there |
19e579f
to
cd8bdd0
Compare
/** | ||
* LuceneNRTCachingMergePolicy is a best-effort policy to generate merges for segments that are fully in memory, | ||
* at the time of SegmentInfo selection. It does not consider segments that have been flushed to disk eligible | ||
* for merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments on why we want to introduce NRTCaching. For performance? reduced I/O?
private final int _maxParallelism; | ||
// delay between refresh iterations | ||
private int _delayMs; | ||
// partitioned lists of SearcherManagerHolders, each is gets its own thread for refreshing. SearcherManagerHolders |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each is --> each gets
_partitionedListsOfSearchers = new ArrayList<>(); | ||
} | ||
|
||
public static RealtimeLuceneIndexRefreshManager getInstance() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this class thread safe? Why not add synchronized as the basic singleton method as suggested here as a common pattern?
https://stackoverflow.com/questions/11165852/java-singleton-and-synchronization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the traditional singleton that can follow that pattern since it accepts args (server configs). The init(..)
method will be called first in the server starter, ensuring the instance is not null. The precondition below is added as a hint to others when writing UTs
This PR allows for better freshness/cpu/disk io usage for realtime Lucene text index.
User facing changes:
pinot.server.lucene.min.refresh.interval.ms
(default is 10, as 10ms was the previous behavior)pinot.server.lucene.max.refresh.threads
(default is 1, as a single thread was the previous behavior)Implementation changes:
ScalingThreadPoolExecutor
to allow for multiple background refresh threads to refresh Lucene indexesRealtimeLuceneTextIndex._searcherManager
s are evenly distributed between background refresh threads.1 thread:1 RealtimeLuceneTextIndex
, up to max threads configured, then each thread handles multipleRealtimeLuceneTextIndex
RealtimeLuceneTextIndex
to refresh, the thread will be removedRealtimeLuceneTextIndex
specific logic out ofMutableSegmentImpl
- the index itself registers itself with the refresh manager, and is removed once closedLuceneNRTCachingMergePolicy
to perform best effort merging of in-memory Lucene segments - each refresh causes a flush, and making refreshes more common will cause huge numbers of small files.With configs not set/default settings, we see lower cpu/disk io/slightly better index freshness. With more aggressive configs, we see much better index freshness (we have many tables w/ text index) at the same or similar resource usage.
For testing, we've had this deployed in some of our prod clusters for a bit without issues.
Sample improvement over 1 table (spikes are due to server restarts).
for resource improvements, see below