Improve realtime Lucene text index freshness/cpu/disk io usage #13503

itschrispeck · 2024-06-27T21:19:50Z

This PR allows for better freshness/cpu/disk io usage for realtime Lucene text index.

User facing changes:

Add config pinot.server.lucene.min.refresh.interval.ms (default is 10, as 10ms was the previous behavior)
Add config pinot.server.lucene.max.refresh.threads (default is 1, as a single thread was the previous behavior)

Implementation changes:

Use scale-first ScalingThreadPoolExecutor to allow for multiple background refresh threads to refresh Lucene indexes
1. All RealtimeLuceneTextIndex._searcherManagers are evenly distributed between background refresh threads.
2. The refresh thread pool is 1 thread:1 RealtimeLuceneTextIndex, up to max threads configured, then each thread handles multiple RealtimeLuceneTextIndex
3. If tables are deleted/consuming segment rebalance occurs leaving a thread without a RealtimeLuceneTextIndex to refresh, the thread will be removed
Refactor RealtimeLuceneTextIndex specific logic out of MutableSegmentImpl - the index itself registers itself with the refresh manager, and is removed once closed
Add LuceneNRTCachingMergePolicy to perform best effort merging of in-memory Lucene segments - each refresh causes a flush, and making refreshes more common will cause huge numbers of small files.

With configs not set/default settings, we see lower cpu/disk io/slightly better index freshness. With more aggressive configs, we see much better index freshness (we have many tables w/ text index) at the same or similar resource usage.

For testing, we've had this deployed in some of our prod clusters for a bit without issues.

Sample improvement over 1 table (spikes are due to server restarts).

previous average max delay: ~25s
previous post restart spike: ~5-10min
new average max delay: ~2s
new post restart spike: ~70s

for resource improvements, see below

codecov-commenter · 2024-06-27T22:00:28Z

Codecov Report

Attention: Patch coverage is 80.00000% with 24 lines in your changes missing coverage. Please review.

Project coverage is 62.00%. Comparing base (59551e4) to head (624eae6).
Report is 795 commits behind head on master.

Files	Patch %	Lines
...vertedindex/RealtimeLuceneIndexRefreshManager.java	81.39%	9 Missing and 7 partials ⚠️
.../pinot/server/starter/helix/BaseServerStarter.java	0.00%	7 Missing ⚠️
...mpl/invertedindex/LuceneNRTCachingMergePolicy.java	92.30%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #13503      +/-   ##
============================================
+ Coverage     61.75%   62.00%   +0.24%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2554     +118     
  Lines        133233   140552    +7319     
  Branches      20636    21868    +1232     
============================================
+ Hits          82274    87145    +4871     
- Misses        44911    46765    +1854     
- Partials       6048     6642     +594

Flag	Coverage Δ
custom-integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`61.94% <80.00%> (+0.23%)`	⬆️
java-21	`61.87% <80.00%> (+0.25%)`	⬆️
skip-bytebuffers-false	`61.96% <80.00%> (+0.22%)`	⬆️
skip-bytebuffers-true	`61.85% <80.00%> (+34.12%)`	⬆️
temurin	`62.00% <80.00%> (+0.24%)`	⬆️
unittests	`61.99% <80.00%> (+0.25%)`	⬆️
unittests1	`46.43% <0.00%> (-0.46%)`	⬇️
unittests2	`27.78% <80.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

siddharthteotia · 2024-07-05T19:38:03Z

Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc

I think it may be useful to analyze these numbers for a prod like setting for a steady state workload.

IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ?

itschrispeck · 2024-07-05T22:19:19Z

Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc

I think it may be useful to analyze these numbers for a prod like setting for a steady state workload.

Yes. For CPU/Mem/GC, we found the queue poll/offer pattern in such a tight loop caused ~1.2gbps allocations per thread (realizing now this was w/ less than 10ms delay). We use Generational ZGC and this had an impact on % CPU spent on GC, especially when increasing refresh thread count. I can't find the flame graphs for this, but the simple change to an ArrayList solves that and the reduced allocations should be apparent even profiling locally.

For Disk IO improvement, it is mostly from taking advantage of the LuceneNRTCachingMergePolicy. We do a best effort attempt to merge only segments that are entirely in memory, which reduces FDs and avoids most of the IO.

Here's an example of reduced delay w/ 10 threads/server. For reference, this is in a production cluster with hundreds of consuming partitions per node. The narrow spikes are mostly due to server restart, the wide periods of narrow spikes are due to rollouts (I need to make a change to avoid emitting delay metrics if server is still catching up). With a single queue, we see all tables are sensitive to ingestion spikes/data pattern changes in a single table. Partitioning helps reduce the 'noisy neighbor' indexes.

Here's some host metrics around the same time frame, showing no significant change in heap, a slight disk IO reduction, and increased CPU usage (since we went from 1 to 10 threads).

IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ?

I think this is mostly a separate effort. As I understand it, the snapshot refresh business is done since it's inherently expensive to build/use Lucene like structures in memory (especially since input is not necessarily ordered). For an entire segment, this is prohibitive and part of the reason why native text index's true real-time indexing is relatively resource intensive. By reducing the indexing delay, I think we can reduce the scope of the problem so that we only require building/holding such a structure in memory for a very small portion of data (i.e., the portion that has not been refreshed yet).

I opened an issue to track this and will share a doc there with more details, pending further testing. For now, I think this is a standalone feature that is good to have regardless as it can reduce the amount of incorrect data. If you have any thoughts on this, I would love to continue discussion there

chenboat · 2024-07-20T00:05:11Z

.../org/apache/pinot/segment/local/realtime/impl/invertedindex/LuceneNRTCachingMergePolicy.java

+/**
+ * LuceneNRTCachingMergePolicy is a best-effort policy to generate merges for segments that are fully in memory,
+ * at the time of SegmentInfo selection. It does not consider segments that have been flushed to disk eligible
+ * for merging.


Add comments on why we want to introduce NRTCaching. For performance? reduced I/O?

chenboat · 2024-07-20T00:06:42Z

...pache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneIndexRefreshManager.java

+  private final int _maxParallelism;
+  // delay between refresh iterations
+  private int _delayMs;
+  // partitioned lists of SearcherManagerHolders, each is gets its own thread for refreshing. SearcherManagerHolders


each is --> each gets

chenboat · 2024-07-24T17:39:17Z

...pache/pinot/segment/local/realtime/impl/invertedindex/RealtimeLuceneIndexRefreshManager.java

+    _partitionedListsOfSearchers = new ArrayList<>();
+  }
+
+  public static RealtimeLuceneIndexRefreshManager getInstance() {


Is this class thread safe? Why not add synchronized as the basic singleton method as suggested here as a common pattern?
https://stackoverflow.com/questions/11165852/java-singleton-and-synchronization

It's not the traditional singleton that can follow that pattern since it accepts args (server configs). The init(..) method will be called first in the server starter, ensuring the instance is not null. The precondition below is added as a hint to others when writing UTs

itschrispeck added enhancement Configuration Config changes (addition/deletion/change in behavior) performance refactor real-time labels Jun 27, 2024

itschrispeck mentioned this pull request Jun 27, 2024

Provide true real-time indexing for Lucene based text index #13504

Open

itschrispeck requested a review from chenboat July 3, 2024 05:06

itschrispeck added 2 commits July 16, 2024 15:00

enable parallel background refresh for lucene text index

4d4c958

lint

cd8bdd0

itschrispeck force-pushed the lucene_refresh branch from 19e579f to cd8bdd0 Compare July 17, 2024 18:07

chenboat reviewed Jul 20, 2024

View reviewed changes

address comments

624eae6

chenboat reviewed Jul 24, 2024

View reviewed changes

chenboat approved these changes Jul 26, 2024

View reviewed changes

chenboat merged commit 8603164 into apache:master Jul 26, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve realtime Lucene text index freshness/cpu/disk io usage #13503

Improve realtime Lucene text index freshness/cpu/disk io usage #13503

itschrispeck commented Jun 27, 2024 •

edited

Loading

codecov-commenter commented Jun 27, 2024 •

edited

Loading

siddharthteotia commented Jul 5, 2024 •

edited

Loading

itschrispeck commented Jul 5, 2024 •

edited

Loading

chenboat Jul 20, 2024

chenboat Jul 20, 2024

chenboat Jul 24, 2024

itschrispeck Jul 24, 2024

Improve realtime Lucene text index freshness/cpu/disk io usage #13503

Improve realtime Lucene text index freshness/cpu/disk io usage #13503

Conversation

itschrispeck commented Jun 27, 2024 • edited Loading

codecov-commenter commented Jun 27, 2024 • edited Loading

Codecov Report

siddharthteotia commented Jul 5, 2024 • edited Loading

itschrispeck commented Jul 5, 2024 • edited Loading

chenboat Jul 20, 2024

Choose a reason for hiding this comment

chenboat Jul 20, 2024

Choose a reason for hiding this comment

chenboat Jul 24, 2024

Choose a reason for hiding this comment

itschrispeck Jul 24, 2024

Choose a reason for hiding this comment

itschrispeck commented Jun 27, 2024 •

edited

Loading

codecov-commenter commented Jun 27, 2024 •

edited

Loading

siddharthteotia commented Jul 5, 2024 •

edited

Loading

itschrispeck commented Jul 5, 2024 •

edited

Loading