Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve realtime Lucene text index freshness/cpu/disk io usage #13503

Merged
merged 3 commits into from
Jul 26, 2024

Conversation

itschrispeck
Copy link
Collaborator

@itschrispeck itschrispeck commented Jun 27, 2024

This PR allows for better freshness/cpu/disk io usage for realtime Lucene text index.

User facing changes:

  1. Add config pinot.server.lucene.min.refresh.interval.ms (default is 10, as 10ms was the previous behavior)
  2. Add config pinot.server.lucene.max.refresh.threads (default is 1, as a single thread was the previous behavior)

Implementation changes:

  1. Use scale-first ScalingThreadPoolExecutor to allow for multiple background refresh threads to refresh Lucene indexes
    1. All RealtimeLuceneTextIndex._searcherManagers are evenly distributed between background refresh threads.
    2. The refresh thread pool is 1 thread:1 RealtimeLuceneTextIndex, up to max threads configured, then each thread handles multiple RealtimeLuceneTextIndex
    3. If tables are deleted/consuming segment rebalance occurs leaving a thread without a RealtimeLuceneTextIndex to refresh, the thread will be removed
  2. Refactor RealtimeLuceneTextIndex specific logic out of MutableSegmentImpl - the index itself registers itself with the refresh manager, and is removed once closed
  3. Add LuceneNRTCachingMergePolicy to perform best effort merging of in-memory Lucene segments - each refresh causes a flush, and making refreshes more common will cause huge numbers of small files.

With configs not set/default settings, we see lower cpu/disk io/slightly better index freshness. With more aggressive configs, we see much better index freshness (we have many tables w/ text index) at the same or similar resource usage.

For testing, we've had this deployed in some of our prod clusters for a bit without issues.

Sample improvement over 1 table (spikes are due to server restarts).

previous average max delay: ~25s
previous post restart spike: ~5-10min
new average max delay: ~2s
new post restart spike: ~70s
image

for resource improvements, see below

@codecov-commenter
Copy link

codecov-commenter commented Jun 27, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 24 lines in your changes missing coverage. Please review.

Project coverage is 62.00%. Comparing base (59551e4) to head (624eae6).
Report is 795 commits behind head on master.

Files Patch % Lines
...vertedindex/RealtimeLuceneIndexRefreshManager.java 81.39% 9 Missing and 7 partials ⚠️
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 7 Missing ⚠️
...mpl/invertedindex/LuceneNRTCachingMergePolicy.java 92.30% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13503      +/-   ##
============================================
+ Coverage     61.75%   62.00%   +0.24%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2554     +118     
  Lines        133233   140552    +7319     
  Branches      20636    21868    +1232     
============================================
+ Hits          82274    87145    +4871     
- Misses        44911    46765    +1854     
- Partials       6048     6642     +594     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.94% <80.00%> (+0.23%) ⬆️
java-21 61.87% <80.00%> (+0.25%) ⬆️
skip-bytebuffers-false 61.96% <80.00%> (+0.22%) ⬆️
skip-bytebuffers-true 61.85% <80.00%> (+34.12%) ⬆️
temurin 62.00% <80.00%> (+0.24%) ⬆️
unittests 61.99% <80.00%> (+0.25%) ⬆️
unittests1 46.43% <0.00%> (-0.46%) ⬇️
unittests2 27.78% <80.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Jul 5, 2024

Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc

I think it may be useful to analyze these numbers for a prod like setting for a steady state workload.

IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ?

@itschrispeck
Copy link
Collaborator Author

itschrispeck commented Jul 5, 2024

Is it possible to share some numbers on how this improvement has helped with freshness and at the same time the corresponding impact on CPU / mem / IO utilization / GC etc

I think it may be useful to analyze these numbers for a prod like setting for a steady state workload.

Yes. For CPU/Mem/GC, we found the queue poll/offer pattern in such a tight loop caused ~1.2gbps allocations per thread (realizing now this was w/ less than 10ms delay). We use Generational ZGC and this had an impact on % CPU spent on GC, especially when increasing refresh thread count. I can't find the flame graphs for this, but the simple change to an ArrayList solves that and the reduced allocations should be apparent even profiling locally.
image

For Disk IO improvement, it is mostly from taking advantage of the LuceneNRTCachingMergePolicy. We do a best effort attempt to merge only segments that are entirely in memory, which reduces FDs and avoids most of the IO.

Here's an example of reduced delay w/ 10 threads/server. For reference, this is in a production cluster with hundreds of consuming partitions per node. The narrow spikes are mostly due to server restart, the wide periods of narrow spikes are due to rollouts (I need to make a change to avoid emitting delay metrics if server is still catching up). With a single queue, we see all tables are sensitive to ingestion spikes/data pattern changes in a single table. Partitioning helps reduce the 'noisy neighbor' indexes.
image

Here's some host metrics around the same time frame, showing no significant change in heap, a slight disk IO reduction, and increased CPU usage (since we went from 1 to 10 threads).
image

IIRC, apart from freshness , there has also been a correctness concern with the way Lucene NRT works and the whole snapshot refresh business. Are we fixing that too ?

I think this is mostly a separate effort. As I understand it, the snapshot refresh business is done since it's inherently expensive to build/use Lucene like structures in memory (especially since input is not necessarily ordered). For an entire segment, this is prohibitive and part of the reason why native text index's true real-time indexing is relatively resource intensive. By reducing the indexing delay, I think we can reduce the scope of the problem so that we only require building/holding such a structure in memory for a very small portion of data (i.e., the portion that has not been refreshed yet).

I opened an issue to track this and will share a doc there with more details, pending further testing. For now, I think this is a standalone feature that is good to have regardless as it can reduce the amount of incorrect data. If you have any thoughts on this, I would love to continue discussion there

/**
* LuceneNRTCachingMergePolicy is a best-effort policy to generate merges for segments that are fully in memory,
* at the time of SegmentInfo selection. It does not consider segments that have been flushed to disk eligible
* for merging.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments on why we want to introduce NRTCaching. For performance? reduced I/O?

private final int _maxParallelism;
// delay between refresh iterations
private int _delayMs;
// partitioned lists of SearcherManagerHolders, each is gets its own thread for refreshing. SearcherManagerHolders
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each is --> each gets

_partitionedListsOfSearchers = new ArrayList<>();
}

public static RealtimeLuceneIndexRefreshManager getInstance() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class thread safe? Why not add synchronized as the basic singleton method as suggested here as a common pattern?
https://stackoverflow.com/questions/11165852/java-singleton-and-synchronization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the traditional singleton that can follow that pattern since it accepts args (server configs). The init(..) method will be called first in the server starter, ensuring the instance is not null. The precondition below is added as a hint to others when writing UTs

@chenboat chenboat merged commit 8603164 into apache:master Jul 26, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) enhancement performance real-time refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants