-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges. #633
Conversation
4f2c84c
to
8f6ef86
Compare
The My plan is to defer adding the newly created merged segments to IndexWriter until all merges finish. MergeScheduler threads will do all the work for merge right upto creating segment files, and return. At addIndexes() API, we'll wait for all merges kicked off by the API to complete, and add them to Added code for this change to get some feedback on this approach. Tests are failing right now and I'm working on fixing them and adding new ones. Have put |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really love this approach, of letting MergePolicy
guide exactly how merging of addIndexes(CodecReader[])
will happen.
What remains to bring this out of WIP? I see a few // nocommits
still to fix ... can you post the failing tests details here maybe? This is certainly tricky code; thank you for tackling it!
@@ -352,6 +352,14 @@ public FieldDimensions(int dimensionCount, int indexDimensionCount, int dimensio | |||
this.softDeletesFieldName = softDeletesFieldName; | |||
} | |||
|
|||
public void verifyFieldInfo(FieldInfo fi) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be public? Or is it only for asserting
, in which case maybe make it package private and rename to boolean assertFieldInfo
or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method makes sure that incoming index fields are compatible with the destination index, e.g. vectors have the same dimensions and use the same similarity function.
Changed access to package private. I retained method name - verifyFieldInfo()
, as it internally calls verifySoftDeletedFieldName
and verifySameSchema
, which have a similar naming style.
I can change it if assert*()
is more in line with the Lucene convention. In which case, I also think that it should only return a condition and the asserts should be done by the caller.. Let me know..
@@ -39,6 +40,11 @@ public MergeSpecification findMerges( | |||
return null; | |||
} | |||
|
|||
@Override | |||
public MergeSpecification findMerges(List<CodecReader> readers) throws IOException { | |||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should null
even be allowed by this new API? Won't this throw NPE
if someone tries to addIndexes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my current impl, addIndexes simply completes without adding anything if the API returns null. I now realize, this is misleading for the user. Callers of addIndexes may not know about the configured merge policy. I should probably throw an exception if I get a null here?
On similar lines, what should we do if the API call had some CodecReaders, but findMerges returns an empty spec.? Should we error out? (Will add a log for this either way.)
Re: letting the API return null, I see that other variants of findMerges
in this policy, all return null
, and all callers of findMerges seem to be doing the null checks. Personally, I prefer disallowing the null return values, but I wanted to keep this in sync with the other variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are consumers of NoMergePolicy
that expect addIndexes(CodecReaders...)
to work, like IndexRearranger
. My understanding is that NoMergePolicy is meant to turn off any merges on existing segments. Since this API uses merges to add new readers to the index, I've changed it here to return the default base class impl.
It's a little off because the name is NoMergePolicy, but I think(hope) it aligns with the intent behind that class. I've added some comments to call this out. Let me know if we should instead create a different MergePolicy that only allows fineMerges(readers).. (it would be exactly like NoMergePolicy except for one api).
Thanks for reviewing this @mikemccand ! I've addressed the ones that had minor changes, and am working on the larger ones.
There are 2 existing tests that are breaking: - org.apache.lucene.index.TestAddIndexes.testAddIndicesWithSoftDeletes (:lucene:core)
- org.apache.lucene.index.TestIndexSorting.testAddIndexesWithDeletions (:lucene:core) I've been looking at java.lang.AssertionError: pendingNumDocs 7 != 5 totalMaxDoc
at __randomizedtesting.SeedInfo.seed([63C0554849BFDEC5:4F94AA0CFCBAFADD]:0)
at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2428)
at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1334)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1362)
at org.apache.lucene.util.IOUtils.close(IOUtils.java:85)
at org.apache.lucene.util.IOUtils.close(IOUtils.java:72)
at org.apache.lucene.index.TestAddIndexes.testAddIndicesWithSoftDeletes(TestAddIndexes.java:1521) Outside of fixing existing tests, I want to add some new tests, especially around cases where one of the merges fails, causing the whole API to fail (i.e. test that transactionality was retained). |
edab6df
to
84cfe00
Compare
New Changes:
Tests Added:
Existing Tests Expanded:
|
I think this PR has reasonable test coverage and can come out of WIP now. Will continue to iterate on the PR comments and suggestions.. |
Is there a way to ensure all code paths in the random tests get executed? I want to run the tests that invoke |
|
Alas, I don't think we have a way to ensure ALL paths get executed -- that would likely be combinatorically prohibitive? But sure would be nice to have some control over this. But we do have |
Thanks @mikemccand . I figured that I had changed the exception signature for addIndexes() in my last revision. I was throwing a blanket MergePolicyException when addIndexes failed to merge all the readers in; and a lot of these tests were designed to look for specific exceptions that would've earlier gotten thrown from the SegmentMerger code running in main addIndexes thread.. This was causing the randomly invoked tests to fail. I changed that, to I think this is better, because now the API throws the actual error seen by a thread doing the merge, instead of some blanket failure message. In general, I have to say, I'm fascinated by the breadth of test cases around this API, and the code framework present to add more tests. |
OK let's leave it as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vigyasharma -- this is a super exciting change! It generalizes addIndexes
to use MergePolicy/Scheduler
so that incoming indices can be added concurrently. I think it's ready, but needs rebasing to resolve conflicts?
Thanks for reviewing these rather lengthy changes, @mikemccand! I've rebased this on to the latest main branch, and updated doc string and exception message. |
The build failure is on The failure (with randomization seed) is because of a mismatch in (SUM) aggregated multi-valued, I'm not sure if this is related to the PR's changes (I don't see any related APIs being invoked by the test). My hunch is that it is some floating point approximation error. -- Failing Test ./gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom \
-Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
-Dtests.timezone=Europe/Athens -Dtests.asserts=true -Dtests.file.encoding=UTF-8 Failure: org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > testFloatAssociationRandom FAILED
java.lang.AssertionError: expected:<2409060.8> but was:<2409059.5>
at __randomizedtesting.SeedInfo.seed([4DFBA8209AC82EB2:1EC25C9148A9BF04]:0)
at junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
at junit@4.13.1/org.junit.Assert.failNotEquals(Assert.java:835)
at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:577)
at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:701)
at org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations.validateFloats(TestTaxonomyFacetAssociations.java:445)
at org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations.testFloatAssociationRandom(TestTaxonomyFacetAssociations.java:256)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
... Passing Test: ./gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom \
-Dtests.slow=true -Dtests.locale=fr-VU -Dtests.timezone=Europe/Athens \
-Dtests.asserts=true -Dtests.file.encoding=UTF-8 |
Thanks for digging into that failure @vigyasharma.
OK, so this is pre-existing! Could you open a separate spinoff issue and post the failure details there? It should not block this awesome change. |
I restarted the |
lucene/CHANGES.txt
Outdated
@@ -89,6 +89,9 @@ New Features | |||
Improvements | |||
--------------------- | |||
|
|||
* LUCENE-10216: Use MergePolicy to define and MergeScheduler to trigger the reader merges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding CHANGES
entry, but we usually append to the end of the section not the start :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vigyasharma -- I had a couple questions about some changes to existing unit tests.
I think it would be worth implementing the new MergePolicy
method in either MockRandomMergePolicy
or AlcoholicMergePolicy
or maybe both to exercise concurrent addIndexes
across any tests today or future tests that use addIndexes
?
@@ -352,15 +352,15 @@ public void testAddIndexOnDiskFull() throws IOException { | |||
done = true; | |||
} | |||
|
|||
} catch (IllegalStateException | IOException e) { | |||
} catch (IllegalStateException | IOException | MergePolicy.MergeException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can you give an example exception when we now throw MergePolicy.MergeException
inside the try
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the merge threads triggered for addIndexes readers fail, they throw a merge exception, which is silently recorded in that OneMerge object. After all threads join, we check if merge passed on all threads, and if not, we rethrow this MergeException, to surface the actual cause of API failure. This is done here. If all threads passed, we proceed to register the new segmentinfo.
testAddIndexOnDiskFull()
simulates this by randomly filling up the disk and causing these merge threads to fail.
Great idea @mikemccand, thanks for the input, and for reviewing this PR. I added the new merge policy to For |
Updated the Kudos to Lucene's randomized testing for catching such edge cases! |
Aborted pending merges were slipping through the merge exception check in API, and getting caught later in the RuntimeException check.
I decided to go with the first approach of wrapping critical sections with a A Also rebased on the latest main. |
Thanks @vigyasharma -- I'll try re-beasting. This is on the nightly Lucene benchmarking box (128 cores). |
My beasting has not uncovered any more failures! I think this is ready! I'll try to merge soon, just to |
…decReader[]) merges. (apache#633) * Use merge policy and merge scheduler to run addIndexes merges * wrapped reader does not see deletes - debug * Partially fixed tests in TestAddIndexes * Use writer object to invoke addIndexes merge * Use merge object info * Add javadocs for new methods * TestAddIndexes passing * verify field info schemas upfront from incoming readers * rename flag to track pooled readers * Keep addIndexes API transactional * Maintain transactionality - register segments with iw after all merges complete * fix checkstyle * PR comments * Fix pendingDocs - numDocs mismatch bug * Tests with 1-1 merges and partial merge failures * variable renaming and better comments * add test for partial merge failures. change tests to use 1-1 findmerges * abort pending merges gracefully * test null and empty merge specs * test interim files are deleted * test with empty readers * test cascading merges triggered * remove nocommits * gradle check errors * remove unused line * remove printf * spotless apply * update TestIndexWriterOnDiskFull to accept mergeException from failing addIndexes calls * return singleton reader mergespec in NoMergePolicy * rethrow exceptions seen in merge threads on failure * spotless apply * update test to new exception type thrown * spotlessApply * test for maxDoc limit in IndexWriter * spotlessApply * Use DocValuesIterator instead of DocValuesFieldExistsQuery for counting soft deletes * spotless apply * change exception message for closed IW * remove non-essential comments * update api doc string * doc string update * spotless * Changes file entry * simplify findMerges API, add 1-1 merges to MockRandomMergePolicy * update merge policies to new api * remove unused imports * spotless apply * move changes entry to end of list * fix testAddIndicesWithSoftDeletes * test with 1-1 merge policy always enabled * please spotcheck * tidy * test - never use 1-1 merge policy * use 1-1 merge policy randomly * Remove concurrent addIndexes findMerges from MockRandomMergePolicy * Bug Fix: RuntimeException in addIndexes Aborted pending merges were slipping through the merge exception check in API, and getting caught later in the RuntimeException check. * tidy * Rebase on main. Move changes to 10.0 * Synchronize IW.AddIndexesMergeSource on outer class IW object * tidy
…decReader[]) merges. (#1051) Use merge policy and merge scheduler to run addIndexes merges. This is a back port of the following commits from main: * LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges. (#633) * LUCENE-10648: Fix failures in TestAssertingPointsFormat.testWithExceptions (#1012)
Description
The addIndexes(CodecReader...) API today merges all provided readers into a single merge, in one large blocking call. We want to add concurrency here by invoking multiple parallel merges on subsets of readers, in a way that is configurable by users. The merged segments created, can later be merged further in the regular, non-blocking, background merges triggered by Lucene. Currently, users are responsible for managing their API run times, by invoking it multiple times with subsets of readers.
JIRA - https://issues.apache.org/jira/browse/LUCENE-10216
Solution
In this change, we leverage the configured MergeScheduler to invoke all merges required by the addIndexes API. MergePolicy also exposes a
findMerges(List<CodecReader> readers)
API that users can override to configure how their readers should be clustered into merges. We add a default implementation that retains current behavior of adding all readers into a single merge.Tests
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.