Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestLucene90DocValuesFormat fails with ArrayIndexOutOfBoundsException #13805

Closed
ChrisHegarty opened this issue Sep 18, 2024 · 21 comments · Fixed by #13812
Closed

TestLucene90DocValuesFormat fails with ArrayIndexOutOfBoundsException #13805

ChrisHegarty opened this issue Sep 18, 2024 · 21 comments · Fixed by #13812
Labels
blocker A severe issue that should be resolved before the released specified in its Milestone.
Milestone

Comments

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Sep 18, 2024

ERROR: The following test(s) have failed:
  - org.apache.lucene.codecs.lucene90.TestLucene90DocValuesFormat.testSparseDocValuesVsStoredFields (:lucene:core)
    Test output: /opt/buildkite-agent/builds/bk-agent-prod-gcp-1726674638633683811/elastic/apache-lucene-nightly/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.codecs.lucene90.TestLucene90DocValuesFormat.txt
    Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.codecs.lucene90.TestLucene90DocValuesFormat.testSparseDocValuesVsStoredFields" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=175AB2293A24B66E -Ptests.nightly=true -Ptests.gui=true -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=128 -Ptests.forceintegervectors=true
  - org.apache.lucene.backward_codecs.lucene80.TestBestSpeedLucene80DocValuesFormat.testSparseDocValuesVsStoredFields (:lucene:backward-codecs)
    Test output: /opt/buildkite-agent/builds/bk-agent-prod-gcp-1726674638633683811/elastic/apache-lucene-nightly/lucene/backward-codecs/build/test-results/test/outputs/OUTPUT-org.apache.lucene.backward_codecs.lucene80.TestBestSpeedLucene80DocValuesFormat.txt
    Reproduce with: gradlew :lucene:backward-codecs:test --tests "org.apache.lucene.backward_codecs.lucene80.TestBestSpeedLucene80DocValuesFormat.testSparseDocValuesVsStoredFields" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=175AB2293A24B66E -Ptests.nightly=true -Ptests.gui=true -Ptests.file.encoding=UTF-8 -Ptests.vectorsize=128 -Ptests.forceintegervectors=true
 >     java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
   >         at __randomizedtesting.SeedInfo.seed([175AB2293A24B66E:43816E3AC0A89021]:0)
   >         at org.apache.lucene.util.packed.Packed64.get(Packed64.java:80)
   >         at org.apache.lucene.index.OrdinalMap$1.get(OrdinalMap.java:379)
   >         at org.apache.lucene.codecs.DocValuesConsumer$7$1.nextOrd(DocValuesConsumer.java:946)
   >         at org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer$4$1.nextDoc(Lucene90DocValuesConsumer.java:808)
   >         at org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.writeValues(Lucene90DocValuesConsumer.java:201)
   >         at org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.doAddSortedNumericField(Lucene90DocValuesConsumer.java:705)
   >         at org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.addSortedSetField(Lucene90DocValuesConsumer.java:770)
   >         at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedSetField(DocValuesConsumer.java:853)
   >         at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:148)
   >         at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:152)
   >         at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:188)
   >         at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:314)
   >         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:149)
   >         at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5292)
   >         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4758)
   >         at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6581)
   >         at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:38)
   >         at org.apache.lucene.index.IndexWriter.executeMerge(IndexWriter.java:2327)sFormat
   >         at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2322)
   >         at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:6033)
   >         at org.apache.lucene.index.IndexWriter.maybeProcessEvents(IndexWriter.java:6023)
   >         at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1562)
   >         at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1847)
   >         at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1487)
   >         at org.apache.lucene.tests.index.RandomIndexWriter.addDocument(RandomIndexWriter.java:224)
   >         at org.apache.lucene.codecs.lucene90.TestLucene90DocValuesFormat.doTestSparseDocValuesVsStoredFields(TestLucene90DocValuesFormat.java:215)
   >         at org.apache.lucene.codecs.lucene90.TestLucene90DocValuesFormat.testSparseDocValuesVsStoredFields(TestLucene90DocValuesFormat.java:169)
   >         at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
   >         at java.base/java.lang.reflect.Method.invoke(Method.java:580)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   >         at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   >         at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
   >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   >         at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
   >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
   >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
   >         at java.base/java.lang.Thread.run(Thread.java:1583)
@ChrisHegarty ChrisHegarty added this to the 9.12.0 milestone Sep 18, 2024
@benwtrent
Copy link
Member

Git bisect puts the blame at: 6634b41

#13686

@benwtrent
Copy link
Member

git bisect might be lying, I don't see how that PR could cause this failure :(

@iverase
Copy link
Contributor

iverase commented Sep 18, 2024

Probably we are not remapping the field ordinal properly when merging segments.

@iverase
Copy link
Contributor

iverase commented Sep 18, 2024

See here:

// NOTE: we cannot just use the merged fieldInfo.number (instead of resolving to

@benwtrent benwtrent added blocker A severe issue that should be resolved before the released specified in its Milestone. labels Sep 18, 2024
@jpountz
Copy link
Contributor

jpountz commented Sep 19, 2024

Argh, I remember carefully checking whether this PR could cause issues due to mismatched field infos, but apparently I missed something.

@rmuir
Copy link
Member

rmuir commented Sep 19, 2024

Can we just revert the change for now? it does two things at once... one of those is using field.number instead of field.name which is historically unsafe: it was always the big risk of bulk merge.

It can't be done like this in all situations, and doing it across versions like this is especially YOLO and asking for corruption IMO. This is why stored fields never bulk merge across versions:

|| ((Lucene90CompressingStoredFieldsReader) candidate).getVersion() != VERSION_CURRENT) {

@ChrisHegarty
Copy link
Contributor Author

I'm going to try reverting 6634b41.

@benwtrent
Copy link
Member

@ChrisHegarty this makes be worried about all the other field number switch with field name things as well.

I am wondering if we should revert all of them, there are multiple PRs.

@ChrisHegarty
Copy link
Contributor Author

The revert fixes the failures we see here and the other related test failures, seen in #13807 #13808.

@rmuir
Copy link
Member

rmuir commented Sep 19, 2024

sounds like the safe bet to backout any changes messing around with fieldinfos on merge.

Sorry for the short explanation, there is a long history of super-sneaky corruption bugs like this. always happening on some corner-case such as addIndexes(reader) or across different versions, or something like that. When they happen on merge it makes debugging them especially difficult. Mixing up data across fields because of field numbers happened more than once.

This is why, if you look at bulk merge code, you see crazy sysprop escape hatches and stuffl like that: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingStoredFieldsWriter.java#L492-L508

@rmuir
Copy link
Member

rmuir commented Sep 19, 2024

we didn't even have bulk merge at all in lucene for a couple years at all because of field-number bugs like this. got bit too many times.

@ChrisHegarty
Copy link
Contributor Author

I filed a meta issue to better track the reverts, #13809

@jpountz
Copy link
Contributor

jpountz commented Sep 19, 2024

I don't mind reverting but I would also like to fix the root cause as this change only exposed an existing bug: someone is calling a doc-values producer with the wrong FieldInfo object.

@rmuir
Copy link
Member

rmuir commented Sep 19, 2024

yeah it would be best to improve the tests: it is not good that it took this test, run many many times, to find it.

@jpountz
Copy link
Contributor

jpountz commented Sep 19, 2024

I found the bug, it's the slow composite reader wrapper which is at fault here. I'll look into improving tests to detect such issues.

Separately, we may want to consider changing the DocValuesProducer API to take a String rather than a FieldInfo, like e.g. points, so that it is not tempted to trust the caller to resolve the FieldInfo object correctly.

@jpountz
Copy link
Contributor

jpountz commented Sep 19, 2024

I found the root cause, it's here:

values = docValuesProducer.getSorted(fieldInfo);
. The producer is called on fieldInfo instead of readerFieldInfo like other doc values types do. I'm working on tests that would have uncovered this problem.

@ChrisHegarty
Copy link
Contributor Author

Ok, reverts are prepared. @jpountz you wanna fix (and not revert), or revert for now?

@jpountz
Copy link
Contributor

jpountz commented Sep 19, 2024

Give me some time to see how the fix and tests look, and let's think about whether/what to revert later on? I expect to have something by end of day. @ChrisHegarty Feel free to cut the branch in the meantime, we can backport to the 9.12 branch if necessary?

@ChrisHegarty
Copy link
Contributor Author

Let's postpone the 9_12 branch cut until tomorrow, pending on the outcome of this.

@bugmakerrrrrr
Copy link
Contributor

Separately, we may want to consider changing the DocValuesProducer API to take a String rather than a FieldInfo, like e.g. points, so that it is not tempted to trust the caller to resolve the FieldInfo object correctly.

@jpountz +1, We have encountered the related issue in NormsProducer, and I worked around it by resolving the field info inside the NormsProducer.

jpountz added a commit to jpountz/lucene that referenced this issue Sep 19, 2024
This improves testing of mismatched field numbers by
 - improving `AssertingDocValuesProducer` to detect mismatched field numbers,
 - introducing a `MismatchedCodecReader` to actually test mismatched field
   numbers on `DocValuesProducer` (a `MismatchedLeafReader` wrapping a
`SlowCodecReaderWrapper` doesn't work since `SlowCodecReaderWrapper` implicitly
resolves the correct `FieldInfo` object),
 - introducing an explicit test for mismatched field numbers in
   `BaseDocValuesFormatTestCase`.

These new tests uncovered a bug when merging sorted doc values, which would
call the underlying doc values producer with the merged field info.

Closes apache#13805
@jpountz
Copy link
Contributor

jpountz commented Sep 19, 2024

I have a fix and tests that would have found the bug at #13812.

jpountz added a commit that referenced this issue Sep 20, 2024
This improves testing of mismatched field numbers by
 - improving `AssertingDocValuesProducer` to detect mismatched field numbers,
 - introducing a `MismatchedCodecReader` to actually test mismatched field
   numbers on `DocValuesProducer` (a `MismatchedLeafReader` wrapping a
`SlowCodecReaderWrapper` doesn't work since `SlowCodecReaderWrapper` implicitly
resolves the correct `FieldInfo` object),
 - introducing an explicit test for mismatched field numbers for doc values, points,
postings and knn vectors.

These new tests uncovered a bug when merging sorted doc values, which would
call the underlying doc values producer with the merged field info.

Closes #13805
jpountz added a commit that referenced this issue Sep 20, 2024
This improves testing of mismatched field numbers by
 - improving `AssertingDocValuesProducer` to detect mismatched field numbers,
 - introducing a `MismatchedCodecReader` to actually test mismatched field
   numbers on `DocValuesProducer` (a `MismatchedLeafReader` wrapping a
`SlowCodecReaderWrapper` doesn't work since `SlowCodecReaderWrapper` implicitly
resolves the correct `FieldInfo` object),
 - introducing an explicit test for mismatched field numbers for doc values, points,
postings and knn vectors.

These new tests uncovered a bug when merging sorted doc values, which would
call the underlying doc values producer with the merged field info.

Closes #13805
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker A severe issue that should be resolved before the released specified in its Milestone.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants