Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to merge results from shards #875

Closed
eemmiirr opened this issue Aug 29, 2024 · 13 comments · Fixed by #877
Closed

[BUG] Unable to merge results from shards #875

eemmiirr opened this issue Aug 29, 2024 · 13 comments · Fixed by #877
Assignees
Labels
bug Something isn't working

Comments

@eemmiirr
Copy link

Describe the bug

When performing a hybrid search in OpenSearch 2.16, combining a lexical and a kNN query, you may encounter the error: "cannot merge top docs because it does not have enough elements." Downgrading to version 2.15 returns results but may omit results from one shard. Executing a forcemerge before running the search or disabling concurrent segment search resolves the issue. This problem is challenging to reproduce, as it only occurs when tests are run as part of a suite. Running a single test in isolation on an empty index consistently succeeds. The issue seems to be related to segment search.

Related component

Search

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Returns results from all shards and segments

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@eemmiirr eemmiirr added bug Something isn't working untriaged labels Aug 29, 2024
@jainankitk
Copy link

@jed326 - Might be related to concurrent segment search. Tagging it to Concurrent Search

@jed326
Copy link

jed326 commented Aug 29, 2024

Thanks for reporting this @eemmiirr. I see the exception you're seeing is coming from here in the neural-search plugin:

public T[] merge(final T[] sourceScoreDocs, final T[] newScoreDocs, final Comparator<T> comparator, final boolean isSortEnabled) {
if (Objects.requireNonNull(sourceScoreDocs, "score docs cannot be null").length < MIN_NUMBER_OF_ELEMENTS_IN_SCORE_DOC
|| Objects.requireNonNull(newScoreDocs, "score docs cannot be null").length < MIN_NUMBER_OF_ELEMENTS_IN_SCORE_DOC) {
throw new IllegalArgumentException("cannot merge top docs because it does not have enough elements");
}

Could you please share some sample queries (and index mappings if possible) where you are encountering this issue? Depending on how documents are distributed across segments the same query may return different results due to different scoring and this is especially likely in cases where there are only a few documents per segment. We'll be able to better determine if this is truly a bug or if it's an issue with sparse data in the test instead with these details.

@jed326
Copy link

jed326 commented Aug 29, 2024

@opensearch-project/admin could you help transfer this to neural-search repo for now? Let's start the investigation on that end.

FYI @sohami @Gankris96

@eemmiirr
Copy link
Author

@jed326 I'll try to write tomorrow an example with tests to showcase the error

@jed326
Copy link

jed326 commented Aug 30, 2024

While we're waiting on @opensearch-project/admin to transfer the issue, In the meantime @navneet1v @martin-gaievski could you help take a look at this as well?

@prudhvigodithi prudhvigodithi transferred this issue from opensearch-project/OpenSearch Aug 30, 2024
@navneet1v
Copy link
Collaborator

Adding @vibrantvarun

@eemmiirr
Copy link
Author

eemmiirr commented Aug 30, 2024

@jed326 I created a showcase project which triggers the bug. You can check it out here: https://github.com/eemmiirr/opensearch-concurrent-segment-search-bug. All details can be found in the README

@jed326
Copy link

jed326 commented Aug 30, 2024

Thanks @eemmiirr , looking at https://github.com/eemmiirr/opensearch-concurrent-segment-search-bug/blob/main/src/test/kotlin/com/github/eemmiirr/osshowcase/OpenSearchRepositoryTest.kt#L15-L36 it looks like the test case is only using 3 documents, which is the same value as the setting

private static final int MIN_NUMBER_OF_ELEMENTS_IN_SCORE_DOC = 3;

to hit the exception you're seeing.

It seems like if you ever have more than 1 segment in your index, then you will hit that condition. You can quickly verify this by calling refreshIndexes() after each document you index and see if the test fails 100% of the time.

I'll let @vibrantvarun or one of the other neural plugin folks confirm if this is the expected behavior.

@eemmiirr
Copy link
Author

@jed326 I'm not sure I follow. Why would the number of documents matter? The test succeeds multiple times under the same conditions, but then it suddenly fails without any changes in those conditions. Regardless, I added a test with a configurable number of documents, and as expected, it also fails.

@martin-gaievski
Copy link
Member

@eemmiirr I have tried the Kotlin based project you've shared to reproduce the issue (https://github.com/eemmiirr/opensearch-concurrent-segment-search-bug), and I got results that are bit different from yours.

I do not see the mentioned error "cannot merge top docs because it does not have enough elements.", however I do see that both test cases, specifically one with 3 docs are failing with following:

OpenSearchRepositoryTest > trigger concurrent segment search bug - 3 docs()
       org.opentest4j.AssertionFailedError at OpenSearchRepositoryTest.kt:35

I wonder how do you see that mentioned error and something is missing from steps or setup?

I also tried with a standalone opensearch 2.16 cluster, and I'm not seeing the issue. That seems expected as you've mentioned that issue is transient and you've seen it only with certain combination of steps.

I also agree with your point that number of documents in index shouldn't matter. That 3 number is specific to internal format we use for hybrid query results, we need 1 as header element, 1 for first query delimiter and 1 as footer element.

@eemmiirr
Copy link
Author

eemmiirr commented Sep 3, 2024

Hi @martin-gaievski . Thanks for having a look at it. I improved the tests so they collect the errors and throw an exception. Both tests are failing now with cannot merge top docs because it does not have enough elements. Can you please pull the changes and re-run the tests

@martin-gaievski
Copy link
Member

tried with your latest code @eemmiirr, it's failing on assertion for both test cases:

org.opentest4j.AssertionFailedError: 
expected: 
  ["6da959c0-0fcd-4b09-8023-481edf089680",
      "905a723a-7237-4208-84e1-2d1ee33c3c3b",
      "4e2ace3c-5bf1-497e-b051-7d0ac2eafc9c"]
 but was: 
  ["6da959c0-0fcd-4b09-8023-481edf089680", "4e2ace3c-5bf1-497e-b051-7d0ac2eafc9c"]
	at app//com.github.eemmiirr.osshowcase.OpenSearchRepositoryTest.trigger concurrent segment search bug - 3 docs(OpenSearchRepositoryTest.kt:35)
	at [email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at [email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at [email protected]/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at [email protected]/java.lang.reflect.Method.invoke(Method.java:568)
	at app//org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:728)
	at app//org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
	at app//org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
	at app//org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
	at app//org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)
	at app//org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86)
	at app//org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)
	at app//org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)
	at app//org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
	at app//org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
	at app//org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
	at app//org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
	at app//org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92)
	at app//org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86)
	at app//org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:218)
	at app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
	at app//org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:214)
	at app//org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:139)
	at app//org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:69)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
	at app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
	at app//org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
	at app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
	at [email protected]/java.util.ArrayList.forEach(ArrayList.java:1511)
	at app//org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
	at app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
	at app//org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
	at app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
	at [email protected]/java.util.ArrayList.forEach(ArrayList.java:1511)
	at app//org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
	at app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
	at app//org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
	at app//org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
	at app//org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
	at app//org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:35)
	at app//org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
	at app//org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:54)
	at app//org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:107)
	at app//org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
	at app//org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54)
	at app//org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67)
	at app//org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52)
	at app//org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
	at app//org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
	at app//org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
	at org.gradle.api.internal.tasks.testing.junitplatform.JUnitPlatformTestClassProcessor$CollectAllTestClassesExecutor.processAllTestClasses(JUnitPlatformTestClassProcessor.java:119)
	at org.gradle.api.internal.tasks.testing.junitplatform.JUnitPlatformTestClassProcessor$CollectAllTestClassesExecutor.access$000(JUnitPlatformTestClassProcessor.java:94)
	at org.gradle.api.internal.tasks.testing.junitplatform.JUnitPlatformTestClassProcessor.stop(JUnitPlatformTestClassProcessor.java:89)
	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:62)
	at [email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at [email protected]/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at [email protected]/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at [email protected]/java.lang.reflect.Method.invoke(Method.java:568)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
	at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
	at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
	at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
	at jdk.proxy1/jdk.proxy1.$Proxy2.stop(Unknown Source)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker$3.run(TestWorker.java:193)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.executeAndMaintainThreadName(TestWorker.java:129)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:100)
	at org.gradle.api.internal.tasks.testing.worker.TestWorker.execute(TestWorker.java:60)
	at org.gradle.process.internal.worker.child.ActionExecutionWorker.execute(ActionExecutionWorker.java:56)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:119)
	at org.gradle.process.internal.worker.child.SystemApplicationClassLoaderWorker.call(SystemApplicationClassLoaderWorker.java:66)
	at app//worker.org.gradle.process.internal.worker.GradleWorkerMain.run(GradleWorkerMain.java:69)
	at app//worker.org.gradle.process.internal.worker.GradleWorkerMain.main(GradleWorkerMain.java:74)

and

trigger concurrent segment search bug - multiple docs()
org.opentest4j.AssertionFailedError: 
expected: 
  ["d6282a09-a5d1-4cbc-ba8a-02c88f2bcdf0",
      "2c878556-6b4f-48c4-b8d4-e9b1392460e0",
      "f5828c4b-e669-4a8a-93ee-ea90ebccb328",
      "6ca6e65d-379a-43a6-8428-b05baeef9d8d",
      "d03add21-5ef7-4fe8-908e-c5e7ccc66591",
      "74ee6071-4181-4711-86db-e3192dc5887f",
      "426732a2-c155-447d-ba0c-9cc0daeed0fb",
      "8ef0abb1-60d3-4442-8272-eacbd7efb5c7",
      "4d75a936-eb9c-469e-89d7-52f3b24c4fd3",
      "492d48fe-3569-4d0f-832c-79867d903204",
      "4d8d9041-48ed-4c39-be9a-884592c55720",
      "91d39b2f-8c6b-452d-9650-13fb020f97cc",
      "fed28d25-4559-4834-bf8b-bb22dc36a33b",
      "98289b98-42ee-47d1-b238-1865941826b8",
      "da651546-4039-4713-820e-14f923c636e5",
      "207e7814-f28b-420d-860a-3876b871e872",
      "cd9fac00-52b4-48c2-915b-80c11558603a",
      "98f90cf3-4ba5-45cd-b75a-ab2b94522747",
      "caff1e1c-6e05-4112-9b5f-bdf361e21111",
      "5405765d-23ee-472f-b506-f44e3b75df67"]
 but was: 
  ["d6282a09-a5d1-4cbc-ba8a-02c88f2bcdf0",
      "2c878556-6b4f-48c4-b8d4-e9b1392460e0",
      "5405765d-23ee-472f-b506-f44e3b75df67",
      "cd9fac00-52b4-48c2-915b-80c11558603a",
      "6ca6e65d-379a-43a6-8428-b05baeef9d8d",
      "d03add21-5ef7-4fe8-908e-c5e7ccc66591",
      "74ee6071-4181-4711-86db-e3192dc5887f",
      "426732a2-c155-447d-ba0c-9cc0daeed0fb",
      "8ef0abb1-60d3-4442-8272-eacbd7efb5c7",
      "4d75a936-eb9c-469e-89d7-52f3b24c4fd3"]

it looks like not all docs are retrieved, I think that was the behavior in 2.15, but it should be fixed in 2.16.

@martin-gaievski
Copy link
Member

@eemmiirr I finally replicated the problem, will be working on the fix now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Archived in project
5 participants