Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9953: Make FacetResult#value accurate for LongValueFacetCounts #131

Merged
merged 2 commits into from
May 18, 2021

Conversation

gsmiller
Copy link
Contributor

Description

LongValueFacetCounts may produce inaccurate counts for FacetResult#value in cases where docs are multi-valued. This fixes the bug. Note that I'm proposing this change be included in the 8x branch, so I've also included a separate pull request in the old repo: apache/lucene-solr#2491

Solution

Modify LongValueFacetCounts to only count each new document it sees once, instead of once-per-value.

Tests

Modified existing test cases in TestLongValueFacetCounts to reflect the correct behavior and ensured they pass.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

@@ -429,15 +429,15 @@ public void testRandomMultiValued() throws Exception {
int expectedChildCount = 0;
int expectedTotalCount = 0;
for (int i = 0; i < valueCount; i++) {
if (values[i] != null) {
if (values[i] != null && values[i].length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we force the test case to randomly add a duplicate long value in a document (Line 398) so that the test case fails with the existing code and then passes with your PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96 . I'm not sure I fully understand the suggestion to add a duplicate long value to test the change? The current bug is that, whenever a document has multiple values for a particular field (they don't need to be the same value), we'll count that document multiple times (once for each value it contains) in FacetResult#value. So I think it should be sufficient to just make sure documents contain multiple values to trip the bug, which is happening in Line 396 in this test case. The change I made to the test case reflects the proper logic for populating FacetResult#value, and does in fact fail if my change is not present.

Does this make sense? Maybe I'm missing your point. If so, I apologize!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a single document contains the same numeric value more than once, indexed as a SortedNumericDocValuesField, Lucene will indeed index that number multiple times for that document, and then this facet counting implementation will count it multiple times for that one value, right?

Or does Lucene de-dup such duplicate numeric values in one document? I'm not certain :)

Ideally, it should only count that value once for that document, but it over-counts today.

However, this is a different bug than what this PR is fixing -- so let's keep this PR about the original bug of not overcounting the FacetResult.value?

Copy link
Contributor

@gautamworah96 gautamworah96 May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a single document contains the same numeric value more than once, indexed as a SortedNumericDocValuesField, Lucene will indeed index that number multiple times for that document, and then this facet counting implementation will count it multiple times for that one value, right?

Correct @mikemccand . I thought that this PR was for fixing that bug. I have replicated that bug here (see testLongValues)

So essentially, within a FacetResult you would want the value count for the label to be counted just once but that is not the case today

I'll create a JIRA for it.

Edit: Created LUCENE-9964

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting @mikemccand / @gautamworah96. Right, so that's not the bug I was considering here, but that's a good one to dig into as well. Thanks for spinning off an issue @gautamworah96!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thank you for reproducing this (separate) issue in test PR and opening a new issue @gautamworah96! Let's see if we can fix this one too :)

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks @gsmiller!

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gsmiller!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants