Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9953: Make FacetResult#value accurate for LongValueFacetCounts #131

Merged
merged 2 commits into from
May 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions lucene/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -390,6 +390,10 @@ Bug Fixes

* LUCENE-9958: Fixed performance regression for boolean queries that configure a
minimum number of matching clauses. (Adrien Grand, Matt Weber)

* LUCENE-9953: LongValueFacetCounts should count each document at most once when determining
the total count for a dimension. Prior to this fix, multi-value docs could contribute a > 1
count to the dimension count. (Greg Miller)

Other
---------------------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,12 @@ public final class FacetResult {
public final String[] path;

/**
* Total value for this path (sum of all child counts, or sum of all child values), even those not
* included in the topN.
* Total number of documents containing a value for this path, even those not included in the
* topN. If a document contains multiple values for the same path, it will only be counted once in
* this value.
*/
// TODO: This may not hold true for SSDV faceting, where docs can be counted more than
// once. We should fix this. See LUCENE-9952
public final Number value;

/** How many child labels were encountered. */
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,9 @@ private void count(String field, List<MatchingDocs> matchingDocs) throws IOExcep

for (int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = it.nextDoc()) {
int limit = multiValues.docValueCount();
totCount += limit;
if (limit > 0) {
totCount++;
}
for (int i = 0; i < limit; i++) {
increment(multiValues.nextValue());
}
Expand Down Expand Up @@ -204,7 +206,9 @@ private void countAll(IndexReader reader, String field) throws IOException {

while (multiValues.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int limit = multiValues.docValueCount();
totCount += limit;
if (limit > 0) {
totCount++;
}
for (int i = 0; i < limit; i++) {
increment(multiValues.nextValue());
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -429,15 +429,15 @@ public void testRandomMultiValued() throws Exception {
int expectedChildCount = 0;
int expectedTotalCount = 0;
for (int i = 0; i < valueCount; i++) {
if (values[i] != null) {
if (values[i] != null && values[i].length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we force the test case to randomly add a duplicate long value in a document (Line 398) so that the test case fails with the existing code and then passes with your PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gautamworah96 . I'm not sure I fully understand the suggestion to add a duplicate long value to test the change? The current bug is that, whenever a document has multiple values for a particular field (they don't need to be the same value), we'll count that document multiple times (once for each value it contains) in FacetResult#value. So I think it should be sufficient to just make sure documents contain multiple values to trip the bug, which is happening in Line 396 in this test case. The change I made to the test case reflects the proper logic for populating FacetResult#value, and does in fact fail if my change is not present.

Does this make sense? Maybe I'm missing your point. If so, I apologize!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a single document contains the same numeric value more than once, indexed as a SortedNumericDocValuesField, Lucene will indeed index that number multiple times for that document, and then this facet counting implementation will count it multiple times for that one value, right?

Or does Lucene de-dup such duplicate numeric values in one document? I'm not certain :)

Ideally, it should only count that value once for that document, but it over-counts today.

However, this is a different bug than what this PR is fixing -- so let's keep this PR about the original bug of not overcounting the FacetResult.value?

Copy link
Contributor

@gautamworah96 gautamworah96 May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if a single document contains the same numeric value more than once, indexed as a SortedNumericDocValuesField, Lucene will indeed index that number multiple times for that document, and then this facet counting implementation will count it multiple times for that one value, right?

Correct @mikemccand . I thought that this PR was for fixing that bug. I have replicated that bug here (see testLongValues)

So essentially, within a FacetResult you would want the value count for the label to be counted just once but that is not the case today

I'll create a JIRA for it.

Edit: Created LUCENE-9964

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting @mikemccand / @gautamworah96. Right, so that's not the bug I was considering here, but that's a good one to dig into as well. Thanks for spinning off an issue @gautamworah96!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thank you for reproducing this (separate) issue in test PR and opening a new issue @gautamworah96! Let's see if we can fix this one too :)

expectedTotalCount++;
for (long value : values[i]) {
Integer curCount = expected.get(value);
if (curCount == null) {
curCount = 0;
expectedChildCount++;
}
expected.put(value, curCount + 1);
expectedTotalCount++;
}
}
}
Expand Down Expand Up @@ -520,9 +520,9 @@ public void testRandomMultiValued() throws Exception {
expectedChildCount = 0;
int totCount = 0;
for (int i = minId; i <= maxId; i++) {
if (values[i] != null) {
if (values[i] != null && values[i].length > 0) {
totCount++;
for (long value : values[i]) {
totCount++;
Integer curCount = expected.get(value);
if (curCount == null) {
expectedChildCount++;
Expand Down