LUCENE-9950: New facet counting implementation for general string doc value fields #133

gsmiller · 2021-05-11T16:50:03Z

Description

Adding a new implementation for facet counting that works against any string doc value field (i.e., SortedSetDocValues, SortedDocValues). This implementation doesn't require "dimensions" to be encoded in the stored string, or for the user to rely on FacetConfig. It's meant to complement LongValueFacetCounts, which allows facet counting on any long doc value field, without any need for FacetConfig, etc.

Solution

Added a new facet counting implementation similar to SortedSetDocValueFacetCounts, but without the assumption of a "dimension" being present in the strings.

Tests

Added new unit tests for testing the new faceting counting implementation.

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.

… value fields

mikemccand

This looks great, and would work for single and multi valued fields seamlessly, right? I left a few small comments. Thanks @gsmiller!

lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java

mikemccand · 2021-05-14T14:54:25Z

lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java

+  private final OrdinalMap ordinalMap;
+  private final SortedSetDocValues docValues;
+
+  private final int[] counts;


Hmm maybe a comment explaining what this array is? I think it is non-sparse, indexed by SSDV ordinal? We might want to (later optimization) better handle the (likely more common?) sparse case, e.g. using IntIntScatterMap or so from HPPC.

That's correct. I'll add some documentation. I considered having both sparse and dense approaches triggered by different thresholds, similar to what IntTaxonomyFacetCounts does, but opted not to for now. There should at least be some fairly common cases where this counting is pretty dense, assuming most unique values end up being seen at least once for a given field on any given match set. For very restrictive queries though, this could certainly get sparse.

Anyway, maybe the most relevant reason I took this approach for now is that it's the existing approach used by SortedSetDocValueFacetCounts, so seemed like a reasonable starting place. But yes, optimization opportunities exist :)

lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java

lucene/facet/src/java/org/apache/lucene/facet/StringDocValuesReaderState.java

gsmiller · 2021-05-14T19:17:31Z

@mikemccand yeah, this works for both single- and multi-valued fields. In getDocValues() I'm relying on DocValues.getSortedSet() which will first try to load stored values as SortedSetDocValues but will fall back to trying SortedDocValues. Pretty handy helper functionality. I cover this case in testBasicSingleValuedUsingSortedDoc to confirm.

gsmiller · 2021-05-14T22:17:46Z

lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java

+        FacetsCollector.MatchingDocs hits = matchingDocs.get(i);
+
+        // Validate state before doing anything else:
+        validateState(hits.context);


Is checking this for every segment really necessary? I guess it's technically possible that there could be MatchingDocs instances in here with different top-level readers, but can that really happen in practice? I know that SortedSetDocValuesFacetCollector checks each one so I'm doing the same here, but I'm wondering if it would be enough to validate the first segment? Anyone have thoughts on this?

Hmm, segments can be shared across readers, if that segment had not changed in between refreshes.

But, I think the top-level reader (from the LeafReaderContext) must point to the new reader for all segments in the new reader, so I think you could indeed just check the first segment, and lose no safety. +1 to do that.

Thanks @mikemccand!

gsmiller · 2021-05-15T00:03:24Z

I went ahead and added a sparse counting approach since it wasn't complicated to do. I borrowed heuristics and some logic from IntTaxonomyFacets in doing so.

mikemccand

This looks awesome! Thanks @gsmiller! A new faceting implementation is born in Lucene :)

I'll try to push soon.

LUCENE-9950: New facet counting implementation for general string doc…

37df577

… value fields

mikemccand reviewed May 14, 2021

View reviewed changes

Greg Miller added 2 commits May 14, 2021 11:51

PR feedback

7781f0d

spelling misses

c710043

gsmiller commented May 14, 2021

View reviewed changes

add sparse counting in addition to dense counting

3f3229e

only check state match for the first segment

b5b703d

mikemccand approved these changes May 18, 2021

View reviewed changes

mikemccand merged commit ade50f0 into apache:main May 18, 2021

gsmiller deleted the LUCENE-9950/pr branch May 31, 2021 13:12

asfimport mentioned this pull request Jun 23, 2021

Support both single- and multi-value string fields in facet counting (non-taxonomy based approaches) [LUCENE-9950] #10989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9950: New facet counting implementation for general string doc value fields #133

LUCENE-9950: New facet counting implementation for general string doc value fields #133

gsmiller commented May 11, 2021 •

edited

Loading

mikemccand left a comment

mikemccand May 14, 2021

gsmiller May 14, 2021

gsmiller commented May 14, 2021

gsmiller May 14, 2021

mikemccand May 17, 2021

gsmiller May 17, 2021

gsmiller commented May 15, 2021

mikemccand left a comment

LUCENE-9950: New facet counting implementation for general string doc value fields #133

LUCENE-9950: New facet counting implementation for general string doc value fields #133

Conversation

gsmiller commented May 11, 2021 • edited Loading

Description

Solution

Tests

Checklist

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand May 14, 2021

Choose a reason for hiding this comment

gsmiller May 14, 2021

Choose a reason for hiding this comment

gsmiller commented May 14, 2021

gsmiller May 14, 2021

Choose a reason for hiding this comment

mikemccand May 17, 2021

Choose a reason for hiding this comment

gsmiller May 17, 2021

Choose a reason for hiding this comment

gsmiller commented May 15, 2021

mikemccand left a comment

Choose a reason for hiding this comment

gsmiller commented May 11, 2021 •

edited

Loading