-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9950: New facet counting implementation for general string doc value fields #133
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, and would work for single and multi valued fields seamlessly, right? I left a few small comments. Thanks @gsmiller!
lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
Outdated
Show resolved
Hide resolved
private final OrdinalMap ordinalMap; | ||
private final SortedSetDocValues docValues; | ||
|
||
private final int[] counts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe a comment explaining what this array is? I think it is non-sparse, indexed by SSDV
ordinal? We might want to (later optimization) better handle the (likely more common?) sparse case, e.g. using IntIntScatterMap
or so from HPPC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. I'll add some documentation. I considered having both sparse and dense approaches triggered by different thresholds, similar to what IntTaxonomyFacetCounts
does, but opted not to for now. There should at least be some fairly common cases where this counting is pretty dense, assuming most unique values end up being seen at least once for a given field on any given match set. For very restrictive queries though, this could certainly get sparse.
Anyway, maybe the most relevant reason I took this approach for now is that it's the existing approach used by SortedSetDocValueFacetCounts
, so seemed like a reasonable starting place. But yes, optimization opportunities exist :)
lucene/facet/src/java/org/apache/lucene/facet/StringValueFacetCounts.java
Show resolved
Hide resolved
lucene/facet/src/java/org/apache/lucene/facet/StringDocValuesReaderState.java
Outdated
Show resolved
Hide resolved
@mikemccand yeah, this works for both single- and multi-valued fields. In |
FacetsCollector.MatchingDocs hits = matchingDocs.get(i); | ||
|
||
// Validate state before doing anything else: | ||
validateState(hits.context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is checking this for every segment really necessary? I guess it's technically possible that there could be MatchingDocs
instances in here with different top-level readers, but can that really happen in practice? I know that SortedSetDocValuesFacetCollector
checks each one so I'm doing the same here, but I'm wondering if it would be enough to validate the first segment? Anyone have thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, segments can be shared across readers, if that segment had not changed in between refreshes.
But, I think the top-level reader (from the LeafReaderContext
) must point to the new reader for all segments in the new reader, so I think you could indeed just check the first segment, and lose no safety. +1 to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mikemccand!
I went ahead and added a sparse counting approach since it wasn't complicated to do. I borrowed heuristics and some logic from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome! Thanks @gsmiller! A new faceting implementation is born in Lucene :)
I'll try to push soon.
Description
Adding a new implementation for facet counting that works against any string doc value field (i.e., SortedSetDocValues, SortedDocValues). This implementation doesn't require "dimensions" to be encoded in the stored string, or for the user to rely on FacetConfig. It's meant to complement LongValueFacetCounts, which allows facet counting on any long doc value field, without any need for FacetConfig, etc.
Solution
Added a new facet counting implementation similar to SortedSetDocValueFacetCounts, but without the assumption of a "dimension" being present in the strings.
Tests
Added new unit tests for testing the new faceting counting implementation.
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.