Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9950: New facet counting implementation for general string doc value fields #133

Merged
merged 5 commits into from
May 18, 2021

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented May 11, 2021

Description

Adding a new implementation for facet counting that works against any string doc value field (i.e., SortedSetDocValues, SortedDocValues). This implementation doesn't require "dimensions" to be encoded in the stored string, or for the user to rely on FacetConfig. It's meant to complement LongValueFacetCounts, which allows facet counting on any long doc value field, without any need for FacetConfig, etc.

Solution

Added a new facet counting implementation similar to SortedSetDocValueFacetCounts, but without the assumption of a "dimension" being present in the strings.

Tests

Added new unit tests for testing the new faceting counting implementation.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, and would work for single and multi valued fields seamlessly, right? I left a few small comments. Thanks @gsmiller!

private final OrdinalMap ordinalMap;
private final SortedSetDocValues docValues;

private final int[] counts;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe a comment explaining what this array is? I think it is non-sparse, indexed by SSDV ordinal? We might want to (later optimization) better handle the (likely more common?) sparse case, e.g. using IntIntScatterMap or so from HPPC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I'll add some documentation. I considered having both sparse and dense approaches triggered by different thresholds, similar to what IntTaxonomyFacetCounts does, but opted not to for now. There should at least be some fairly common cases where this counting is pretty dense, assuming most unique values end up being seen at least once for a given field on any given match set. For very restrictive queries though, this could certainly get sparse.

Anyway, maybe the most relevant reason I took this approach for now is that it's the existing approach used by SortedSetDocValueFacetCounts, so seemed like a reasonable starting place. But yes, optimization opportunities exist :)

@gsmiller
Copy link
Contributor Author

@mikemccand yeah, this works for both single- and multi-valued fields. In getDocValues() I'm relying on DocValues.getSortedSet() which will first try to load stored values as SortedSetDocValues but will fall back to trying SortedDocValues. Pretty handy helper functionality. I cover this case in testBasicSingleValuedUsingSortedDoc to confirm.

FacetsCollector.MatchingDocs hits = matchingDocs.get(i);

// Validate state before doing anything else:
validateState(hits.context);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is checking this for every segment really necessary? I guess it's technically possible that there could be MatchingDocs instances in here with different top-level readers, but can that really happen in practice? I know that SortedSetDocValuesFacetCollector checks each one so I'm doing the same here, but I'm wondering if it would be enough to validate the first segment? Anyone have thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, segments can be shared across readers, if that segment had not changed in between refreshes.

But, I think the top-level reader (from the LeafReaderContext) must point to the new reader for all segments in the new reader, so I think you could indeed just check the first segment, and lose no safety. +1 to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mikemccand!

@gsmiller
Copy link
Contributor Author

I went ahead and added a sparse counting approach since it wasn't complicated to do. I borrowed heuristics and some logic from IntTaxonomyFacets in doing so.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! Thanks @gsmiller! A new faceting implementation is born in Lucene :)

I'll try to push soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants