Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with field resolution in significant terms aggregations #5687

Closed
hkorte opened this issue Apr 4, 2014 · 2 comments
Closed

Problems with field resolution in significant terms aggregations #5687

hkorte opened this issue Apr 4, 2014 · 2 comments

Comments

@hkorte
Copy link
Contributor

hkorte commented Apr 4, 2014

Hi,

I noticed (at least for me) unexpected behavior in the field resolution of the significant terms aggregations when prepending the document type to the field name:

"significant_terms" : { "field" : "report.crime_type" }

instead of

"significant_terms" : { "field" : "crime_type" }

leads to an NPE in ES 1.1.0:

java.lang.NullPointerException
        at org.elasticsearch.search.aggregations.bucket.significant.SignificantTermsAggregatorFactory.getBackgroundFrequency(SignificantTermsAggregatorFactory.java:190)
        at org.elasticsearch.search.aggregations.bucket.significant.SignificantStringTermsAggregator.buildAggregation(SignificantStringTermsAggregator.java:87)
        at org.elasticsearch.search.aggregations.bucket.significant.SignificantStringTermsAggregator$WithOrdinals.buildAggregation(SignificantStringTermsAggregator.java:129)
        at org.elasticsearch.search.aggregations.AggregationPhase.execute(AggregationPhase.java:135)
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:136)
...

and in the current master this leads to "Infinity" scores:

"aggregations" : {
    "contentTerms" : {
      "doc_count" : 5,
      "buckets" : [ {
        "key" : "of",
        "doc_count" : 4,
        "score" : "Infinity",
        "bg_count" : 0
      }, {
        "key" : "a",
        "doc_count" : 3,
        "score" : "Infinity",
        "bg_count" : 0
      }, {
        "key" : "metals",
        "doc_count" : 5,
        "score" : "Infinity",
        "bg_count" : 0
      }
...

Without the document type it works fine in both ES versions. Here is a gist to reproduce it: https://gist.github.com/hkorte/9974567

@markharwood
Copy link
Contributor

Thanks for raising this issue.
Adding support for doctype prefixes in field names probably brings the expectation that background frequencies are also filtered by the doc type e.g. if an index contains tweet and email doc types then tweet.text should filter out any counts for email.text occurrences in the single indexed text field held by the es index.
This scenario will be expensive as we'll need to drop down a level into postings to count docs that match a doctype filter (ideally only if the index contains > 1 doctype with that field).
I'm not sure what would happen if the query is on an indiscriminate text field (so querying both email and tweet doc types) but the significant_terms analysis is requested on a qualified email.text field - a quick test with a plain terms agg suggests the counts produced in this case are not filtered by doc type. So if the foreground stats obtained from FieldData cache are unfiltered then there is a case for making the background stats unfiltered too.

@clintongormley
Copy link
Contributor

Closing in favour of #8870

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants