Problems with field resolution in significant terms aggregations #5687

hkorte · 2014-04-04T13:44:44Z

Hi,

I noticed (at least for me) unexpected behavior in the field resolution of the significant terms aggregations when prepending the document type to the field name:

"significant_terms" : { "field" : "report.crime_type" }

instead of

"significant_terms" : { "field" : "crime_type" }

leads to an NPE in ES 1.1.0:

java.lang.NullPointerException
        at org.elasticsearch.search.aggregations.bucket.significant.SignificantTermsAggregatorFactory.getBackgroundFrequency(SignificantTermsAggregatorFactory.java:190)
        at org.elasticsearch.search.aggregations.bucket.significant.SignificantStringTermsAggregator.buildAggregation(SignificantStringTermsAggregator.java:87)
        at org.elasticsearch.search.aggregations.bucket.significant.SignificantStringTermsAggregator$WithOrdinals.buildAggregation(SignificantStringTermsAggregator.java:129)
        at org.elasticsearch.search.aggregations.AggregationPhase.execute(AggregationPhase.java:135)
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:136)
...

and in the current master this leads to "Infinity" scores:

"aggregations" : {
    "contentTerms" : {
      "doc_count" : 5,
      "buckets" : [ {
        "key" : "of",
        "doc_count" : 4,
        "score" : "Infinity",
        "bg_count" : 0
      }, {
        "key" : "a",
        "doc_count" : 3,
        "score" : "Infinity",
        "bg_count" : 0
      }, {
        "key" : "metals",
        "doc_count" : 5,
        "score" : "Infinity",
        "bg_count" : 0
      }
...

Without the document type it works fine in both ES versions. Here is a gist to reproduce it: https://gist.github.com/hkorte/9974567

The text was updated successfully, but these errors were encountered:

markharwood · 2014-04-20T22:43:47Z

Thanks for raising this issue.
Adding support for doctype prefixes in field names probably brings the expectation that background frequencies are also filtered by the doc type e.g. if an index contains tweet and email doc types then tweet.text should filter out any counts for email.text occurrences in the single indexed text field held by the es index.
This scenario will be expensive as we'll need to drop down a level into postings to count docs that match a doctype filter (ideally only if the index contains > 1 doctype with that field).
I'm not sure what would happen if the query is on an indiscriminate text field (so querying both email and tweet doc types) but the significant_terms analysis is requested on a qualified email.text field - a quick test with a plain terms agg suggests the counts produced in this case are not filtered by doc type. So if the foreground stats obtained from FieldData cache are unfiltered then there is a case for making the background stats unfiltered too.

clintongormley · 2014-12-30T15:15:35Z

Closing in favour of #8870

clintongormley closed this as completed Dec 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with field resolution in significant terms aggregations #5687

Problems with field resolution in significant terms aggregations #5687

hkorte commented Apr 4, 2014

markharwood commented Apr 20, 2014

clintongormley commented Dec 30, 2014

Problems with field resolution in significant terms aggregations #5687

Problems with field resolution in significant terms aggregations #5687

Comments

hkorte commented Apr 4, 2014

markharwood commented Apr 20, 2014

clintongormley commented Dec 30, 2014