significant terms: infrastructure for changing easily the significanc…

…e heuristic This commit adds the infrastructure to allow pluging in different measures for computing the significance of a term. Significance measures can be provided externally by overriding - SignificanceHeuristic - SignificanceHeuristicBuilder - SignificanceHeuristicParser closes #6561
elastic · Jul 14, 2014 · 89838d8 · 89838d8
1 parent 4a89c0d
commit 89838d8
Show file tree

Hide file tree

Showing 26 changed files with 1,747 additions and 154 deletions.
diff --git a/docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc b/docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc
@@ -194,10 +194,7 @@ where a simple `terms` aggregation would typically show the very popular "consta
 
 .How are the scores calculated?
 **********************************
-The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users.
-The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favour
-common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favour rare terms.
-Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
+The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in _foreground_ and _background_ sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.
 
 **********************************
 
@@ -282,7 +279,35 @@ However, the `size` and `shard size` settings covered in the next section provid
 
 ==== Parameters
 
+===== JLH score
 
+The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
+
+===== mutual information
+added[1.3.0]
+
+Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter
+
+[source,js]
+--------------------------------------------------
+
+	 "mutual_information": {
+	      "include_negatives": true
+	 }
+--------------------------------------------------
+
+Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, `include_negatives` can be set to `false`. 
+
+Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set 
+
+[source,js]
+--------------------------------------------------
+
+"background_is_superset": false
+--------------------------------------------------
+
+
+
 ===== Size & Shard Size
 
 The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
@@ -338,7 +363,7 @@ Terms that score highly will be collected on a shard level and merged with the t
 
 added[1.2.0] `shard_min_doc_count` parameter
 
-The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.
+The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.
 
 
 

diff --git a/src/main/java/org/elasticsearch/common/ParseField.java b/src/main/java/org/elasticsearch/common/ParseField.java
@@ -55,6 +55,16 @@ public String getPreferredName(){
         return underscoreName;
     }
 
+    public String[] getAllNamesIncludedDeprecated() {
+        String[] allNames = new String[2 + deprecatedNames.length];
+        allNames[0] = camelCaseName;
+        allNames[1] = underscoreName;
+        for (int i = 0; i < deprecatedNames.length; i++) {
+            allNames[i + 2] = deprecatedNames[i];
+        }
+        return allNames;
+    }
+
     public ParseField withDeprecation(String... deprecatedNames) {
         return new ParseField(this.underscoreName, deprecatedNames);
     }

diff --git a/src/main/java/org/elasticsearch/search/SearchModule.java b/src/main/java/org/elasticsearch/search/SearchModule.java
@@ -27,6 +27,7 @@
 import org.elasticsearch.index.search.morelikethis.MoreLikeThisFetchService;
 import org.elasticsearch.search.action.SearchServiceTransportAction;
 import org.elasticsearch.search.aggregations.AggregationModule;
+import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificantTermsHeuristicModule;
 import org.elasticsearch.search.controller.SearchPhaseController;
 import org.elasticsearch.search.dfs.DfsPhase;
 import org.elasticsearch.search.facet.FacetModule;
@@ -50,7 +51,7 @@ public class SearchModule extends AbstractModule implements SpawnModules {
 
     @Override
     public Iterable<? extends Module> spawnModules() {
-        return ImmutableList.of(new TransportSearchModule(), new FacetModule(), new HighlightModule(), new SuggestModule(), new FunctionScoreModule(), new AggregationModule());
+        return ImmutableList.of(new TransportSearchModule(), new FacetModule(), new HighlightModule(), new SuggestModule(), new FunctionScoreModule(), new AggregationModule(), new SignificantTermsHeuristicModule());
     }
 
     @Override

diff --git a/...arch/search/aggregations/bucket/significant/GlobalOrdinalsSignificantTermsAggregator.java b/...arch/search/aggregations/bucket/significant/GlobalOrdinalsSignificantTermsAggregator.java
@@ -99,7 +99,7 @@ public SignificantStringTerms buildAggregation(long owningBucketOrdinal) {
             // that are for this shard only
             // Back at the central reducer these properties will be updated with
             // global stats
-            spare.updateScore();
+            spare.updateScore(termsAggFactory.getSignificanceHeuristic());
             if (spare.subsetDf >= bucketCountThresholds.getShardMinDocCount()) {
                 spare = (SignificantStringTerms.Bucket) ordered.insertWithOverflow(spare);
             }
@@ -114,7 +114,7 @@ public SignificantStringTerms buildAggregation(long owningBucketOrdinal) {
             list[i] = bucket;
         }
 
-        return new SignificantStringTerms(subsetSize, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Arrays.asList(list));
+        return new SignificantStringTerms(subsetSize, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Arrays.asList(list));
     }
 
     @Override
@@ -123,7 +123,7 @@ public SignificantStringTerms buildEmptyAggregation() {
         ContextIndexSearcher searcher = context.searchContext().searcher();
         IndexReader topReader = searcher.getIndexReader();
         int supersetSize = topReader.numDocs();
-        return new SignificantStringTerms(0, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Collections.<InternalSignificantTerms.Bucket>emptyList());
+        return new SignificantStringTerms(0, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Collections.<InternalSignificantTerms.Bucket>emptyList());
     }
 
     @Override

diff --git a/...va/org/elasticsearch/search/aggregations/bucket/significant/InternalSignificantTerms.java b/...va/org/elasticsearch/search/aggregations/bucket/significant/InternalSignificantTerms.java
@@ -25,6 +25,7 @@
 import org.elasticsearch.search.aggregations.Aggregations;
 import org.elasticsearch.search.aggregations.InternalAggregation;
 import org.elasticsearch.search.aggregations.InternalAggregations;
+import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristic;
 
 import java.util.*;
 
@@ -33,6 +34,7 @@
  */
 public abstract class InternalSignificantTerms extends InternalAggregation implements SignificantTerms, ToXContent, Streamable {
 
+    protected SignificanceHeuristic significanceHeuristic;
     protected int requiredSize;
     protected long minDocCount;
     protected Collection<Bucket> buckets;
@@ -42,7 +44,6 @@ public abstract class InternalSignificantTerms extends InternalAggregation imple
 
     protected InternalSignificantTerms() {} // for serialization
 
-    // TODO updateScore call in constructor to be cleaned up as part of adding pluggable scoring algos
     @SuppressWarnings("PMD.ConstructorCallsOverridableMethod")
     public static abstract class Bucket extends SignificantTerms.Bucket {
 
@@ -53,7 +54,6 @@ public static abstract class Bucket extends SignificantTerms.Bucket {
         protected Bucket(long subsetDf, long subsetSize, long supersetDf, long supersetSize, InternalAggregations aggregations) {
             super(subsetDf, subsetSize, supersetDf, supersetSize);
             this.aggregations = aggregations;
-            updateScore();
         }
 
         @Override
@@ -76,59 +76,8 @@ public long getSubsetSize() {
             return subsetSize;
         }
 
-        /**
-         * Calculates the significance of a term in a sample against a background of
-         * normal distributions by comparing the changes in frequency. This is the heart
-         * of the significant terms feature.
-         * <p/>
-         * TODO - allow pluggable scoring implementations
-         *
-         * @param subsetFreq   The frequency of the term in the selected sample
-         * @param subsetSize   The size of the selected sample (typically number of docs)
-         * @param supersetFreq The frequency of the term in the superset from which the sample was taken
-         * @param supersetSize The size of the superset from which the sample was taken  (typically number of docs)
-         * @return a "significance" score
-         */
-        public static double getSampledTermSignificance(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize) {
-            if ((subsetSize == 0) || (supersetSize == 0)) {
-                // avoid any divide by zero issues
-                return 0;
-            }
-            if (supersetFreq == 0) {
-                // If we are using a background context that is not a strict superset, a foreground 
-                // term may be missing from the background, so for the purposes of this calculation
-                // we assume a value of 1 for our calculations which avoids returning an "infinity" result
-                supersetFreq = 1;
-            }
-            double subsetProbability = (double) subsetFreq / (double) subsetSize;
-            double supersetProbability = (double) supersetFreq / (double) supersetSize;
-
-            // Using absoluteProbabilityChange alone favours very common words e.g. you, we etc
-            // because a doubling in popularity of a common term is a big percent difference 
-            // whereas a rare term would have to achieve a hundred-fold increase in popularity to
-            // achieve the same difference measure.
-            // In favouring common words as suggested features for search we would get high
-            // recall but low precision.
-            double absoluteProbabilityChange = subsetProbability - supersetProbability;
-            if (absoluteProbabilityChange <= 0) {
-                return 0;
-            }
-            // Using relativeProbabilityChange tends to favour rarer terms e.g.mis-spellings or 
-            // unique URLs.
-            // A very low-probability term can very easily double in popularity due to the low
-            // numbers required to do so whereas a high-probability term would have to add many
-            // extra individual sightings to achieve the same shift. 
-            // In favouring rare words as suggested features for search we would get high
-            // precision but low recall.
-            double relativeProbabilityChange = (subsetProbability / supersetProbability);
-
-            // A blend of the above metrics - favours medium-rare terms to strike a useful
-            // balance between precision and recall.
-            return absoluteProbabilityChange * relativeProbabilityChange;
-        }
-
-        public void updateScore() {
-            score = getSampledTermSignificance(subsetDf, subsetSize, supersetDf, supersetSize);
+        public void updateScore(SignificanceHeuristic significanceHeuristic) {
+            score = significanceHeuristic.getScore(subsetDf, subsetSize, supersetDf, supersetSize);
         }
 
         @Override
@@ -162,13 +111,14 @@ public double getSignificanceScore() {
         }
     }
 
-    protected InternalSignificantTerms(long subsetSize, long supersetSize, String name, int requiredSize, long minDocCount, Collection<Bucket> buckets) {
+    protected InternalSignificantTerms(long subsetSize, long supersetSize, String name, int requiredSize, long minDocCount, SignificanceHeuristic significanceHeuristic, Collection<Bucket> buckets) {
         super(name);
         this.requiredSize = requiredSize;
         this.minDocCount = minDocCount;
         this.buckets = buckets;
         this.subsetSize = subsetSize;
         this.supersetSize = supersetSize;
+        this.significanceHeuristic = significanceHeuristic;
     }
 
     @Override
@@ -227,6 +177,7 @@ public InternalAggregation reduce(ReduceContext reduceContext) {
         for (Map.Entry<String, List<Bucket>> entry : buckets.entrySet()) {
             List<Bucket> sameTermBuckets = entry.getValue();
             final Bucket b = sameTermBuckets.get(0).reduce(sameTermBuckets, reduceContext.bigArrays());
+            b.updateScore(significanceHeuristic);
             if ((b.score > 0) && (b.subsetDf >= minDocCount)) {
                 ordered.insertWithOverflow(b);
             }

diff --git a/...n/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantLongTerms.java b/...n/java/org/elasticsearch/search/aggregations/bucket/significant/SignificantLongTerms.java
@@ -18,6 +18,7 @@
  */
 package org.elasticsearch.search.aggregations.bucket.significant;
 
+import org.elasticsearch.Version;
 import org.elasticsearch.common.Nullable;
 import org.elasticsearch.common.io.stream.StreamInput;
 import org.elasticsearch.common.io.stream.StreamOutput;
@@ -26,6 +27,7 @@
 import org.elasticsearch.common.xcontent.XContentBuilder;
 import org.elasticsearch.search.aggregations.AggregationStreams;
 import org.elasticsearch.search.aggregations.InternalAggregations;
+import org.elasticsearch.search.aggregations.bucket.significant.heuristics.*;
 import org.elasticsearch.search.aggregations.support.format.ValueFormatter;
 import org.elasticsearch.search.aggregations.support.format.ValueFormatterStreams;
 
@@ -92,12 +94,13 @@ Bucket newBucket(long subsetDf, long subsetSize, long supersetDf, long supersetS
 
     private ValueFormatter formatter;
 
-    SignificantLongTerms() {} // for serialization
+    SignificantLongTerms() {
+    } // for serialization
 
     public SignificantLongTerms(long subsetSize, long supersetSize, String name, @Nullable ValueFormatter formatter,
-                                int requiredSize, long minDocCount, Collection<InternalSignificantTerms.Bucket> buckets) {
+                                int requiredSize, long minDocCount, SignificanceHeuristic significanceHeuristic, Collection<InternalSignificantTerms.Bucket> buckets) {
 
-        super(subsetSize, supersetSize, name, requiredSize, minDocCount, buckets);
+        super(subsetSize, supersetSize, name, requiredSize, minDocCount, significanceHeuristic, buckets);
         this.formatter = formatter;
     }
 
@@ -109,7 +112,7 @@ public Type type() {
     @Override
     InternalSignificantTerms newAggregation(long subsetSize, long supersetSize,
             List<InternalSignificantTerms.Bucket> buckets) {
-        return new SignificantLongTerms(subsetSize, supersetSize, getName(), formatter, requiredSize, minDocCount, buckets);
+        return new SignificantLongTerms(subsetSize, supersetSize, getName(), formatter, requiredSize, minDocCount, significanceHeuristic, buckets);
     }
 
     @Override
@@ -120,14 +123,17 @@ public void readFrom(StreamInput in) throws IOException {
         this.minDocCount = in.readVLong();
         this.subsetSize = in.readVLong();
         this.supersetSize = in.readVLong();
+        significanceHeuristic = SignificanceHeuristicStreams.read(in);
 
         int size = in.readVInt();
         List<InternalSignificantTerms.Bucket> buckets = new ArrayList<>(size);
         for (int i = 0; i < size; i++) {
             long subsetDf = in.readVLong();
             long supersetDf = in.readVLong();
             long term = in.readLong();
-            buckets.add(new Bucket(subsetDf, subsetSize, supersetDf,supersetSize, term, InternalAggregations.readAggregations(in)));
+            Bucket readBucket = new Bucket(subsetDf, subsetSize, supersetDf,supersetSize, term, InternalAggregations.readAggregations(in));
+            readBucket.updateScore(significanceHeuristic);
+            buckets.add(readBucket);
         }
         this.buckets = buckets;
         this.bucketMap = null;
@@ -141,6 +147,9 @@ public void writeTo(StreamOutput out) throws IOException {
         out.writeVLong(minDocCount);
         out.writeVLong(subsetSize);
         out.writeVLong(supersetSize);
+        if (out.getVersion().onOrAfter(Version.V_1_3_0)) {
+            significanceHeuristic.writeTo(out);
+        }
         out.writeVInt(buckets.size());
         for (InternalSignificantTerms.Bucket bucket : buckets) {
             out.writeVLong(((Bucket) bucket).subsetDf);