Skip to content

Commit

Permalink
significant terms: infrastructure for changing easily the significanc…
Browse files Browse the repository at this point in the history
…e heuristic

This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding

- SignificanceHeuristic
- SignificanceHeuristicBuilder
- SignificanceHeuristicParser

closes #6561
  • Loading branch information
brwe committed Jul 14, 2014
1 parent 4a89c0d commit 89838d8
Show file tree
Hide file tree
Showing 26 changed files with 1,747 additions and 154 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -194,10 +194,7 @@ where a simple `terms` aggregation would typically show the very popular "consta

.How are the scores calculated?
**********************************
The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users.
The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favour
common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favour rare terms.
Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in _foreground_ and _background_ sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.
**********************************

Expand Down Expand Up @@ -282,7 +279,35 @@ However, the `size` and `shard size` settings covered in the next section provid

==== Parameters

===== JLH score

The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.

===== mutual information
added[1.3.0]

Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter

[source,js]
--------------------------------------------------
"mutual_information": {
"include_negatives": true
}
--------------------------------------------------

Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, `include_negatives` can be set to `false`.

Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set

[source,js]
--------------------------------------------------
"background_is_superset": false
--------------------------------------------------



===== Size & Shard Size

The `size` parameter can be set to define how many term buckets should be returned out of the overall terms list. By
Expand Down Expand Up @@ -338,7 +363,7 @@ Terms that score highly will be collected on a shard level and merged with the t

added[1.2.0] `shard_min_doc_count` parameter

The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a resonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.
The parameter `shard_min_doc_count` regulates the _certainty_ a shard has if the term should actually be added to the candidate list or not with respect to the `min_doc_count`. Terms will only be considered if their local shard frequency within the set is higher than the `shard_min_doc_count`. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the `shard_min_doc_count` parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required `min_doc_count` even after merging the local frequencies. `shard_min_doc_count` is set to `1` per default and has no effect unless you explicitly set it.



Expand Down
10 changes: 10 additions & 0 deletions src/main/java/org/elasticsearch/common/ParseField.java
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,16 @@ public String getPreferredName(){
return underscoreName;
}

public String[] getAllNamesIncludedDeprecated() {
String[] allNames = new String[2 + deprecatedNames.length];
allNames[0] = camelCaseName;
allNames[1] = underscoreName;
for (int i = 0; i < deprecatedNames.length; i++) {
allNames[i + 2] = deprecatedNames[i];
}
return allNames;
}

public ParseField withDeprecation(String... deprecatedNames) {
return new ParseField(this.underscoreName, deprecatedNames);
}
Expand Down
3 changes: 2 additions & 1 deletion src/main/java/org/elasticsearch/search/SearchModule.java
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
import org.elasticsearch.index.search.morelikethis.MoreLikeThisFetchService;
import org.elasticsearch.search.action.SearchServiceTransportAction;
import org.elasticsearch.search.aggregations.AggregationModule;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificantTermsHeuristicModule;
import org.elasticsearch.search.controller.SearchPhaseController;
import org.elasticsearch.search.dfs.DfsPhase;
import org.elasticsearch.search.facet.FacetModule;
Expand All @@ -50,7 +51,7 @@ public class SearchModule extends AbstractModule implements SpawnModules {

@Override
public Iterable<? extends Module> spawnModules() {
return ImmutableList.of(new TransportSearchModule(), new FacetModule(), new HighlightModule(), new SuggestModule(), new FunctionScoreModule(), new AggregationModule());
return ImmutableList.of(new TransportSearchModule(), new FacetModule(), new HighlightModule(), new SuggestModule(), new FunctionScoreModule(), new AggregationModule(), new SignificantTermsHeuristicModule());
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ public SignificantStringTerms buildAggregation(long owningBucketOrdinal) {
// that are for this shard only
// Back at the central reducer these properties will be updated with
// global stats
spare.updateScore();
spare.updateScore(termsAggFactory.getSignificanceHeuristic());
if (spare.subsetDf >= bucketCountThresholds.getShardMinDocCount()) {
spare = (SignificantStringTerms.Bucket) ordered.insertWithOverflow(spare);
}
Expand All @@ -114,7 +114,7 @@ public SignificantStringTerms buildAggregation(long owningBucketOrdinal) {
list[i] = bucket;
}

return new SignificantStringTerms(subsetSize, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Arrays.asList(list));
return new SignificantStringTerms(subsetSize, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Arrays.asList(list));
}

@Override
Expand All @@ -123,7 +123,7 @@ public SignificantStringTerms buildEmptyAggregation() {
ContextIndexSearcher searcher = context.searchContext().searcher();
IndexReader topReader = searcher.getIndexReader();
int supersetSize = topReader.numDocs();
return new SignificantStringTerms(0, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), Collections.<InternalSignificantTerms.Bucket>emptyList());
return new SignificantStringTerms(0, supersetSize, name, bucketCountThresholds.getRequiredSize(), bucketCountThresholds.getMinDocCount(), termsAggFactory.getSignificanceHeuristic(), Collections.<InternalSignificantTerms.Bucket>emptyList());
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import org.elasticsearch.search.aggregations.Aggregations;
import org.elasticsearch.search.aggregations.InternalAggregation;
import org.elasticsearch.search.aggregations.InternalAggregations;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.SignificanceHeuristic;

import java.util.*;

Expand All @@ -33,6 +34,7 @@
*/
public abstract class InternalSignificantTerms extends InternalAggregation implements SignificantTerms, ToXContent, Streamable {

protected SignificanceHeuristic significanceHeuristic;
protected int requiredSize;
protected long minDocCount;
protected Collection<Bucket> buckets;
Expand All @@ -42,7 +44,6 @@ public abstract class InternalSignificantTerms extends InternalAggregation imple

protected InternalSignificantTerms() {} // for serialization

// TODO updateScore call in constructor to be cleaned up as part of adding pluggable scoring algos
@SuppressWarnings("PMD.ConstructorCallsOverridableMethod")
public static abstract class Bucket extends SignificantTerms.Bucket {

Expand All @@ -53,7 +54,6 @@ public static abstract class Bucket extends SignificantTerms.Bucket {
protected Bucket(long subsetDf, long subsetSize, long supersetDf, long supersetSize, InternalAggregations aggregations) {
super(subsetDf, subsetSize, supersetDf, supersetSize);
this.aggregations = aggregations;
updateScore();
}

@Override
Expand All @@ -76,59 +76,8 @@ public long getSubsetSize() {
return subsetSize;
}

/**
* Calculates the significance of a term in a sample against a background of
* normal distributions by comparing the changes in frequency. This is the heart
* of the significant terms feature.
* <p/>
* TODO - allow pluggable scoring implementations
*
* @param subsetFreq The frequency of the term in the selected sample
* @param subsetSize The size of the selected sample (typically number of docs)
* @param supersetFreq The frequency of the term in the superset from which the sample was taken
* @param supersetSize The size of the superset from which the sample was taken (typically number of docs)
* @return a "significance" score
*/
public static double getSampledTermSignificance(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize) {
if ((subsetSize == 0) || (supersetSize == 0)) {
// avoid any divide by zero issues
return 0;
}
if (supersetFreq == 0) {
// If we are using a background context that is not a strict superset, a foreground
// term may be missing from the background, so for the purposes of this calculation
// we assume a value of 1 for our calculations which avoids returning an "infinity" result
supersetFreq = 1;
}
double subsetProbability = (double) subsetFreq / (double) subsetSize;
double supersetProbability = (double) supersetFreq / (double) supersetSize;

// Using absoluteProbabilityChange alone favours very common words e.g. you, we etc
// because a doubling in popularity of a common term is a big percent difference
// whereas a rare term would have to achieve a hundred-fold increase in popularity to
// achieve the same difference measure.
// In favouring common words as suggested features for search we would get high
// recall but low precision.
double absoluteProbabilityChange = subsetProbability - supersetProbability;
if (absoluteProbabilityChange <= 0) {
return 0;
}
// Using relativeProbabilityChange tends to favour rarer terms e.g.mis-spellings or
// unique URLs.
// A very low-probability term can very easily double in popularity due to the low
// numbers required to do so whereas a high-probability term would have to add many
// extra individual sightings to achieve the same shift.
// In favouring rare words as suggested features for search we would get high
// precision but low recall.
double relativeProbabilityChange = (subsetProbability / supersetProbability);

// A blend of the above metrics - favours medium-rare terms to strike a useful
// balance between precision and recall.
return absoluteProbabilityChange * relativeProbabilityChange;
}

public void updateScore() {
score = getSampledTermSignificance(subsetDf, subsetSize, supersetDf, supersetSize);
public void updateScore(SignificanceHeuristic significanceHeuristic) {
score = significanceHeuristic.getScore(subsetDf, subsetSize, supersetDf, supersetSize);
}

@Override
Expand Down Expand Up @@ -162,13 +111,14 @@ public double getSignificanceScore() {
}
}

protected InternalSignificantTerms(long subsetSize, long supersetSize, String name, int requiredSize, long minDocCount, Collection<Bucket> buckets) {
protected InternalSignificantTerms(long subsetSize, long supersetSize, String name, int requiredSize, long minDocCount, SignificanceHeuristic significanceHeuristic, Collection<Bucket> buckets) {
super(name);
this.requiredSize = requiredSize;
this.minDocCount = minDocCount;
this.buckets = buckets;
this.subsetSize = subsetSize;
this.supersetSize = supersetSize;
this.significanceHeuristic = significanceHeuristic;
}

@Override
Expand Down Expand Up @@ -227,6 +177,7 @@ public InternalAggregation reduce(ReduceContext reduceContext) {
for (Map.Entry<String, List<Bucket>> entry : buckets.entrySet()) {
List<Bucket> sameTermBuckets = entry.getValue();
final Bucket b = sameTermBuckets.get(0).reduce(sameTermBuckets, reduceContext.bigArrays());
b.updateScore(significanceHeuristic);
if ((b.score > 0) && (b.subsetDf >= minDocCount)) {
ordered.insertWithOverflow(b);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
*/
package org.elasticsearch.search.aggregations.bucket.significant;

import org.elasticsearch.Version;
import org.elasticsearch.common.Nullable;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
Expand All @@ -26,6 +27,7 @@
import org.elasticsearch.common.xcontent.XContentBuilder;
import org.elasticsearch.search.aggregations.AggregationStreams;
import org.elasticsearch.search.aggregations.InternalAggregations;
import org.elasticsearch.search.aggregations.bucket.significant.heuristics.*;
import org.elasticsearch.search.aggregations.support.format.ValueFormatter;
import org.elasticsearch.search.aggregations.support.format.ValueFormatterStreams;

Expand Down Expand Up @@ -92,12 +94,13 @@ Bucket newBucket(long subsetDf, long subsetSize, long supersetDf, long supersetS

private ValueFormatter formatter;

SignificantLongTerms() {} // for serialization
SignificantLongTerms() {
} // for serialization

public SignificantLongTerms(long subsetSize, long supersetSize, String name, @Nullable ValueFormatter formatter,
int requiredSize, long minDocCount, Collection<InternalSignificantTerms.Bucket> buckets) {
int requiredSize, long minDocCount, SignificanceHeuristic significanceHeuristic, Collection<InternalSignificantTerms.Bucket> buckets) {

super(subsetSize, supersetSize, name, requiredSize, minDocCount, buckets);
super(subsetSize, supersetSize, name, requiredSize, minDocCount, significanceHeuristic, buckets);
this.formatter = formatter;
}

Expand All @@ -109,7 +112,7 @@ public Type type() {
@Override
InternalSignificantTerms newAggregation(long subsetSize, long supersetSize,
List<InternalSignificantTerms.Bucket> buckets) {
return new SignificantLongTerms(subsetSize, supersetSize, getName(), formatter, requiredSize, minDocCount, buckets);
return new SignificantLongTerms(subsetSize, supersetSize, getName(), formatter, requiredSize, minDocCount, significanceHeuristic, buckets);
}

@Override
Expand All @@ -120,14 +123,17 @@ public void readFrom(StreamInput in) throws IOException {
this.minDocCount = in.readVLong();
this.subsetSize = in.readVLong();
this.supersetSize = in.readVLong();
significanceHeuristic = SignificanceHeuristicStreams.read(in);

int size = in.readVInt();
List<InternalSignificantTerms.Bucket> buckets = new ArrayList<>(size);
for (int i = 0; i < size; i++) {
long subsetDf = in.readVLong();
long supersetDf = in.readVLong();
long term = in.readLong();
buckets.add(new Bucket(subsetDf, subsetSize, supersetDf,supersetSize, term, InternalAggregations.readAggregations(in)));
Bucket readBucket = new Bucket(subsetDf, subsetSize, supersetDf,supersetSize, term, InternalAggregations.readAggregations(in));
readBucket.updateScore(significanceHeuristic);
buckets.add(readBucket);
}
this.buckets = buckets;
this.bucketMap = null;
Expand All @@ -141,6 +147,9 @@ public void writeTo(StreamOutput out) throws IOException {
out.writeVLong(minDocCount);
out.writeVLong(subsetSize);
out.writeVLong(supersetSize);
if (out.getVersion().onOrAfter(Version.V_1_3_0)) {
significanceHeuristic.writeTo(out);
}
out.writeVInt(buckets.size());
for (InternalSignificantTerms.Bucket bucket : buckets) {
out.writeVLong(((Bucket) bucket).subsetDf);
Expand Down
Loading

0 comments on commit 89838d8

Please sign in to comment.