Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrastructure for changing easily the significance terms heuristic #6561

Closed
wants to merge 22 commits into from

Conversation

brwe
Copy link
Contributor

@brwe brwe commented Jun 19, 2014

...euristic

This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding

  • SignificanceHeuristic
  • SignificanceHeuristicBuilder
  • SignificanceHeuristicParser

and registering Parser and Heuristic at the SignificantTermsHeuristicModule.

As a proof of concept, this commit also adds a second heuristic to the
already existing one (MutualInformation).

The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.

===== mutual information
added[1.2.3]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not go into a bugfix release (ie. s/1.2.3/1.3.0/)

@brwe brwe removed the v1.2.2 label Jun 19, 2014

protected static final String[] NAMES = {"mutual_information", Strings.toCamelCase("mutual_information")};

protected static final ParseField NAMES_FIELD = new ParseField(NAMES[0], NAMES[1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that ParseField does the camel casing for you and the 2nd/3rd... args to its constructor are actually deprecated names so that in future we can run in "strict" mode and flag any client uses of deprecated APIs

@markharwood
Copy link
Contributor

Looks great @brwe ! I have the Log Likelihood Ratio code from Mahout if you want to bundle that too?
I made a couple of tweaks with Ted's guidance as part of our tests.

@brwe
Copy link
Contributor Author

brwe commented Jun 20, 2014

@markharwood I thought I should make a second pull request that adds that and also Chi square and all that? It is lots of code already

@jpountz
Copy link
Contributor

jpountz commented Jun 21, 2014

+1 to split into several pull requests

@brwe
Copy link
Contributor Author

brwe commented Jun 23, 2014

I added two commits to add the deprecated names checking but only for the significant terms heuristics here. It seems to me that deprecated names are never checked in aggregations anywhere unless I am missing something. I am now wondering if would make more sense to add that to aggregations in a separate commit.

@areek areek assigned rmuir and unassigned rmuir Jun 23, 2014
@brwe
Copy link
Contributor Author

brwe commented Jun 23, 2014

I am now wondering if would make more sense to add that to aggregations in a separate commit.

Removed the strict parsing flag check again, seems to make more sense to do that consistently in a different pull request.

@s1monw
Copy link
Contributor

s1monw commented Jun 26, 2014

I only looked briefly at this but can we add extensive unittests for the individual heuristics, I think we should add those for these!

@@ -120,6 +122,7 @@ public void readFrom(StreamInput in) throws IOException {
this.minDocCount = in.readVLong();
this.subsetSize = in.readVLong();
this.supersetSize = in.readVLong();
significanceHeuristic = SignificanceHeuristicStreams.read(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you only read it if the version is >= 1,3,0 and otherwise fall back to the default impl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I added a commit "check if version supports..."
I added a bwc test for this check, but on latest master that fails because of the combination of 6093
and 5659. Currently, the branch is based on e2da211 . I'll rebase on latest master once this is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests run now, I rebased on master

@brwe
Copy link
Contributor Author

brwe commented Jul 1, 2014

Updated with new commits. @s1monw : I added unit tests in SignificantTermsUnitTests, is that extensive enough?
@markharwood: I added assertions to the score computation and had to change one of the tests (see commits "test score assertions and score" and "check for shard failures...") - is the new behavior OK?

/**
*
*/
@ElasticsearchIntegrationTest.SuiteScopeTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is an ElasticsearchLuceneTestCase you don't need @ElasticsearchIntegrationTest.SuiteScopeTest. I also think it should extend ElasticsearchTestCase rather than ElasticsearchLuceneTestCase or do you use any lucene specific parts? Can we also call this test not ..UnitTest maybe SignificanceHeuristicTest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, commit is "make ElasticsearchTestCase and rename..."

@brwe
Copy link
Contributor Author

brwe commented Jul 2, 2014

@markharwood About the assertions in the scoring function: I agree, we might not always want to rely on the strict superset property. However, for mutual information we sort of rely on the fact that it is strict, else the computations do not make sense.

Mutual information compares two sets and not so much foreground against background. I assumed that the two sets are the subset and the background without the subset. It therefore relies on knowing the frequency in the subset but also the frequency in the background set without the subset. Because currently I only get the background frequency, I have to do a subtraction of background frequency and foreground frequency to figure out how many are in the other set.

Now an example:
Background contains 3 documents, but foreground contains 2 because the strict superset property was violated or because the two sets are completely independent. Now, if the function gets passed foreground freg = 3 and background freq=2 I know that one set contains 3 but I have no means to determine how many documents are actually in the other set as I do not know the overlap of the two sets. Subtraction of background frequency and foreground frequency is clearly wrong - I get a negative number and the computed value will have no meaning. Hence all the strict checks.

I will remove the assertions from the default score and only leave them in mutual information. Actually I am thinking I should replace the asserts by exceptions to make sure users are aware that whatever is computed is wrong...

@markharwood
Copy link
Contributor

I find practical uses of these significance algos on free text are vastly improved if the foreground sample is devoid of the sorts of duplicate text introduced by retweets, email replies, copyright notices etc. we find in typical content. This is the area I am working on at the moment to efficiently strip out repetitions and this will only add to the fuzziness of the numbers presented (e.g. I count only half of the text in documents in a result set). This will mean the foreground sample under-reports word frequencies and any significance algos shouldn't be too thrown off by that.
I'm closing to issuing a PR for this so it may be useful to try some of these alternative significant algos in this context.

@brwe
Copy link
Contributor Author

brwe commented Jul 2, 2014

@markharwood the problem does not arise from under reports of word frequencies but from the inability to clearly distinguish what the frequencies in the two sets are that are compared.
The current heuristic compares one set vs a background and the counts can be fuzzy I agree. But mutual information compares two distinct sets and the significance cannot be determined if the frequencies in each of the sets cannot be computed.

This will mean the foreground sample under-reports word frequencies and any significance algos shouldn't be too thrown off by that.

Maybe I am missing something, but I do not see how I should compare two sets if I cannot determine the frequencies within them?

@brwe
Copy link
Contributor Author

brwe commented Jul 2, 2014

Maybe we could let the user give a hint for mutual information if or if not the background is actually a strict superset or defines a completely different set. Something like

 "significant_terms": {
    "field": ...,
    ...
    "mutual_information": {
      "background_is_superset": true/false
    }
  }

and then derive the different frequencies depending on that? This way, the user would have all the flexibility.

@markharwood
Copy link
Contributor

Maybe I am missing something, but I do not see how I should compare two sets if I cannot determine the frequencies within them?

We potentially have a number of useful tools at our disposal in producing sample sets (background_filters, de-duping and doc "slicing") and they can all introduce some oddities into the numbers presented. Maybe it is a mistake to use words in the code like "subset" and "superset" to describe the numbers if certain algos expect that strict behaviour? Maybe foreground/background-sample are less charged words and your flag "is superset" helps clarify the position.

@brwe
Copy link
Contributor Author

brwe commented Jul 3, 2014

I implemented all review comments I got so far, ready for the next review round.

@markharwood
Copy link
Contributor

I just wanted to register the concern that this may well become a scoring function with custom params in future. Shouldn't be too hard to refactor if we choose to add this later.


private JLHScore() {};

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you must drop this equals method unless you have a corresponding hashCode impl Yet since this is a singleton you can just drop it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it.

@s1monw
Copy link
Contributor

s1monw commented Jul 8, 2014

left a tiny comment - if you fix this you can push ie LGTM


protected static final ParseField IS_BACKGROUND = new ParseField("is_background");

protected static final String SCORE_ERROR_MESSAGE = ", does you background filter not include all documents in the bucket? If so and it is intentional, set \"" + IS_BACKGROUND.getPreferredName() + "\": false";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "does you". Curious why you moved away from is_superset as the parameter name? The new "Is_background" says nothing about the required characteristics of the background. How about "background_is_superset" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"background_is_superset" is best. I'll change it to that.

@Override
public SignificanceHeuristic parse(XContentParser parser) throws IOException, QueryParsingException {
// move to the closing bracket
parser.nextToken();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check nextToken is Token.END_OBJECT and throw appropriate error if not. Without this additional check the parser errors are somewhat confused if the JSON contains a parameter.

@markharwood
Copy link
Contributor

Left a couple of small comments but otherwise looks great.

@s1monw s1monw removed the review label Jul 9, 2014
@brwe
Copy link
Contributor Author

brwe commented Jul 11, 2014

implemented latest review comments

@s1monw
Copy link
Contributor

s1monw commented Jul 14, 2014

LGTM +1 to push

brwe added a commit that referenced this pull request Jul 14, 2014
…e heuristic

This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding

- SignificanceHeuristic
- SignificanceHeuristicBuilder
- SignificanceHeuristicParser

closes #6561
@brwe brwe closed this in 74927ad Jul 14, 2014
@clintongormley clintongormley changed the title significant terms: infrastructure for changing easily the significance h... Aggregations: Infrastructure for changing easily the significance terms heuristic Jul 16, 2014
@clintongormley clintongormley changed the title Aggregations: Infrastructure for changing easily the significance terms heuristic Infrastructure for changing easily the significance terms heuristic Jun 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants