-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infrastructure for changing easily the significance terms heuristic #6561
Conversation
The scores are derived from the doc frequencies in _foreground_ and _background_ sets. The _absolute_ change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the _relative_ change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall. | ||
|
||
===== mutual information | ||
added[1.2.3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not go into a bugfix release (ie. s/1.2.3/1.3.0/)
|
||
protected static final String[] NAMES = {"mutual_information", Strings.toCamelCase("mutual_information")}; | ||
|
||
protected static final ParseField NAMES_FIELD = new ParseField(NAMES[0], NAMES[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that ParseField does the camel casing for you and the 2nd/3rd... args to its constructor are actually deprecated names so that in future we can run in "strict" mode and flag any client uses of deprecated APIs
@markharwood I thought I should make a second pull request that adds that and also Chi square and all that? It is lots of code already |
+1 to split into several pull requests |
I added two commits to add the deprecated names checking but only for the significant terms heuristics here. It seems to me that deprecated names are never checked in aggregations anywhere unless I am missing something. I am now wondering if would make more sense to add that to aggregations in a separate commit. |
Removed the strict parsing flag check again, seems to make more sense to do that consistently in a different pull request. |
I only looked briefly at this but can we add extensive unittests for the individual heuristics, I think we should add those for these! |
@@ -120,6 +122,7 @@ public void readFrom(StreamInput in) throws IOException { | |||
this.minDocCount = in.readVLong(); | |||
this.subsetSize = in.readVLong(); | |||
this.supersetSize = in.readVLong(); | |||
significanceHeuristic = SignificanceHeuristicStreams.read(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you only read it if the version is >= 1,3,0 and otherwise fall back to the default impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests run now, I rebased on master
Updated with new commits. @s1monw : I added unit tests in SignificantTermsUnitTests, is that extensive enough? |
/** | ||
* | ||
*/ | ||
@ElasticsearchIntegrationTest.SuiteScopeTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is an ElasticsearchLuceneTestCase
you don't need @ElasticsearchIntegrationTest.SuiteScopeTest
. I also think it should extend ElasticsearchTestCase
rather than ElasticsearchLuceneTestCase
or do you use any lucene specific parts? Can we also call this test not ..UnitTest
maybe SignificanceHeuristicTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, commit is "make ElasticsearchTestCase and rename..."
@markharwood About the assertions in the scoring function: I agree, we might not always want to rely on the strict superset property. However, for mutual information we sort of rely on the fact that it is strict, else the computations do not make sense. Mutual information compares two sets and not so much foreground against background. I assumed that the two sets are the subset and the background without the subset. It therefore relies on knowing the frequency in the subset but also the frequency in the background set without the subset. Because currently I only get the background frequency, I have to do a subtraction of background frequency and foreground frequency to figure out how many are in the other set. Now an example: I will remove the assertions from the default score and only leave them in mutual information. Actually I am thinking I should replace the asserts by exceptions to make sure users are aware that whatever is computed is wrong... |
I find practical uses of these significance algos on free text are vastly improved if the foreground sample is devoid of the sorts of duplicate text introduced by retweets, email replies, copyright notices etc. we find in typical content. This is the area I am working on at the moment to efficiently strip out repetitions and this will only add to the fuzziness of the numbers presented (e.g. I count only half of the text in documents in a result set). This will mean the foreground sample under-reports word frequencies and any significance algos shouldn't be too thrown off by that. |
@markharwood the problem does not arise from under reports of word frequencies but from the inability to clearly distinguish what the frequencies in the two sets are that are compared.
Maybe I am missing something, but I do not see how I should compare two sets if I cannot determine the frequencies within them? |
Maybe we could let the user give a hint for mutual information if or if not the background is actually a strict superset or defines a completely different set. Something like
and then derive the different frequencies depending on that? This way, the user would have all the flexibility. |
We potentially have a number of useful tools at our disposal in producing sample sets (background_filters, de-duping and doc "slicing") and they can all introduce some oddities into the numbers presented. Maybe it is a mistake to use words in the code like "subset" and "superset" to describe the numbers if certain algos expect that strict behaviour? Maybe foreground/background-sample are less charged words and your flag "is superset" helps clarify the position. |
I implemented all review comments I got so far, ready for the next review round. |
I just wanted to register the concern that this may well become a scoring function with custom params in future. Shouldn't be too hard to refactor if we choose to add this later. |
|
||
private JLHScore() {}; | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you must drop this equals method unless you have a corresponding hashCode impl Yet since this is a singleton you can just drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it.
left a tiny comment - if you fix this you can push ie LGTM |
|
||
protected static final ParseField IS_BACKGROUND = new ParseField("is_background"); | ||
|
||
protected static final String SCORE_ERROR_MESSAGE = ", does you background filter not include all documents in the bucket? If so and it is intentional, set \"" + IS_BACKGROUND.getPreferredName() + "\": false"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "does you". Curious why you moved away from is_superset
as the parameter name? The new "Is_background" says nothing about the required characteristics of the background. How about "background_is_superset" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"background_is_superset" is best. I'll change it to that.
@Override | ||
public SignificanceHeuristic parse(XContentParser parser) throws IOException, QueryParsingException { | ||
// move to the closing bracket | ||
parser.nextToken(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check nextToken is Token.END_OBJECT and throw appropriate error if not. Without this additional check the parser errors are somewhat confused if the JSON contains a parameter.
Left a couple of small comments but otherwise looks great. |
implemented latest review comments |
LGTM +1 to push |
…e heuristic This commit adds the infrastructure to allow pluging in different measures for computing the significance of a term. Significance measures can be provided externally by overriding - SignificanceHeuristic - SignificanceHeuristicBuilder - SignificanceHeuristicParser closes #6561
...euristic
This commit adds the infrastructure to allow pluging in different
measures for computing the significance of a term.
Significance measures can be provided externally by overriding
and registering Parser and Heuristic at the SignificantTermsHeuristicModule.
As a proof of concept, this commit also adds a second heuristic to the
already existing one (MutualInformation).