[ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set #76764

tveasey · 2021-08-20T13:10:15Z

Whilst testing the p_value scoring heuristic for significant terms introduced in #75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency.

…reground set

elasticmachine · 2021-08-20T13:10:18Z

Pinging @elastic/ml-core (Team:ML)

benwtrent

There are some minor formatting quibbles, but I think this is good. I pretty much had exactly this already in a local branch 😅

tveasey · 2021-08-20T13:23:51Z

Thanks...

I pretty much had exactly this already in a local branch 😅

and sorry (I guess it is remote pair programming).

…he foreground set (elastic#76764) Whilst testing the p_value scoring heuristic for significant terms introduced in elastic#75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation). We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency.

…he foreground set (#76764) (#76773) Whilst testing the p_value scoring heuristic for significant terms introduced in #75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation). We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency. Co-authored-by: Tom Veasey <[email protected]>

…he foreground set (#76764) (#76772) Whilst testing the p_value scoring heuristic for significant terms introduced in #75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation). We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency. Co-authored-by: Tom Veasey <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

Avoid very low p-values if category is only a tiny fraction of the fo…

a652d0a

…reground set

tveasey added :ml Machine learning v8.0.0 labels Aug 20, 2021

tveasey requested a review from benwtrent August 20, 2021 13:10

elasticmachine added the Team:ML Meta label for the ML team label Aug 20, 2021

tveasey added the v7.15.0 label Aug 20, 2021

benwtrent added v7.16.0 >non-issue auto-backport Automatically create backport pull requests when merged labels Aug 20, 2021

benwtrent approved these changes Aug 20, 2021

View reviewed changes

tveasey added 2 commits August 20, 2021 14:28

Formatting

d7d8fb3

Merge branch 'master' into sig-terms-p-value-for-low-fraction-categories

04f502c

tveasey merged commit e511a25 into elastic:master Aug 20, 2021

tveasey deleted the sig-terms-p-value-for-low-fraction-categories branch August 20, 2021 14:53

benwtrent mentioned this pull request Aug 20, 2021

[7.x] [ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set (#76764) #76772

Merged

benwtrent mentioned this pull request Aug 20, 2021

[7.15] [ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set (#76764) #76773

Merged

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set #76764

[ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set #76764

tveasey commented Aug 20, 2021

elasticmachine commented Aug 20, 2021

benwtrent left a comment

tveasey commented Aug 20, 2021

[ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set #76764

[ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set #76764

Conversation

tveasey commented Aug 20, 2021

elasticmachine commented Aug 20, 2021

benwtrent left a comment

Choose a reason for hiding this comment

tveasey commented Aug 20, 2021