Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Avoid very low p-values if the term is only a tiny fraction of the foreground set #76764

Merged

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Aug 20, 2021

Whilst testing the p_value scoring heuristic for significant terms introduced in #75313 it became clear we can assign arbitrarily low p-values if the overall counts are high enough for terms which constitute a very small fraction of the foreground set. Even if the difference in their frequency on the foreground and background set is statistically significant they don't explain the majority of the foreground cases and so are not of significant interest (certainly not in the use cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is small on both the foreground and background set, 2. the term frequencies are very similar. These offset the actual term counts by a fixed small fraction of the background counts and make the foreground and background frequencies more similar by a small relative amount, respectively. This change simply applies offsets to the term counts before making frequencies more similar. For frequencies much less than the offset we therefore get equal frequencies on the foreground and background sets and p-value tends to 1. This retains the advantage of being a smooth correction to the p-value so we get no strange discontinuities in the vicinity of the small absolute and difference thresholds for the frequency.

@tveasey tveasey added :ml Machine learning v8.0.0 labels Aug 20, 2021
@tveasey tveasey requested a review from benwtrent August 20, 2021 13:10
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Aug 20, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent added v7.16.0 >non-issue auto-backport Automatically create backport pull requests when merged labels Aug 20, 2021
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some minor formatting quibbles, but I think this is good. I pretty much had exactly this already in a local branch 😅

@tveasey
Copy link
Contributor Author

tveasey commented Aug 20, 2021

Thanks...

I pretty much had exactly this already in a local branch 😅

and sorry (I guess it is remote pair programming).

@tveasey tveasey merged commit e511a25 into elastic:master Aug 20, 2021
@tveasey tveasey deleted the sig-terms-p-value-for-low-fraction-categories branch August 20, 2021 14:53
benwtrent pushed a commit to benwtrent/elasticsearch that referenced this pull request Aug 20, 2021
…he foreground set (elastic#76764)

Whilst testing the p_value scoring heuristic for significant terms introduced
in elastic#75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.
benwtrent pushed a commit to benwtrent/elasticsearch that referenced this pull request Aug 20, 2021
…he foreground set (elastic#76764)

Whilst testing the p_value scoring heuristic for significant terms introduced
in elastic#75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.
elasticsearchmachine pushed a commit that referenced this pull request Aug 20, 2021
…he foreground set (#76764) (#76773)

Whilst testing the p_value scoring heuristic for significant terms introduced
in #75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.

Co-authored-by: Tom Veasey <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Aug 23, 2021
…he foreground set (#76764) (#76772)

Whilst testing the p_value scoring heuristic for significant terms introduced
in #75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.

Co-authored-by: Tom Veasey <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :ml Machine learning >non-issue Team:ML Meta label for the ML team v7.15.0 v7.16.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants