Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] adding new p_value scoring heuristic to significant terms aggregation #75313

Merged

Conversation

benwtrent
Copy link
Member

This commit adds a new p_value score heuristic to significant terms.

The p_value is calculating assuming that the foreground set and the background set are independent Bernoulli trials with the null hypothesis that the probabilities are the same.

Example usage:

This calculates the p_value score for terms user_agent.version given the foreground set of "ended in failure" vs "NOT ended in failure".

NOTE: "background_is_superset": false to indicate that the background set does not contain the counts of the foreground set as we filter them out.

{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "event.outcome": "failure"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2021-02-01",
              "lt": "2021-02-04"
            }
          }
        },
        {
          "term": {
            "service.name": {
              "value": "frontend-node"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "failure_p_value": {
      "significant_terms": {
        "field": "user_agent.version",
        "background_filter": {
          "bool": {
            "must_not": [
              {
                "term": {
                  "event.outcome": "failure"
                }
              }
            ],
            "filter": [
              {
                "range": {
                  "@timestamp": {
                    "gte": "2021-02-01",
                    "lt": "2021-02-04"
                  }
                }
              },
              {
                "term": {
                  "service.name": {
                    "value": "frontend-node"
                  }
                }
              }
            ]
          }
        },
        "p_value": {"background_is_superset": false}
      }
    }
  }
}

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Member Author

@elastic/ml-docs Hola, I am not sure where to put the docs for this.

Its a new function in the significant terms aggregation that the ML plugin provides. Right now, there is no independent page for significance functions or a place to put plugin ones.

@benwtrent
Copy link
Member Author

@not-napoleon related to: #75264

I needed to move some files so that the ML plugin could access them. I failed to do that in the previous PR. The changes are rather small (mostly a package move).

*
* It expands its usage to allow `long` values instead of restricting to `int`
*/
public class LongBinomialDistribution {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is mostly a copy paste from apache math3. The only difference is the parameter types are now long instead of int.

public double survivalFunction(double x) {
return x <= 0 ?
0 :
Gamma.regularizedGammaQ(gamma.getShape(), x / gamma.getScale());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more accurate to use this regularizedGammaQ directly instead of attempting 1-regularizedGammaP as we could over/under flow quite easily on smaller values.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Everything looks good to me (except for one correction which was a mistake in the prototype). I do think it would be good to allow key constants to be supplied as a parameter and also made some suggestions for extra testing.

@benwtrent benwtrent requested a review from tveasey July 14, 2021 18:13
Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted one further simplification, since you pulled in the condition that the frequency term must be higher on the subset. Also, I realised seeing the actual p-values that my suggestions for testing the case that the fraction is within 5% were off by a factor.

@benwtrent
Copy link
Member Author

Factored out the required agg test changes to this PR: #75452

@benwtrent
Copy link
Member Author

run elasticsearch-ci/part-1

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeBoolean(backgroundIsSuperset);
Copy link
Member

@davidkyle davidkyle Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be tempted to use the super methods for writeTo and in the super StreamInput ctor in case something changes in the base class. This class can't be constructed with includeNegatives == false so that constraint is preserved.

@benwtrent
Copy link
Member Author

run elasticsearch-ci/part-1

@benwtrent benwtrent merged commit 79c176c into elastic:master Jul 21, 2021
@benwtrent benwtrent deleted the feature/ml-p_value-sig-terms-heuristic branch July 21, 2021 15:39
elasticsearchmachine pushed a commit that referenced this pull request Jul 21, 2021
…aggregation (#75313) (#75597)

* [ML] adding new p_value scoring heuristic to significant terms aggregation (#75313)

This commit adds a new p_value score heuristic to significant terms. 

The p_value is calculating assuming that the foreground set and the background set are independent Bernoulli trials with the null hypothesis that the probabilities are the same.

* adjusting for backport
ywangd pushed a commit to ywangd/elasticsearch that referenced this pull request Jul 30, 2021
…ation (elastic#75313)

This commit adds a new p_value score heuristic to significant terms. 

The p_value is calculating assuming that the foreground set and the background set are independent Bernoulli trials with the null hypothesis that the probabilities are the same.
tveasey added a commit that referenced this pull request Aug 20, 2021
…he foreground set (#76764)

Whilst testing the p_value scoring heuristic for significant terms introduced
in #75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.
benwtrent pushed a commit to benwtrent/elasticsearch that referenced this pull request Aug 20, 2021
…he foreground set (elastic#76764)

Whilst testing the p_value scoring heuristic for significant terms introduced
in elastic#75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.
benwtrent pushed a commit to benwtrent/elasticsearch that referenced this pull request Aug 20, 2021
…he foreground set (elastic#76764)

Whilst testing the p_value scoring heuristic for significant terms introduced
in elastic#75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.
elasticsearchmachine pushed a commit that referenced this pull request Aug 20, 2021
…he foreground set (#76764) (#76773)

Whilst testing the p_value scoring heuristic for significant terms introduced
in #75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.

Co-authored-by: Tom Veasey <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Aug 23, 2021
…he foreground set (#76764) (#76772)

Whilst testing the p_value scoring heuristic for significant terms introduced
in #75313 it became clear we can assign arbitrarily low p-values if the overall
counts are high enough for terms which constitute a very small fraction of the
foreground set. Even if the difference in their frequency on the foreground and
background set is statistically significant they don't explain the majority of the
foreground cases and so are not of significant interest (certainly not in the use
cases we have for this aggregation).

We already have some mitigation for the cases that 1. the term frequency is
small on both the foreground and background set, 2. the term frequencies are
very similar. These offset the actual term counts by a fixed small fraction of
the background counts and make the foreground and background frequencies
more similar by a small relative amount, respectively. This change simply applies
offsets to the term counts before making frequencies more similar. For frequencies
much less than the offset we therefore get equal frequencies on the foreground
and background sets and p-value tends to 1. This retains the advantage of being
a smooth correction to the p-value so we get no strange discontinuities in the
vicinity of the small absolute and difference thresholds for the frequency.

Co-authored-by: Tom Veasey <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v7.15.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants