Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize finished counter at zero #23080

Merged
merged 1 commit into from
Apr 22, 2022
Merged

Conversation

bilbof
Copy link
Contributor

@bilbof bilbof commented Apr 19, 2022

Sets initial count of task finished state to zero. This enables acquiring the rate from zero to one (particularly useful if you want to alert on task failures).

We're using the Prometheus statsd-exporter. Since counters are usually used with a PromQL function like rate, it's important
that counters are initialized at zero, otherwise when a task finishes the rate function will not have a previous value to compare the state count to.

For example, what we'd like to do, which tells us the failure rate of tasks over time:

sum by (dag_id, task_id) (rate(airflow_ti_finish{state='failed'}[1h])) > 0

Two useful posts on this subject:
https://www.robustperception.io/why-predeclare-metrics
https://www.section.io/blog/beware-prometheus-counters-that-do-not-begin-at-zero/

@bilbof bilbof requested review from kaxil, XD-DENG and ashb as code owners April 19, 2022 12:38
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 19, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This increments that counter by 0. That doesn't seem right https://statsd.readthedocs.io/en/v3.3/reference.html#StatsClient.incr

@bilbof
Copy link
Contributor Author

bilbof commented Apr 21, 2022

Incrementing the timer by zero tells the receiver that the metric exists (sets it at zero) which is useful in cases where you want to capture the rate change from 0 to 1. Initializing metrics at zero is best practice for Prometheus metrics (which is how we're consuming the Statsd metrics from Airflow jobs).

I've tested that should work with the Statsd Prometheus exporter locally:

# first shell
docker run -p 9102:9102 -p 9125:9125  -it prom/statsd-exporter
# second shell
echo "foo:0|c" | nc localhost 9125
curl localhost:9102/metrics | grep foo

# Result:
# TYPE foo counter
> foo 0

@ashb
Copy link
Member

ashb commented Apr 21, 2022

Oh, somehow I thought this was changing the existing metric we emitted. Looking at it again I see it's not. Clearly not.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Apr 21, 2022
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@bilbof bilbof force-pushed the bf/init-counters-at-zero branch from 15545b7 to 3595aa7 Compare April 21, 2022 15:57
@bilbof
Copy link
Contributor Author

bilbof commented Apr 21, 2022

Thanks @ashb, I rebased to the latest main.

Sets initial count of task finished state to zero.
This enables acquiring the rate from zero to one
(particularly useful if you want to alert on any failures).

We're using the Prometheus statsd-exporter. Since counters
are usually used with a PromQL function like `rate`, it's important
that counters are initialized at zero, otherwise when a task
finishes the rate function will not have a previous value to compare
the state count to.

For example, what we'd like to do:

```
sum by (dag_id, task_id) (rate(airflow_ti_finish{state='failed'}[1h])) >
0
```

This tells us the failure rate of tasks over time.

What I've tried to do instead to ensure the metric captures the change
from zero to one:

```
(sum by (dag_id, task_id) (rate(airflow_ti_finish{state='failed'}[1h])) > 0) or sum by (dag_id, task_id) (airflow_ti_finish{state='failed'} != 0 unless (airflow_ti_finish{state='failed'} offset 1m))
```

Two useful posts on this subject:
https://www.robustperception.io/why-predeclare-metrics
https://www.section.io/blog/beware-prometheus-counters-that-do-not-begin-at-zero/
@bilbof bilbof force-pushed the bf/init-counters-at-zero branch from 3db7bf0 to a272afe Compare April 21, 2022 22:42
@bilbof
Copy link
Contributor Author

bilbof commented Apr 22, 2022

@ashb just a gentle nudge about this one - tests are passing now 😃

@ashb ashb merged commit 3b2ef88 into apache:main Apr 22, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 22, 2022

Awesome work, congrats on your first merged pull request!

@jedcunningham jedcunningham added the type:improvement Changelog: Improvements label Apr 25, 2022
@jedcunningham
Copy link
Member

Thanks @bilbof, congrats on your first commit 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants