-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dag task scheduled and queued duration metrics #37936
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see a few metrics are measured in seconds. At some point of time we have to do unification.
@potiuk @dirrao |
@htpawel, |
@dirrao |
This would break the dashboard of any user currently monitoring that metric and their timers will suddenly show 1000x longer durations, right? |
@ferruzzi |
I never said don't fix it. It's just a matter of if we call it a bugfix and fix it now, creating a breaking change and sending users scrambling to figure out why their queues are suddenly 1000 times longer, or if we have to go through the deprecation process so users are informed. Since this is documented as |
timedelta object statsd handles internally by converting to milliseconds. But there are some case where we explicitly pass the seconds instead of timedelta object and it is stored in statsd as seconds. so I added a fix with different key name where the newly added key name persists the data in milliseconds while the older key stays for backward compatibility with a deprecation message. |
Hey @vandonr - since you are the original author of #30612 and likely will be easier to get all the context back - maybe you can comment on that one? Is that the right fix ? Or maybe @ferruzzi @o-nikolas - since you were around when it was implemented - is there any reason those durations should be kept as they are (and maybe there is a constructive proposal how to handle that case. It looks like we have somewhat long lasting issues that those new metrics introduced in #30612 caused and possibly we should find a way how to address it before 2.9.0 rc2 is out ? |
I don't have anything against converting this metric to milliseconds, I believe I wasn't aware of that statsd recommendation when I wrote that code. However, there are plenty of other timers that are emitted in seconds in airflow, and I think if we make a migration effort, we might as well migrate all at once rather than little by little. We must be aware of the high impact this can have for users as well: if they have threshold alerts on those metrics, they will certainly ring when they see the metric go x1000. |
@vandonr You probably didn't read the topic fully or misunderstood. Airflow indeed emits most of metrics (or all of them) in seconds, and those two should also be emitted in seconds like you wanted to achieve and like it is stated in documentation and like I also want it to be (and everyone I suppose). But unfortunately they are NOT right now. It is because you need to pass milliseconds or delta time object to Statsd timing (then it will emit metric in seconds). But you are passing seconds which is incorrect. Check all other places in Airflow code - delta time object is passed. This is obvious bug which we need to fix with above bugfix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than adding a unittest to ensure this doesn't happen again. I'm happy to approve the change.
I think to me we can safely call this a bugfix and not worry about back-compat (https://xkcd.com/1172/ etc etc).
Please check here. Statsd timing internally sends the stat in milliseconds. I am sharing this here as you told Statsd timing emit metric in seconds. Does statsd sets the data in milliseconds but emits in seconds? |
Ok now I am not 100% sure it emits it in seconds (it will require further investigation), but anyway it is 100% clear that it expects time delta or milliseconds as input, not seconds (it is even said in the code you sent in comment), to work properly. |
Ok, I did further investigation and now I know everything :D |
Sounds like a breaking bug-fix to me . I.e. yes - we know it will be breaking, but it was broken in the first place anyway. @ferruzzi - WDYT? I'd merge it and add significant note about it |
I has been using Airflow with statsd-exporter for quite a long time and to be honest, while I found the document about the timers sometimes inconsistent, the values exported by Statsd which we are using look good to me (except task scheduled and queued duration metrics as I have not used them yet) |
So |
KILOSECONDS? Why the heck used KILOSECONDS for anything? It looks like consensus is that we don't have to deal with deprecation, so that's great. |
Agree. |
@Bowrna @HTRafal @tanvn - Does this proposed change look like it will address your comments? If you all confirm, I'll approve and we can get it merged. The code looks fin to me, just want to confirm that it will fix what you are seeing. @htpawel - Can you think of some kind of unit test on (Stats.timer, satsd_logger.timer and/or otel_logger.timer) that asserts the expected output format? Maybe create a timer, sleep(1), and assert the duration is how we want it to look? I know this was a really small thing and it really dragged out, sorry about that. |
If you are trying to make sure the metric is emitted in seconds and not milliseconds, maybe start a timer, sleep(1), and assert that Stats.timer() was called with a value of 1 and not 1000? |
I've already made sure that metric is emitted in milliseconds (if you use it correctly, so pass milliseconds or delta) but if someone will pass seconds it will emit seconds obviously (with ms suffix..) and there is no test / mechanism to prevents this automatically in any way.. Developers just must know that Statsd expects milliseconds or delta object only, that's convention. |
That's horrible. So we don't have any way of knowing, catching, or preventing this from happening again? Alright. I'll approve it as-is then. |
d2773d9
to
5853949
Compare
If this is the case we probably should add a newsfragmant that warn/explain users |
b5de713
to
0ec39ad
Compare
@ferruzzi who has write access and can merge this? |
0ec39ad
to
ed8082a
Compare
@eladkal seems like once again there is some flaky test failing, not related to my change. Could someone override its result or rerun only it or fix? Otherwise we will never merge such a simple fix :/ |
Ok. Let's treat it as a bugfix and merge. |
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
(cherry picked from commit bffb7b0)
From statsd documentation (calling-timing-manually):
dt int time must be in milliseconds rather than seconds. You can check other examples in airflow code also. Then statsd lib converts it to seconds before exporting metric.