Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIP-49] [OTel Integration] Add tagging to existing stats #30496

Merged
merged 10 commits into from
Apr 18, 2023

Conversation

ferruzzi
Copy link
Contributor

@ferruzzi ferruzzi commented Apr 5, 2023

Groundwork PR for the AIP-49 OpenTelemetry integration work. Many stats had tagging already, this PR cleans up some of them and adds tagging to the ones that were missing it.

cc @o-nikolas @vincbeck @vandonr-amz @syedahsn

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Apr 5, 2023
@ferruzzi
Copy link
Contributor Author

ferruzzi commented Apr 6, 2023

Interesting. I'll fix the tests

airflow/dag_processing/processor.py Outdated Show resolved Hide resolved
airflow/dag_processing/processor.py Outdated Show resolved Hide resolved
airflow/executors/base_executor.py Outdated Show resolved Hide resolved
@@ -540,6 +540,10 @@ def __init__(
# can be changed when calling 'run'
self.test_mode = False

@property
def stats_tags(self) -> dict[str, str]:
return prune_dict({"dag_id": self.dag_id, "run_id": str(self.run_id), "task_id": self.task_id})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't run_ID a meaningless autogenerated number ? If so, I think it'd add little value to the metric, while adding a lot of cardinality, making the costs 📈 for users

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't run_ID a meaningless autogenerated number ? If so, I think it'd add little value to the metric, while adding a lot of cardinality, making the costs 📈 for users

I don't think it's a meaningless, but yes - it is auto-generated. For example, scheduled__2023-04-11T19:10:00+00:00 but I have to agree, that it will generate unnecessary cardinality since each of these will attribute to a distinct time series and users would probably going to see dots instead of lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it in 3c32c7d

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethinking this, that was already in there (here). All I did was move it from an inline declaration to a property. So removing it now would/could/should be considered a breaking change. How strongly do we feel about that? I'm tempted to leave it in at this point unless there is a reason to drop it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this may be a breaking change, after thinking about it, I believe adding run_id was not a good design decision as the presence of that label would not create a continuous time series, but rather a multiple metrics data that are going to become high-cardinality pattern.

My usual thought process for determining if something is a metric data or not, is whether the series could be something that we could connect it as a 'time series' (connecting dots to make it into the chart) or not. Having each data point with distinct run_id would not make it into time series, but just a series of distinct individual 'events' that would simply be plotted as dots in the chart (since each of them would have their own distinct run_id).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

airflow/stats.py Show resolved Hide resolved
@ferruzzi
Copy link
Contributor Author

Sorry for the delay, I was out of town celebrating my anniversary :P I will look at the failing tests and get to the comments today. Thanks for your patience!

@potiuk
Copy link
Member

potiuk commented Apr 11, 2023

Sorry for the delay, I was out of town celebrating my anniversary :P I will look at the failing tests and get to the comments today. Thanks for your patience!

Congrats! That must have been a celerbration! 🎉 🎉 🎉 🎉 🎉

@ferruzzi
Copy link
Contributor Author

Congrats! That must have been a celerbration

15 years! I can't believe anyone would put up with me that long!

@howardyoo
Copy link
Contributor

Congrats! That must have been a celerbration

15 years! I can't believe anyone would put up with me that long!

Congrats! Considering the fact that my daughter is 16 years old now, I think my marriage has lasted +1 year of that time, and I do feel the same about how understanding my wife has been with me :-) .

@ferruzzi ferruzzi force-pushed the ferruzzi/otel/m2-stat-tagging branch from 64c780f to 3c32c7d Compare April 12, 2023 02:27
@ferruzzi
Copy link
Contributor Author

ferruzzi commented Apr 12, 2023

Rebased and made a handful of adjustments based on feedback above. Still pending a decision on including the email address(es); I'll adjust the test and/or the method accordingly once we have some consensus there.

Also pending a deeper look at the tests which are failing because AttributeError: 'TaskInstance' object has no attribute 'stat_tags' when it has the @property here.

@ferruzzi
Copy link
Contributor Author

@o-nikolas mentioned that it was likely because the stats_tags wasn't be stored in the db, so I thought making it a property and assembling it there was a clever solution. I may have to inline it (8 uses) but I was hoping for a more DRY answer.

@o-nikolas
Copy link
Contributor

Also pending a deeper look at the tests which are failing because AttributeError: 'TaskInstance' object has no attribute 'stat_tags' when it has the @Property here.
@o-nikolas mentioned that it was likely because the stats_tags wasn't be stored in the db, so I thought making it a property and assembling it there was a clever solution. I may have to inline it (8 uses) but I was hoping for a more DRY answer.

The code link you added is for the DagRun object, but the exception is referring to a TaskInstance object, I think you're missing a similar property on the TI (if that's what was intended, or you're trying to get that property on a different object than you think you're using).

@ferruzzi
Copy link
Contributor Author

Nah, I'm just a dough-head and linked the wrong place. Fixed the link. It's in both places.

@ferruzzi
Copy link
Contributor Author

It's a typo. stats_tags vs stat_tags

@o-nikolas
Copy link
Contributor

It's a typo. stats_tags vs stat_tags

Nice, the easiest of fixes! 😃

@ferruzzi
Copy link
Contributor Author

Nice, the easiest of fixes! smiley

And among the more embarrassing ones :P Running static checks and double-checking it was only the one place I made that mistake and I'll have another commit pushed shortly.

That just leaves the email(s) question pending.

@ferruzzi
Copy link
Contributor Author

ferruzzi commented Apr 12, 2023

Alright, a whole slew of new ones failing now that the typo is out of the way. I have to update a handful of unit tests that were checking for called_with(). I'll worry about that tomorrow.

[DONE]

@ferruzzi
Copy link
Contributor Author

In this test we are specifically checking to make sure that there are no tags in the dagrun.{dag.dag_id}.first_task_scheduling_delay metric, but that they do exist in the dagrun.first_task_scheduling_delay metric. Is that actually the desired behavior or is that a side effect of the test that wasn't intended?

I was planning to add tagging to every metric, but perhaps I overdid it and it's not wanted or needed on some?

@ferruzzi ferruzzi force-pushed the ferruzzi/otel/m2-stat-tagging branch from 2070795 to 1c09535 Compare April 12, 2023 19:09
@ferruzzi
Copy link
Contributor Author

Sorry for the churn. Post-vacation brain melt, I guess :P All tests are currently passing.

Still pending discussion/decisions:

  1. Do we want emails in the tags (discussion here)
  2. I think I am going to add the dag_run_id back in as that was there before I started tinkering and removing it feels like a breaking change. (discussion here)

@howardyoo
Copy link
Contributor

Sorry for the churn. Post-vacation brain melt, I guess :P All tests are currently passing.

Still pending discussion/decisions:

  1. Do we want emails in the tags (discussion here)
  2. I think I am going to add the dag_run_id back in as that was there before I started tinkering and removing it feels like a breaking change. (discussion here)

I wrote a couple of comments, but they're currently in pending mode (so I guess only I can see them?) Not sure how to make them visible :-(

Copy link
Contributor

@howardyoo howardyoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left out a few comments on the topics.

  • I believe it is better to remove the email_id from the label (tag)
  • Also, it may be helpful if we could remove run_id from the label.

airflow/models/taskinstance.py Outdated Show resolved Hide resolved
airflow/dag_processing/processor.py Outdated Show resolved Hide resolved
@@ -540,6 +540,10 @@ def __init__(
# can be changed when calling 'run'
self.test_mode = False

@property
def stats_tags(self) -> dict[str, str]:
return prune_dict({"dag_id": self.dag_id, "run_id": str(self.run_id), "task_id": self.task_id})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this may be a breaking change, after thinking about it, I believe adding run_id was not a good design decision as the presence of that label would not create a continuous time series, but rather a multiple metrics data that are going to become high-cardinality pattern.

My usual thought process for determining if something is a metric data or not, is whether the series could be something that we could connect it as a 'time series' (connecting dots to make it into the chart) or not. Having each data point with distinct run_id would not make it into time series, but just a series of distinct individual 'events' that would simply be plotted as dots in the chart (since each of them would have their own distinct run_id).

@ferruzzi
Copy link
Contributor Author

ferruzzi commented Apr 13, 2023

Alright, the last couple commits there should remove the email and run_id tags and update the unit tests accordingly.

@ferruzzi ferruzzi force-pushed the ferruzzi/otel/m2-stat-tagging branch from 8a29124 to 1d26c9c Compare April 13, 2023 20:46
@ferruzzi ferruzzi force-pushed the ferruzzi/otel/m2-stat-tagging branch from 1d26c9c to 6484b32 Compare April 13, 2023 20:48
@ferruzzi
Copy link
Contributor Author

@uranusjr @vincbeck @howardyoo - Are we good? Can I get a review/approval on this one when you get time? I think I've covered all comments.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @uranusjr ?

@uranusjr uranusjr merged commit 7e1dace into apache:main Apr 18, 2023
1 check passed
wookiist pushed a commit to wookiist/airflow that referenced this pull request Apr 19, 2023
@ephraimbuddy ephraimbuddy added this to the Airflow 2.7.0 milestone May 8, 2023
@ephraimbuddy ephraimbuddy added the type:new-feature Changelog: New Features label May 8, 2023
@ephraimbuddy ephraimbuddy modified the milestone: Airflow 2.7.0 May 8, 2023
@conorbev
Copy link

hey @ferruzzi I'm curious about some of these metrics and if you had further plans to change them. Like e.g.: Stats.incr(f"ti.start.{self.task.dag_id}.{self.task.task_id}", tags=self.stats_tags) <-- this metric would now have dag_id and task_id encoded both as part of the metric name and as separate tags ?

@ferruzzi
Copy link
Contributor Author

@conorbev "plans" may be a strong word, but I do intend to do more work on the OTel stuff as I get time.

The reason they still have the embedded values is because not all metrics backends support tagging, so the names themselves have to remain the same for compatibility. The tags are there to hep for backends which benefit from them (like OTel) but if we start changing the names of existing metrics we need to go through a whole deprecation cycle. Or we emit everything twice, once with the embedded values and again without, but then anyone using OTel or other tag-friendly backend will be seeing everything double. It's a bit of a no-win situation at the moment and would require a community discussion and consensus on how to proceed.

@ephraimbuddy ephraimbuddy added changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) and removed type:new-feature Changelog: New Features labels Aug 2, 2023
@ferruzzi ferruzzi changed the title [OTel Integration] Add tagging to existing stats [AIP-49] [OTel Integration] Add tagging to existing stats Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants