-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Setup dashboard for Airflow monitoring #10341
Comments
I have done one issue about the agent and I'm interested in oap. I think I can do this task, could you please assign it to me? |
Assigned. Good luck. |
Ok, I will do my best. |
@mufiye You could take one step at a time. Make metrics available for airflow first. Then move forward on logs. |
Hello, @wu-sheng . I find that all data opentelemetry collector received use tag to compose "metrics name" but have no "tag attributes", such as "airflow.ti.finish.tutorial.templated.up_for_reschedule", "airflow.ti.finish.tutorial.print_date.shutdown". I have no idea how to write the mal rules to process these data. |
What tag do you need? Tag is not required. For describing airflow server, that could be set through otel collector, like we did for mysql metrics. |
I mean that all info is contained in the "metrics name" such as <job_name>, <job_id>, <dag_id>, <task_id>, <operator_name>, and so on. But I have no way to filter and process. Or I just don't consider these metrics? Because of the statsD data format, the otel data collected will not have "key value pair" tag attributes. |
Does the original statsd have these metadata? |
These metadata compose the name, such as "local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>". |
@potiuk Do you have time to help? |
OK, if it has, we could write a small MAL script(groovy based) to split these matadata. |
@mufiye Do you check OTEL side configurations? Is there a way to change their style? Otherwise, we maybe need to build a stated receiver. |
I think maybe the processor of the otel collector can do this and I need to check this part. |
I doubt that you mixed the concept between "airflow job" and "opentelemetry job"? We use the OpenTelemetry Job name to distinguish data sources. As for the metrics name like |
I think I say something wrong. It could add "key value pair" tag to the statsD message, but actually airflow only use the name to contain these metadata. I think using OpenTelemetry processor to process the data maybe a feasible method. About the "job", it is just the original airflow metrics name. |
Not much time to help (catching up with some stuff) , but from what it is worth - statsd of Airflow is not the "best" to consume for Skywalking - unforatunately you'd indeed need to parse the metric name and while I am not sure how OTEL processor might work, regexp approach might be a good idea. However just to give you perspective - Airflow's metrics are evolving. Quite recently (coming in next version of Airflow) - 2.6 most likely @hussein-awala improved Statsd metrics with DataDog metadata tags - apache/airflow#28961 and maybe, rather than focusing on pure statsd metrics you could integrate those. Also - a bit more long term - In Airlfow we already approved Open Telemetry support for Airflow https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow and we even have a chance to progress with the implementation - @feruzzi is looking into the integration and even is adding better support for Airflow's statsd metrics testing in Breeze (Airflow development environment) with Grafana and Prometheus - apache/airflow#29449 So maybe it could be a nice teamwork. |
I have done the research. And start a new discussion in the opentelemetry collector contrib. @wu-sheng @kezhenxu94 |
Could you use |
I try it but it doesn't work. Because the third argument of |
Are you considering this too complex? In the transfer process, you should be able to hardcode most of them, right?
Tag key is static and hard codes, such as replace_match would change everything, it may not be a good one. And I can find the index relative docs, https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/README.md#lists |
I think I just say something wrong, I want to say that it will change the value of the relative key. |
I think we can not get the single string in the array. The doc says that "the grammar does not provide an accessor to individual list entries". |
We could use this(replace ConvertCase to another function) to set the metric name without parameter.
Meanwhile, we could Could you check what I am missing? |
Yes, you are right. And I have tried this before by below config. The most important thing I think is how to process the attributes in processors:
transform:
metric_statements:
- context: resource
statements:
- context: datapoint
statements:
- set(attributes["job_id"], metric.name)
- context: metric
statements:
- replace_match(name, "system.*.cpu", "system.cpu") |
I just read https://opentelemetry.io/docs/reference/specification/metrics/data-model/#exponentialhistogram, it seems it is just the typical Prometheus Histogram setup in practice. Back to you question
We should transfer this to our histogram, I think. You need to get the bucket transfer correctly from |
I will try to do it later. And there are some other essential points that need to be discussed.
|
About <1>, the easiest way is,
|
How do the process could be negative? What does it mean originally? |
Because the total number which is the sum of the gauge value means currently running DAG parsing processes. So one delta value can be negative. The "originally" means we just show the gauge value whether they are negative or positive. |
Then, in this case, it seems we never get the absolute value, is it? Does it report absolute time somehow? |
sorry, I can't get it, could you explain your perspective more? |
If a time-series value is delta, let's say (-5, 4, 3, 1, -4), unless we know the initial value is 10(or any value), we could know the exact value of So, do we have that number or do we have the total of processes? If there isn't, we only could see the trend. |
I think we can't get the total number of processes unless we add every delta value. |
There is no Could you check how this works on stated? Such as check and try apache/airflow#29449? |
You mean to check how airflow collect metrics and send out stated data? |
I think about how they visualize this type, so I think we could try this on Prometheus/Grafana. |
Ok, I get it. I will check how they use their metrics. |
I think tasks and pool are inclusion relation, but others are not. Furthermore, by the metric name, we can not get which task is in which pool. Maybe make these components' level same is the only way. |
I don't know as much as you are. |
I think it is because the counter definition in promethus metrics. A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. |
If Prometheus could identify/use it as a counter, why can't we? We converted it to delta because it isn't cumulative. |
In my opinion, I think they always do the accumulation for the counter metrics weather they have been stored or not. But we can not do the accumulation for metrics have been stored. |
If you could push a counter to OAP, we could work on that. Your previous context is about there is a delta only. |
I think I can only push a "delta type counter" to the oap by otel collector. I think maybe we can support to accumulate "delta type counter"? It may be complicated but I can try to do it. |
I think you need to check what is delta counter. Counter is increasing or reset. How does delta apply to this case? |
I think this dag_processing.process does not meet the Prometheus counter definition, it can decrease, I'm sure because I test it. It is the pr I find. |
That is my point of asking. Only focus on this metric, whether they show, how they show. |
Ok, I get it. |
@mufiye Any update or block? |
I think I should block it here temporarily. I am preparing to find an internship now and have no time to continue this issue in the last two weeks. You can unassign this issue to me. |
Got it. Thanks for the feedback. |
Search before asking
Description
This is an open issue for new contributors. Apache Airflow is a widely used workflow scheduler. We are encouraging someone new to the community to add a new level catalog(workflow) for Airflow.
Metrics
Airflow exposes metrics through StatsD, https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html.
We could use StatsD + OpenTelemetry StatesD(https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/statsdreceiver/README.md) + OpenTelemetry OTEL exporter to ship the metrics to SkyWalking OTEL receiver.
Then use MAL to build metrics as well as a dashboard for those metrics. Notice, a new layer and new UI menu should be added.
Logging
Airflow supports Fluents to ship metrics, https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/logging-architecture.html. SkyWalking already has FluentD setup support, so we should be able to receive and catalog the logs.
Additionally, Task Logs seems an interesting think. We could use LAL(Log Analysis) to group the logs by task name(or ID) by treating tasks as endpoints(SkyWalking concept).
Use case
Add more observability for Airflow server.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: