fix(ingestion/airflow-plugin): airflow remove old tasks #10485

dushayntAW · 2024-05-10T19:21:36Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

treff7es · 2024-05-23T21:15:31Z

metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/datahub_listener.py

+            logger.debug("Initiating the cleanup of obsselete data from datahub")
+
+            ingested_dataflow_urns = list(
+                self.graph.get_urns_by_filter(


I think you should filter for cluster as well; otherwise if user has multiple Airflow instance you will delete dags which you shouldn't.

I am filtering the entire URN which is already having the cluster i.e. urn:li:dataFlow:(airflow,simple_dag,prod)
So, still we need to match/filter cluster explicitly? or my understanding is wrong.

If you check here we use cluster or env to generate the DataFlow Urns, so it is part of the urn. ->

datahub/metadata-ingestion/src/datahub/api/entities/datajob/dataflow.py

Line 75 in 90febde

self.urn = DataFlowUrn.create_from_ids(

This means if the env or cluster is set and has multiple Airflow environments like DEV and PROD, then your query will return the urns for both PROD and DEV, which we don't want in this case as these are different Airflow environment.

You should add cluster/env as a filter parameter.

treff7es · 2024-05-23T21:15:52Z

metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/datahub_listener.py

+            airflow_job_urns: List = []
+
+            for dag in all_airflow_dags:
+                flow_urn = builder.make_data_flow_urn(


cluster should be passed in if exists

same as other comment

treff7es · 2024-06-05T11:29:11Z

metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/datahub_listener.py

+            logger.debug("Initiating the cleanup of obsselete data from datahub")
+
+            ingested_dataflow_urns = list(
+                self.graph.get_urns_by_filter(


If you check here we use cluster or env to generate the DataFlow Urns, so it is part of the urn. ->

datahub/metadata-ingestion/src/datahub/api/entities/datajob/dataflow.py

Line 75 in 90febde

self.urn = DataFlowUrn.create_from_ids(

This means if the env or cluster is set and has multiple Airflow environments like DEV and PROD, then your query will return the urns for both PROD and DEV, which we don't want in this case as these are different Airflow environment.

You should add cluster/env as a filter parameter.

…e tasks and pipelines based on the cluster

…ect#10485)

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 10, 2024

vercel bot deployed to Preview May 10, 2024 20:02 View deployment

dushayntAW added 4 commits May 23, 2024 20:31

fix(ingestion/airflow-plugin): airflow remove old tasks

4f7133f

fix(ingestion/airflow-plugin): fixed linter issues

f76fdd0

fix(ingestion/airflow-plugin): airflow remove old tasks

0d36c63

fix(ingestion/airflow-plugin): airflow remove old tasks

49c689c

dushayntAW force-pushed the fix/ING-447/airflow-not-removing-old-task branch 2 times, most recently from ebd7762 to 49c689c Compare May 23, 2024 18:35

dushayntAW added 2 commits May 23, 2024 20:42

fix(ingestion/airflow-plugin): fix linter issue

5050641

fix(ingestion/airflow-plugin): fix linter issue

94f7c0d

vercel bot deployed to Preview May 23, 2024 19:00 View deployment

treff7es reviewed May 23, 2024

View reviewed changes

dushayntAW requested a review from treff7es May 24, 2024 05:56

treff7es requested changes Jun 5, 2024

View reviewed changes

fix(ingestion/airflow-plugin): fixed review comments

bdfb809

vercel bot deployed to Preview June 5, 2024 16:51 View deployment

dushayntAW requested a review from treff7es June 5, 2024 17:08

treff7es approved these changes Jun 6, 2024

View reviewed changes

dushayntAW added 2 commits June 9, 2024 13:41

fix(ingestion/airflow-plugin): added a functionality ti filter out th…

4c1ed05

…e tasks and pipelines based on the cluster

fix(ingestion/airflow-plugin): fixed linter issues

071b9ec

vercel bot deployed to Preview June 9, 2024 12:00 View deployment

anshbansal merged commit 177a50f into datahub-project:master Jun 10, 2024
54 checks passed

sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Jun 25, 2024

fix(ingestion/airflow-plugin): airflow remove old tasks (datahub-proj…

a878784

…ect#10485)

yoonhyejin pushed a commit that referenced this pull request Jul 16, 2024

fix(ingestion/airflow-plugin): airflow remove old tasks (#10485)

0b462eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingestion/airflow-plugin): airflow remove old tasks #10485

fix(ingestion/airflow-plugin): airflow remove old tasks #10485

dushayntAW commented May 10, 2024

treff7es May 23, 2024

dushayntAW May 24, 2024

treff7es Jun 5, 2024

treff7es May 23, 2024

dushayntAW May 24, 2024

treff7es Jun 5, 2024

fix(ingestion/airflow-plugin): airflow remove old tasks #10485

fix(ingestion/airflow-plugin): airflow remove old tasks #10485

Conversation

dushayntAW commented May 10, 2024

Checklist

treff7es May 23, 2024

Choose a reason for hiding this comment

dushayntAW May 24, 2024

Choose a reason for hiding this comment

treff7es Jun 5, 2024

Choose a reason for hiding this comment

treff7es May 23, 2024

Choose a reason for hiding this comment

dushayntAW May 24, 2024

Choose a reason for hiding this comment

treff7es Jun 5, 2024

Choose a reason for hiding this comment