-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Strict validation in Dataset URI in Airflow 2.9 breaks some DAGs #39486
Labels
affected_version:2.9
area:core
area:datasets
Issues related to the datasets feature
good first issue
kind:bug
This is a clearly a bug
Comments
tatiana
added
area:core
kind:bug
This is a clearly a bug
needs-triage
label for new issues that we didn't triage yet
labels
May 8, 2024
3 tasks
tatiana
changed the title
[Bug] Strict validation in Dataset URI in Airflow 2.9 break some DAGs
[Bug] Strict validation in Dataset URI in Airflow 2.9 breaks some DAGs
May 8, 2024
Tentatively targeting 2.9.2, I think we can argue this is a bug fix (the issue being backward incompatibility). |
@uranusjr I'll try to get the PR ready for today - I can't assign the ticket to myself, but feel free to assign it if possible. |
eladkal
added
area:datasets
Issues related to the datasets feature
affected_version:2.9
good first issue
and removed
needs-triage
label for new issues that we didn't triage yet
labels
May 11, 2024
@tatiana Any progress on this one? |
pankajkoti
pushed a commit
that referenced
this issue
May 17, 2024
…rflow 2.9 (#39670) Closes: #39486 # Context Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). # How to reproduce By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` # About the changes introduced This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default.
ephraimbuddy
pushed a commit
that referenced
this issue
Jun 4, 2024
…rflow 2.9 (#39670) Closes: #39486 Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default. (cherry picked from commit a07d799)
ephraimbuddy
pushed a commit
that referenced
this issue
Jun 5, 2024
…rflow 2.9 (#39670) Closes: #39486 Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after #37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default. (cherry picked from commit a07d799)
41 tasks
romsharon98
pushed a commit
to romsharon98/airflow
that referenced
this issue
Jul 26, 2024
…rflow 2.9 (apache#39670) Closes: apache#39486 # Context Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as `Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")`, stopped working in Airflow 2.9.0 & Airflow 2.9.1, after apache#37005 was merged. This was a breaking change in an Airflow minor version. We should avoid this. Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour. The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration). # How to reproduce By running the following DAG with `apache-airflow==2.9.1` and `apache-airflow-providers-postgres==5.11.0`, as an example: ``` from datetime import datetime from airflow import DAG from airflow.datasets import Dataset from airflow.operators.empty import EmptyOperator with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag: task1 = EmptyOperator( task_id='empty_task1', dag=dag, outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")] ) task2 = EmptyOperator( task_id='empty_task2', dag=dag ) task1 >> task2 ``` Causes to the exception: ``` Broken DAG: [/usr/local/airflow/dags/example_issue.py] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri parsed = normalizer(parsed) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri raise ValueError("URI format postgres:// must contain database, schema, and table names") ValueError: URI format postgres:// must contain database, schema, and table names ``` # About the changes introduced This PR introduces the following: 1. A boolean configuration within `[core],` named `strict_dataset_uri_validation,` which should be `False` by default. 2. When this configuration is `False,` Airflow should raise a warning saying: ``` From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX. ``` 3. If this configuration is `True,` Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1 4. From Airflow 3.0, we change this configuration to be `True` by default.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
affected_version:2.9
area:core
area:datasets
Issues related to the datasets feature
good first issue
kind:bug
This is a clearly a bug
Apache Airflow version
2.9.0 & 2.9.1
What happened?
Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as
Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")
, stopped working in Airflow 2.9.0, after #37005 was merged.What you think should happen instead?
This is a breaking change in an Airflow minor version. We should avoid this.
Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.
The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).
How to reproduce
By running the following DAG, as an example:
Causes to the exception:
Operating System
All
Versions of Apache Airflow Providers
apache-airflow-providers-postgres==5.11.0
Deployment
Astronomer
Deployment details
This issue can be seen in any Airflow 2.9.0 or 2.9.1 deployment.
Anything else?
An apparent "quick-fix" could be to replace "." by "/" in the Dataset URI of the DAG. However, this would break any DAGs that use data-aware schedules and consume from this Dataset - in a silent way. For this reason, let's fix this in Airflow 2.9 and prepare users for a breaking change in Airflow 3.0.
Suggested approach to fix the problem:
Introduce a boolean configuration within
[core],
namedstrict_dataset_uri_validation,
which should beFalse
by default.When this configuration is
False,
Airflow should raise a warning saying:If this configuration is
True,
Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1From Airflow 3.0, we change this configuration to be True by default.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: