Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG serialization broken because of pendulum #20055

Closed
1 of 2 tasks
cansjt opened this issue Dec 5, 2021 · 6 comments
Closed
1 of 2 tasks

DAG serialization broken because of pendulum #20055

cansjt opened this issue Dec 5, 2021 · 6 comments
Labels
area:core kind:bug This is a clearly a bug

Comments

@cansjt
Copy link

cansjt commented Dec 5, 2021

Apache Airflow version

2.2.2 (latest released)

Operating System

Debian

Versions of Apache Airflow Providers

N/A

Deployment

Other 3rd-party Helm chart

Deployment details

N/A

What happened

After upgrading from 2.1.3 to 2.2.2, Airflow is no longer capable of serializing some dags. It fails with the exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.9/site-packages/airflow/serialization/serialized_objects.py\", line 817, in serialize_dag
    serialize_dag = cls.serialize_to_json(dag, cls._decorated_fields)
  File \"/usr/local/lib/python3.9/site-packages/airflow/serialization/serialized_objects.py\", line 269, in serialize_to_json
    serialized_object[key] = _encode_timetable(value)
  File \"/usr/local/lib/python3.9/site-packages/airflow/serialization/serialized_objects.py\", line 151, in _encode_timetable
    return {\"__type\": importable_string, \"__var\": var.serialize()}
  File \"/usr/local/lib/python3.9/site-packages/airflow/timetables/interval.py\", line 155, in serialize
    return {\"expression\": self._expression, \"timezone\": encode_timezone(self._timezone)}
  File \"/usr/local/lib/python3.9/site-packages/airflow/serialization/serialized_objects.py\", line 116, in encode_timezone
    raise ValueError(f\"DAG timezone should be a pendulum.tz.Timezone, not {var!r}\")
ValueError: DAG timezone should be a pendulum.tz.Timezone, not datetime.timezone.utc

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.9/site-packages/airflow/models/dagbag.py\", line 591, in _serialize_dag_capturing_errors
    dag_was_updated = SerializedDagModel.write_dag(
  File \"/usr/local/lib/python3.9/site-packages/airflow/utils/session.py\", line 67, in wrapper
    return func(*args, **kwargs)
  File \"/usr/local/lib/python3.9/site-packages/airflow/models/serialized_dag.py\", line 136, in write_dag
    new_serialized_dag = cls(dag)
  File \"<string>\", line 4, in __init__
  File \"/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/state.py\", line 433, in _initialize_instance
    manager.dispatch.init_failure(self, args, kwargs)
  File \"/usr/local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py\", line 68, in __exit__
    compat.raise_(
  File \"/usr/local/lib/python3.9/site-packages/sqlalchemy/util/compat.py\", line 182, in raise_
    raise exception
  File \"/usr/local/lib/python3.9/site-packages/sqlalchemy/orm/state.py\", line 430, in _initialize_instance
    return manager.original_init(*mixed[1:], **kwargs)
  File \"/usr/local/lib/python3.9/site-packages/airflow/models/serialized_dag.py\", line 95, in __init__
    self.data = SerializedDAG.to_dict(dag)
  File \"/usr/local/lib/python3.9/site-packages/airflow/serialization/serialized_objects.py\", line 935, in to_dict
    json_dict = {\"__version\": cls.SERIALIZER_VERSION, \"dag\": cls.serialize_dag(var)}
  File \"/usr/local/lib/python3.9/site-packages/airflow/serialization/serialized_objects.py\", line 847, in serialize_dag
    raise SerializationError(f'Failed to serialize DAG {dag.dag_id!r}: {e}')
airflow.exceptions.SerializationError: Failed to serialize DAG 'preview_meta_eventad_registrations': DAG timezone should be a pendulum.tz.Timezone, not datetime.timezone.utc

The start date was provide to the DAG initializer as datetime.datetime(2020, 11, 1, 0, 0, tzinfo=datetime.timezone.utc), which, FYI, is the result of the parsing of a YAML file.

What you expected to happen

Airflow should be capable of serializing the DAG without any error.

How to reproduce

Create a DAG with a start date as follows:

from datetime import datetime, timezone
from airflow.models import DAG

DAG('whatever', start_date=datetime(2021,12,5, 0, 0, 0, tzinfo=timezone.utc, ...)

And enable DAG serialization in Airflow's configuration (though this is the default now, no more store_serialized_dags option in the configuration reference).

Anything else

This is similar but yet slightly different from #16613.
May also be related to #19450.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@cansjt cansjt added area:core kind:bug This is a clearly a bug labels Dec 5, 2021
@potiuk
Copy link
Member

potiuk commented Dec 5, 2021

I believe this is deliberate. The error tells what to do - change the timezone to use pendulum.

If you look here: https://airflow.apache.org/docs/apache-airflow/stable/timezone.html#time-zone-aware-dags, you are supposed to use pendulum timezones (and it's been there for quite a while when we expected pendulum not datetime). While previously it coudl accidentally work with datetime timezones, it was, I believe, not intended. So please convert your DAGs to use the pendulum timezones and it should be fixed.

@uranusjr - am I right? Should we close it?

@cansjt
Copy link
Author

cansjt commented Dec 5, 2021

As far as I am concerned if this is deliberate this is not a great choice. As mentioned this date comes from the output of the Parsing of a YAML file. For now we implemented a work around: intercept the output of the parsing before passing it to Airflow.

But, to me, using pendulum is a choice internal to Airflow. At it's interface Airflow should be capable of accepting anything that comes from Python standard library. And should not impose to use anything extra to deal with something as simple as dates.

To the least the DAG class should validate properly validate its input, as you put new constraints on the input it accepts.

Also this change is not really documented (still states here start_date should be a datetime.datetime), and all the examples I could find accept datetime as input. And again it was working fine. Something that worked, but does no more, should fit anyone's definition of a bug (or so I hope).

@cansjt
Copy link
Author

cansjt commented Dec 5, 2021

For those who might wonder, there is this documentation section. Maybe the documentation of the DAG class should be updated to reference this section so people are more aware of it.

@potiuk
Copy link
Member

potiuk commented Dec 5, 2021

Something that worked, but does no more, should fit anyone's definition of a bug (or so I hope).

That's a very simplistic view.. Sometimes things work accidentally - especially if the implementation doesn't follow the documentation. So this is by far not a universal definition of bug that anyone should follow. It's not "0-1" definition by far

But I am not sure how it was in this case, so I will revert to others to comment. I am not sure what was the case here - otherwise I'd have closed this or moved to a discussion straight away.

@kaxil
Copy link
Member

kaxil commented Dec 6, 2021

Airflow has a utility for it that you can use to make sure you don't need to use anything else.

from airflow.utils import timezone

now = timezone.utcnow()
a_date = timezone.datetime(2017, 1, 1)

https://airflow.apache.org/docs/apache-airflow/2.2.2/timezone.html#naive-and-aware-datetime-objects

Airflow have relied on Pendulum for as long as I can remember and we make it clear on https://airflow.apache.org/docs/apache-airflow/2.2.2/timezone.html on why we picked pendulum instead of pytz too.

However, if you have any suggestions on how we can make it clearer, we are happy to hear your thoughts

@potiuk
Copy link
Member

potiuk commented Dec 6, 2021

Yeah. Concur with @kaxil - maybe there are some ways we can make it clearer. I will convert this into discussion now - but maybe we can contine discussing what can be done to improve there ((and @cansjt -> maybe you can even directly create a PR to add clear the confusion in the examples you pointed out ? Airflow has > 1800 contributors, so that seems like an easy way to become one as well.

@apache apache locked and limited conversation to collaborators Dec 6, 2021
@potiuk potiuk converted this issue into discussion #20070 Dec 6, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

3 participants