Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'datetime.timezone' object has no attribute 'name' #16551

Closed
ecerulm opened this issue Jun 20, 2021 · 14 comments · Fixed by #16599
Closed

AttributeError: 'datetime.timezone' object has no attribute 'name' #16551

ecerulm opened this issue Jun 20, 2021 · 14 comments · Fixed by #16599
Assignees
Labels
affected_version:2.0 Issues Reported for 2.0 area:core kind:bug This is a clearly a bug

Comments

@ecerulm
Copy link
Contributor

ecerulm commented Jun 20, 2021

Apache Airflow version: 2.0.2

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

What happened:

In a DAG with datetime(2021, 5, 31, tzinfo=timezone.utc) it will raise an AttributeError: 'datetime.timezone' object has no attribute 'name' in the scheduler.

It seems that airflow relies on the tzinfo object to have a .name attribute so the "canonical" datetime.timezone.utc does not comply with that requirement.

AttributeError: 'datetime.timezone' object has no attribute 'name'
Process DagFileProcessor302-Process:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 184, in _run_file_processor
    result: Tuple[int, int] = dag_file_processor.process_file(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", line 70, in wrapper
    return func(*args, session=session, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 648, in process_file
    dagbag.sync_to_db()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", line 70, in wrapper
    return func(*args, session=session, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 556, in sync_to_db
    for attempt in run_with_db_retries(logger=self.log):
  File "/home/airflow/.local/lib/python3.8/site-packages/tenacity/__init__.py", line 390, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/home/airflow/.local/lib/python3.8/site-packages/tenacity/__init__.py", line 356, in iter
    return fut.result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 570, in sync_to_db
    DAG.bulk_write_to_db(self.dags.values(), session=session)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", line 67, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dag.py", line 1892, in bulk_write_to_db
    orm_dag.calculate_dagrun_date_fields(
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dag.py", line 2268, in calculate_dagrun_date_fields
    self.next_dagrun, self.next_dagrun_create_after = dag.next_dagrun_info(most_recent_dag_run)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dag.py", line 536, in next_dagrun_info
    next_execution_date = self.next_dagrun_after_date(date_last_automated_dagrun)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dag.py", line 571, in next_dagrun_after_date
    next_start = self.following_schedule(now)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dag.py", line 485, in following_schedule
    tz = pendulum.timezone(self.timezone.name)
AttributeError: 'datetime.timezone' object has no attribute 'name'

What you expected to happen:

If start_date or any other input parameter requires a tzinfo with a name attribute it should check for that in the DAG object and produce a more specific error message not AttributeError.

Also I guess this requirement should be explicitly mentioned in https://airflow.apache.org/docs/apache-airflow/stable/timezone.html with a comment like

  you can't use datetime.timezone.utc because it does not have a name attribute

Or even better it would be not to rely on the presence of a name attribute in the tzinfo....

How to reproduce it:

from datetime import timedelta, datetime, timezone
args = {
    "owner": "airflow",
    "retries": 3,
}
dag = DAG(
    dag_id="xxxx",
    default_args=args,
    start_date=datetime(2021, 5, 31, tzinfo=timezone.utc),
    schedule_interval="0 8 * * *",
    max_active_runs=1,
    dagrun_timeout=timedelta(minutes=60),
    catchup=False,
    description="xxxxx",
)

Anything else we need to know:

@ecerulm ecerulm added the kind:bug This is a clearly a bug label Jun 20, 2021
@uranusjr
Copy link
Member

I think the timezone argument was really designed to only except pendulum.tz.Timezone, not datetime.timezone (or dateutil.tz.* for that matter). We can probably fix this by converting start_date, end_date etc. on DAG and BaseOperator (any others? TaskGroup?) to pendulum.DateTime.

@potiuk
Copy link
Member

potiuk commented Jun 21, 2021

I honestly think we should completely get rid of Pendulum. I think it comes from the history where Airflow was also Python 2.7 compliant and Pendulum had many advantages over the standard datetime and friends. Right now with Python 3.6+ I see no reason why we should use Pendulum especially that it seems to be MUCH slower (like all the other non-native date parsing/handling libraries https://aboutsimon.com/blog/2016/08/04/datetime-vs-Arrow-vs-Pendulum-vs-Delorean-vs-udatetime.html). This is from 2016 so things might have changed since (and penduilum does have c extensions it seems), but I saw so many problems coming from us using it and mixing pendulum with native datetime.

I remember a problem I spend hours looking for where different behaviour of Pendulum when passed datatime object in the presence or lack of fold parameter (one of the problems which only manifested in Python 3.5).

@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 21, 2021

I think the timezone argument was really designed to only except pendulum.tz.Timezone, not datetime.timezone (or dateutil.tz.* for that matter).

I can't guarantee it but I think this is a regression, I mean I had (always) used datetime(2021, 5, 31, tzinfo=timezone.utc) before and I didn't get this.

But anyway I agree that you that probably you can quickly solve this particular issue and prevent lots of confused users by start_date = pendulum.instance(start_date), etc in the constructors that take datetimes.

If checked and converting to pendulum datetime provides a tzinfo with a name:

pendulum.instance(datetime(2021, 5, 31, tzinfo=timezone.utc)).tzinfo.name # '+00:00'

@uranusjr
Copy link
Member

I mean I had (always) used datetime(2021, 5, 31, tzinfo=timezone.utc) before and I didn't get this.

Not sure when do you mean by “before”, but the tzinfo.name call is added three years ago so it’s not exactly recent. But I guess that’s subjective. (FWIW the previous implementation didn’t error because it didn’t checked the timezone at all and was buggy on non-fixed timezones).

I honestly think we should completely get rid of Pendulum.

That’s probably a good idea long term, but we’ll need some planning on this to re-write timezone functionalities in scheduling, which are heavily built around Pendulum right now. IMO unifying everything to Peendulum can likely help with the transition off Pendulum as well, if we only use the standard stuff (e.g. no Timezone.name, only tzinfo.tzname(), things like that).

@eladkal
Copy link
Contributor

eladkal commented Jun 21, 2021

I honestly think we should completely get rid of Pendulum.

This will be kinda painful I think since some of the most important macros are Pendulum and it's very common to do manipulation over it so this may require significant code changes per dag when upgrading version

@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 21, 2021

If you assign this to me I can do a PR that converts the input datetime.datetime into pendulm equivalents.

@uranusjr
Copy link
Member

I can’t assign but feel free to go ahead.

@potiuk
Copy link
Member

potiuk commented Jun 21, 2021

assigned :)

@eladkal eladkal added area:core affected_version:2.0 Issues Reported for 2.0 labels Jun 21, 2021
@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 22, 2021

Before I go any further:

The current error comes from tz = pendulum.timezone(self.timezone.name) at dag.py

If we just convert the start_date to pendulum with pendulum.instance(start_date) it fail instead with a pendulum.tz.zoneinfo.exceptions.InvalidTimezone: Invalid timezone "+00:00" the timezone name will be +00:00 and pendulum.timezone("+00:00") fails. This really surprised me.

So there are some alternatives:

  • Use pendulum.instance() , let the self.timezone be pendulum.tz.timezone.FixedOffset()/TimeZone('+00:00') and replace the tz = pendulum.timezone(self.timezone.name) with just tz = self.timezone
  • Just do self.timezone = pendulum._safe_timezone(self.timezone) that will convert timezone.utc into TimeZone('UTC'), but that is an undocumented/private function of pendulum.

@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 22, 2021

So it seems that currently the requirements for the dag.timezone are

  • it has to have a name attribute , so it must be a pendulum.tz.timezone.*
  • It has to have a name that resolve to a "named timezone" like (Europe/Stockholm, UTC, etc) and it can't be a FixedOffset like +00:00, this show in serialization/deserialization for example. You can serialize the dag but it won't be able to deserialize because it can parse back the +00:00 timezone. (It will eventually call pendulum.tz.timezone('+00:00') and that raises InvalidTimezone

So I don't know, I can sure make it work for tzinfo=timezone.utc but not all tzinfo can be converted to a "named timezone".

What I can do it's to do a fail-fast approach in DAG.__init__() by doing the self.timezone = pendulum.timezone(self.timezone.name) there instead of waiting until following_schedule, previous_schedule or a deserialization to occur.

@uranusjr
Copy link
Member

If we just convert the start_date to pendulum with pendulum.instance(start_date) it fail instead with a pendulum.tz.zoneinfo.exceptions.InvalidTimezone: Invalid timezone "+00:00"

Where exactly is this raised? pendulum.instance() seems to work with datetime.timezone for me:

>>> import datetime, pendulum
>>> pendulum.instance(datetime.datetime.now().replace(tzinfo=datetime.timezone.utc))
DateTime(2021, 6, 23, 5, 54, 45, 240168, tzinfo=Timezone('+00:00'))
>>> pendulum.instance(datetime.datetime.now().replace(tzinfo=datetime.timezone(datetime.timedelta(seconds=0))))
DateTime(2021, 6, 23, 5, 55, 45, 900498, tzinfo=Timezone('+00:00'))

I’m using pendulum 2.1.2 (as specified in the official constraints).

@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 22, 2021

Where exactly is this raised? pendulum.instance() seems to work with datetime.timezone for me:

It will be raised later at pendulum.timezone(self.time zone.name)

So the following will raise InvalidTimezone:

pendulum.timezone(pendulum.instance(datetime(2021,6,1,tzinfo=timezone.utc)).tzinfo.name)

because the pendulum's instance timezone will be Timezone("+00:00") with .name=="+00:00" and that's not a valid timezone name according to pendulum.timezone().

So in essence pendulum.timezone("+00:00") raises the exception.

But anyway I just removed the usage of pendulum.timezone(self.timezone.name) in the PR.

@uranusjr
Copy link
Member

uranusjr commented Jun 22, 2021

If we coerce the datetime object in __init__, we can remove the later timezone type coercion we have right now. pendulum.instance(start_date).tzinfo is already a pendulum.Timezone and don't need to be converted again.

(IOW you did the right thing removing it in the PR)

@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 23, 2021

If we coerce the datetime object in __init__, we can remove the later timezone type coercion we have right now.

Yes, and you will get rid of InvalidTimezone at DAG.following_schedule() and DAG.previous_schedule() but it will still raise InvalidTimezone on dag deserialization. So it's actually detrimental to coerce to pendulum.

Just to clarify my point, in the current (pre-PR) codebase if you provide a pendulum datetime with a non-named pendulum.timezone that will raise InvalidTimezone in following_schedule, previous_schedule and in deserialization:

from datetime import datetime,timezone
import pendulum
from airflow.serialization import SerializedDAG
from airflow.models import DAG

start_date = datetime(2021,6,1,tzinfo=timezone.utc) 
start_date = pendulum.instance(start_date) #  DateTime(2021, 6, 1, 0, 0, 0, tzinfo=Timezone('+00:00'))
dag = DAG(dag_id='mydag', start_date=start_date, schedule_interval="@hourly")
serialized_dag = SerializedDAG.to_dict(dag)
serialized_dag['dag']['timezone'] # '+00:00'
dag = SerializedDAG.from_dict(serialized_dag) # InvalidTimezone raised here in deserialization
dag.following_schedule(start_date) # raises InvalidTimezone
dag.previous_schedule(start_date) # raises InvalidTimezone

The PR #16599 will get rid of the InvalidTimezone in the following_schedule, previous_schedule but not really help for the deserialization case. You still need to provide a serializable/deserializable datetime as start_date and there are many pendulum datetimes that are not (all the datetimes with non-named timezones like fixed offsets)

kaxil pushed a commit that referenced this issue Jun 23, 2021
… ``name`` (#16599)

closes: #16551

Previous implementation tried to force / coerce the provided timezone (from the dag's `start_date`) into a `pendulum.tz.timezone.*` that only worked if the provided timezone was already a pendulum's timezone and it specifically failed when with `datetime.timezone.utc` as timezone.
Jorricks pushed a commit to Jorricks/airflow that referenced this issue Jun 24, 2021
… ``name`` (apache#16599)

closes: apache#16551

Previous implementation tried to force / coerce the provided timezone (from the dag's `start_date`) into a `pendulum.tz.timezone.*` that only worked if the provided timezone was already a pendulum's timezone and it specifically failed when with `datetime.timezone.utc` as timezone.
jhtimmins pushed a commit that referenced this issue Aug 9, 2021
… ``name`` (#16599)

closes: #16551

Previous implementation tried to force / coerce the provided timezone (from the dag's `start_date`) into a `pendulum.tz.timezone.*` that only worked if the provided timezone was already a pendulum's timezone and it specifically failed when with `datetime.timezone.utc` as timezone.

(cherry picked from commit 86c2091)
jhtimmins pushed a commit that referenced this issue Aug 13, 2021
… ``name`` (#16599)

closes: #16551

Previous implementation tried to force / coerce the provided timezone (from the dag's `start_date`) into a `pendulum.tz.timezone.*` that only worked if the provided timezone was already a pendulum's timezone and it specifically failed when with `datetime.timezone.utc` as timezone.

(cherry picked from commit 86c2091)
kaxil pushed a commit that referenced this issue Aug 17, 2021
… ``name`` (#16599)

closes: #16551

Previous implementation tried to force / coerce the provided timezone (from the dag's `start_date`) into a `pendulum.tz.timezone.*` that only worked if the provided timezone was already a pendulum's timezone and it specifically failed when with `datetime.timezone.utc` as timezone.

(cherry picked from commit 86c2091)
jhtimmins pushed a commit that referenced this issue Aug 17, 2021
… ``name`` (#16599)

closes: #16551

Previous implementation tried to force / coerce the provided timezone (from the dag's `start_date`) into a `pendulum.tz.timezone.*` that only worked if the provided timezone was already a pendulum's timezone and it specifically failed when with `datetime.timezone.utc` as timezone.

(cherry picked from commit 86c2091)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.0 Issues Reported for 2.0 area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants