Skip to content

Commit

Permalink
Explain the users how they can check if python code is top-level (#34006
Browse files Browse the repository at this point in the history
)

Many users have problem with it. Adding the way how they can
check it easily.

(cherry picked from commit 9702a14)
  • Loading branch information
potiuk authored and ephraimbuddy committed Sep 1, 2023
1 parent ae03ad5 commit b83fa2e
Showing 1 changed file with 90 additions and 0 deletions.
90 changes: 90 additions & 0 deletions docs/apache-airflow/best-practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,96 @@ Good example:
In the Bad example, NumPy is imported each time the DAG file is parsed, which will result in suboptimal performance in the DAG file processing. In the Good example, NumPy is only imported when the task is running.

Since it is not always obvious, see the next chapter to check how my code is "top-level" code.

How to check if my code is "top-level" code
-------------------------------------------

In order to understand whether your code is "top-level" or not you need to understand a lot of
intricacies of how parsing Python works. In general, when Python parses the python file it executes
the code it sees, except (in general) internal code of the methods that it does not execute.

It has a number of special cases that are not obvious - for example top-level code also means
any code that is used to determine default values of methods.

However, there is an easy way to check whether your code is "top-level" or not. You simply need to
parse your code and see if the piece of code gets executed.

Imagine this code:

.. code-block:: python
from airflow import DAG
from airflow.operators.python import PythonOperator
import pendulum
def get_task_id():
return "print_array_task" # <- is that code going to be executed?
def get_array():
return [1, 2, 3] # <- is that code going to be executed?
with DAG(
dag_id="example_python_operator",
schedule=None,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
tags=["example"],
) as dag:
operator = PythonOperator(
task_id=get_task_id(),
python_callable=get_array,
dag=dag,
)
What you can do check it, add to your code you want to check some print statements and run
``python <my_dag_file>.py``.


.. code-block:: python
from airflow import DAG
from airflow.operators.python import PythonOperator
import pendulum
def get_task_id():
print("Executing 1")
return "print_array_task" # <- is that code going to be executed? YES
def get_array():
print("Executing 2")
return [1, 2, 3] # <- is that code going to be executed? NO
with DAG(
dag_id="example_python_operator",
schedule=None,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
tags=["example"],
) as dag:
operator = PythonOperator(
task_id=get_task_id(),
python_callable=get_array,
dag=dag,
)
When you execute that code you will see:

.. code-block:: bash
root@cf85ab34571e:/opt/airflow# python /files/test_python.py
Executing 1
This means that the ``get_array`` is not executed as top-level code, but ``get_task_id`` is.

.. _best_practices/dynamic_dag_generation:

Dynamic DAG Generation
Expand Down

0 comments on commit b83fa2e

Please sign in to comment.