From 114da69c521f1628e61f517f077d1a68fcc45977 Mon Sep 17 00:00:00 2001 From: Ephraim Anierobi Date: Mon, 25 Apr 2022 08:00:39 +0100 Subject: [PATCH] Update doc for DAG file processing We can now run the ``DagFileProcessorProcess`` in a separate process and it's not fully documented Update docs/apache-airflow/concepts/dagfile-processing.rst Co-authored-by: Tzu-ping Chung fixup! Update docs/apache-airflow/concepts/dagfile-processing.rst --- .../concepts/dagfile-processing.rst | 46 +++++++++++++++++++ docs/apache-airflow/concepts/index.rst | 1 + docs/apache-airflow/concepts/scheduler.rst | 25 ++-------- 3 files changed, 50 insertions(+), 22 deletions(-) create mode 100644 docs/apache-airflow/concepts/dagfile-processing.rst diff --git a/docs/apache-airflow/concepts/dagfile-processing.rst b/docs/apache-airflow/concepts/dagfile-processing.rst new file mode 100644 index 0000000000000..676fd7af78ad2 --- /dev/null +++ b/docs/apache-airflow/concepts/dagfile-processing.rst @@ -0,0 +1,46 @@ + .. Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + .. http://www.apache.org/licenses/LICENSE-2.0 + + .. Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +DAG File Processing +------------------- + +DAG File Processing refers to the process of turning Python files contained in the DAGs folder into DAG objects that contain tasks to be scheduled. + +There are two primary components involved in DAG file processing. The ``DagFileProcessorManager`` is a process executing an infinite loop that determines which files need +to be processed, and the ``DagFileProcessorProcess`` is a separate process that is started to convert an individual file into one or more DAG objects. + +The ``DagFileProcessorManager`` runs user codes. As a result, you can decide to run it as a standalone process in a different host than the scheduler process. +If you decide to run it as a standalone process, you need to set this configuration: ``AIRFLOW__SCHEDULER__STANDALONE_DAG_PROCESSOR=True`` and +run the ``airflow dag-processor`` CLI command, otherwise, starting the scheduler process (``airflow scheduler``) also starts the ``DagFileProcessorManager``. + +.. image:: /img/dag_file_processing_diagram.png + +``DagFileProcessorManager`` has the following steps: + +1. Check for new files: If the elapsed time since the DAG was last refreshed is > :ref:`config:scheduler__dag_dir_list_interval` then update the file paths list +2. Exclude recently processed files: Exclude files that have been processed more recently than :ref:`min_file_process_interval` and have not been modified +3. Queue file paths: Add files discovered to the file path queue +4. Process files: Start a new ``DagFileProcessorProcess`` for each file, up to a maximum of :ref:`config:scheduler__parsing_processes` +5. Collect results: Collect the result from any finished DAG processors +6. Log statistics: Print statistics and emit ``dag_processing.total_parse_time`` + +``DagFileProcessorProcess`` has the following steps: + +1. Process file: The entire process must complete within :ref:`dag_file_processor_timeout` +2. The DAG files are loaded as Python module: Must complete within :ref:`dagbag_import_timeout` +3. Process modules: Find DAG objects within Python module +4. Return DagBag: Provide the ``DagFileProcessorManager`` a list of the discovered DAG objects diff --git a/docs/apache-airflow/concepts/index.rst b/docs/apache-airflow/concepts/index.rst index f4f0cb3b65f58..122dc760fe701 100644 --- a/docs/apache-airflow/concepts/index.rst +++ b/docs/apache-airflow/concepts/index.rst @@ -43,6 +43,7 @@ Here you can find detailed documentation about each one of Airflow's core concep taskflow ../executor/index scheduler + dagfile-processing pools timetable priority-weight diff --git a/docs/apache-airflow/concepts/scheduler.rst b/docs/apache-airflow/concepts/scheduler.rst index 420d464f1146f..0ee6724699466 100644 --- a/docs/apache-airflow/concepts/scheduler.rst +++ b/docs/apache-airflow/concepts/scheduler.rst @@ -63,29 +63,10 @@ In the UI, it appears as if Airflow is running your tasks a day **late** DAG File Processing ------------------- -The Airflow Scheduler is responsible for turning the Python files contained in the DAGs folder into DAG objects that contain tasks to be scheduled. - -There are two primary components involved in DAG file processing. The ``DagFileProcessorManager`` is a process executing an infinite loop that determines which files need -to be processed, and the ``DagFileProcessorProcess`` is a separate process that is started to convert an individual file into one or more DAG objects. - -.. image:: /img/dag_file_processing_diagram.png - -``DagFileProcessorManager`` has the following steps: - -1. Check for new files: If the elapsed time since the DAG was last refreshed is > :ref:`config:scheduler__dag_dir_list_interval` then update the file paths list -2. Exclude recently processed files: Exclude files that have been processed more recently than :ref:`min_file_process_interval` and have not been modified -3. Queue file paths: Add files discovered to the file path queue -4. Process files: Start a new ``DagFileProcessorProcess`` for each file, up to a maximum of :ref:`config:scheduler__parsing_processes` -5. Collect results: Collect the result from any finished DAG processors -6. Log statistics: Print statistics and emit ``dag_processing.total_parse_time`` - -``DagFileProcessorProcess`` has the following steps: - -1. Process file: The entire process must complete within :ref:`dag_file_processor_timeout` -2. Load modules from file: Uses Python imp command, must complete within :ref:`dagbag_import_timeout` -3. Process modules: Find DAG objects within Python module -4. Return DagBag: Provide the ``DagFileProcessorManager`` a list of the discovered DAG objects +You can have the Airflow Scheduler be responsible for starting the process that turns the Python files contained in the DAGs folder into DAG objects +that contain tasks to be scheduled. +Refer to :doc:`dagfile-processing` for details on how this can be achieved Triggering DAG with Future Date