Add a Docker Taskflow decorator #15330

dimberman · 2021-04-12T15:05:59Z

Add the ability to run @task.docker on a python function and turn it into a DockerOperator that can run that python function remotely.

@task.docker(
    image="quay.io/bitnami/python:3.8.8",
    force_pull=True,
    docker_url="unix://var/run/docker.sock",
    network_mode="bridge",
    api_version='auto',
)
def f():
    import random
    return [random.random() for i in range(10000000)]

One notable aspect of this architecture is that we had to build it to make as few assumptions about user setups as possible. We could not share a volume between the worker and the container as this would break if the user runs the airflow worker on a docker container. We could not assume that users would have any specialized system libraries on their images (this implementation only requires python 3 and bash).

To work with these requirements, we use base64 encoding to store a jinja generated python file and inputs (which are generated using the same functions used by the PythonVirtualEnvOperator). Once the container starts, it uses these environment variables to deserialize the strings, run the function, and store the result in a file located at /tmp/script.out.

Once the function completes, we create a sleep loop until the DockerOperator retrieves the result via docker's get_archive API. This result can then be deserialized using pickle and sent to Airflow's XCom library in the same fashion as a python or python_virtualenv result.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

xinbinhuang · 2021-04-12T15:40:25Z

Love this! I wonder if it would be better to use something like @task.container to abstract away the specific operator being used so that DAG author doesn't need to care about it, and data engineer can control either DockerOperator or KueberntesPodOperator depending on their infrastructure

dimberman · 2021-04-12T15:47:04Z

@xinbinhuang that's a great idea!

re: kubernetes, once this gets merged the next step will be "@task.kubernetes" where a user can give a pod spec and launch it using the KPO :)

github-actions · 2021-04-12T15:57:34Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2021-04-12T15:58:19Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2021-04-12T23:22:12Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2021-04-13T18:41:57Z

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

isaulv · 2021-04-15T18:57:24Z

tests/decorators/test_docker.py

+        assert ti.xcom_pull() == {'number': test_number + 1, '43': 43}
+
+    def test_call_20(self):
+        """Test calling decorated function 21 times in a DAG"""


func name and doc string don't agree.

airflow/decorators/__init__.py

airflow/decorators/docker.py

ashb

This needs some docs showing about it, including a giant warning about the downsides (that it ships sourcecode by inspecting it, length, no closures etc.

I'm not sure this needs a whole example dag (each one we add slows down tests, as all dags get loaded by some dags.) At least we should put it in a separate location so it doesn't get loaded by anything that loads the full dag bags.

airflow/decorators/docker.py

airflow/example_dags/tutorial_taskflow_api_etl_docker_virtualenv.py

github-actions · 2021-04-15T20:00:58Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

airflow/decorators/__init__.py

github-actions · 2021-04-15T21:46:36Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

github-actions · 2021-04-15T21:47:06Z

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

Co-authored-by: Ash Berlin-Taylor <[email protected]>

airflow/decorators/__init__.py

airflow/providers/docker/decorators/docker.py

Stale

When apache#15330 added docker.task, it also optimized replacement of the callable with it's result in LazyDictWithCache. LazyDictWithCache is used by Provider's Manager to optimize access to hooks - basically hook is only actually imported, when it is accessed. This helps with speeding up importing of connection information The optimization added result of running callable to _resolved set, but it missed the case when None was returned. Previously, when None was returned, the callable was not replaced and it was called again. After the change - the _resolved set was updated with the key and None was returned. But since the key has not been replaced, next time when the same key was retrieved, the original "callable" was returned, not the None value. So if callable returned None, and the same key was retrieved twice, the second time, instead of None, the dictionary returned Callable. This PR fixes it by setting the value to dictionary even if it was None.

When #15330 added docker.task, it also optimized replacement of the callable with it's result in LazyDictWithCache. LazyDictWithCache is used by Provider's Manager to optimize access to hooks - basically hook is only actually imported, when it is accessed. This helps with speeding up importing of connection information The optimization added result of running callable to _resolved set, but it missed the case when None was returned. Previously, when None was returned, the callable was not replaced and it was called again. After the change - the _resolved set was updated with the key and None was returned. But since the key has not been replaced, next time when the same key was retrieved, the original "callable" was returned, not the None value. So if callable returned None, and the same key was retrieved twice, the second time, instead of None, the dictionary returned Callable. This PR fixes it by setting the value to dictionary even if it was None.

When #15330 added docker.task, it also optimized replacement of the callable with it's result in LazyDictWithCache. LazyDictWithCache is used by Provider's Manager to optimize access to hooks - basically hook is only actually imported, when it is accessed. This helps with speeding up importing of connection information The optimization added result of running callable to _resolved set, but it missed the case when None was returned. Previously, when None was returned, the callable was not replaced and it was called again. After the change - the _resolved set was updated with the key and None was returned. But since the key has not been replaced, next time when the same key was retrieved, the original "callable" was returned, not the None value. So if callable returned None, and the same key was retrieved twice, the second time, instead of None, the dictionary returned Callable. This PR fixes it by setting the value to dictionary even if it was None. (cherry picked from commit 462df0d)

boring-cyborg bot added area:providers kind:documentation labels Apr 12, 2021

dimberman force-pushed the decorator-with-docker branch from af5bf38 to 379ecbb Compare April 12, 2021 15:13

dimberman force-pushed the decorator-with-docker branch from 8e9205d to 3d31b27 Compare April 13, 2021 18:20

dimberman marked this pull request as ready for review April 14, 2021 19:00

dimberman requested review from kaxil, mik-laj and vikramkoka as code owners April 14, 2021 19:00

dimberman added the full tests needed We need to run full set of tests for this PR to merge label Apr 14, 2021

dimberman requested review from ashb and potiuk as code owners April 14, 2021 20:43

dimberman removed the full tests needed We need to run full set of tests for this PR to merge label Apr 14, 2021

dimberman force-pushed the decorator-with-docker branch 2 times, most recently from 6e34372 to 573085f Compare April 15, 2021 17:02

isaulv reviewed Apr 15, 2021

View reviewed changes

ashb reviewed Apr 15, 2021

View reviewed changes

airflow/decorators/__init__.py Outdated Show resolved Hide resolved

ashb reviewed Apr 15, 2021

View reviewed changes

airflow/decorators/docker.py Outdated Show resolved Hide resolved

dimberman force-pushed the decorator-with-docker branch from 0af0cd7 to e478a9a Compare April 15, 2021 19:39

ashb reviewed Apr 15, 2021

View reviewed changes

ashb marked this pull request as draft April 15, 2021 19:58

ashb reviewed Apr 15, 2021

View reviewed changes

airflow/decorators/__init__.py Outdated Show resolved Hide resolved

dimberman and others added 12 commits September 20, 2021 15:06

try again doc

c51c5d2

doc build passing

d6e2c9d

Update airflow/decorators/__init__.py

a5e5281

Co-authored-by: Ash Berlin-Taylor <[email protected]>

fix static

49dcec9

add docker init

1d7e37e

fix tests

3064c5b

let's see if this fixes the import error

ee1d0e4

possibly fix docs

a448184

possibly fix docs

bf2c909

Tidy up decorators

88ceccf

Review feedback

89c75f5

fixup! Review feedback

3d7b63a

ashb force-pushed the decorator-with-docker branch from e396abe to 3d7b63a Compare September 20, 2021 14:07

kaxil reviewed Sep 20, 2021

View reviewed changes

airflow/decorators/__init__.py Outdated Show resolved Hide resolved

fixup! Review feedback

c7057a2

kaxil reviewed Sep 20, 2021

View reviewed changes

airflow/providers/docker/decorators/docker.py Outdated Show resolved Hide resolved

fixup! Review feedback

6612698

kaxil approved these changes Sep 20, 2021

View reviewed changes

fixup! Review feedback

1b38762

kaxil merged commit a9772cf into apache:main Sep 20, 2021

kaxil deleted the decorator-with-docker branch September 20, 2021 21:41

potiuk mentioned this pull request Sep 30, 2021

Status of testing Providers that were prepared on September 30, 2021 #18638

Closed

56 tasks

potiuk mentioned this pull request Jun 29, 2023

Fix behaviour of LazyDictWithCache when import fails #32248

Merged

ephraimbuddy mentioned this pull request Jul 7, 2023

Status of testing of Apache Airflow 2.6.3rc1 #32432

Closed

53 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Docker Taskflow decorator #15330

Add a Docker Taskflow decorator #15330

dimberman commented Apr 12, 2021

xinbinhuang commented Apr 12, 2021

dimberman commented Apr 12, 2021

github-actions bot commented Apr 12, 2021

github-actions bot commented Apr 12, 2021

github-actions bot commented Apr 12, 2021

github-actions bot commented Apr 13, 2021

isaulv Apr 15, 2021

ashb left a comment

github-actions bot commented Apr 15, 2021

github-actions bot commented Apr 15, 2021

github-actions bot commented Apr 15, 2021

Add a Docker Taskflow decorator #15330

Add a Docker Taskflow decorator #15330

Conversation

dimberman commented Apr 12, 2021

xinbinhuang commented Apr 12, 2021

dimberman commented Apr 12, 2021

github-actions bot commented Apr 12, 2021

github-actions bot commented Apr 12, 2021

github-actions bot commented Apr 12, 2021

github-actions bot commented Apr 13, 2021

isaulv Apr 15, 2021

Choose a reason for hiding this comment

ashb left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 15, 2021

github-actions bot commented Apr 15, 2021

github-actions bot commented Apr 15, 2021