Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PythonVirtualenvDecorator to Taskflow API #14761

Merged
merged 1 commit into from
Apr 8, 2021

Conversation

dimberman
Copy link
Contributor

@dimberman dimberman commented Mar 13, 2021

To improve the usability of the TaskFlow API, we will add the ability to
define virtualenv or docker environments so users can run tasks with
environments that do not match that of the Airflow system.

Example:

    @task.virtualenv(
        use_dill=True,
        system_site_packages=False,
        requirements=['funcsigs'],
    )
    def extract():
        """
        #### Extract task
        A simple Extract task to get data ready for the rest of the data
        pipeline. In this case, getting data is simulated by reading from a
        hardcoded JSON string.
        """
        data_string = '{"1001": 301.27, "1002": 433.21, "1003": 502.22}'

        order_data_dict = json.loads(data_string)
        return order_data_dict

    # [END extract_virtualenv]

    # [START transform_docker]
    @task.docker(
        image="python:3.8.8-slim-buster",
        force_pull=True,
        docker_url="unix://var/run/docker.sock",
        network_mode="bridge",
        api_version='auto',
        multiple_outputs=True,
    )
    def transform(order_data_dict: dict):
        """
        #### Transform task
        A simple Transform task which takes in the collection of order data and
        computes the total order value.
        """
        total_order_value = 0

        for value in order_data_dict.values():
            total_order_value += value

        return {"total_order_value": total_order_value}
        

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@github-actions
Copy link

The Workflow run is cancelling this PR. Building image for the PR has been cancelled

@turbaszek turbaszek self-requested a review March 13, 2021 19:45
@dimberman dimberman marked this pull request as ready for review March 15, 2021 15:21
packages and system libraries of the Airflow worker.

To use a docker image with the Taskflow API, change the decorator to ``@task.docker``
and add any needed arguments to correctly run the task. Please note that the docker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about listing some arguments instead of using any? As a new users I would not be sure what the arguments can be

cap_add=cap_add,
extra_hosts=extra_hosts,
**kwargs,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I correctly understand most of those arguments are for DockerOperator. I'm wondering if it would make sense to generate the method automatically using DockerOperator class? This would have to advantages:

  • we would be sure that any change in signature in DockerOperator is automatically reflected in task.docker
  • the mechanism can be reused for other operators/decorators

Of course "explicit is better than implicit" but "missing" arguments that users would like to use are a common case in operators world. WDYT @dimberman ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turbaszek what would it look like to generate the function automatically? Ideally a system where we can keep parity with the DockerOperator would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turbaszek I'm moving docker into a separate PR

if not callable(python_callable):
raise TypeError('`python_callable` param must be callable')
if 'self' in signature(python_callable).parameters.keys():
raise AirflowException('@task does not support methods')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about class methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turbaszek what do you mean? I imagine we want to keep parity with how the task decorator currently works

raise AirflowException(
'Returned dictionary keys must be strings when using '
f'multiple_outputs, found {key} ({type(key)}) instead'
)
Copy link
Member

@turbaszek turbaszek Mar 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we cover this somewhere in docs or users have to learn the hard way? 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turbaszek isn't this part of the taskflow API docs?

:type op_args: list
:param op_kwargs: A dict of keyword arguments to pass to python_callable.
:type op_kwargs: dict
:param string_args: Strings that are present in the global var virtualenv_string_args,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose and usage of virtualenv_string_args isn't immediately obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhtimmins I'm not quite sure either, but I'm just copying the PythonVirtualEnvOperator API

airflow/decorators/__init__.py Outdated Show resolved Hide resolved
:type cap_add: list[str]
"""
return _docker_task(
python_callable=python_callable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the python callable, should we enforce keyword-only arguments?

airflow/decorators/base.py Show resolved Hide resolved
Comment on lines +150 to +151
if not self.multiple_outputs:
return return_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self.multiple_outputs:
return return_value
if self.multiple_outputs:

Since the return value isn't used, it isn't necessary to return.

@@ -107,7 +105,7 @@ class PythonOperator(BaseOperator):

template_fields = ('templates_dict', 'op_args', 'op_kwargs')
template_fields_renderers = {"templates_dict": "json", "op_args": "py", "op_kwargs": "py"}
ui_color = PYTHON_OPERATOR_UI_COLOR
ui_color = '#ffefeb'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The color is currently a magic value. I'd assign it to a constant to indicate what color it is.

BLUE = '#ffefeb'
ui_color = BLUE

airflow/operators/python.py Outdated Show resolved Hide resolved
num_paren = num_paren + 1
elif current == ")":
num_paren = num_paren - 1
return ''.join(after_decorator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not matter, but this will fail if any of the input args have a parentheses value within a string, such as :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative approach:

def remove_task_decorator(python_source: str, task_decorator_name: str) -> str:
  """
  Removed @task.virtualenv
  :param python_source:
  """
  func_start = source.find("def ")
  decorators = source[:func_start]
  decorated = "@".join(d for d in decorators.split("@") if not d.startswith(task_decorator_name))
  return decorated + source[func_start:]

Honestly this doesn't matter but I wanted to see if there was a clear alternative way to do it and here it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhtimmins unfortunately this doesn't seem to work if you look at the tests (e.g. if you're nesting decorators)

Using the Taskflow API with Docker or Virtual Environments
----------------------------------------------------------

As of Airflow <Airflow version>, you will have the ability to use the Taskflow API with either a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Update the version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhtimmins this unfortunately can't be done until we know which version we add it to.

tests/decorators/test_python_virtualenv.py Outdated Show resolved Hide resolved
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

1 similar comment
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Mar 23, 2021
@github-actions
Copy link

github-actions bot commented Apr 7, 2021

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dimberman dimberman changed the title Add PythonVirtualenvDecorator and DockerDecorator to Taskflow API Add PythonVirtualenvDecorator to Taskflow API Apr 8, 2021
@dimberman dimberman force-pushed the docker-decorator branch 2 times, most recently from 88d85fd to e07483b Compare April 8, 2021 15:15
To improve the usability of the TaskFlow API, we will add the ability to
define virtualenv environments so users can run tasks with
environments that do not match that of the Airflow system
@github-actions
Copy link

github-actions bot commented Apr 8, 2021

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@dimberman dimberman merged commit 5661273 into apache:master Apr 8, 2021
@dimberman dimberman deleted the docker-decorator branch April 8, 2021 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants