Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-103] Allow jinja templates to be used in task params #1488

Closed
wants to merge 1 commit into from

Conversation

withnale
Copy link
Contributor

Dear Airflow Maintainers,

Please accept this PR that addresses the following issues:

Reminders for contributors:

@@ -1414,6 +1411,15 @@ def get_template_context(self, session=None):
'test_mode': self.test_mode,
}

# Allow task level param definitions to be rendered via jinja
# using the context available up until this point
if task.params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in the render_templates and apply only on strings (use isinstance(obj, basestring) to check)

Copy link
Member

@mistercrunch mistercrunch Jun 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still holds. It hasn't been addressed or contested.

@mistercrunch
Copy link
Member

I think this can result in a change of behavior that could create some troubles or slow downs. If many tasks are referencing a params object (which is a common pattern), it will get "rendered" many times. This could result in either unexpected results or inefficiencies / regressions.

It'd be wise to add a new bool param to BaseOperator named render_params, set to False by default and only apply where necessary. I personally wouldn't approve this PR without this safeguard.

If double rendering was an issue (I think with render_params it's less of a concern), the DAG object could keep track of object ids (using Python's id(obj)) that have been templated and implement some logic in render_template to avoid re-rendering objects that have been rendered before.

@withnale
Copy link
Contributor Author

If double rendering was an issue (I think with render_params it's less of a concern), the DAG object could keep track of object ids (using Python's id(obj)) that have been templated and implement some logic in render_template to avoid re-rendering objects that have been rendered before.

Yes, I can see that effectively caching values would avoid the potential slowdown of multiple jinja renders. However, would it not be possible to just store the rendered params on first call within a TI and serve this value of subsequent calls rather than delving into the tracking individual ids walking through the params dict?

Also, I'm confused why you'd say it lives in render_template rather than get_template_context. I wouldn't like to treat it as just another field in template_fields, since you would always want the template_fields to be able to make use of the rendered params. Or are you suggesting a separate block in render_templates?

@mistercrunch
Copy link
Member

  1. well if it's the same object it's already been rendered in place, so all you want to do it to not re-render it . Maintaining a list of object ids that have been rendered would allow to skip double-rendering
  2. get_template_context returns the context used for rendering, it does not render anything. render_templates is the method that renders things, so params should be rendered in there along with everything else.

@withnale withnale force-pushed the jinja_task_params branch from a8309cf to 69be5e4 Compare May 24, 2016 13:10
@landscape-bot
Copy link

Code Health
Repository health increased by 0.11% when pulling 69be5e4 on withnale:jinja_task_params into ee24855 on apache:master.

@withnale withnale force-pushed the jinja_task_params branch from 69be5e4 to 4025764 Compare May 24, 2016 16:04
@withnale
Copy link
Contributor Author

Have updated the existing PR to include a BaseOperator parameter of render_params=False and tested as appropriate in get_template_context(). Still trying to work out whether the natural home of this should be render_templates() or get_template_context().

For context sake, here is an example operator from one of my DAGs. The config and steps params are long YAML files which define our EMR clusters and most of these are very similar hence the need to parameterise. Some of the values passed in via the parameters are dynamic and created from other TaskInstances.

create_cluster = EmrSensor(
    config='resources/emr/cluster/config.yaml',
    steps='resources/emr/steps/steps.yaml',
    params={
        'db_host': "{{ task_instance.xcom_pull(task_ids='create_db') }}",
        'year': "{{ (execution_date - macros.timedelta(days=18)).year }}",
        'month': "{{ (execution_date - macros.timedelta(days=18)).month }}"
    },
    task_id='create_cluster',
    dag=dag,
)

So if implemented in `render_templates()`` are you thinking of something like this (not tested as yet)...?

    def render_templates(self):
        task = self.task
        jinja_context = self.get_template_context()
        if hasattr(self, 'task') and hasattr(self.task, 'dag'):
            if self.task.dag.user_defined_macros:
                jinja_context.update(
                    self.task.dag.user_defined_macros)

        rt = self.task.render_template  # shortcut to method
        if self.render_params:
            jinja_context['params'] = rt('params', jinja_context['params'], jinja_context)

        for attr in task.__class__.template_fields:
            content = getattr(task, attr)
            if content:
                rendered_content = rt(attr, content, jinja_context)
                setattr(task, attr, rendered_content)

I guess it comes down to whether you view that the rendered params are special cases of normal templated variables or whether they are actually now the context for rendering - this was my original view, but after writing out what I think the updated render_templates would look like, it's a little more six of one and half a dozen of the other now. I'm happy to resubmit the PR in whichever form seems best...

NB: It seems that get_template_context() is already being called twice during the execution of TaskInstance.run(). If there is a worry about performance of the templating then perhaps we should make render_templates() accept an optional context to avoid the double call, something like...

        if not mark_success:
            context = self.get_template_context()

            task_copy = copy.copy(task)
            self.task = task_copy

            def signal_handler(signum, frame):
                '''Setting kill signal handler'''
                logging.error("Killing subprocess")
                task_copy.on_kill()
                raise AirflowException("Task received SIGTERM signal")
            signal.signal(signal.SIGTERM, signal_handler)

            self.render_templates(context=context)

    ...

    def render_templates(self, context=None):
        task = self.task
        jinja_context = context or self.get_template_context()

@landscape-bot
Copy link

Code Health
Repository health increased by 0.11% when pulling 4025764 on withnale:jinja_task_params into ee24855 on apache:master.

@withnale
Copy link
Contributor Author

@mistercrunch any chance of an update?

@@ -489,7 +489,8 @@ def to_csv(
schema='default',
delimiter=',',
lineterminator='\r\n',
output_header=True):
output_header=True,
fetch_size=1000):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have the potential to break existing implementations of HiveHook, if someone is expecting > 1000 rows? I think this should default to None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this related to the PR?

@jlowin
Copy link
Member

jlowin commented May 30, 2016

I think this looks good and the explicit render_params option should alleviate the concerns about performance. +1 from me on the params bit but @withnale not sure about the Hive Hook bit.

@mistercrunch are you ok with it?

@withnale
Copy link
Contributor Author

Ah. The hive-hook shouldn't be there. I think I must have included it from someone elses commit when doing a rebase. Will tidy it up and do a force push later today.

@withnale withnale force-pushed the jinja_task_params branch 2 times, most recently from 0d41e15 to 5548dbb Compare June 3, 2016 19:50
@withnale withnale force-pushed the jinja_task_params branch from 5548dbb to d1387b3 Compare June 3, 2016 19:50
@codecov-io
Copy link

codecov-io commented Jun 3, 2016

Current coverage is 67.78%

Merging #1488 into master will increase coverage by <.01%

@@             master      #1488   diff @@
==========================================
  Files           116        116          
  Lines          8285       8289     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           5611       5618     +7   
+ Misses         2674       2671     -3   
  Partials          0          0          

Powered by Codecov. Last updated by 89edb6f...0d41e15

@withnale
Copy link
Contributor Author

withnale commented Jun 3, 2016

A bit later than planned but the PR has been rebased again to current.

@bolkedebruin
Copy link
Contributor

@withnale Thanks for all the hard work! Sorry this one got overlooked a bit. As @mistercrunch felt a bit strongly about this I would like hime to chime in before merging. So I hope you still have a bit of patience with us ;).

@msumit
Copy link
Contributor

msumit commented Oct 15, 2016

Hi @withnale.. sorry for long delay in reviewing this PR. If you still wants to pursue this PR, please resolve the conflict and address the concern raised by @mistercrunch. You would also need to change the commit message, so that it doesn't cross the 50 char limit.

We've 1 weeks time before this PR gets marked as Abandoned and gets closed without being merged to master.

@asfgit asfgit closed this in dcef468 Oct 22, 2016
alekstorm pushed a commit to alekstorm/incubator-airflow that referenced this pull request Jun 1, 2017
mobuchowski pushed a commit to mobuchowski/airflow that referenced this pull request Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants