Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP dvc: implement vars templating in multistage dvc file #4463

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions dvc/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
plots,
remote,
remove,
render,
repro,
root,
run,
Expand Down Expand Up @@ -81,6 +82,7 @@
plots,
experiments,
check_ignore,
render,
]


Expand Down
62 changes: 62 additions & 0 deletions dvc/command/render.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import argparse
import logging

from dvc.command.base import CmdBase, append_doc_link
from dvc.utils.serialize._yaml import dumps_yaml, render_dvc_template

logger = logging.getLogger(__name__)


class CmdRender(CmdBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What use case do you see for a separate render command? I see that you've already implemented automatic rendering, so not quite sure what is the use case for a separate command.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!

Ahhh, this is mainly for a user to debug their dvc.yaml. Just to check if the rendered stage and variables are doing what you want them to.

def run(self):
with open(self.args.path, encoding="utf-8") as fd:
text = fd.read()
dvc_dict, vars_dict = render_dvc_template(text)
vars_dict = {"vars": vars_dict}
vars_str = dumps_yaml(vars_dict)

if self.args.only_vars:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many return points in one function. Can we make it more compact?

Copy link
Author

@tall-josh tall-josh Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely. This was tired Josh coding :-p

logger.info(vars_str)
return 0

if self.args.stage is not None:
dvc_dict = dvc_dict["stages"][self.args.stage]
dvc_str = dumps_yaml({"stages": {self.args.stage: dvc_dict}})

if self.args.only_stages:
logger.info(dvc_str)
return 0

logger.info("\n".join([vars_str, dvc_str]))
return 0


def add_parser(subparsers, parent_parser):
RENDER_HELP = "Render templated dvc.yaml"
render_parser = subparsers.add_parser(
"render",
parents=[parent_parser],
description=append_doc_link(RENDER_HELP, "render"),
help=RENDER_HELP,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
render_parser.add_argument(
"--path", default="dvc.yaml", help="Path to dvc.yaml file",
)
render_parser.add_argument(
"--stage", "-s", default=None, help="Only render a specified stage",
)
render_parser.add_argument(
"--only-vars",
action="store_true",
default=False,
help="Only render the `vars` component of the dvc.yaml",
)
render_parser.add_argument(
"--only-stages",
action="store_true",
default=False,
help="Only render the `stages` component of the dvc.yaml",
)

render_parser.set_defaults(func=CmdRender)
70 changes: 67 additions & 3 deletions dvc/utils/serialize/_yaml.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from contextlib import contextmanager

from funcy import reraise
from jinja2 import Template
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not quite sure that jinja template is the best format we can use in dvc files, feels like it is still a bit unusual, at least coming from python. It is mature and powerful though, no doubt about it πŸ™‚

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I went with this because it was relatively easy to use and mature. We also touched on this in issue #3633. If you/anyone has any recommendations I'd be open to taking a crack at implementing it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @efiop

Would a custom syntax be more acceptable? I have been using this fork quite a lot recently, if we were to go down the path of a custom templating syntax I think the following features are important. I welcome any feedback :-)

  • Support for looping flagged command-line arguments. ie: I currently use this pattern frequently. (I'll show it in Jinja for now)
vars:
  repeated_arg:
    - thing
    - another_thing
    - another_other_thing
stages:
  foo:
    wdir: .
    cmd: >-
      python foo.py
      {% for x in repeated_arg %}
      -a "{{ x }}"
      {% endfor %}

# Renders as
vars:
  repeated_arg:
    - thing
    - another_thing
    - another_other_thing
stages:
  foo:
    wdir: .
    cmd: >-
      python foo.py
      -a "thing"
      -a "anothet_thing"
      -a "another_othert_thing"
  • Similarly, looping over key-value pairs would be handy. For example, if a command has optional flagged args you could template the function call in the stage and simply update the vars as need.
  • On a similar vein, supporting a combination of positional and flagged args would be handy. Potentially using a list of tuples, ie:
vars:
  mixed_args:
    - pos-arg-1
    - pos-arg-2
    - ['--flag-arg-1', 42]
    - ['--flag-arg-2', 3.14159]
  • The templated syntax should stand out so it is easy to identify at a glance
  • Templated dvc.yaml should still be valid yaml (not sure about this, but intuitively I like the idea and have been striving for it)
  • Be able to pickup the stdout of a stage and refer to it in subsequent stages. I have not done this but I have seen it in Kubeflow pipelines and like the idea. ie:
stages:
  one:
    wdir: .
    cmd: "echo Bailey"
  two:
    wdir: .
    cmd: "echo {{ stages.one.stdout }} is the greatest doggo in the universe!!!"

# After stage one completes stage two renders as

  two:
    wdir: .
    cmd: "echo Bailey is the greatest doggo in the universe!!!"

Though maybe this is getting a bit distracted from the initial goal


Also pulling in interested parties from #3633

@dsuess @dmpetrov @karajan1001 @skshetry @elgehelge @jcpsantiago

Copy link
Contributor

@karajan1001 karajan1001 Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I only had used jinja2 before, used in an inner annotation platform. It is a powerful tool, maybe a bit over powerful and complex here?

from ruamel.yaml import YAML
from ruamel.yaml.error import YAMLError

Expand All @@ -14,14 +15,77 @@ def __init__(self, path):
super().__init__(path, "YAML file structure is corrupted")


def load_yaml(path, tree=None):
return _load_data(path, parser=parse_yaml, tree=tree)
def recursive_render(tpl, values, max_passes=100):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for nested interpolation, right?

{{ var {{ var }} }}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Not the most efficient thing, just trying to get something working. My team and I have been using this feature in our internal projects. It's been quite nice!

"""This is a bit of black magic to recursivly render a
template. Adaped from:

https://stackoverflow.com/questions/8862731/jinja-nested-rendering-on-variable-content

Args:
tpl: Template string
values: dict of values. Importantly this dict can contain
values that are themselves {{ placeholders }}
max_passes: Limits the number of times we loop over the
template.

Returns:
rendered template.
"""
prev = tpl
for _ in range(max_passes):
curr = Template(prev).render(**values)
if curr != prev:
prev = curr
else:
return curr
raise RecursionError("Max resursion depth reached")


def render_vars(dvc_dict):
Copy link
Member

@skshetry skshetry Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tall-josh, this interpolates values in vars using data from vars itself, right?

vars:
  foo: 3
  bar: "{{ foo }}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is this scenario?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Yeah it does. I have used this a lot. ie:

vars:
  my_labeling_project: 123
  data_out: /storage/data/{{ my_labeling_project }}_data
  src: s3://labeled-data/projects/{{ my_labeling_project }}
stages:
  download:
    cmd: python download_data.py --src {{src}} --out  {{ data_out }}
    outs:
      - {{ data_out }}

Without this I would have to repeat the my_labeling_project

vars_dict = dvc_dict["vars"]
vars_template = dumps_yaml(vars_dict)

rendered_vars = recursive_render(vars_template, vars_dict)
return rendered_vars


def render_dvc(dvc_dict, vars_dict):
dvc_template = dumps_yaml(dvc_dict)
rendered_dvc = Template(dvc_template).render(**vars_dict)
return rendered_dvc


def render_dvc_template(text):

yaml = YAML(typ="safe")
dvc_dict = yaml.load(text) or {}
if "vars" in dvc_dict:
vars_dict = yaml.load(render_vars(dvc_dict))

del dvc_dict["vars"]
rendered_dvc = yaml.load(render_dvc(dvc_dict, vars_dict)) or {}

return rendered_dvc, vars_dict


def parse_yaml(text, path, typ="safe"):
yaml = YAML(typ=typ)
with reraise(YAMLError, YAMLFileCorruptedError(path)):
return yaml.load(text) or {}
result = yaml.load(text) or {}

if "vars" in result:
try:
result, _ = render_dvc_template(
text
) # yaml.load(text, Loader=SafeLoader) or {}
except Exception as exc:
raise YAMLFileCorruptedError(path) from exc

return result


def load_yaml(path, tree=None):
return _load_data(path, parser=parse_yaml, tree=tree)


def parse_yaml_for_update(text, path):
Expand Down