Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using parameters on pipeline commands #4525

Closed
fredtcaroli opened this issue Sep 3, 2020 · 6 comments
Closed

Using parameters on pipeline commands #4525

fredtcaroli opened this issue Sep 3, 2020 · 6 comments
Labels
enhancement Enhances DVC feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@fredtcaroli
Copy link

I currently have a couple of stages that use the same script. For example, I have a single script for fetching data from my data lake, but I call it twice: once for training data and another for testing data. There are a couple of ways I can accomplish that.

Naive way

# dvc.yaml
fetch_training_data:
    cmd: python fetch_data.py 01-04-2020 01-05-2020
fetch_test_data:
    cmd: python fetch_data.py 01-05-2020 07-05-2020

So my fetch_data.py receives a date range. Training data is the entire month of April, and testing data is the first week of May.

This works OK, but I'd like to run retraining every month, so these dates need to be parameterized somehow.

A better way

Let's set up a param.yaml file this time. We'll need to tell our script how to read that parameter though

# fetch_data.yaml
import sys
import yaml
import datetime

start_date_param = sys.argv[1]
end_date_param = sys.argv[2]
with open('params.yaml', 'r') as fin:
    config = yaml.load(fin, Loader=yaml.SafeLoader)
start_date = datetime.datetime.strptime(config[start_date_param], '%Y-%m-%d').date()
end_date = datetime.datetime.strptime(config[end_date_param], '%Y-%m-%d').date()

fetch_data(start_date, end_date)
# params.yaml
start_date: '01-04-2020'
end_date: '01-05-2020'
eval_start_date: '01-05-2020'
eval_end_date: '07-05-2020'
# dvc.yaml
fetch_training_data:
    cmd: python fetch_data.py start_date end_date
    params:
        - start_date
        - end_date
fetch_test_data:
    cmd: python fetch_data.py eval_start_date eval_end_date
    params:
        - eval_start_date
        - eval_end_date

This is a bit better. There's definitely a couple of hoops you need to go through, but it gets the job done.
You could also instead bake a stage parameter that tells you which parameters to read. That might be better for some cases.

The bestest way

Now here's my feature request. Imagine a world where I can simply setup my script parameters to read directly from the parameters file. Here's my suggestion:

# fetch_data.py
import sys
import datetime
start_date = datetime.datetime.strptime(sys.argv[1], '%Y-%m-%d').date()
end_date = datetime.datetime.strptime(sys.argv[2], '%Y-%m-%d').date()

fetch_data(start_date, end_date)
# params.yaml
start_date: '01-04-2020'
end_date: '01-05-2020'
eval_start_date: '01-05-2020'
eval_end_date: '07-05-2020'
# dvc.yaml
fetch_training_data:
    cmd: python fetch_data.py {{start_date}} {{end_date}}
fetch_test_data:
    cmd: python fetch_data.py {{eval_start_date}} {{eval_end_date}}

See what I did there? I wouldn't even need to specify the params specification, since you can infer it from the cmd. That could also play nicely with other parameter files, like so:

cmd: python fetch_data.py {{data_params.yaml:start_date}} {{data_params.yaml:end_date}}

So what do you think? That would greatly simplify my workflow, and I think it would be a great addition to this already awesome tool.

@pared
Copy link
Contributor

pared commented Sep 4, 2020

The idea sounds like a nice addition to current functionality.
Few things to consider:

  1. It would be probably the best to start with the "Using params in run" scenario, since we already are able to handle params, we would just need to fill the yaml variables before executing the command.
  2. If we were to go with the bestest scenario, we could take a look at how terraform handles unspecified values: It just asks to provide them. This use case requires us to save params provided "by hand" (for reproducibility).

@fredtcaroli
As a workaround for now I would consider creating intermediate scripts, that would use already existing functionality (params) to call your script for particular stage.

  1. Have params in param.yml
# params.yaml
start_date: '01-04-2020'
end_date: '01-05-2020'
eval_start_date: '01-05-2020'
eval_end_date: '07-05-2020'
  1. Create fetch_training.py:
from fetch_data import fetch_data
params = parse_yml("params.yaml")

fetch_data(params["start_date"], params["end_date"])

(similar for eval)
and with DVC:

dvc run -n fetch_train -p start_date,end_date -d fetch_training.py python fetch_training.py

@pared pared added feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint enhancement Enhances DVC labels Sep 4, 2020
@fredtcaroli
Copy link
Author

If we were to go with the bestest scenario, we could take a look at how terraform handles unspecified values: It just asks to provide them. This use case requires us to save params provided "by hand" (for reproducibility).

That could be the case if you're running a dvc run command, but I don't see why asking unspecified values on dvc repro, since everything should be specified by then and we're just reproducing results.

As a workaround for now I would consider creating intermediate scripts, that would use already existing functionality (params) to call your script for particular stage.

I'm aware there are a few ways to work around this, all I'm saying is that none of them are simple and they all require some amount of boilerplate code.

Let me tell you a bit about my context... I'm one of the people responsible for my (large sized) company's data science standards, and we have recently adopted DVC as our data and model versioning tool.
One of the big pluses of using DVC for us, is that we are free to use whatever toolset we want, as long as we can write a cmd that runs our stages, right? Not quite. Although DVC tries to be completely decoupled from the source code that it's running, the params.yaml mechanism fails to parameterize stages without affecting how the stage is implemented.
That might be clearer to see if we're talking about source code that we don't own. Let's say we have a stage that runs the ffmpeg command. If we want to parameterize it, we would need to create an extra layer of code just so we can run the command the way we want. At this rate DVC, which was supposed to ease my data science workflow, is actually making me write more code for stuff that should be really simple to do.

@pared
Copy link
Contributor

pared commented Sep 4, 2020

That could be the case if you're running a dvc run command, but I don't see why asking unspecified values on dvc repro, since everything should be specified by then and we're just reproducing results.

@fredtcaroli you are right, upon repro, we should not require user to provide the params that he/she already provided during run. That is why we would need to save those params somewhere when we first run the command.

Although DVC tries to be completely decoupled from the source code that it's running, the params.yaml mechanism fails to parameterize stages without affecting how the stage is implemented.

You are right about this one, too. We do lack straightforward way of passing the parameters to stage command. Allowing passing the parameters in a way that (for example) Github actions allow would be a nice enhancement.

@shcheklein
Copy link
Member

I wonder if it can be done on top of this PR #4463 ?

@fredtcaroli @pared WDYT?

@dmpetrov
Copy link
Member

dmpetrov commented Sep 5, 2020

@fredtcaroli that's a great scenario! We had this discussion - #3633. It seems like this issue is a duplicate of #3633. But your proposal (the bestest way) that is close to the one we come up with during the discussion. I even like your description more (we just need to decide if jinja is the best way to expand variables).

There are plans to implement this in the near future all together with more advanced dvc.yaml customization features like micro-batches/loops - #331.

@skshetry
Copy link
Member

skshetry commented Dec 9, 2020

Closing as the parametrization can also track used variables. See #3633

@skshetry skshetry closed this as completed Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

5 participants