Using parameters on pipeline commands #4525

fredtcaroli · 2020-09-03T20:10:35Z

I currently have a couple of stages that use the same script. For example, I have a single script for fetching data from my data lake, but I call it twice: once for training data and another for testing data. There are a couple of ways I can accomplish that.

Naive way

# dvc.yaml
fetch_training_data:
    cmd: python fetch_data.py 01-04-2020 01-05-2020
fetch_test_data:
    cmd: python fetch_data.py 01-05-2020 07-05-2020

So my fetch_data.py receives a date range. Training data is the entire month of April, and testing data is the first week of May.

This works OK, but I'd like to run retraining every month, so these dates need to be parameterized somehow.

A better way

Let's set up a param.yaml file this time. We'll need to tell our script how to read that parameter though

# fetch_data.yaml
import sys
import yaml
import datetime

start_date_param = sys.argv[1]
end_date_param = sys.argv[2]
with open('params.yaml', 'r') as fin:
    config = yaml.load(fin, Loader=yaml.SafeLoader)
start_date = datetime.datetime.strptime(config[start_date_param], '%Y-%m-%d').date()
end_date = datetime.datetime.strptime(config[end_date_param], '%Y-%m-%d').date()

fetch_data(start_date, end_date)

# params.yaml
start_date: '01-04-2020'
end_date: '01-05-2020'
eval_start_date: '01-05-2020'
eval_end_date: '07-05-2020'

# dvc.yaml
fetch_training_data:
    cmd: python fetch_data.py start_date end_date
    params:
        - start_date
        - end_date
fetch_test_data:
    cmd: python fetch_data.py eval_start_date eval_end_date
    params:
        - eval_start_date
        - eval_end_date

This is a bit better. There's definitely a couple of hoops you need to go through, but it gets the job done.
You could also instead bake a stage parameter that tells you which parameters to read. That might be better for some cases.

The bestest way

Now here's my feature request. Imagine a world where I can simply setup my script parameters to read directly from the parameters file. Here's my suggestion:

# fetch_data.py
import sys
import datetime
start_date = datetime.datetime.strptime(sys.argv[1], '%Y-%m-%d').date()
end_date = datetime.datetime.strptime(sys.argv[2], '%Y-%m-%d').date()

fetch_data(start_date, end_date)

# params.yaml
start_date: '01-04-2020'
end_date: '01-05-2020'
eval_start_date: '01-05-2020'
eval_end_date: '07-05-2020'

# dvc.yaml
fetch_training_data:
    cmd: python fetch_data.py {{start_date}} {{end_date}}
fetch_test_data:
    cmd: python fetch_data.py {{eval_start_date}} {{eval_end_date}}

See what I did there? I wouldn't even need to specify the params specification, since you can infer it from the cmd. That could also play nicely with other parameter files, like so:

cmd: python fetch_data.py {{data_params.yaml:start_date}} {{data_params.yaml:end_date}}

So what do you think? That would greatly simplify my workflow, and I think it would be a great addition to this already awesome tool.

The text was updated successfully, but these errors were encountered:

pared · 2020-09-04T09:00:53Z

The idea sounds like a nice addition to current functionality.
Few things to consider:

It would be probably the best to start with the "Using params in run" scenario, since we already are able to handle params, we would just need to fill the yaml variables before executing the command.
If we were to go with the bestest scenario, we could take a look at how terraform handles unspecified values: It just asks to provide them. This use case requires us to save params provided "by hand" (for reproducibility).

@fredtcaroli
As a workaround for now I would consider creating intermediate scripts, that would use already existing functionality (params) to call your script for particular stage.

Have params in param.yml

# params.yaml
start_date: '01-04-2020'
end_date: '01-05-2020'
eval_start_date: '01-05-2020'
eval_end_date: '07-05-2020'

Create fetch_training.py:

from fetch_data import fetch_data
params = parse_yml("params.yaml")

fetch_data(params["start_date"], params["end_date"])

(similar for eval)
and with DVC:

dvc run -n fetch_train -p start_date,end_date -d fetch_training.py python fetch_training.py

fredtcaroli · 2020-09-04T13:59:27Z

If we were to go with the bestest scenario, we could take a look at how terraform handles unspecified values: It just asks to provide them. This use case requires us to save params provided "by hand" (for reproducibility).

That could be the case if you're running a dvc run command, but I don't see why asking unspecified values on dvc repro, since everything should be specified by then and we're just reproducing results.

As a workaround for now I would consider creating intermediate scripts, that would use already existing functionality (params) to call your script for particular stage.

I'm aware there are a few ways to work around this, all I'm saying is that none of them are simple and they all require some amount of boilerplate code.

Let me tell you a bit about my context... I'm one of the people responsible for my (large sized) company's data science standards, and we have recently adopted DVC as our data and model versioning tool.
One of the big pluses of using DVC for us, is that we are free to use whatever toolset we want, as long as we can write a cmd that runs our stages, right? Not quite. Although DVC tries to be completely decoupled from the source code that it's running, the params.yaml mechanism fails to parameterize stages without affecting how the stage is implemented.
That might be clearer to see if we're talking about source code that we don't own. Let's say we have a stage that runs the ffmpeg command. If we want to parameterize it, we would need to create an extra layer of code just so we can run the command the way we want. At this rate DVC, which was supposed to ease my data science workflow, is actually making me write more code for stuff that should be really simple to do.

pared · 2020-09-04T14:17:08Z

That could be the case if you're running a dvc run command, but I don't see why asking unspecified values on dvc repro, since everything should be specified by then and we're just reproducing results.

@fredtcaroli you are right, upon repro, we should not require user to provide the params that he/she already provided during run. That is why we would need to save those params somewhere when we first run the command.

Although DVC tries to be completely decoupled from the source code that it's running, the params.yaml mechanism fails to parameterize stages without affecting how the stage is implemented.

You are right about this one, too. We do lack straightforward way of passing the parameters to stage command. Allowing passing the parameters in a way that (for example) Github actions allow would be a nice enhancement.

shcheklein · 2020-09-05T00:31:52Z

I wonder if it can be done on top of this PR #4463 ?

@fredtcaroli @pared WDYT?

dmpetrov · 2020-09-05T01:13:41Z

@fredtcaroli that's a great scenario! We had this discussion - #3633. It seems like this issue is a duplicate of #3633. But your proposal (the bestest way) that is close to the one we come up with during the discussion. I even like your description more (we just need to decide if jinja is the best way to expand variables).

There are plans to implement this in the near future all together with more advanced dvc.yaml customization features like micro-batches/loops - #331.

skshetry · 2020-12-09T07:02:20Z

Closing as the parametrization can also track used variables. See #3633

pared added feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint enhancement Enhances DVC labels Sep 4, 2020

skshetry mentioned this issue Sep 23, 2020

Pipeline variables from params file #3633

Closed

skshetry closed this as completed Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using parameters on pipeline commands #4525

Using parameters on pipeline commands #4525

fredtcaroli commented Sep 3, 2020

pared commented Sep 4, 2020

fredtcaroli commented Sep 4, 2020

pared commented Sep 4, 2020 •

edited

Loading

shcheklein commented Sep 5, 2020

dmpetrov commented Sep 5, 2020

skshetry commented Dec 9, 2020

Using parameters on pipeline commands #4525

Using parameters on pipeline commands #4525

Comments

fredtcaroli commented Sep 3, 2020

Naive way

A better way

The bestest way

pared commented Sep 4, 2020

fredtcaroli commented Sep 4, 2020

pared commented Sep 4, 2020 • edited Loading

shcheklein commented Sep 5, 2020

dmpetrov commented Sep 5, 2020

skshetry commented Dec 9, 2020

pared commented Sep 4, 2020 •

edited

Loading