Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc run: pass dependencies and outputs as params #995

Closed
dmpetrov opened this issue Aug 10, 2018 · 11 comments
Closed

dvc run: pass dependencies and outputs as params #995

dmpetrov opened this issue Aug 10, 2018 · 11 comments
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint research

Comments

@dmpetrov
Copy link
Member

From xiang0x48: https://discuss.dvc.org/t/is-there-any-elegant-way-of-passing-argument-d-and-o-to-command-run-by-dvc-run/66

"When using dvc run, argument of dvc run and true command to be run are always similar, thus
dvc run -d source.npy -o target.npy python some_process.py source.npy target.npy . As there are repeated file names, it might be a “smell of code”."

Solution: introduce --pass-params to add all the dependencies and output params in the same order.

Example:
Commnd dvc run -d source.npy -d proc.py -o target.npy --pass-params python proc.py myparam
DVC adds all the params in addition to the original one (myparam): myparam source.npy proc.py target.npy

@dmpetrov dmpetrov added the feature request Requesting a new feature label Aug 10, 2018
@efiop
Copy link
Contributor

efiop commented Aug 10, 2018

This is really hacky and not at all flexible. You only build your pipeline once, so it is not that inconvenient to introduce such a hack. Plus it defeats the purpose of our explicit dvc run syntax making your command really cryptic. I would rather not add this.

@dmpetrov
Copy link
Member Author

Something is needed to avoid the duplication which is a good source of mistakes.

This is the way to avoid the duplication with a minimum overhead - a single option instead of "flexible" set of options. Another alternative is to do this by param: @1 @3 or @dvcparam1 @dvcparam3. The first approach just the simplest way of doing this.

Any of these can work.

Yes, you build a pipeline once. But it still requires a way to avoid mistakes.
I agree that this is a quite hacky option and we should not use it in tutorials.

@efiop
Copy link
Contributor

efiop commented Feb 25, 2019

Related #1462

@efiop
Copy link
Contributor

efiop commented Apr 16, 2019

Related discussion on discord https://discordapp.com/channels/485586884165107732/563406153334128681/567801713948360725 proposing --parse option that would parse command by itself.

@ghost
Copy link

ghost commented Apr 22, 2019

[x-post from discord]

I don't know if DVC is the tool for this use case, but let's imagine that you want to specify a pipeline that depends on a directory (e.g. raw) and transform the content and put them on another directory (e.g. processed), so you have something like the following:

echo "hello world" > data/raw/greetings.en
echo "hola mundo" > data/raw/greetings.es
dvc add data/raw

dvc run \
  --file data/processed.dvc \
  --deps data/raw \
  --outs data/processed \
  'mkdir -p data/processed &&
  for raw in data/raw/*
  do
      processed="data/processed/$(basename $raw)"
      tr "[:lower:]" "[:upper:]" < $raw > $processed
  done'

currently, there's no way to add another file to the raw directory and compute only that last file instead of the whole thing:

echo "bonjour monde" > data/raw/greetings.fr

# this will reproduce everything again
dvc repro data/processed.dvc

have you thought about supporting this case? is this common on ML pipelines? (relying on this type of generic transformations applied to a bunch of files on a directory).

Also, we would need to have a way to iterate over a command, it could be with environment variables, like, let the user name the dependency or output, and then make them available to the shell using bash arrays listing every entry that needs to be computed

dvc run --deps raw=data/raw 'for file in $raw; do ...; done'

or the other one could be with an special syntax:

dvc run \
    -d data/raw \
    -o data/processed \
    transform.sh {input} > {output}

and DVC iterates internally, running that command, and doing a lot of implicit things, like {output} would be the same file name as {input} but in different directory

The former one looks easier to implement (naming variables), then we would need to identify if the path is a file or a directory, if it's a dir, then we would need to filter every file that hasn't been computed yet (I don't know how to do this... maybe using the state? ), and when it's time to run the command, inject an environment variable with the name that the user specified before. Although, if I remember correctly, bash arrays have a low limit of entries (and I'm not sure we can use the same strategy for windows).
Maybe, we write a temporary file with the entries that needs to be processed. I like more the file approach instead of the bash array :P

@efiop efiop added p3-nice-to-have It should be done this or next sprint research labels Jul 23, 2019
@jorgeorpinel
Copy link
Contributor

We should probably not call this new feature "params" anymore though, since we now have the dvc params command. Should we update the issue's title and description? How about --pass-deps meaning "pass dependencies as command arguments"?

@efiop
Copy link
Contributor

efiop commented Jun 19, 2020

@jorgeorpinel Let's leave it as is so we don't lose context. We will be getting to parametrization in the near future.

@jorgeorpinel
Copy link
Contributor

OK. What do you mean by parametrization?

@efiop
Copy link
Contributor

efiop commented Jun 23, 2020

@jorgeorpinel passing our deps/outs as parameters to the command we are running.

@jorgeorpinel
Copy link
Contributor

Thanks. So that's my point. Parametrization already means using -p --params in dvc run. We're overloading the term.

@dmpetrov
Copy link
Member Author

It is not relevant anymore with new variables in dvc.yaml (#4734) as well as params.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint research
Projects
None yet
Development

No branches or pull requests

3 participants