-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc run: pass dependencies and outputs as params #995
Comments
This is really hacky and not at all flexible. You only build your pipeline once, so it is not that inconvenient to introduce such a hack. Plus it defeats the purpose of our explicit |
Something is needed to avoid the duplication which is a good source of mistakes. This is the way to avoid the duplication with a minimum overhead - a single option instead of "flexible" set of options. Another alternative is to do this by param: Any of these can work. Yes, you build a pipeline once. But it still requires a way to avoid mistakes. |
Related #1462 |
Related discussion on discord https://discordapp.com/channels/485586884165107732/563406153334128681/567801713948360725 proposing |
[x-post from discord] I don't know if DVC is the tool for this use case, but let's imagine that you want to specify a pipeline that depends on a directory (e.g. echo "hello world" > data/raw/greetings.en
echo "hola mundo" > data/raw/greetings.es
dvc add data/raw
dvc run \
--file data/processed.dvc \
--deps data/raw \
--outs data/processed \
'mkdir -p data/processed &&
for raw in data/raw/*
do
processed="data/processed/$(basename $raw)"
tr "[:lower:]" "[:upper:]" < $raw > $processed
done' currently, there's no way to add another file to the raw directory and compute only that last file instead of the whole thing: echo "bonjour monde" > data/raw/greetings.fr
# this will reproduce everything again
dvc repro data/processed.dvc have you thought about supporting this case? is this common on ML pipelines? (relying on this type of generic transformations applied to a bunch of files on a directory). Also, we would need to have a way to iterate over a command, it could be with environment variables, like, let the user name the dependency or output, and then make them available to the shell using bash arrays listing every entry that needs to be computed dvc run --deps raw=data/raw 'for file in $raw; do ...; done' or the other one could be with an special syntax: dvc run \
-d data/raw \
-o data/processed \
transform.sh {input} > {output} and DVC iterates internally, running that command, and doing a lot of implicit things, like The former one looks easier to implement (naming variables), then we would need to identify if the path is a file or a directory, if it's a dir, then we would need to filter every file that hasn't been computed yet (I don't know how to do this... maybe using the |
We should probably not call this new feature "params" anymore though, since we now have the |
@jorgeorpinel Let's leave it as is so we don't lose context. We will be getting to parametrization in the near future. |
OK. What do you mean by parametrization? |
@jorgeorpinel passing our deps/outs as parameters to the command we are running. |
Thanks. So that's my point. Parametrization already means using -p --params in |
It is not relevant anymore with new variables in dvc.yaml (#4734) as well as params.yaml. |
From xiang0x48: https://discuss.dvc.org/t/is-there-any-elegant-way-of-passing-argument-d-and-o-to-command-run-by-dvc-run/66
"When using dvc run, argument of dvc run and true command to be run are always similar, thus
dvc run -d source.npy -o target.npy python some_process.py source.npy target.npy . As there are repeated file names, it might be a “smell of code”."
Solution: introduce
--pass-params
to add all the dependencies and output params in the same order.Example:
Commnd
dvc run -d source.npy -d proc.py -o target.npy --pass-params python proc.py myparam
DVC adds all the params in addition to the original one (myparam):
myparam source.npy proc.py target.npy
The text was updated successfully, but these errors were encountered: