Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run: reference dependencies and outputs in command #2437

Closed
ghost opened this issue Aug 25, 2019 · 3 comments
Closed

run: reference dependencies and outputs in command #2437

ghost opened this issue Aug 25, 2019 · 3 comments
Labels

Comments

@ghost
Copy link

ghost commented Aug 25, 2019

There have been already many threads discussing different ways to specify
outputs and dependencies on the command invocation:

This issue intents to summarize the different approaches so we can discuss
them and hopefully take a decision. (cc: @iterative/engineering , et al.)

Reducing duplication while defining the run command will help you to:

  • Make cmd more readable (by having meaningful names instead of relpaths)
  • Avoid mistakes (typos)
  • Thinking that you are doing the right thing by "not repeating yourself" (DRY principle) 😬

Ideas

  1. Introduce --pass-params to add all the deps/outs in the same order:
  • Pros:
    • Minimal overhead
  • Cons:
    • Not flexible
    • Implicit dvc run command (cryptic)
  1. Introduce named deps/outs and pass them as env vars (e.g. dvc run -d raw=data/raw "for file in $raw ..."):
  • Pros:
    • Flexibility
  • Cons:
    • Not that straight forward to implement (Maybe PathInfo could store it in an alias attribute)
    • Shell will try to expand the variable before dvc, maybe we will need to implement a special syntax
  1. Add special syntax, expanding input and output (e.g. dvc run -d foo -o bar "cp {input} > {output}")
  • Pros:
    • Explicitness
  • Cons:
    • No way to split several outputs (e.g. dvc run -o 1.txt -o 2.txt)
  1. Introduce a build matrix - #1018 and use wildcards:
  • Pros:
    • This will solve some problems
  • Cons:
    • This will raise more questions 😅
  1. Makefile shenanigans (e.g. $<, $^, $%, rules, etc.)

  2. Use environment variables and let the shell do the job (e.g raw=data/raw dvc run -d $raw "for file in $raw"):

  • Pros:
  • Cons:
    • Syntax is different for Windows users (%var% instead of $var)
@ghost ghost added the research label Aug 25, 2019
@shcheklein
Copy link
Member

@MrOutis amazing summary 🙏

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Aug 26, 2019

Enjoyed the summary also 🥂 I like point 3.

Add special syntax, expanding input and output (e.g. dvc run -d foo -o bar "cp {input} > {output}")
...
Cons: No way to split several outputs (e.g. dvc run -o 1.txt -o 2.txt)

Maybe something more like printf (widely known) e.g. dvc run "cp {%d} > {%o}" -d foo -o bar could help overcome the mentioned con.

BUT:

Are there any discussions open about the complexity of dvc run though? I've heard mention about this and so if we decide to break it up into several commands then maybe this whole issue would need to be revisited after that redesign happens.

Here's at least one discussion about this on: https://discuss.dvc.org/t/simplifying-dvc-run-and-pipelines/199

@efiop
Copy link
Contributor

efiop commented Oct 8, 2021

Closing as stale. We've switched to dvc.yaml - focused parametrization these days.

@efiop efiop closed this as completed Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants