Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAG with more structured dependencies per stage #4228

Closed
jonilaserson opened this issue Jul 17, 2020 · 3 comments
Closed

DAG with more structured dependencies per stage #4228

jonilaserson opened this issue Jul 17, 2020 · 3 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@jonilaserson
Copy link

Say I have a million files in the directory ./data/pre.

I have a python script process_dir.py which goes over each file in ./data/pre and processes it and creates a file in the same name in a directory ./data/post (if such file already exists, it skips processing it).

I defined a pipeline:

dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py

Now let’s say I removed one file from data/pre.

When I run dvc repro it will still unnecessarily process all the 999,999 files again, because it will remove (by design) the entire content of the ./data/post directory before running the process stage. Can we think of an elegant way to define the pipeline so that process.py will not process the same file twice?

suggestion: if we were able to define a rule that will connect directly in the DAG pairs of data/pre/X.txt to data/post/X.txt in the context of the process stage, then when can adjust the process stage in the pipeline as follows:

  1. identify which file-pairs haven't changed and remove those files to a temp dir
  2. run the process stage as you normally would
  3. move the file-pairs from the temp dir back to their original locations.
@shcheklein
Copy link
Member

@jonilaserson looks like a duplicate of this one #331 ? And there are some suggestions how to fix this - #4213 . Could you please move your excellent comment to one of those tickets please?

@shcheklein shcheklein added the awaiting response we are waiting for your reply, please respond! :) label Jul 17, 2020
@jonilaserson
Copy link
Author

Indeed. Moving it there.

@shcheklein
Copy link
Member

Closing this for now then. Thanks @jonilaserson !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

2 participants