DAG with more structured dependencies per stage #4228

jonilaserson · 2020-07-17T10:24:04Z

Say I have a million files in the directory ./data/pre.

I have a python script process_dir.py which goes over each file in ./data/pre and processes it and creates a file in the same name in a directory ./data/post (if such file already exists, it skips processing it).

I defined a pipeline:

dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py

Now let’s say I removed one file from data/pre.

When I run dvc repro it will still unnecessarily process all the 999,999 files again, because it will remove (by design) the entire content of the ./data/post directory before running the process stage. Can we think of an elegant way to define the pipeline so that process.py will not process the same file twice?

suggestion: if we were able to define a rule that will connect directly in the DAG pairs of data/pre/X.txt to data/post/X.txt in the context of the process stage, then when can adjust the process stage in the pipeline as follows:

identify which file-pairs haven't changed and remove those files to a temp dir
run the process stage as you normally would
move the file-pairs from the temp dir back to their original locations.

The text was updated successfully, but these errors were encountered:

shcheklein · 2020-07-17T18:39:55Z

@jonilaserson looks like a duplicate of this one #331 ? And there are some suggestions how to fix this - #4213 . Could you please move your excellent comment to one of those tickets please?

jonilaserson · 2020-07-17T19:09:52Z

Indeed. Moving it there.

shcheklein · 2020-07-17T20:18:04Z

Closing this for now then. Thanks @jonilaserson !

shcheklein added the awaiting response we are waiting for your reply, please respond! :) label Jul 17, 2020

shcheklein closed this as completed Jul 17, 2020

weekly-digest bot mentioned this issue Jul 19, 2020

Weekly Digest (12 July, 2020 - 19 July, 2020) #4238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAG with more structured dependencies per stage #4228

DAG with more structured dependencies per stage #4228

jonilaserson commented Jul 17, 2020

shcheklein commented Jul 17, 2020

jonilaserson commented Jul 17, 2020

shcheklein commented Jul 17, 2020

DAG with more structured dependencies per stage #4228

DAG with more structured dependencies per stage #4228

Comments

jonilaserson commented Jul 17, 2020

shcheklein commented Jul 17, 2020

jonilaserson commented Jul 17, 2020

shcheklein commented Jul 17, 2020