You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Say I have a million files in the directory ./data/pre.
I have a python script process_dir.py which goes over each file in ./data/pre and processes it and creates a file in the same name in a directory ./data/post (if such file already exists, it skips processing it).
I defined a pipeline:
dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py
Now let’s say I removed one file from data/pre.
When I run dvc repro it will still unnecessarily process all the 999,999 files again, because it will remove (by design) the entire content of the ./data/post directory before running the process stage. Can we think of an elegant way to define the pipeline so that process.py will not process the same file twice?
suggestion: if we were able to define a rule that will connect directly in the DAG pairs of data/pre/X.txt to data/post/X.txt in the context of the process stage, then when can adjust the process stage in the pipeline as follows:
identify which file-pairs haven't changed and remove those files to a temp dir
run the process stage as you normally would
move the file-pairs from the temp dir back to their original locations.
The text was updated successfully, but these errors were encountered:
@jonilaserson looks like a duplicate of this one #331 ? And there are some suggestions how to fix this - #4213 . Could you please move your excellent comment to one of those tickets please?
Say I have a million files in the directory
./data/pre
.I have a python script
process_dir.py
which goes over each file in./data/pre
and processes it and creates a file in the same name in a directory./data/post
(if such file already exists, it skips processing it).I defined a pipeline:
Now let’s say I removed one file from
data/pre
.When I run
dvc repro
it will still unnecessarily process all the 999,999 files again, because it will remove (by design) the entire content of the./data/post
directory before running theprocess
stage. Can we think of an elegant way to define the pipeline so thatprocess.py
will not process the same file twice?suggestion: if we were able to define a rule that will connect directly in the DAG pairs of
data/pre/X.txt
todata/post/X.txt
in the context of theprocess
stage, then when can adjust theprocess
stage in the pipeline as follows:process
stage as you normally wouldThe text was updated successfully, but these errors were encountered: