-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc: support granularity for fetch/pull/push/status/checkout #3002
Conversation
4d6a723
to
3025798
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a new plan?
- collect outs not stages, i.e. drop
._collect_granular()
, make.collect()
return a list of outs - move needed logic from stage to out (out knows its stage, so should be straitforward)
- use
out.checkout()
,out.get_files_number()
, ... where appropriate instead of same stage methods - create
FilteredOutput(out, path_info)
wrapper to support granularity within out dirs, expose all needed methods. Make these inRepo.collect()
when needed. - drop no longer used Stage methods, things are done on output level now
I see some rwlock complications, but again rwlock is also output based ;)
In the end everything should look leaner than now not more complicated. I hope)
P.S. On rwlock complications. We might move it to out level, but that will make us constantly updating locks. The alternative is to move it higher level and lock everything at repo level after we resolved targets to outs/path_infos we may lock them all at once before doing things. Not sure which one is better.
@Suor Have thought about that, but it will require quite a heavy refactoring. I would rather not get into it right now. Plus, outputs are not atomic, as we have directories and we need to support granularity for files inside of them, so FilteredOutput won't work as-is. |
@efiop don't get why |
Now a lot of methods on stage simply pass things into outs and back, this is the sign. And adding |
I see what you mean. Yeah, could do that.
It has been a sign for a long time, this PR is not meant to deal with that. Need to create a separate ticket and deal with it as refactoring. |
@Suor I'm still dealing with some checkout parts, will see if the refactoring you suggest will help with that. E.g. |
OK, even without the big refactoring you can make Hmm. Actually we have two types of operations stage and out ones. It's repro and everything else correspondingly. So we might want Also, I don't perceive this as a giant refactoring, just a normal non-small one )) |
fa493d6
to
1d52044
Compare
@Suor I appreciate the ideas and I have considered this while working on this PR, but I would prefer to begin with this change and then proceed with the refactoring, as putting it into this PR will only blow it out of proportion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel how our debt is growing (
dvc/repo/__init__.py
Outdated
if out.scheme == "local" and ( | ||
path_info == out.path_info or path_info.isin(out.path_info) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much slower is this.
fbd7870
to
cdb4749
Compare
@Suor Yeah 🙁 I think we could do better than FilteredOutput. It seems like our output dir and file concepts are a bit off and we actually need OutputDir and Output(File), that OutputDir will consist of. That way all our manipulations with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is a last question )
Using output path or a subdir/subfile path within an output now works with `dvc push/pull/fetch/status -c` commands. Other commands don't support the same logic for now, as there are some questions about what should commands like `dvc remove` do when given a specific output path. Example: dvc add data dvc pull data/subdir # will only pull files within data/subdir Related to iterative#2458
Using output path or a subdir/subfile path within an output now works
with
dvc push/pull/fetch/status -c
commands. Other commands don'tsupport the same logic for now, as there are some questions about what
should commands like
dvc remove
do when given a specific output path.Example:
dvc add data
dvc pull data/subdir # will only pull files within data/subdir
Related to #2458
❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
iterative/dvc.org#886
Thank you for the contribution - we'll try to review it as soon as possible. 🙏