Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: support granularity for fetch/pull/push/status/checkout #3002

Merged
merged 1 commit into from
Dec 30, 2019

Conversation

efiop
Copy link
Contributor

@efiop efiop commented Dec 23, 2019

Using output path or a subdir/subfile path within an output now works
with dvc push/pull/fetch/status -c commands. Other commands don't
support the same logic for now, as there are some questions about what
should commands like dvc remove do when given a specific output path.

Example:
dvc add data
dvc pull data/subdir # will only pull files within data/subdir

Related to #2458

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

iterative/dvc.org#886

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a new plan?

  • collect outs not stages, i.e. drop ._collect_granular(), make .collect() return a list of outs
  • move needed logic from stage to out (out knows its stage, so should be straitforward)
  • use out.checkout(), out.get_files_number(), ... where appropriate instead of same stage methods
  • create FilteredOutput(out, path_info) wrapper to support granularity within out dirs, expose all needed methods. Make these in Repo.collect() when needed.
  • drop no longer used Stage methods, things are done on output level now

I see some rwlock complications, but again rwlock is also output based ;)

In the end everything should look leaner than now not more complicated. I hope)

P.S. On rwlock complications. We might move it to out level, but that will make us constantly updating locks. The alternative is to move it higher level and lock everything at repo level after we resolved targets to outs/path_infos we may lock them all at once before doing things. Not sure which one is better.

dvc/repo/checkout.py Outdated Show resolved Hide resolved
dvc/cache.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor Author

efiop commented Dec 24, 2019

@Suor Have thought about that, but it will require quite a heavy refactoring. I would rather not get into it right now. Plus, outputs are not atomic, as we have directories and we need to support granularity for files inside of them, so FilteredOutput won't work as-is.

@Suor
Copy link
Contributor

Suor commented Dec 24, 2019

@efiop don't get why FilteredOutput() won't work, how is this connected with atomicity?

@Suor
Copy link
Contributor

Suor commented Dec 24, 2019

Now a lot of methods on stage simply pass things into outs and back, this is the sign. And adding path_info filter to Stage methods doesn't look pretty either.

@efiop
Copy link
Contributor Author

efiop commented Dec 24, 2019

@efiop don't get why FilteredOutput() won't work, how is this connected with atomicity?

I see what you mean. Yeah, could do that.

Now a lot of methods on stage simply pass things into outs and back, this is the sign. And adding path_info filter to Stage methods doesn't look pretty either.

It has been a sign for a long time, this PR is not meant to deal with that. Need to create a separate ticket and deal with it as refactoring.

@efiop
Copy link
Contributor Author

efiop commented Dec 24, 2019

@Suor I'm still dealing with some checkout parts, will see if the refactoring you suggest will help with that. E.g. get_files_number thingy might be replaced by a dynamic pbar update, which is more reasonable than the giant refactoring that you suggest 🙂

@Suor
Copy link
Contributor

Suor commented Dec 24, 2019

OK, even without the big refactoring you can make Repo.collect() return a list of outs, some of which will be filtered. Then do the out.stage.do_thing(..., path_info=out.path_info) when you actually want to do stuff, which doesn't have a method on out. This way you won't need pairs.

Hmm. Actually we have two types of operations stage and out ones. It's repro and everything else correspondingly. So we might want repo.collect_stages() and repo.collect_outs(). Or should we call them .resolve_stages(targets) and .resolve_outs(targets)?

Also, I don't perceive this as a giant refactoring, just a normal non-small one ))

@efiop efiop force-pushed the 2458 branch 3 times, most recently from fa493d6 to 1d52044 Compare December 24, 2019 14:16
@efiop
Copy link
Contributor Author

efiop commented Dec 24, 2019

@Suor I appreciate the ideas and I have considered this while working on this PR, but I would prefer to begin with this change and then proceed with the refactoring, as putting it into this PR will only blow it out of proportion.

@efiop efiop changed the title [WIP] dvc: support granularity for fetch/pull/push/status/checkout dvc: support granularity for fetch/pull/push/status/checkout Dec 24, 2019
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel how our debt is growing (

dvc/output/base.py Outdated Show resolved Hide resolved
dvc/output/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
Comment on lines 442 to 444
if out.scheme == "local" and (
path_info == out.path_info or path_info.isin(out.path_info)
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much slower is this.

dvc/repo/checkout.py Show resolved Hide resolved
dvc/stage.py Outdated Show resolved Hide resolved
dvc/stage.py Outdated Show resolved Hide resolved
tests/unit/test_repo.py Show resolved Hide resolved
@efiop efiop force-pushed the 2458 branch 4 times, most recently from fbd7870 to cdb4749 Compare December 25, 2019 17:48
dvc/output/base.py Outdated Show resolved Hide resolved
dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/repo/__init__.py Outdated Show resolved Hide resolved
@efiop efiop changed the title dvc: support granularity for fetch/pull/push/status/checkout [WIP] dvc: support granularity for fetch/pull/push/status/checkout Dec 26, 2019
@efiop
Copy link
Contributor Author

efiop commented Dec 26, 2019

I feel how our debt is growing (

@Suor Yeah 🙁 I think we could do better than FilteredOutput. It seems like our output dir and file concepts are a bit off and we actually need OutputDir and Output(File), that OutputDir will consist of. That way all our manipulations with get_files_number and progress bars would be much more straightforward, as checkout would be always dealing with Output(File)s, which are trivial to wrap in tqdm pbar. Also we would simply filter right away, instead of creating a special FilteredOutput class. Though unpacked tricks would probably cause some troubles there. That being said, working on a pre-collected list of output files, would also simplify parallelization and stuff like that, so we might get rid of unpacked_dir for good. Need to take a closer look.

@efiop efiop changed the title [WIP] dvc: support granularity for fetch/pull/push/status/checkout dvc: support granularity for fetch/pull/push/status/checkout Dec 27, 2019
@efiop efiop requested a review from pared December 27, 2019 09:05
@efiop efiop requested a review from a user December 27, 2019 09:05
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is a last question )

dvc/repo/__init__.py Outdated Show resolved Hide resolved
Using output path or a subdir/subfile path within an output now works
with `dvc push/pull/fetch/status -c` commands. Other commands don't
support the same logic for now, as there are some questions about what
should commands like `dvc remove` do when given a specific output path.

Example:
    dvc add data
    dvc pull data/subdir # will only pull files within data/subdir

Related to iterative#2458
@efiop efiop merged commit ef6018f into iterative:master Dec 30, 2019
@efiop efiop deleted the 2458 branch December 30, 2019 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants