Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: update gc to remove unpacked dir #3054

Merged
merged 1 commit into from
Jan 6, 2020
Merged

Conversation

sharidas
Copy link

@sharidas sharidas commented Jan 4, 2020

In local remote its found that gc does not
remove the unpacked dir. This change set
helps to remove it.

Signed-off-by: Sujith H [email protected]

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@@ -152,6 +152,9 @@ def remove(self, path_info):

if self.exists(path_info):
remove(path_info.fspath)
# Remove unpacked if dir checksum is true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this comment is necessary, the code is pretty self-explanatory :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.

@pared
Copy link
Contributor

pared commented Jan 4, 2020

Ok, @sharidas, as you probably noticed, the tests do not pass. The reason for that is that you are trying to check is_dir_checksum(path_to_checksum) which will not work. I think this is a problem with naming on our side, but path_to_checksum works only for cache paths. What you could do is:

  1. use self.get_checksum to obtain checksum for path_info
  2. check if checksum is dir checksum using self.is_dir_checksum
  3. get unpacked dir path info using _get_unpacked_dir_path_info
  4. and then remove it
    Also, that will have to be done before actually removing path_info.

@sharidas
Copy link
Author

sharidas commented Jan 4, 2020

@pared Thanks for the support. I did implemented the way suggested here #3054 (comment). Unfortunately the tests are still failing. Need to look into it.

@@ -151,6 +151,9 @@ def remove(self, path_info):
raise NotImplementedError

if self.exists(path_info):
checksum = self.get_checksum(path_info)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharidas, looks like on some tests, the state does not get load and get_checksum fails?

@pared, I wonder if this could be handled here:

dvc/dvc/remote/base.py

Lines 699 to 700 in 85c35f8

path_info = self.checksum_to_path_info(checksum)
self.remove(path_info)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry Not sure. I was looking into one of the test, basically looking into the stack trace and found https://github.com/iterative/dvc/blob/master/tests/func/test_checkout.py#L320. And this test does not parse the gc code if I noticed the flow correctly.

Copy link
Member

@skshetry skshetry Jan 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated .remove() has .get_checksum() which gets checksum from .dvc/state AFAIK, but the test does not load the state and hence the exception (i assume as the state.cursor is None). You can load the state via with self.dvc.state: though. But, I'd try moving the logic after this line (though, a check if it's local needs to be added), but @iterative/dvc developers might have a different suggestion.

Copy link
Contributor

@efiop efiop Jan 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry is right and that is also what we meant in the solution description in #2946 (comment) . This PR is trying to make a change to a generic method that doesn't really need to know anything about checksums and is often used in other places where it is not expected to have such a behaviour.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop Thanks for the feedback. I was trying to make the solution close to the remote local. Now I have updated the pr again by making change in base.py. Initially I made the change in the base.py as suggested in the issue. Later I thought it would be better to make the solution close to the local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharidas Makes sense, I guess my comment about "local" was confusing, sorry about that 🙂

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Please see a comment above. Also, please don't forget to include issue number in the PR or commit desc. #2946 🙂

@sharidas
Copy link
Author

sharidas commented Jan 5, 2020

@efiop I have added Fixes tag to the pr description with the issue number. Hope this sounds ok.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Please see a few minor comments.

@@ -697,6 +697,9 @@ def gc(self, named_cache):
if checksum in used:
continue
path_info = self.checksum_to_path_info(checksum)
if self.is_dir_checksum(checksum):
if callable(getattr(self, "_get_unpacked_dir_path_info")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need callable 🙂 Also, this getattr will keep working normally even if we remove _get_unpacked_dir_path_info at some point. So your implementation is correct, but maybe we could clean it up a bit. How about we mimic the other unpacked_dir logic that we have in the code and define _remove_unpacked_dir method in base(as a noop) and in local, so that we could simply:

if self.is_dir_checksum(checksum):
    self._remove_unpacked_dir(checksum)

here, and won't have to do getattr tricks?

Also, please add a test for this to tests/func/test_gc.py. Something similar to the reproduction script in the issue would be just right 🙂

tests/func/test_gc.py Outdated Show resolved Hide resolved
tests/func/test_gc.py Outdated Show resolved Hide resolved
@sharidas
Copy link
Author

sharidas commented Jan 6, 2020

Updated the test as per the suggestion from #3054 (comment)

@@ -204,3 +205,15 @@ def test_gc_no_dir_cache(tmp_dir, dvc, repo_template):

def _count_files(path):
return sum(len(files) for _, _, files in os.walk(path))

def test_gc_no_unpacked_dir(tmp_dir, dvc, repo_template):
dir_stages = dvc.add("dir")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use stage, = tmp_dir.dvc_gen({"dir": {"file": "file_content"}})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you choose to go this way there will be no need for tepo_template :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared repo_template is fine too though.

@efiop
Copy link
Contributor

efiop commented Jan 6, 2020

@sharidas Looks like you didn't install pre-commit hooks described in our contribution guide, hence why the styling checks are failing.

In local remote its found that gc does not
remove the unpacked dir. This change set
helps to remove it.

Signed-off-by: Sujith H <[email protected]>
@sharidas
Copy link
Author

sharidas commented Jan 6, 2020

@sharidas Looks like you didn't install pre-commit hooks described in our contribution guide, hence why the styling checks are failing.

@efiop Yes pre-commit hooks were not installed in my env. Thanks for the pointer. Now I have installed it and rebased the PR again.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@efiop efiop merged commit 498dcab into iterative:master Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc gc does not remove files under dir.unpacked
5 participants