Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support adding directories in google cloud storage remote #2853

Merged
merged 19 commits into from
Dec 1, 2019

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Nov 26, 2019

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • πŸ“– Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Description

I have reused test code from unit/test_s3.py by moving it to functional tests, as, I argue, moto server can be assumed as functional as real s3 (which is a selling point of moto itself). The same tests are run with different remotes (i.e. s3 and gs).

And, I have stolen most of the code from #2619.

The logic behind s3 and gs is almost similar, as they both are object storages. So, there is an opportunity for abstraction for ObjectStorageRemote, but, I think that it'll be over-engineering for now (what about azure?).

Todo

  • Reuse code instead of copy-pasting.
  • Check for empty_dir. Will that work?
  • Use prefix for the test, or custom bucket (might not be possible locally)
  • Check if tests for s3 and gs can be combined with different remotes.
  • Rebase with current master.
  • Run some manual tests with directories.
  • Run all tests locally 2-3 times with DVC_TEST_GCP set and make it pass.
  • Make this PR ready for review.

How was this tested?

  1. Upload folder in structure similar to the following:
$ gsutil ls -r gs://dvc-test/skshetry-test
gs://dvc-test/skshetry-test/:
gs://dvc-test/skshetry-test/data1.txt

gs://dvc-test/skshetry-test/data/:
gs://dvc-test/skshetry-test/data/data2.txt
gs://dvc-test/skshetry-test/data/data3.txt
gs://dvc-test/skshetry-test/data/data5.txt

gs://dvc-test/skshetry-test/data/subdir/:
gs://dvc-test/skshetry-test/data/subdir/data4.txt

gs://dvc-test/skshetry-test/data/subdir2/:
gs://dvc-test/skshetry-test/data/subdir2/data6.txt
gs://dvc-test/skshetry-test/data/subdir2/data7.txt

gs://dvc-test/skshetry-test/data/subdir3/:

gs://dvc-test/skshetry-test/data/subdir3/nested/:
gs://dvc-test/skshetry-test/data/subdir3/nested/data6.txt
$ git init
$ dvc init
$ dvc remote add gscache gs://dvc-test/skshetry-test
$ dvc config cache.gs gscache
  1. dvc add gs://dvc-test/skshetry-test/data (check gs://dvc-test/skshetry-test/data1.txt should not get uploaded).
  2. Reset cache, and dvc add gs://dvc-test/skshetry-test/data/subdir.
  3. Similar to step 4 with dvc add gs://dvc-test/skshetry-test/data/subdir2.
  4. Similar to step 4 with dvc add gs://dvc-test/skshetry-test/data/subdir3.
  5. Similar to step 4 with dvc add gs://dvc-test/skshetry-test and dvc add gs://dvc-test/skshetry-test/.
  6. Reset cache and dvc run -d remote://gs/data 'echo hello world'.

Fixes #2814

tests/func/test_gs.py Outdated Show resolved Hide resolved
@Suor
Copy link
Contributor

Suor commented Nov 26, 2019

Am I missing something? We agreed to not add this for now. The ticket #1654 is closed.

@skshetry
Copy link
Member Author

@Suor, there is another issue for gs only: #2814. Forgot to mention in the PR.

@efiop efiop changed the title Support adding directories in google cloud storage remote WIP Support adding directories in google cloud storage remote Nov 27, 2019
@efiop efiop changed the title WIP Support adding directories in google cloud storage remote [WIP] Support adding directories in google cloud storage remote Nov 27, 2019
@efiop
Copy link
Contributor

efiop commented Nov 27, 2019

FYI: added [WIP] prefix to make it more obvious ("draft" is fine, but prefix seems to be a bit easier to grasp, no biggie) πŸ™‚

@skshetry skshetry force-pushed the gs-external-deps branch 3 times, most recently from 73d6102 to 87ee935 Compare November 27, 2019 17:08
@skshetry skshetry marked this pull request as ready for review November 28, 2019 00:17
@skshetry skshetry changed the title [WIP] Support adding directories in google cloud storage remote Support adding directories in google cloud storage remote Nov 28, 2019
@shcheklein shcheklein requested a review from a user November 28, 2019 00:28
dvc/remote/gs.py Outdated Show resolved Hide resolved
@efiop
Copy link
Contributor

efiop commented Nov 28, 2019

@skshetry About empty dirs, have you seen #2678 ? Does gs have a concept of empty dirs? πŸ™‚

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good start.

We need to unify this and some existing code both in remotes and tests, see below.

dvc/remote/base.py Outdated Show resolved Hide resolved
dvc/remote/gs.py Outdated Show resolved Hide resolved
dvc/remote/gs.py Outdated Show resolved Hide resolved
tests/func/test_remote_dir.py Outdated Show resolved Hide resolved
tests/func/test_remote_dir.py Outdated Show resolved Hide resolved
tests/func/test_remote_dir.py Outdated Show resolved Hide resolved
tests/func/test_remote_dir.py Outdated Show resolved Hide resolved
@skshetry
Copy link
Member Author

skshetry commented Nov 28, 2019

@skshetry About empty dirs, have you seen #2678 ? Does gs have a concept of empty dirs?

Yep, I saw that, and this also suffers from same issue.

➜ dvc add gs://dvc-test/skshetry-test/empty-dir/
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ|Add                                                                                                 1.00/1.00 [00:21<00:00,  21.1s/file]
ERROR: 'skshetry-test/empty-dir/.' doesn't exist in the cloud 

gsutil does not support creating empty directories, but Google Cloud Console seems to support this which I can't confirm as I don't have access to it. And, we can emulate this by creating blob with trailing slash (gsutil provides an illusion to this.)

I didn't act on it because of #2683 (comment), as I am not really sure what you mean by solving it properly. I guess, I was just waiting for how it will be solved for s3. But, what @MrOutis has done in the PR, same can be applied here char-by-char.

cc @efiop

EDIT: talked with efiop on chat

ignore empty dirs inside of the directory we are adding,
but not ignore it if the whole directory is empty.
That way it will have the same behaviour as in local empty dirs.

@efiop
Copy link
Contributor

efiop commented Nov 28, 2019

For the record, had an additional discussion on discord: https://discordapp.com/channels/485586884165107732/565699007037571084/649634041993101333

* master:
  address @efiop's suggestions
  s3: add support for top level empty directories
  remote: http: calculate length basing on response and not head call
  test: remote: http: clear test file name
  HTTPError: reformat error message
  Update dvc/remote/http.py
  remote: http: rename http error, refactor test
  remote: http: raise exception when download response with error status code
  Support GDrive as remote
  setup: don't forget contextlib2
  remote: refactor ssh ask password code
  remote: protect all remote client/session creation code with locks
  test: refactor & remove redundant test fixtures
  NoRemoteInExternalRepoError: dont pass cause of exception
  perf: optimize cache listing for local, ssh and hdfs
  remote: small .save_info()/.get_checksum() cleanup
@Suor
Copy link
Contributor

Suor commented Nov 30, 2019 via email

@efiop efiop requested a review from Suor November 30, 2019 18:17
@efiop
Copy link
Contributor

efiop commented Nov 30, 2019

For the record: windows tests are failing due to unrelated problems with chocolatey packages. The rest of the tests are fine. ;)

EDIT: sorry, accidentally closed πŸ™

@efiop efiop closed this Nov 30, 2019
@efiop efiop reopened this Nov 30, 2019
@Suor
Copy link
Contributor

Suor commented Nov 30, 2019

Wrote about listing for exists here - #2873 (comment). TLDR, the reason why we do listing instead of some Blob.exists() call is unknown. If we may do the latter then we won't have all those issues.

BTW, we will still need to list to check if dir exists, which is not good for cache batch exists, quoting from there:

Additional consideration about cache, we don't really want to check .isdir() for every checksum, but we want .exists() to work for both files and dirs for remotes supporting that. This contradiction might be resolved by calling .isfile() instead of .exists() inside .cache_exists(). We will need to make isfile() work properly though, now it is just a stab in RemoteBASE.

@skshetry
Copy link
Member Author

skshetry commented Dec 1, 2019

Thanks @Suor. I was too pinned on checking for the directory, that I missed it. I implemented isfile, but for the .cache_exists(), we could implement that in #2873 or in a separate PR.

* master:
  travis: windows: use python 3.7.5
  test: skip non supported remotes fast in api tests
dvc/remote/gs.py Outdated Show resolved Hide resolved
tests/func/test_api.py Outdated Show resolved Hide resolved
@@ -0,0 +1,337 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you postpone, then create an issue, add a link here and resolve this thread :)

tests/remotes.py Outdated Show resolved Hide resolved
tests/remotes.py Outdated Show resolved Hide resolved
tests/remotes.py Outdated Show resolved Hide resolved
tests/remotes.py Outdated Show resolved Hide resolved
tests/remotes.py Outdated Show resolved Hide resolved
tests/remotes.py Outdated Show resolved Hide resolved
Comment on lines +83 to 84
@pytest.mark.parametrize("remote", [S3Mocked], indirect=True)
def test_copy_preserve_etag_across_buckets(remote):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to create all those blobs for this test and parametrize looks a bit stretched. I suggest using @mock_s3 and creating both remotes simply with RemoteS3(...) and not using S3Mocked here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current approach is also fine, I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it this way for now, then. :)

@efiop efiop requested a review from Suor December 1, 2019 11:02
Comment on lines 111 to 130
@pytest.mark.parametrize("remote", [GCP], indirect=True)
def test_isfile(remote):
test_cases = [
(False, "empty_dir/"),
(True, "empty_file"),
(True, "foo"),
(True, "data/alice"),
(True, "data/alpha"),
(True, "data/subdir/1"),
(True, "data/subdir/2"),
(True, "data/subdir/3"),
(False, "data/subdir/empty_dir/"),
(True, "data/subdir/empty_file"),
(False, "something-that-does-not-exist"),
(False, "data/subdir/empty-file/"),
(False, "empty_dir"),
]

for expected, path in test_cases:
assert remote.isfile(remote.path_info / path) == expected
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will need adjustment if #2873 gets merged.

Copy link
Contributor

@efiop efiop Dec 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's merge #2873 first, once we get one more approval πŸ˜‰ And then we'll adjust and merge this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry #2873 is merged, please adjust this PR accordingly πŸ™‚

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks way cleaner than average dvc code now) And thanks for starting a remote fixtures refactor.

dvc/remote/gs.py Outdated

eg: if `data/file.txt` exists, check for `data` should return True
"""
return self.isfile(path_info) or self.isdir(path_info / "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return self.isfile(path_info) or self.isdir(path_info / "")
return self.isfile(path_info) or self.isdir(path_info)

No need to make it twice.

@Suor
Copy link
Contributor

Suor commented Dec 1, 2019

BTW, need to create .cache_exists()/isfile() issue before closing it.

@skshetry
Copy link
Member Author

skshetry commented Dec 1, 2019

@Suor, I have created an issue: #2877. Regarding the suggestion #2853 (comment), I'll make a issue after this gets merged.

@skshetry skshetry requested a review from efiop December 1, 2019 12:12
@efiop efiop merged commit 3d70a1e into iterative:master Dec 1, 2019
@efiop
Copy link
Contributor

efiop commented Dec 1, 2019

Great stuff, thank you so much @skshetry ! πŸš€

For the record: windows tests are failing due to temporary problems with chocolatey. Ignoring that for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gs: support directories as external dependencies/outputs
3 participants