Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to dvc add more than 1 file at a time in s3 bucket #2678

Closed
elleobrien opened this issue Oct 27, 2019 · 11 comments · Fixed by #2683
Closed

Unable to dvc add more than 1 file at a time in s3 bucket #2678

elleobrien opened this issue Oct 27, 2019 · 11 comments · Fixed by #2683
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. research

Comments

@elleobrien
Copy link

elleobrien commented Oct 27, 2019

** Note: DVC version 0.66.1, pip install, Ubuntu 16.04

I am having difficulty configuring DVC to track files on an s3 bucket with the structure

ellesdatabucket
       ├── data
       │     ├── img1.png
       │     ├── img2.png
       │     ├── ...
       └── cache

Specifically, I want to use DVC to version control *.png files stored in the data folder, and use the cache folder as DVC's cache.

Based on the docs provided here, I believe I've replicated exactly the provided steps. But I hit an error when I run dvc add:

$ git init
$ dvc init
$ dvc remote add myremote s3://ellesdatabucket/data
$ dvc remote add s3cache s3://ellesdatabucket/cache
$ dvc config cache.s3 s3cache
$ dvc add s3://ellesdatabucket/data

The output looks initially encouraging,

 29%|██▊       |Computing hashes (only done once691/2424 [00:03<00:07,    223md5/s]

But then I get this error message:

ERROR: s3://ellesdatabucket/data/. does not exist: An error occurred (404) when calling the HeadObject operation: Not Found

I'm positive that there are files in the data folder and can view them by aws s3 ls s3://ellesdatabucket/data. And, if I try to run dvc add with only a single file at a time instead of a whole directory, the command completes successfully. Although a bash script could dvc add each file in a loop, I want to make sure there's not a better way. Issue 2647 seems to be discussing a similar problem but I can't figure out how to apply that code to my own example here.

Thank you for any help!

More context:

https://discordapp.com/channels/485586884165107732/485596304961962003/637836708184064010

@elleobrien elleobrien changed the title Unable to dvc add more than 1 file at a time in s3 bucket Unable to dvc add more than 1 file at a time in s3 bucket Oct 27, 2019
@elleobrien
Copy link
Author

elleobrien commented Oct 27, 2019

Also here are the contents of my .dvc/config file if it helps:

['remote "myremote"']
url = s3://ellesdatabucket/data
['remote "s3cache"']
url = s3://ellesdatabucket/cache
[cache]
s3 = s3cache

@shcheklein shcheklein assigned ghost Oct 27, 2019
@shcheklein shcheklein added bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. labels Oct 27, 2019
@efiop
Copy link
Contributor

efiop commented Oct 27, 2019

Hi @andronovhopf ! Which dvc --version are you using? The functionality you are looking for has been merged very recently, so I suspect that you just need to upgrade to the latest version, if you are not using it already 🙂

EDIT: sorry, missed that you've already provided the version and it is indeed the latest one. Looks like we have a bug, looking into it now...

@efiop
Copy link
Contributor

efiop commented Oct 27, 2019

@andronovhopf Could you please post output of dvc add s3://ellesdatabucket/data -v?

@efiop
Copy link
Contributor

efiop commented Oct 27, 2019

No need for logs. I am able to reproduce by creating an empty dir first. Working on a fix right now.

aws s3api put-object --bucket dvc-temp --key ruslan-test/data
➜  dvc git:(master) ✗ dvc add s3://dvc-temp/ruslan-test/data                        
100%|██████████|Add                                                                                                                                                                                             1/1 [00:06<00:00,  6.55s/file]
ERROR: s3://dvc-temp/ruslan-test/data/. does not exist: An error occurred (404) when calling the HeadObject operation: Not Found

@efiop efiop assigned efiop and unassigned ghost Oct 27, 2019
@efiop
Copy link
Contributor

efiop commented Oct 27, 2019

For the record, full log:

➜  dvc git:(master) ✗ dvc add s3://dvc-temp/ruslan-test/data -v
DEBUG: PRAGMA user_version;
DEBUG: fetched: [(3,)]
DEBUG: CREATE TABLE IF NOT EXISTS state (inode INTEGER PRIMARY KEY, mtime TEXT NOT NULL, size TEXT NOT NULL, md5 TEXT NOT NULL, timestamp TEXT NOT NULL)
DEBUG: CREATE TABLE IF NOT EXISTS state_info (count INTEGER)
DEBUG: CREATE TABLE IF NOT EXISTS link_state (path TEXT PRIMARY KEY, inode INTEGER NOT NULL, mtime TEXT NOT NULL)
DEBUG: INSERT OR IGNORE INTO state_info (count) SELECT 0 WHERE NOT EXISTS (SELECT * FROM state_info)
DEBUG: PRAGMA user_version = 3;
100%|██████████|Add                                                                                                                                                                                             1/1 [00:05<00:00,  5.48s/file]
DEBUG: SELECT count from state_info WHERE rowid=?                                                                                                                                                                                             
DEBUG: fetched: [(7206,)]
DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
ERROR: s3://dvc-temp/ruslan-test/data/. does not exist: An error occurred (404) when calling the HeadObject operation: Not Found
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/efiop/git/dvc/dvc/remote/s3.py", line 88, in get_head_object
    obj = s3.head_object(Bucket=bucket, Key=path, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

Traceback (most recent call last):
  File "/Users/efiop/git/dvc/dvc/remote/s3.py", line 88, in get_head_object
    obj = s3.head_object(Bucket=bucket, Key=path, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)

botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

Traceback (most recent call last):
  File "/Users/efiop/git/dvc/dvc/remote/s3.py", line 88, in get_head_object
    obj = s3.head_object(Bucket=bucket, Key=path, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/efiop/git/dvc/dvc/command/add.py", line 24, in run
    fname=self.args.file,
  File "/Users/efiop/git/dvc/dvc/repo/__init__.py", line 33, in wrapper
    ret = f(repo, *args, **kwargs)
  File "/Users/efiop/git/dvc/dvc/repo/scm_context.py", line 4, in run
    result = method(repo, *args, **kw)
  File "/Users/efiop/git/dvc/dvc/repo/add.py", line 51, in add
    stage.save()
  File "/Users/efiop/git/dvc/dvc/stage.py", line 711, in save
    out.save()
  File "/Users/efiop/git/dvc/dvc/output/base.py", line 244, in save
    if not self.changed():
  File "/Users/efiop/git/dvc/dvc/output/base.py", line 200, in changed
    status = self.status()
  File "/Users/efiop/git/dvc/dvc/output/base.py", line 191, in status
    if self.changed_checksum():
  File "/Users/efiop/git/dvc/dvc/output/base.py", line 173, in changed_checksum
    != self.remote.save_info(self.path_info)[
  File "/Users/efiop/git/dvc/dvc/remote/base.py", line 323, in save_info
    return {self.PARAM_CHECKSUM: self.get_checksum(path_info)}
  File "/Users/efiop/git/dvc/dvc/remote/base.py", line 312, in get_checksum
    checksum = self.get_dir_checksum(path_info)
  File "/Users/efiop/git/dvc/dvc/remote/base.py", line 228, in get_dir_checksum
    dir_info = self._collect_dir(path_info)
  File "/Users/efiop/git/dvc/dvc/remote/base.py", line 204, in _collect_dir
    new_checksums = self._calculate_checksums(not_in_state)
  File "/Users/efiop/git/dvc/dvc/remote/base.py", line 187, in _calculate_checksums
    checksums = dict(zip(file_infos, tasks))
  File "/usr/local/lib/python3.7/site-packages/tqdm/std.py", line 1081, in __iter__
    for obj in iterable:
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/efiop/git/dvc/dvc/remote/s3.py", line 82, in get_file_checksum
    return self.get_etag(self.s3, path_info.bucket, path_info.path)
  File "/Users/efiop/git/dvc/dvc/remote/s3.py", line 77, in get_etag
    obj = cls.get_head_object(s3, bucket, path)
  File "/Users/efiop/git/dvc/dvc/remote/s3.py", line 91, in get_head_object
    "s3://{}/{} does not exist".format(bucket, path), exc
dvc.exceptions.DvcException: s3://dvc-temp/ruslan-test/data/. does not exist
------------------------------------------------------------

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

@efiop
Copy link
Contributor

efiop commented Oct 27, 2019

Test

diff --git a/tests/unit/remote/test_s3.py b/tests/unit/remote/test_s3.py
index 7861fb5a..64cd0031 100644
--- a/tests/unit/remote/test_s3.py
+++ b/tests/unit/remote/test_s3.py
@@ -83,3 +83,11 @@ def test_walk_files(remote):
     ]
 
     assert list(remote.walk_files(remote.path_info / "data")) == files
+
+    # This is how some users are creating empty dirs. See [1] for more info.
+    # [1] https://github.com/iterative/dvc/issues/2678
+    remote.s3.put_object(Bucket="bucket", Key="data/")
+
+    files = [remote.path_info / "data/."] + files
+
+    assert list(remote.walk_files(remote.path_info / "data")) == files

@shcheklein
Copy link
Member

@efiop
Copy link
Contributor

efiop commented Oct 28, 2019

@andronovhopf Sorry for the delay, we are attending OS Summit in Lyon right now 🙂A proper patch is coming, but the workaround for you would be to do:

aws s3api delete-object --bucket ellesdatabucket --key data/

Note the / at data/. That one is really important. Effectively we are deleting a so-called "empty directory" that was put there by something like

aws s3api put-object --bucket ellesdatabucket --key data/

Deleting it doesn't delete the data inside of it. Btw, how did you upload the data there? ec2 instance through some s3fs? Or through s3api(e.g. python's boto3 or awscli CLI tool)? I'm wondering about it, because usually tools don't create an "empty dir" for s3, because s3 doesn't actually have the concept of dirs and prefixes are not required to be pre-created.

Thank you for your patience 🙂

@elleobrien
Copy link
Author

elleobrien commented Oct 28, 2019

Hi @efiop thanks for your help!

I believe I created the bucket and the data "directory" through the AWS dashboard GUI and then uploaded a bunch of files from my local machine using

aws s3 sync --include="000016*" . s3://ellesdatabucket/data

In the future, what commands would you recommend using to create separate cache and data locations when preparing an s3 bucket for DVC?

EDIT: Clarified: I've just learned it is not necessary to make a cache "empty directory" before the first dvc add for buckets.

@ghost
Copy link

ghost commented Oct 28, 2019

Sorry for jumping late to the discussion (AFK during the weekend 🙈 )!

@efiop , I was testing already for empty directories but forgot to do it with walk_files:

s3.put_object(Bucket="bucket", Key="empty_dir/")

I would change your patch a little bit to the following:

diff --git a/tests/unit/remote/test_s3.py b/tests/unit/remote/test_s3.py
index 7861fb5a..fb8393a0 100644
--- a/tests/unit/remote/test_s3.py
+++ b/tests/unit/remote/test_s3.py
@@ -80,6 +80,8 @@ def test_walk_files(remote):
         remote.path_info / "data/subdir/1",
         remote.path_info / "data/subdir/2",
         remote.path_info / "data/subdir/3",
+        remote.path_info / "empty_file",
+        remote.path_info / "foo",
     ]
 
-    assert list(remote.walk_files(remote.path_info / "data")) == files
+    assert list(remote.walk_files(remote.path_info)) == files

@ghost
Copy link

ghost commented Oct 28, 2019

The patch was a small one, @efiop , I assumed that you were busy with OS Summit and submitted a PR that fixes the problem :)

#2683

Test it already manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants