Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3: ignore empty directories while walking files #2683

Merged
merged 2 commits into from Nov 29, 2019
Merged

s3: ignore empty directories while walking files #2683

merged 2 commits into from Nov 29, 2019

Conversation

ghost
Copy link

@ghost ghost commented Oct 28, 2019

@ghost ghost requested a review from efiop October 28, 2019 19:22
Suor
Suor previously requested changes Oct 29, 2019
dvc/remote/s3.py Outdated Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Oct 29, 2019

Had an offline discussion with @efiop where we argued if DVC should track empty directories or not.
I agreed on submitting a patch that supports empty directories for S3 but it is not straight forward.

git ignores empty directories, it is convention to add an empty stub file (e.g. .keep, .gitkeep) if you want to track it.

What do you think, @Suor , @efiop , sholud we track empty dirs?

@efiop
Copy link
Contributor

efiop commented Oct 29, 2019

@MrOutis It is not an empty dir, it is a special object that is used that way. I think we have to support it, as someone might have those outputs as empty dirs and have logic built on the assumption that that prefix already exists.

@ghost
Copy link
Author

ghost commented Oct 29, 2019

@efiop, what kind of logic?

@ghost
Copy link
Author

ghost commented Oct 29, 2019

@efiop , depending on existing directories breaks reproducibility (preferred way: dvc run -o data 'mkdir data && fill_data.sh') . In case you are depending on a directory it would cause overlapping stages:

$ dvc init --no-scm
$ mkdir data
$ echo 'foo' > data/foo
$ dvc add data
$ echo 'bar' > data/bar
$ dvc add data/bar

ERROR: Paths for outs:
'data'('data.dvc')
'data/bar'('data/bar.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

@ghost ghost dismissed Suor’s stale review October 29, 2019 19:08

changes done

@efiop
Copy link
Contributor

efiop commented Oct 29, 2019

@MrOutis The logic of something assuming that the prefix exists. For example your first stage is creating that directory that way by doing aws s3api put-object --bucket bucket --key empty_dir/ and then doesn't have anything to put in there. Next stage expects that prefix to exist, but won't find it if we don't cache and checkout it properly.

changes done

This patch still doesn't handle aws s3api put-object --bucket bucket --key empty_dir/.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is not fixed yet.

@ghost
Copy link
Author

ghost commented Oct 29, 2019

Next stage expects that prefix to exist, but won't find it if we don't cache and checkout it properly.

@efiop , but why would the next stage expect that such prefix exist?

@efiop
Copy link
Contributor

efiop commented Oct 29, 2019

@MrOutis I have no idea, something created that empty dir in users example, so something is using it and I won't be surprised if something else would be expecting it to exist. I would think about it as creating an empty dir in local fs, but just adapted to s3 caveats.

@Suor
Copy link
Contributor

Suor commented Oct 30, 2019

I believe the original issue - "an empty dir" created by user making a mess - is fixed. The supposition that some later user code will rely on that "empty dir" to exist is a separate thing.

I am not sure it's actually a thing, the issue might, however, come up if a user is using "a dir file" as an actual file with contents. And this can't be fixed without giving up on the dir notion for S3 and using a prefix notion instead.

@efiop
Copy link
Contributor

efiop commented Oct 30, 2019

@Suor This PR is hiding the issue and we still don't support "empty dirs" here. We should solve it properly.

@efiop efiop changed the title s3: ignore empty directories while walking files [WIP] s3: ignore empty directories while walking files Nov 25, 2019
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are 2 parts to what we've discussed here in terms of the implementation.

  1. collection
    Same as with local directories, walk_files should ignore empty directories. But the top level directory shouldn't be ignored, and that can be implemented by making isdir return True on empty directories. That way get_checksum will also be able to properly call get_dir_checksum instead of calling get_file_checksum on an "empty_dir" object.

  2. checkout
    We shouldn't create embedded empty dirs, hence why we didn't collect them in 1. But we should create a top level empty dir. For the sake of implementation simplicity, we could simply implement makedirs for s3 to create empty_dir objects.

@ghost ghost changed the title [WIP] s3: ignore empty directories while walking files s3: ignore empty directories while walking files Nov 29, 2019
@ghost
Copy link
Author

ghost commented Nov 29, 2019

For the record, dvc add remote://s3/empty_dir already worked @{30.days.ago}.
It was just a matter of implementing makedirs

tests/unit/remote/test_s3.py Outdated Show resolved Hide resolved
dvc/remote/s3.py Outdated Show resolved Hide resolved
dvc/remote/s3.py Outdated Show resolved Hide resolved
@ghost ghost requested a review from efiop November 29, 2019 18:38
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 🎉

@efiop efiop merged commit 94f803e into iterative:master Nov 29, 2019
@Suor
Copy link
Contributor

Suor commented Nov 30, 2019

Sorry guys to revive this, but I still don't get the logic behind this change. If we state that an s3 dir is a collection of files with common prefix ending with /, say:

s3://bucket/path/dir/a
s3://bucket/path/dir/b
s3://bucket/path/dir/c

Or

s3://bucket/path/dir/
s3://bucket/path/dir/a
s3://bucket/path/dir/b
s3://bucket/path/dir/c

Then why do we actively creating dir files - the ones ending with /?

If we require dir file to be present then will dvc break when someone will try to dvc add the first list of files?

@efiop
Copy link
Contributor

efiop commented Nov 30, 2019

@Suor The / creation is just for a simplicity of implementation, this is how s3 creates them through web console, for example. dvc add won't break, we don't require / on collection, we just simplify the checkout implementation by creating /, so that if someone would to add dvc add s3://bucket/empty_dir, then we would be able to properly re-create it on checkout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to dvc add more than 1 file at a time in s3 bucket
2 participants