-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3: ignore empty directories while walking files #2683
Conversation
Had an offline discussion with @efiop where we argued if DVC should track empty directories or not.
What do you think, @Suor , @efiop , sholud we track empty dirs? |
@MrOutis It is not an empty dir, it is a special object that is used that way. I think we have to support it, as someone might have those outputs as empty dirs and have logic built on the assumption that that prefix already exists. |
@efiop, what kind of logic? |
@efiop , depending on existing directories breaks reproducibility (preferred way: $ dvc init --no-scm
$ mkdir data
$ echo 'foo' > data/foo
$ dvc add data
$ echo 'bar' > data/bar
$ dvc add data/bar
ERROR: Paths for outs:
'data'('data.dvc')
'data/bar'('data/bar.dvc')
overlap. To avoid unpredictable behaviour, rerun command with non overlapping outs paths.
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help! |
@MrOutis The logic of something assuming that the prefix exists. For example your first stage is creating that directory that way by doing
This patch still doesn't handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is not fixed yet.
@efiop , but why would the next stage expect that such prefix exist? |
@MrOutis I have no idea, something created that empty dir in users example, so something is using it and I won't be surprised if something else would be expecting it to exist. I would think about it as creating an empty dir in local fs, but just adapted to s3 caveats. |
I believe the original issue - "an empty dir" created by user making a mess - is fixed. The supposition that some later user code will rely on that "empty dir" to exist is a separate thing. I am not sure it's actually a thing, the issue might, however, come up if a user is using "a dir file" as an actual file with contents. And this can't be fixed without giving up on the dir notion for S3 and using a prefix notion instead. |
@Suor This PR is hiding the issue and we still don't support "empty dirs" here. We should solve it properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are 2 parts to what we've discussed here in terms of the implementation.
-
collection
Same as with local directories,walk_files
should ignore empty directories. But the top level directory shouldn't be ignored, and that can be implemented by makingisdir
returnTrue
on empty directories. That wayget_checksum
will also be able to properly callget_dir_checksum
instead of callingget_file_checksum
on an "empty_dir" object. -
checkout
We shouldn't create embedded empty dirs, hence why we didn't collect them in 1. But we should create a top level empty dir. For the sake of implementation simplicity, we could simply implementmakedirs
for s3 to create empty_dir objects.
For the record, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 🎉
Sorry guys to revive this, but I still don't get the logic behind this change. If we state that an s3 dir is a collection of files with common prefix ending with
Or
Then why do we actively creating dir files - the ones ending with If we require dir file to be present then will dvc break when someone will try to |
@Suor The |
Fix #2678
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html