Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOs and Windows lowercases filenames #3318

Closed
pared opened this issue Feb 13, 2020 · 22 comments
Closed

MacOs and Windows lowercases filenames #3318

pared opened this issue Feb 13, 2020 · 22 comments
Assignees
Labels
p2-medium Medium priority, should be done, but less important research

Comments

@pared
Copy link
Contributor

pared commented Feb 13, 2020

Context: https://discordapp.com/channels/485586884165107732/485596304961962003/677178126916124702
The issue:
User created data dir which contained files of the same name, some were uppercase, some were lowercase.
Other user cloned the repository on MacOs and could not get images containing uppercase letters,
when those names ovelapped with other images having lowercase (the same) name.
Pulling on linux was fine.

Reproduction script:

#!/bin/bash

rm -rf repo storage git_repo erepo
mkdir erepo git_repo storage

MAIN=$(pwd)

pushd git_repo
git init --bare --quiet
popd
set -ex
pushd erepo
git init --quiet
dvc init -q

mkdir data_dir
echo data1>>data_dir/file
echo data2>>data_dir/File

dvc add data_dir
dvc remote add -d str $MAIN/storage
git add -A
git commit -am "initial"
git remote add origin $MAIN/git_repo

dvc push
git push origin master
popd

git clone git_repo repo

pushd repo
dvc pull

after pull, we should have file and File in target repo, which seems to not happen on MacOs.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 13, 2020
@pared pared added the bug Did we break something? label Feb 13, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 13, 2020
@pared pared added p0-critical Critical issue. Needs to be fixed ASAP. triage Needs to be triaged labels Feb 13, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Feb 13, 2020
@pared pared changed the title MacOs lowercases files names MacOs lowercases filenames Feb 13, 2020
@pared
Copy link
Contributor Author

pared commented Feb 13, 2020

So, turns out that by default, Windows and MacOs do not support case sensitive file names.
Some resources:
https://superuser.com/questions/165975/are-all-versions-of-windows-case-insensitive
https://apple.stackexchange.com/questions/22297/is-bash-in-osx-case-insensitive

It seems that this filesystem setting, so one should be able to configure case sensitive support.
It doesn't seem to me that we can do something about it at dvc level.

@pared pared removed bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. labels Feb 13, 2020
@pared pared changed the title MacOs lowercases filenames MacOs and Windows lowercases filenames Feb 13, 2020
@pared
Copy link
Contributor Author

pared commented Feb 13, 2020

We could probably warn the user about this behavior, though the detection of such use cases could be resource consuming.

@efiop efiop added p0-critical Critical issue. Needs to be fixed ASAP. research labels Feb 13, 2020
@pared
Copy link
Contributor Author

pared commented Feb 13, 2020

When such a situation might happen:

  1. pull repo created on Linux
  2. I could imagine some data processing step that would create processed data and name it automatically, so verification only on data sync operations is not enough.

@pared
Copy link
Contributor Author

pared commented Feb 13, 2020

Some research shows that we are not the only one affected by this.
Git users on mac might also be hit with this:
https://stackoverflow.com/questions/8904327/case-sensitivity-in-git

@efiop
Copy link
Contributor

efiop commented Feb 13, 2020

if there is no good way to solve it we should at least document it. Probably a separate article about Ensuring cross-platform compatibility would be nice, which then could also be used linked in the troubleshooting guide.

But need to look into the git first, @pared the link that you've shared seems to note that it now forks fine in git. Might be missing something.

@pared
Copy link
Contributor Author

pared commented Feb 14, 2020

Yes, it seems so, though situation described there is about moving the names, and not coexisting same (from case insensitive perspective) names. I will check how git behaves when we check out repository with "same" filenames.

@pared
Copy link
Contributor Author

pared commented Feb 14, 2020

Ok, so after discussion with @efiop I write down some summary of what happend to better pinpoint the problem.

  1. The user created a data repository on Linux (case sensitive) system.
    Simple tree displaying data dir could be shown as:
dir
├── file
└── FILE
  1. Other user tried to clone repo and pull the dir. What he received on output, was
    Could not create '{md5}.dir.unpacked' (originating from here)
    Besides that, pull was successful.

The problem was that the number of data files did not match (for MacOs with default setting file and FILE is the same, so checkout was unintentionally overriding one with another).
Why did unpacked dir creation fail? Our linking method fails if link exists, and from point of view of filesystem it did (check whether FILE exists was successfull for file that was created earlier). But unpacked creation failure does not cripple dvc critically so we just log a warning.

We could improve the detection of such cases if we would include filenames in the Exception that is thrown upon linking. In this case, the warning could hint us but was too generic, we should probably prepare our own error including both filenames, and print it upon unpacked creation failure, even if its only warning.

@efiop
Copy link
Contributor

efiop commented Feb 14, 2020

Btw, in case of dvc add file + dvc add FILE + dvc checkout, the issue won't be reproduced, because our checkout logic will see that file exists and will remove it before checking out FILE. Just a side note for myself.

@pared
Copy link
Contributor Author

pared commented Feb 17, 2020

@efiop, but in case of a case-insensitive system, you cannot create file and FILE.

@efiop
Copy link
Contributor

efiop commented Feb 17, 2020

@pared Did you close it intentionally?

@pared pared reopened this Feb 18, 2020
@pared
Copy link
Contributor Author

pared commented Feb 18, 2020

I don't recall closing this, don't know what happend, sorry.

@pared
Copy link
Contributor Author

pared commented Feb 19, 2020

Ok, so it seems that git is handling this problem by warning the user about collision.
Here is sample output:

Cloning into 'test_repo'...
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
Unpacking objects: 100% (4/4), done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'FILE'
  'file'

I guess it would be reasonable for us to do the same, though we should think on how to do that to not slow down any checkout because simply checking all filenames on checkout is O(n^2)

@efiop
Copy link
Contributor

efiop commented Feb 20, 2020

@pared Amazing research! Thank you so much! 🙏 As to a reasonable solution, I suppose that git detects that collision while performing the checkout, right? From the wording, it seems like the ordering is completely random, so git doesn't care about it too much after all. We could do the same thing in dvc, for example, by playing around that defensive if self.exists() in _do_link, that has tipped us about this issue in the first place. Though I wonder if os.listdir returns case-sensitive filenames 🙂 , if it does then the implementation might be even simpler.

@pared
Copy link
Contributor Author

pared commented Feb 20, 2020

@efiop

Though I wonder if os.listdir returns case-sensitive filenames, if it does then the implementation might be even simpler.

I suppose it depends on the system. As of current state of research, default settings of Windows and MacOs make them "Case insensitive, but case aware", so
os.listdir will return "cased" paths but will treat FILE as if it was file.
I believe that implementing this logic in _do_link is the best way to go now because we do the check there anyway, so improving the message sounds like a zero-impact solution. What worries me here is that in the original issue, I did not see the message Link '{}' already exists. I will investigate why it happened.

@pared pared self-assigned this Feb 20, 2020
@pared
Copy link
Contributor Author

pared commented Feb 26, 2020

Some further research:
this test:

def test_file(tmp_dir, dvc):
    (tmp_dir / "data").mkdir()
    (tmp_dir / "data"/ "DATA").write_text("asdfsdf")

    DATA = os.path.join("data", "DATA")
    data = os.path.join("data", "data")

    from dvc.path_info import PathInfo
    assert dvc.cache.local.exists(PathInfo(DATA))
    assert not dvc.cache.local.exists(PathInfo(data))

passes on my Windows computer.
That means that even though filesystem considers data\data and data\DATA the same filename when writing,
os.path.lexists does not.

[EDIT]
It gets better:
this test does not pass not only for MacOs on travis, but also Windows. I guess that system configuration plays a part here.

@pared pared mentioned this issue Feb 26, 2020
3 tasks
@pared
Copy link
Contributor Author

pared commented Mar 3, 2020

So, it turns out that git is checking if files collide by accessing the inode of given file. Collision is detected, when the inode has been already taken by another file.

EDIT:
Here is where the check happens:

EDIT 2:
It seems to me that I was wrong, and deciphered logic wrongly, actually what git does, here: when checking out some cache entry, git is comparing its name with all other entries. The comparison is using _stricmp which is case insensitive.

EDIT 3:
In case of macOs inode check actually takes place

@pared
Copy link
Contributor Author

pared commented Mar 6, 2020

I think we could proceed with this one by implementing our own exists basing on access to inode from System. In the case of linux systems, it still boils down to calling lstat (that's what os.path.lexists does). In the case of NTFS we would change our approach to relying on getdirinfo. I need to still confirm whether this approach will be a viable solution for MacOs.

EDIT:
My latest build from #3411 seem to confirm my supposition.

My proposed solution on how to improve detection of collision when linking:
change the implementation of remote/local.exists to call dvc.system.System.inode instead of os.path.lexists. In the case of FileNotFoundError we can return False. In every other case, we return True.

@efiop Do you think I missed something?

@efiop
Copy link
Contributor

efiop commented Mar 9, 2020

@efiop Do you think I missed something?

Dunno, you are the one researching 🙂

In the case of linux systems, it still boils down to calling lstat (that's what os.path.lexists does).

So current exists calls lstat and now we need to check inode instead, which will need to call lstat. I guess I'm confused what's changing there. Or do you mean that it works fine on *nix but on windows cpython's lstat implementation is not sufficient?

@pared
Copy link
Contributor Author

pared commented Mar 13, 2020

One more summary to grasp current state of this issue:

Glossary:

case sensitive system: system that is able to:

  • Create and distinguish between files with same lowercase path, eg file and FILE
    eg: Ubuntu 18.04

case insensitive system:

  • A system that cannot handle task of creating those files as separate entities
    For example on my Windows with following dvc version output:
DVC version: 0.87.0+f9bebd.mod
Python version: 3.7.1
Platform: Windows-10-10.0.18362-SP0
Binary: False
Package: None
Filesystem type (workspace): ('NTFS', 'C:\\')

In some cases, macOS is also considered case insensitive, because usually its filesystem is configured to be case insensitive.
Worth noting that those systems usually allow you to create FILE, because they are case preserving, but they will override existing file/FiLe or any other combination, because, from the system point of view, they have the same name.

Problem

As of today (dvc version 0.87.0), we are unable to handle properly checkout of following data structure:

data
├── file
└── FILE

This situation happens when we create our data folder on case sensitive system and try to work with the project on case insensitive system.

  • What will happen upon checkout in such situation?
    dvc will log warning that it was unable to create unpacked directory. This will happen because Windows will be unable to create hardlink for one of the files because from system point of view it will already exist.
    As it's not an error, and merely warning, dvc will still checkout data dir but it will contain only one of (file, FILE).

What do we want?

Make the user aware that the situation with overlapping paths occurred.

But wait, we already check in _do_link whether path already exists, why there was no error when user was checking out? Because lexists that is used to check whether path exists is case sensitive

Potential solution

First potential solution (credit to @efiop) would be to detect such overlapping the same way as git does and warn the user about the problem.
Git is doing it by comparing checked out entries with all other existing entries. Comparison, in case of windows is done by case insensitive method. This check happens only on clone. In case of MacOs, the git will obtain stats for given path and make check with inode.

I wrote simple test checking whether same lowercase names will have same inode. It seems to be working on MacOs and Windows on Travis, but, for example on my windows computer this test does not pass (createFileW cannot find the target file). So trying to get inode cannot be relied on in all Windows systems.

It seems that what might work here would be to do as git does: check inode in case of macOs and on windows, when linking, we could check directory content and compare lowercase filenames to see if collision occurs.

The problem with such an approach is that if we will do it in _do_link we might end up slowing checkout execution time for windows.

@dmpetrov
Copy link
Member

@pared thank you for such a deep investigation!

This is an important problem and it would be nice if we can fix it but I’d like to understand the priority of the issue.

When we discuss pros/cons of this solution we should keep in mind another option - keep this responsibility on users shoulders. In many cases, this problem has an easy workaround for users (rename some files). The same happens in many multi-platform programming frameworks (even plain C) - you need to follow some conventions to make a framework work in different platforms (and filename is on of the conventions).

For now, the priority does not look like p0. Can we make it p1 or p2?
Please let me know if I missed something about the priority.

@efiop
Copy link
Contributor

efiop commented Mar 16, 2020

@dmpetrov p0 was the research, as we really needed to fully understand what is going on. The solution is lower in priority.

@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Mar 16, 2020
@efiop
Copy link
Contributor

efiop commented May 3, 2021

Closing as stale.

@efiop efiop closed this as completed May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2-medium Medium priority, should be done, but less important research
Projects
None yet
Development

No branches or pull requests

3 participants