Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sideband discussion about replacing EvalSymlinks #4

Open
ericwj opened this issue Oct 21, 2020 · 9 comments
Open

Sideband discussion about replacing EvalSymlinks #4

ericwj opened this issue Oct 21, 2020 · 9 comments

Comments

@ericwj
Copy link
Owner

ericwj commented Oct 21, 2020

@rasky I don't know whether fixLongPath fixes many issues. The prefixes aren't specific to long paths. I don't have a complete oversight readily available for the issues with the API's, but some of those I see with Join are very common accross path/filepath.

There is a whole series of issues evidencing that there are string manipulations that are just wrong. Often the first \ is stripped off the prefix, for example. The \\.\ prefix isn't recognized at all and there are lots of cases where \\?\ is being ignored. After all, \\?\ means do not parse i.e. leave unmodified. I believe Clean is a major source of issues and that function is used in several other API's.

Fixing any part of EvalSymlinks is not an advantage imho. Better leave it broken and deprecate it.

Then also don't introduce a drop-in replacement that will suffer many of the same issues. There are much better ways of solving the problems at hand. I didn't mention any use of GetFinalPathNameByHandle will have to iterate flags until there is a result that is not 'Path not found' or another error and then still the developer might have to hint at how to call that API to get the correct result. On top of that, it is bad performance wise to have string arguments and do the retries for perhaps thousands of files in succession.

@rasky
Copy link

rasky commented Oct 21, 2020

I don't know whether fixLongPath fixes many issues.

It doesn't for path manipulations but it does as soon as those paths hit a syscall. If you are referring to pure string manipulation, then no it doesn't. Feel free to open issues about them and tag me (please just
don't coalesce too many bugs in a single GitHub issue -- one issue per filepath function is probably better).

I believe Clean is a major source of issues and that function is used in several other API's.

Thanks for these pointers, I'll have a look at Join and Clean.

Fixing any part of EvalSymlinks is not an advantage imho. Better leave it broken and deprecate it.

I know your opinion, I've read your very detailed issue about it. I'm disagreeing with it, but I hope we can still have a constructive discussion about it.

Some of the bugs you reported are fixed with my CL, as they relate to long paths. Others are fixed by another CL I have pending (EvalSymlinks is not calling fixLongPath before FindFirstFile).

My position you might disagree with is that these are just bugs and as such I think they should be fixed. We might still end up in a situation where EvalSymlinks does not have any good usage on Windows -- I'm not ruling that out, but I can't make myself convinced until we finish fixing the bugs.

Then also don't introduce a drop-in replacement that will suffer many of the same issues. There are much better ways of solving the problems at hand. I didn't mention any use of GetFinalPathNameByHandle will have to iterate flags until there is a result that is not 'Path not found' or another error and then still the developer might have to hint at how to call that API to get the correct result. On top of that, it is bad performance wise to have string arguments and do the retries for perhaps thousands of files in succession.

Well if you suggest that we deprecate EvalSymlinks and don't add anything else, then we're giving no options to Go programmers. It might be that you think that the world is a better place without any path canonicalization function -- I sympathize with that, but it contradicts what every single other programming language and framework in the world is doing. We can't have the perfect being enemy of the good -- we have to be pragmatic and realize that people leave in a world where they have to interoperable with other software that does indeed path canonicalization. That's a fundamental point where I don't think we can easily agree.

So, I would really love to hear your technical opinion on how you would go about solving the issue at hand: either proving a path resolution function, or providing a different set of functions to solve the same problems (like my proposal of filepath.SameFile), without pushing every request back.

@ericwj
Copy link
Owner Author

ericwj commented Oct 21, 2020

My position you might disagree with is that these are just bugs and as such I think they should be fixed.

In fact I agree. path/filepath should be fixed - just not EvalSymlinks. You can find in this repo what you need to test every function in path/filepath and get JSON and HTML reports for failures. It was once supposed to become part of the CI pipeline of the go project, were it not that they just don't have the right Windows VM's or people with appropriate Windows skills or interests.

Well if you suggest that we deprecate EvalSymlinks and don't add anything else

Sure there needs to be a replacement! It just shouldn't be making the world look like the stage for child play. Canonicalization can't be done on Windows. It will break in various simple situations that I have tested already and those are hardly all cases that need to be considered. And it doesn't definitively tell you whether one path is inside another, so it doesn't solve the author's problem. Path canonicalization is a wrong concept if it is to be ported to Windows. The perfect here is resolving all links! Don't do that and canonicalization is for about all apps good enough - you can do it with Rel.

There obviously needs to be functions that translate various kinds of indirections, such as drive mappings, volume access paths, junctions and reparse points, but those functions should adhere to all the objections against EvalSymlinks that I documented. On Unix too. Meaning the application will have to be explicit about which links it resolves. Meaning there should be a way to determine where there are links in a path, after which the application may decide or not to resolve that link. I understand there are historic reasons why *nix people want the Jip and Janneke version string Foo(string) but well now you know why that is bad.

@ericwj
Copy link
Owner Author

ericwj commented Oct 21, 2020

Your proposal for SameFile is the right direction imho. I mentioned that it could be made perfect using object identifiers on Windows and inode I believe it is called on *nix.

Perhaps it will become slightly more complex than just that single API though - there needs to be a way to represent identifiers, store them, compare them and ask for various kinds of things like whether this path is inside that directory with some identifier.

@rasky
Copy link

rasky commented Oct 21, 2020

Well, there's already os.SameFile which takes two os.FileInfo and is similar to what you are describing. There are reasons I think an API with a pathname is better -- but I haven't had time to write them down yet, so to hear your comments as well

@ericwj
Copy link
Owner Author

ericwj commented Oct 21, 2020

os.SameFile is documented to may be based on path names and does not allow reasoning except direct comparison. It probably also uses EvalSymlinks, seeing the documented difference between Lstat and Stat.

I can totally imagine Git LFS to have to keep track of some context to do these file system checks. That context may be obtained from a path string and the second path may be a string. Keeping a context may improve performance by at least a factor two, before considering types of caching that it may keep track of, which further improves performance. Those are the fundamentals proper software should use. A more convenient, slow API using strings only is a no brainer once that is in place.

@rasky
Copy link

rasky commented Oct 21, 2020

SameFile uses GetFileInformationByHandle and then compare VolumeSerialNumber and FileIndexHigh/Low.

https://github.com/golang/go/blob/612a363bef9ae29d190f6daa2a5a1623f78c874b/src/os/types_windows.go#L216
https://github.com/golang/go/blob/612a363bef9ae29d190f6daa2a5a1623f78c874b/src/os/types_windows.go#L152

It doesn't call fixLongPath though... so it won't work on most peculiar file paths, but I can fix that right away and add tests to make sure it works for any kind of Windows path.

We then could have filepath.SameFile(path1, path2 string) which is a thin wrapper over the same underlying code, but using pathnames instead, so that users don't have go through Stat explicitly, and also as a way to help users that are looking for a canonicalization function in filepath.

Then we need something like filepath.InTree(root, path string) that tells whether path is part of the tree starting at root. Do you have suggestion on the proper implementation on Windows? Because I would use GetFinalPathNameByHandle to do that, and then compare extended length format paths. Any better solution?

@ericwj
Copy link
Owner Author

ericwj commented Oct 22, 2020

BY_HANDLE_FILE_INFORMATION (fileapi.h) - Win32 apps | Microsoft Docs

nFileIndexHigh

The high-order part of a unique identifier that is associated with a file. For more information, see nFileIndexLow.

nFileIndexLow

The low-order part of a unique identifier that is associated with a file.

The identifier (low and high parts) and the volume serial number uniquely identify a file on a single computer. To determine whether two open handles represent the same file, combine the identifier and the volume serial number for each file and compare them.

The ReFS file system, introduced with Windows Server 2012, includes 128-bit file identifiers. To retrieve the 128-bit file identifier use the GetFileInformationByHandleEx function with FileIdInfo to retrieve the FILE_ID_INFO structure. The 64-bit identifier in this structure is not guaranteed to be unique on ReFS.

So the comparison needs to include a unique identifier for the server coughing up the information - which means resolving UNC paths and perhaps even canonicalizing the server name - and to be better safe than sorry, GetFileInformationByHandleEx needs to be used instead of GetFileInformationByHandle on operating system versions where it is available.

I don't immediately see whether this works for directories, although there is a comment to that effect - won't hurt to test that. The docs stick to calling the operand 'file' everywhere.

I really like that with this links can be left unresolved. If indeed links aren't resolved by the Stat call. If I understand correctly it won't? Through the use of the FILE_FLAG_OPEN_REPARSE_POINT flag?

However every comparison with just string arguments needs to call this twice, so for high-volume calling, the fundamentals are to use a variant that takes just one string. Worse for the InTree version, which I would implement to iterate parent directories until a match is found - cache those and it simplifies to one call. Like I said, the variant with two strings for infrequent use is then a no-brainer.

@rasky
Copy link

rasky commented Oct 22, 2020

BY_HANDLE_FILE_INFORMATION (fileapi.h) - Win32 apps | Microsoft Docs

nFileIndexHigh
The high-order part of a unique identifier that is associated with a file. For more information, see nFileIndexLow.
nFileIndexLow
The low-order part of a unique identifier that is associated with a file.
The identifier (low and high parts) and the volume serial number uniquely identify a file on a single computer. To determine whether two open handles represent the same file, combine the identifier and the volume serial number for each file and compare them.
The ReFS file system, introduced with Windows Server 2012, includes 128-bit file identifiers. To retrieve the 128-bit file identifier use the GetFileInformationByHandleEx function with FileIdInfo to retrieve the FILE_ID_INFO structure. The 64-bit identifier in this structure is not guaranteed to be unique on ReFS.

It's true that you could cook up a scenario where two disks have the same volume serial number and the combine a clash in the 64-bit identifier, but I think that situation is rare enough that I won't be happy to complicate the code more than that. If anything, one might decide to document this.

I really like that with this links can be left unresolved. If indeed links aren't resolved by the Stat call. If I understand correctly it won't? Through the use of the FILE_FLAG_OPEN_REPARSE_POINT flag?

No we actually do need to solve the links. We need a function that tells use that M:\FOO and \SERVER\SHARE\FOO is the same file if \SERVER\SHARE is mounted as M. That is the whole point of our discussion, isn't it? We're saying that we are not going to provide a canonicalized path because the concept itself is moot, but we need a way to assess that two files are indeed the same.

Worse for the InTree version, which I would implement to iterate parent directories until a match is found - cache those and it simplifies to one call. Like I said, the variant with two strings for infrequent use is then a no-brainer.

I was thinking of calling GetFinalPathNameByHandle with VOLUME_NAME_NT or VOLUME_NAME_GUID and compare the paths. Any case where this wouldn't work?

@ericwj
Copy link
Owner Author

ericwj commented Oct 23, 2020

but I think that situation is rare enough
No we actually do need to solve the links.

It is probably OK to be optimistic about this most of the time, but then why need to resolve links? If M:\ is \\SERVER\SHARE then the comparison will succeed, won't it? I tested that for this situation and it works. Then there you go, you have checked that two paths are pointing to the same thing and you don't need to resolve links. Plus even if links are resolved, SMB is the only exception where additional comparison would be needed, so only SMB mappings need to be resolved - however any link can be a link to a UNC location.

We need a function that tells use that M:\FOO and \SERVER\SHARE\FOO is the same file

I think that is the set of functions that will tell you M:\ is a link, perhaps even tells you it is an SMB mapping, plus the function which you can give M:\ but not M:\FOO and which yields \\SERVER\SHARE is the target. If M:\FOO is a link itself, then the first function yields M:\FOO and the second functions yields whatever it points to, in which case M:\ is never resolved unless the app specifically computes the parent of M:\FOO and provides that to these functions which equates to the first scenario.

I was thinking of calling GetFinalPathNameByHandle with VOLUME_NAME_NT or VOLUME_NAME_GUID and compare the paths.

I think it is pointless to do that and asking for bugs and edge cases. Why have SameFile with these identifiers and then walk into the same trap of comparing paths for parent directories? Going this route will be complex because you cannot know beforehand which flag will yield a result in the first place, it doesn't work for files accessed both locally and over an SMB share and probably other scenario's which I didn't test. E.g. I have tested this:

Path Flag Result
C:\Users\Eric\Source\Source.ico VOLUME_NAME_GUID \\?\Volume{607932d3-78af-41c0-9786-dd0177e78a39}\Source\Source.ico
N:\Source.ico VOLUME_NAME_GUID The system cannot find the path specified. (0x80070003)
C:\Users\Eric\Source\Source.ico VOLUME_NAME_NT \Device\HarddiskVolume15\Source\Source.ico
N:\Source.ico VOLUME_NAME_NT \Device\Mup\Z68\UncShare\Source.ico

Each of the paths I tested in each situation with GetFileInformationByHandleEx yields ID 26000000000001000000000000000000 and volume 900c88c20c88a4b6.

The situations where this might still not work is perhaps DFS or CSV or similar technologies, where multiple servers are able to cough up the same file from a shared location. You might get different ID's for the same file. That needs to be tested. And I don't think Google can do that during CI today.

By the way, these ID's are almost certainly not persistable. Apps that would need to do these operations across reboots would have to obtain NTFS object identifiers, which are persistable, but aren't available for FAT type of volumes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants