-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Uri] Paths with Unicode/UTF-8 incorrectly parsed/reported by System.Uri #1061
Comments
Also DOS-like and UNC Paths are treated correctly on windows. c:\üri -> file:///c:/üri |
@g0dsCookie let's not mix multiple problems in the same issue. I don't think Windows paths are recognized in Uri on Linux. That is IMO by design - @rmkerr can confirm. Regarding your original report: Can you please clarify (please edit the top post) what is the actual result on each platform and what is expected? |
Actually paths like "C:\üri" and "\localhost\üri" are correctly parsed on Linux with .NET Core.
I updated my original report. Hope it's clearer now. It seems like there's a problem when parsing UTF-8/Unicode characters which causes the path to be doubled for every UTF-8 character encountered. |
Thanks for the detailed report @g0dsCookie. This looks really interesting. I'm not going to be able to take a look at it immediately, but I think we should try to get this fixed for 3.0. |
@tarekgh does this look like something that might be related to libicu? |
@wtgodbe I cannot tell without looking :-) does this issue not repro on Windows? |
It doesn't repro on Windows, I get the same results as https://github.com/dotnet/corefx/issues/33557#issuecomment-439075084 |
It looks on Windows this is by design as the Uri cannot start with one '/' char. https://github.com/dotnet/corefx/blob/master/src/System.Private.Uri/src/System/Uri.cs#L3737 and on Linux this can be a valid Uri according to https://github.com/dotnet/corefx/blob/master/src/System.Private.Uri/src/System/Uri.cs#L3663 I didn't look at Linux yet to know why we are returning the result we are seeing here. |
I have looked at the issue on Linux, the problem has nothing to do with icu. here is what is the problem: The URI code detect that running on Linux and the string starts with '/' which means it could be a valid file path. and store the internal uri._string as the original value "/üri". Later, the code will call the method ParseRemaining which will call EscapeUnescapeIri. EscapeUnescapeIri will return "/üri" and then code will concatenate this value to the original _string. that means _string now will be storing "/üri/üri" Then later the code will try to get the host name. will detect the host name should be the first 4-characters "/üri" and will call EscapeString helper method to normalize this name which will return "/%C3%BCri" that makes the whole uri as "file:///%C3%BCri/üri" Let me know if I can help in anything more. |
@wtgodbe please check this is not regression against 2.x |
Repros for me on Windows, As an aside, I must say I'm really surprised that |
File uri's are special due to handling OS specific file path syntax. Ordinary schemes such as "http", etc. will behave consistently across platforms. However, the behavior you see with UriFormatException vs. the repeated word pattern (üri) is something we will investigate since it seems like a bug. |
I understand file path syntax can differ across OSes, but you might still want to handle Linux-style paths on Windows and vice versa. Especially as this breaks e.g. serialization/deserialization across platforms. |
Just some more data points because I'm running into this right now namespace UriTest
{
class Program
{
static void Main(string[] args)
{
var uris = new [] {
new Uri ("/Source/Test#our codedir/smile😟/Program.cs"),
new Uri ("/Source/Test#our codedir/smile😟/Program.cs", UriKind.Absolute),
new Uri (new Uri ("file://"), "/Source/Test#our codedir/smile😟/Program.cs"),
new Uri (new Uri ("file://localhost"), "Source/Test#our codedir/smile😟/Program.cs"),
new Uri (new Uri ("file://localhost"), "/Source/Test#our codedir/smile😟/Program.cs"),
new Uri (new Uri ("file://localhost"), new Uri ("/Source/Test#our codedir/smile😟/Program.cs")),
new Uri ("file:///Source/Test%23our%20codedir/smile%F0%9F%98%9F/Program.cs"),
new Uri ("file://localhost/Source/Test%23our%20codedir/smile%F0%9F%98%9F/Program.cs"),
new Uri ("http://localhost/Source/Test%23our%20codedir/smile%F0%9F%98%9F/Program.cs"),
};
var i = 0;
foreach (var uri in uris) {
Console.WriteLine ($"{i++} XXXXXXXXXXXXXXXXXXXXXXXXXXXXX");
Console.WriteLine ($" AbsoluteUri: {uri.AbsoluteUri}");
Console.WriteLine ($" AbsolutePath: {uri.AbsolutePath}");
Console.WriteLine ($" LocalPath: {uri.LocalPath}");
Console.WriteLine ($" ToString(): {uri.ToString()}");
Console.WriteLine ($"OriginalString: {uri.OriginalString}");
}
}
}
} Gives the output of:
There doesn't appear to be any way to construct a valid file uri that you can actually get the path back out of on unix |
@MihaZupan can you please take a look at this one? |
Paths with Unicode/UTF-8 incorrectly parsed/reported by System.Uri
General
As described by the title, paths with Unicode/UTF-8 characters are incorrectly parsed/reported by System.Uri resulting in an invalid path. For example the path "/üri" will result in an Uri like "file:///%C3%BCri/üri" (note the unescaped /üri at the end).
This also happens with other Unicode/UTF-8 characters like £, §, etc. So you can replace ü by any other Unicode/UTF-8 character in my example and see the same result, e.g. the path is doubled.
Expected Result
Results collected using mono 5.14 on the same Linux machine.
Actual Result
Using the .NET core version mentioned below.
System Informations
Code to reproduce
Above code prints "file:///%C3%BCri/üri" while "file:///%C3%BCri" is expected.
Using "/üri/üri" in the Uri ctor results in a path like "file:///%C3%BCri/%C3%BCri/üri/üri".
The text was updated successfully, but these errors were encountered: