-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paths with unicode characters in them? #6
Comments
Yes, they do break! When package is created, entires (or headers, I don't know proper terminology) look like gibberish. In fact, Here is an example:
And with help of Emacs I can see:
Where |
@dcoutts, Is PR disarible or you can fix it yourself? |
@mrkkrp this isn't a new problem right? It's never done unicode. Yes, a comprehensive fix would be welcome, but this isn't easy. It still has to work with arbitrary unix files which are not necessarily unicode. |
@dcoutts, I didn't know it's not supposed to work with Unicode. But well, it's 2016, Unicode is everywhere. And there are a lot of coutries that use non-Latin scripts, so once you choose to work with tar archives in Haskell and you have to deal with non-Latin script, you have this problem. Oh, OK. Can you describe why exactly Unicode is so hard? All the tools for Also, where to look if I want to properly fix this? (I now either need to fix it or call extenral |
Can't we just use |
How do you know the paths are UTF8 encoded, and not something else? |
I don't see any problems here. We're talking about Now if we take just UTF-8, it's designed to be backward compatible with ASCII. This means that
For non-ASCII characters however, it's not possible to represent them using only one byte per character, so there will be difference and Unicode paths will be represented by longer I see the following problems, however:
Anyway, this change is a must, because otherwise use of this library is very limited (as Haskell community's tool to be used as part of applications used mostly by programmers). Even if you deal with Latin alphabet only, there are various characters that can be in paths, like quotes: “” (note, they are different from "", which cannot be in paths in Windows, although they are in ASCII range, but “” on the other hand can, and they are proper punctuation to be used anyway), there are copyright signs ©, a lot of punctuation that is not in ASCII range. I can imagine you don't use these things in names of source files, but this doesn't mean other (possibly non-technical) people don't put Unicode in names of files, and they may be direct users of some Haskell program that uses this library. |
Hmm, TAR officially doesn't support non ASCII characters. Too bad, but I think I saw tar-archives that contain paths with Unicode in them. Strange, I'll need to read more about workaround and how it's generally done. |
Anyway since tar specification specifies ASCII range explicitely and UTF-8 and ASCII are the same in that range, I think that idea with UTF-8 should be perfectly OK. |
I'm waiting for @dcoutts opinion. Perhaps I should just use more-modern archive format. This is unbelievable that it doesn't support anything but ASCII, what a flaw… So if it's specification that's broken, then I suggest we close the issue, because this library implements the specification well. I'll just switch to zip, it will be also more familiar for my non-techy users. Sorry for prolonged disscussion. |
I'm not opposed to following whatever convention other tar impls use when it comes to unicode. But note that it isn't a trivial matter of sticking in a few to/fromUTF8 calls (remember that not all unix files are unicode but all windows/osx ones are). See for example https://docs.python.org/2/library/tarfile.html#tar-unicode I think a good time to tackle this problem is when we add pax support (isssue #1). The posix pax standard explicitly supports file name encodings, and utf8 in particular. |
I'm surprised by the discussion here. There is a very simple solution which is unambiguously the right thing to do: use EDIT: OK, I'll retract this. If you followed my suggestion, then if you used |
Can't we make this case an error instead of silently accepting? Current behaviour causes problems for users: |
I support erroring. The truncation from Char8.pack is basically never right, IMO. |
Also, is there an interface for passing |
What's the status of this? The current implementation is breaking filenames. All filepaths should be ByteString (aka RawFilePath). This is a low-level library, if someone wants to add a String or Text interface on top, that's fine. EDIT: afais gnu tar specifies:
But this probably isn't portable for Mac OS and windows... EDIT2: I think I'll create a |
This is what tar-conduit does: https://github.com/snoyberg/tar-conduit/blob/81283887aaa9771c0f2db53cb4e86700da4c2d9e/src/Data/Conduit/Tar/Types.hs#L151 It encodes and decodes as UTF-8. I'd say that's a pretty good bet. For unpacking, we could provide a version that allows to set the encoding... or we make use of something like https://hackage.haskell.org/package/charsetdetect-ae |
There are some non-trivial parts there, because although the tar spec demands unix semantics, the library also works on windows (see |
Yes, I'd assume UTF-8 on Windows. |
Would it be possible for I've written something for Stack that works around EDIT: In the interim, I've realised I can convert the FilePath back into a ByteString, and start again: fromTarPath :: TarPath -> FilePath
fromTarPath = T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath |
@mpilgrem I recommend against #78 is a way forward. |
@Bodigrim, thanks for the warning. My second attempt below makes use of fromTarPath :: TarPath -> FilePath
fromTarPath tp = if isUTF8Encoded rawFilePath
then
T.unpack $ T.decodeUtf8Lenient $ BS.Char8.pack rawFilePath
else
-- A future version of Tar.fromTarPath may itself assume that 'TarPath' is
-- UTF8 encoded.
rawFilePath
where
rawFilePath = Tar.fromTarPath tp |
PR here: #88 |
Unicode filenames should work now, after aa683b0. I switched |
Happened to see this, don't know if it actually matters or not. But
StringTable.construct
callsByteString.Char8.pack
, which throws away a lot of information. Paths with unicode characters will probably break?The text was updated successfully, but these errors were encountered: