Paths with unicode characters in them? #6

edsko · 2015-11-19T13:28:24Z

Happened to see this, don't know if it actually matters or not. But StringTable.construct calls ByteString.Char8.pack, which throws away a lot of information. Paths with unicode characters will probably break?

The text was updated successfully, but these errors were encountered:

mrkkrp · 2015-12-24T13:01:18Z

Yes, they do break! When package is created, entires (or headers, I don't know proper terminology) look like gibberish. In fact, tar doesn't think it's a proper tar archive at all. I've spent about an hour looking where my app is doing something wrong, but it turns out that the library has bugs.

Here is an example:

~/Downloads $ tar -xvf foo.tar
/usr/bin/tar: This does not look like a tar archive
/usr/bin/tar: Skipping to next header
/usr/bin/tar: Exiting with failure status due to previous errors

And with help of Emacs I can see:

-rw-r--r--       0/0       82492644 01 �>65 E@0=8 :>@>;O!.flac

Where � is something unprintable at all. This must be fixed ASAP.

mrkkrp · 2015-12-24T13:11:37Z

@dcoutts, Is PR disarible or you can fix it yourself?

dcoutts · 2016-01-04T10:04:19Z

@mrkkrp this isn't a new problem right? It's never done unicode.

Yes, a comprehensive fix would be welcome, but this isn't easy. It still has to work with arbitrary unix files which are not necessarily unicode.

mrkkrp · 2016-01-04T10:18:01Z

@dcoutts, I didn't know it's not supposed to work with Unicode. But well, it's 2016, Unicode is everywhere. And there are a lot of coutries that use non-Latin scripts, so once you choose to work with tar archives in Haskell and you have to deal with non-Latin script, you have this problem.

Oh, OK. Can you describe why exactly Unicode is so hard? All the tools for ByteString ecoding/decoding available, and UTF-8 is the same as ASCII if it doesn't contain Unicode characters.

Also, where to look if I want to properly fix this? (I now either need to fix it or call extenral tar application instead, which is not very pretty.)

mrkkrp · 2016-01-04T14:07:44Z

Can't we just use utf8-string for example and replace some calls to pack/unpack from Data.ByteString.Char8 with calls to fromString / toString from Data.ByteString.UTF8? That should work for file paths without Unicode characters as well as for those with Unicode characters in them. Am I missing something important here?

edsko · 2016-01-04T14:08:27Z

How do you know the paths are UTF8 encoded, and not something else?

mrkkrp · 2016-01-04T16:39:11Z

I don't see any problems here. We're talking about FilePath, which is a synonym for String, list of Chars. Every char is not a byte, but something that already can represent any Unicode value.

Now if we take just UTF-8, it's designed to be backward compatible with ASCII. This means that ByteString representing UTF-8-encoded string is the same as ByteString representing ASCII string (one byte per character, this how it currently works, as I understand). So, no regression will happen if we switch, with respect to this limited collection of characters, things will be all the same.

As Wikipedia puts it:

Backward compatibility: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that ASCII text is valid UTF-8, and UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.

For non-ASCII characters however, it's not possible to represent them using only one byte per character, so there will be difference and Unicode paths will be represented by longer ByteStrings, but I don't see any problem here either, just put that sequence of bytes into that string table and extract them afterwards decoding them as UTF-8 strings.

I see the following problems, however:

I don't know if there is any standard with respect to encoding that should be used. I mean, is this OS-dependent? Linux uses UTF-8 everywhere, but Windows does not. How tar application should know how to interpret file names? I don't know. I guess if we go with UTF-8 it will be much better than truncated characters anyway.
As I understand from quick reading of source code, file paths are limited in length. If this is a limitation from tar format specification, then Unicode paths that can be put into a tar archive will be shorter than non-Unicode ones.

Anyway, this change is a must, because otherwise use of this library is very limited (as Haskell community's tool to be used as part of applications used mostly by programmers).

Even if you deal with Latin alphabet only, there are various characters that can be in paths, like quotes: “” (note, they are different from "", which cannot be in paths in Windows, although they are in ASCII range, but “” on the other hand can, and they are proper punctuation to be used anyway), there are copyright signs ©, a lot of punctuation that is not in ASCII range.

I can imagine you don't use these things in names of source files, but this doesn't mean other (possibly non-technical) people don't put Unicode in names of files, and they may be direct users of some Haskell program that uses this library.

mrkkrp · 2016-01-04T16:46:30Z

Hmm, TAR officially doesn't support non ASCII characters. Too bad, but I think I saw tar-archives that contain paths with Unicode in them. Strange, I'll need to read more about workaround and how it's generally done.

mrkkrp · 2016-01-04T16:49:50Z

Anyway since tar specification specifies ASCII range explicitely and UTF-8 and ASCII are the same in that range, I think that idea with UTF-8 should be perfectly OK.

mrkkrp · 2016-01-04T16:53:45Z

I'm waiting for @dcoutts opinion. Perhaps I should just use more-modern archive format. This is unbelievable that it doesn't support anything but ASCII, what a flaw…

So if it's specification that's broken, then I suggest we close the issue, because this library implements the specification well. I'll just switch to zip, it will be also more familiar for my non-techy users. Sorry for prolonged disscussion.

dcoutts · 2016-01-10T00:41:14Z

I'm not opposed to following whatever convention other tar impls use when it comes to unicode. But note that it isn't a trivial matter of sticking in a few to/fromUTF8 calls (remember that not all unix files are unicode but all windows/osx ones are). See for example https://docs.python.org/2/library/tarfile.html#tar-unicode

I think a good time to tackle this problem is when we add pax support (isssue #1). The posix pax standard explicitly supports file name encodings, and utf8 in particular.

ezyang · 2016-09-02T10:09:50Z

I'm surprised by the discussion here. There is a very simple solution which is unambiguously the right thing to do: use withFilePath from System.Posix.Internals (in base) to encode a FilePath into the OS-specific encoding, and then blast that straight into the tarball. The point is that people expect tar to work like how an invocation of the tar program on the filesystem would work, and the convention is that you just preserve the raw encoding of the data directly.

EDIT: OK, I'll retract this. If you followed my suggestion, then if you used tar on Windows, all of the files would be blasted into the tarball using UTF-16 encoding. Which will totally do the right thing on Windows (Unicode will be supported properly) and also totally miss the point, if you were hoping to pass the tarball on to someone else. Ouch.

23Skidoo · 2016-09-02T12:09:59Z

Can't we make this case an error instead of silently accepting? Current behaviour causes problems for users:

haskell/cabal#3758
commercialhaskell/stack#2557

ezyang · 2016-09-02T19:26:25Z

I support erroring. The truncation from Char8.pack is basically never right, IMO.

ezyang · 2016-09-02T21:41:35Z

Also, is there an interface for passing tar direct ByteString encodings of the desired file paths? This would at least let end users make a decision what encoding they want.

hasufell · 2020-01-18T12:02:09Z

What's the status of this? The current implementation is breaking filenames. All filepaths should be ByteString (aka RawFilePath). This is a low-level library, if someone wants to add a String or Text interface on top, that's fine.

EDIT: afais gnu tar specifies:

The name, linkname, magic, uname, and gname are null-terminated character strings. All other fields are zero-filled octal numbers in ASCII.

But this probably isn't portable for Mac OS and windows...

EDIT2: I think I'll create a tar-bytestring fork that is specifically targeted for POSIX platforms. At least that fixes half of the problem.

EDIT3: https://hackage.haskell.org/package/tar-bytestring

hasufell · 2021-04-10T10:46:43Z

This is what tar-conduit does: https://github.com/snoyberg/tar-conduit/blob/81283887aaa9771c0f2db53cb4e86700da4c2d9e/src/Data/Conduit/Tar/Types.hs#L151

It encodes and decodes as UTF-8. I'd say that's a pretty good bet. For unpacking, we could provide a version that allows to set the encoding... or we make use of something like https://hackage.haskell.org/package/charsetdetect-ae

Bodigrim · 2023-11-18T12:10:31Z

I pushed 423e6af, prohibiting non-ASCII file names. At the very least, we should not silently corrupt Unicode data. A stategic solution would be to migrate to PosixPath and leave encoding questions to clients.

hasufell · 2023-11-18T12:25:47Z

A stategic solution would be to migrate to PosixPath and leave encoding questions to clients.

There are some non-trivial parts there, because although the tar spec demands unix semantics, the library also works on windows (see toTarPath). Since we use the FilePath representation currently, we don't have to convert the filenames between the platforms (just the separators are changed). With OsPath, it seems we would need a way to convert between PosixPath and WindowsPath. So we kinda have to assume utf8 here too at least on windows?

Bodigrim · 2023-11-18T12:27:37Z

Yes, I'd assume UTF-8 on Windows.

mpilgrem · 2023-12-10T20:46:16Z

Would it be possible for Codec.Archive.Tar.Entry to export the data constructor of TarPath?

I've written something for Stack that works around fromTarPath using BS.Char8.unpack (Stack needs that to be (T.unpack . T.decodeUtf8Lenient)), but the code needs access to the data constructor.

EDIT: In the interim, I've realised I can convert the FilePath back into a ByteString, and start again:

fromTarPath :: TarPath -> FilePath
fromTarPath = T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath

Bodigrim · 2023-12-10T21:23:24Z

@mpilgrem I recommend against T.unpack . T.decodeUtf8Lenient . BS.Char8.pack . Tar.fromTarPath: if tar ever learns to support Unicode so that Tar.fromTarPath returns a Unicode-enabled String, then BS.Char8.pack allows to convert a seemingly innocent path without any dots and slashes to something like ../../Windows/System32/Kernel.dll and corrupt your system files.

#78 is a way forward.

mpilgrem · 2023-12-10T23:43:08Z

@Bodigrim, thanks for the warning. My second attempt below makes use of isUTF8Encoded from the utf8-string package:

fromTarPath :: TarPath -> FilePath
fromTarPath tp = if isUTF8Encoded rawFilePath
  then
    T.unpack $ T.decodeUtf8Lenient $ BS.Char8.pack rawFilePath
  else
    -- A future version of Tar.fromTarPath may itself assume that 'TarPath' is
    -- UTF8 encoded.
    rawFilePath
 where
  rawFilePath = Tar.fromTarPath tp

hasufell · 2023-12-11T15:48:29Z

PR here: #88

Bodigrim · 2023-12-22T02:14:26Z

Unicode filenames should work now, after aa683b0. I switched TarPath to PosixString; since it's not exposed, this is not a breaking change.

ezyang mentioned this issue Sep 2, 2016

Lack of support for non-ASCII characters in cabal sdist haskell/cabal#3758

Open

phadej mentioned this issue Apr 26, 2020

cabal init incorrectly writes non-ASCII file names in library:exposed-modules haskell/cabal#6507

Open

individual-it mentioned this issue Nov 19, 2021

tar archive encodes file-names wrongly cs3org/reva#2255

Closed

gbaz mentioned this issue Jul 28, 2022

cabal sdist corrupted when using Unicode haskell/cabal#2558

Open

Bodigrim added a commit that referenced this issue Nov 18, 2023

Check that filenames are ASCII instead of silent corruption (see #6)

5240d39

Bodigrim added a commit that referenced this issue Nov 18, 2023

Check that filenames are ASCII instead of silent corruption (see #6)

f64cff1

Bodigrim added a commit that referenced this issue Nov 18, 2023

Check that filenames are ASCII instead of silent corruption (see #6)

423e6af

hasufell mentioned this issue Nov 18, 2023

convert to PosixPath #78

Open

This was referenced Dec 10, 2023

Fix #9507 Describe accurately acceptable package names haskell/cabal#9508

Open

stack sdist fails, given a Unicode package name commercialhaskell/stack#6372

Closed

Bodigrim closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paths with unicode characters in them? #6

Paths with unicode characters in them? #6

edsko commented Nov 19, 2015

mrkkrp commented Dec 24, 2015

mrkkrp commented Dec 24, 2015

dcoutts commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

edsko commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

dcoutts commented Jan 10, 2016

ezyang commented Sep 2, 2016 •

edited

Loading

23Skidoo commented Sep 2, 2016 •

edited

Loading

ezyang commented Sep 2, 2016

ezyang commented Sep 2, 2016

hasufell commented Jan 18, 2020 •

edited

Loading

hasufell commented Apr 10, 2021

Bodigrim commented Nov 18, 2023

hasufell commented Nov 18, 2023

Bodigrim commented Nov 18, 2023

mpilgrem commented Dec 10, 2023 •

edited

Loading

Bodigrim commented Dec 10, 2023

mpilgrem commented Dec 10, 2023

hasufell commented Dec 11, 2023

Bodigrim commented Dec 22, 2023

Paths with unicode characters in them? #6

Paths with unicode characters in them? #6

Comments

edsko commented Nov 19, 2015

mrkkrp commented Dec 24, 2015

mrkkrp commented Dec 24, 2015

dcoutts commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

edsko commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

mrkkrp commented Jan 4, 2016

dcoutts commented Jan 10, 2016

ezyang commented Sep 2, 2016 • edited Loading

23Skidoo commented Sep 2, 2016 • edited Loading

ezyang commented Sep 2, 2016

ezyang commented Sep 2, 2016

hasufell commented Jan 18, 2020 • edited Loading

hasufell commented Apr 10, 2021

Bodigrim commented Nov 18, 2023

hasufell commented Nov 18, 2023

Bodigrim commented Nov 18, 2023

mpilgrem commented Dec 10, 2023 • edited Loading

Bodigrim commented Dec 10, 2023

mpilgrem commented Dec 10, 2023

hasufell commented Dec 11, 2023

Bodigrim commented Dec 22, 2023

ezyang commented Sep 2, 2016 •

edited

Loading

23Skidoo commented Sep 2, 2016 •

edited

Loading

hasufell commented Jan 18, 2020 •

edited

Loading

mpilgrem commented Dec 10, 2023 •

edited

Loading