Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytestring parser #65

Closed
clinty opened this issue Jan 5, 2014 · 22 comments
Closed

bytestring parser #65

clinty opened this issue Jan 5, 2014 · 22 comments

Comments

@clinty
Copy link

clinty commented Jan 5, 2014

I'd like to parse bytestrings from System.Posix.Env.ByteString.getArgs instead of converting from String.

@pcapriotti
Copy link
Owner

How would this work, exactly? The arguments would still need to be decoded and converted into a text data type (String or possibly Text). Since the API is all String-based, it doesn't seem like there is any way to improve on the total number of conversions.

@clinty
Copy link
Author

clinty commented Aug 4, 2014

I want to be able to handle arguments that aren't valid Unicode for things that don't require valid Unicode. The String conversion interferes with this.

@pcapriotti
Copy link
Owner

It doesn't have to be Unicode, but it must be some encoding. After all, optparse-applicative needs to split on spaces, recognise certain special characters, etc, so it can't work on pure ByteStrings.

@pcapriotti
Copy link
Owner

I'm closing this for now, as I'm not sure what you're asking exactly.

@xaverdh
Copy link

xaverdh commented Dec 14, 2019

Actually I sort of have a use case for this. The issue is that a cli program may get passed a raw string (as in ByteString), say via an option, which it then has to pass unmodified to another process.
Currently this does not work with optparse-applicative, since the string would have to be converted in a potentially lossy fashion to a String.

While it is true that optparse-applicative needs to have some encoding to work with, that does not mean, that it has to receive its input in an already decoded form.
A solution might be to to decode parts of the input on the fly with user supplied function, only those parts which need to be decoded – lazy decoding in a sense.

@xaverdh
Copy link

xaverdh commented Dec 14, 2019

To to state this a little bit clearer (hopefully):
I envision the user supply a list of ByteString values and a function ByteString -> Maybe String (which would actually get a sane default). The library (optparse-applicative) would then convert the values on the fly as needed. When Nothing occurs the input should be considered invalid. Then option parsers could return the raw ByteString without having to decode it at all.
This works, because the data is already structured as a list and most of the time we don't actually have to look into the values.
That works for "--opt val" style options, but does not work for "--opt=val" obviously.
That would require extending the user supplied function to a richer type (something like
ByteString -> [(String,ByteString)], i.e. splitting off the possible prefixes of the decoded input)

@xaverdh
Copy link

xaverdh commented Dec 15, 2019

Ok, looking at the actual code, a saner way to do this might be to allow the input to be anything and let the user supply both a decoding function : InputType -> Maybe String as well as the parseWord : InputType -> Maybe OptWord function, where data OptWord = OptWord OptName (Maybe InputType).
This would allow to implement a ByteString variant as described above.

@hasufell
Copy link

How would this work, exactly? The arguments would still need to be decoded and converted into a text data type

This is a very common misconception. No, they don't.

needs to split on spaces, recognise certain special characters, etc, so it can't work on pure ByteString's

This works fine. You can split on spaces and special ascii characters just fine with bytestring. What exactly is the problem? Anything you don't need to parse, you don't parse.


For example, filepaths on unix are ByteString (that's what you get back from the syscalls). If you convert them to String, you lose the underlying encoding (and the bytestring representation is potentially different). Now everything further you do with said filepath in your codebase is going to be potentially wrong, including simple things like ==.

This means in fact I can't trust optparse-applicative to deal with filepaths passed in by a user.

@luke-clifton
Copy link

So, on unix, filepaths are ByteStrings, however, that does not mean that Haskell's String type can't handle them. GHC uses various TextEncodings when it deals with the outside world. Of particular interest here is the file system encoding. which, on non-windows systems, is used for argument reading as well (see argvEncoding in the same file).

By default, the file system encoding uses "UTF-8b", which actually embeds invalid utf-8 sequences using "surrogate escaping". As long as you use the same encoding when converting back to bytes, it is actually perfectly possible to round trip arbitrary data through a Haskell String. It should go without saying that System.IO.openFile and friends all do the right thing here. So, if your plan is to just open a file, using the standard Haskell interfaces, everything should just work.

On the other hand, if you want to pass it to a function expecting a ByteString (especially all the ones in System.Posix.ByteString), then you need to convert it to a ByteString correctly.

Now, there is no official way to actually convert such a String to a ByteString as far as I am aware. Your best bet is probably:

import GHC.Foreign
import GHC.IO.Encoding
import qualified Data.ByteString as B

filepathToByteString :: FilePath -> IO B.ByteString
filepathToByteString path = do
    encoding <- argvEncoding
    GHC.Foreign.withCStringLen encoding path B.packCStringLen

Which will reverse the process correctly.

More about encodings not really related to option parsing:

Reading FilePaths from stdin using getLine will not work because the the the file system encoding is not used for this handle, instead the locale encoding is used. You can change the encoding being used with hSetEncoding stdin. If you set it to the file system encoding, then you can use getLine to read filepaths from stdin using the same encoding as is used for arguments, and passing it to openFile will Just Work. (ignore files with newlines for now, this all applies to using \0 delimited names as well, but getLine is easy for demo purposes)

As a side note: if you see the FilePath type alias, you should be assuming that this is encoded using the file system encoding, which on unix can handle arbitrary data. The documentation for FilePath even says

File and directory names are values of type String, whose precise meaning is operating system dependent.

Which is suggesting that there is something strange going on with the encoding. The fact that FilePath is not a distinct type from String causes a lot of grief here.

@hasufell
Copy link

hasufell commented Jun 1, 2022

By default, the file system encoding uses "UTF-8b", which actually embeds invalid utf-8 sequences using "surrogate escaping". As long as you use the same encoding when converting back to bytes, it is actually perfectly possible to round trip arbitrary data through a Haskell String.

You mean as long as the file system encoding (which can be set to arbitrary things) permits that? Hence I'm not sure what you mean by "default" here.

Also note the two caveats mentioned here: https://hackage.haskell.org/package/base-4.14.0.0/docs/GHC-IO-Encoding.html#v:mkTextEncoding

In theory, this mechanism allows arbitrary data to be roundtripped via a String with no loss of data. In practice, there are two limitations to be aware of:

  1. This only stands a chance of working for an encoding which is an ASCII superset, as for security reasons we refuse to escape any bytes smaller than 128. Many encodings of interest are ASCII supersets (in particular, you can assume that the locale encoding is an ASCII superset) but many (such as UTF-16) are not.
  2. If the underlying encoding is not itself roundtrippable, this mechanism can fail. Roundtrippable encodings are those which have an injective mapping into Unicode. Almost all encodings meet this criteria, but some do not. Notably, Shift-JIS (CP932) and Big5 contain several different encodings of the same Unicode codepoint.

So this approach is definitely not total.

@HuwCampbell
Copy link
Collaborator

We're in a hard place here.

If we were to try and use System.Posix.Env.ByteString.getArgs internally, we would, for a start, have to drop windows support, or heavliy CPP the library and provide different APIs on Windows and Unix. This isn't really acceptable in my opinion.

Cribbing from the Rust Docs:

  • On Unix systems, strings are often arbitrary sequences of non-zero bytes, in many cases interpreted as UTF-8.
  • On Windows, strings are often arbitrary sequences of non-zero 16-bit values, interpreted as UTF-16 when it is valid to do so.

But obviously in unix systems, there are other locale specific encodings, which, as you said, may or may not be supersets of ascii.

If Haskell had a std::ffi::OsString like equivalent in base we might able to do something better, but even then, if we were to use read arguments as raw bytes, we would still need to interpret them in order to read things like hyphens, flag names, and command name matches.

@hasufell
Copy link

hasufell commented Jun 1, 2022

If Haskell had a std::ffi::OsString like equivalent in base we might able to do something better, but even then, if we were to use read arguments as raw bytes, we would still need to interpret them in order to read things like hyphens, flag names, and command name matches.

We're in luck. I worked on this for the past year:

@HuwCampbell
Copy link
Collaborator

Interesting. I'm open to opening a branch on top of these proposals.

I'm pretty conservative with changes here because optparse is a 10 yo project people just use and expect to work. But if base is moving, yeah, we can and should too.

@hasufell
Copy link

hasufell commented Jun 1, 2022

Interesting. I'm open to opening a branch on top of these proposals.

I'm pretty conservative with changes here because optparse is a 10 yo project people just use and expect to work. But if base is moving, yeah, we can and should too.

Well, base is not moving any time soon, but some boot libraries will support this additional API. So we won't break backwards compatibility that quickly.

@hasufell
Copy link

@pcapriotti can you please re-open this ticket?

@hasufell
Copy link

Demonstration of unsoundness of current roundtripping techniques: https://gist.github.com/hasufell/c600d318bdbe010a7841cc351c835f92

HF tech proposal outlining the current affairs: haskellfoundation/tech-proposals#35

@Merivuokko
Copy link

Hello,

I am writing Unix utilities in Haskell, and I need support for OsPath and OsString from the new filepath library.

I understand that optparse-applicative has been built on type String = [Char] and it would require major changes to be able to support plain octet streams instead of encoded text as input.

Therefore I'm kindly asking, are there any plans for optparse-applicative to transition to using OsStrings (or ByteStrings) in foreseeable future (while I could use some ad hoc CLI parser in the meanwhile), or should I consider planning a new CLI parsing library?

@hasufell
Copy link

hasufell commented Oct 1, 2023

@Merivuokko first, we need to implement the following function properly and multiplatform:

getArgs :: [OsString]

This is currently blocked because I can't see that GetCommandLineW function is implemented/exposed in Win32 package. So you'd need to create a wrapper yourself.

For unix, it's fairly trivial:

module Main where

import Data.ByteString.Short (ShortByteString)         -- bytestring
import System.OsString.Internal.Types                  -- filepath
import qualified System.Posix.Env.PosixString as Posix -- unix

main :: IO ()
main = do
  args <- getArgs
  print args

getArgs :: IO [OsString]
getArgs = fmap OsString <$> Posix.getArgs

in optparse, getArgs is used only here:

-- | Run a program description with custom preferences.
customExecParser :: ParserPrefs -> ParserInfo a -> IO a
customExecParser pprefs pinfo
= execParserPure pprefs pinfo <$> getArgs
>>= handleParseResult

@hasufell
Copy link

hasufell commented Oct 7, 2023

haskell/win32#221

@Merivuokko
Copy link

Merivuokko commented Oct 11, 2023 via email

@hasufell
Copy link

@hasufell
Copy link

There are two more steps needed:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants