-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACP: Method to split OsStr into (str, OsStr)
#114
Comments
Proposed implementation: jmillikin/upstream__rust@c0ea2a4 Docs screenshots: |
The same for |
rust-lang/rust#95290 should be relevant. Once the underlying bytes are exposed this could be implemented by user code. |
PR rust-lang/rust#95290 (exposing raw WTF-8 bytes on Windows) isn't appealing to me, and even if such an API would be created I would want to have |
Gentle ping -- I'm still interested in this, is there any interest from the libs-api team regarding OsStr unicode prefix splitting? |
Once you get the raw bytes you can use regular string APIs such as |
I might be misunderstanding, but I don't think that would solve the problem because (to the best of my knowledge) there is no way to convert a In other words, how would you implement |
Not only that. Even if I could convert to and from |
@mina86 For trimming prefixes and suffixes from an This ACP is tracking a different request, which is the ability to obtain the valid Unicode portion of an |
The motivation for this ACP is argument parsing. I don’t see how to_str_split is a good solution for that. For example, assuming |
The output type of the tokenizer step function looks approximately like this: pub struct Arg<'a>(
#[cfg(all(target_family = "unix", not(feature = "std")))]
&'a [u8],
#[cfg(all(target_family = "windows", not(feature = "std")))]
&'a str,
#[cfg(feature = "std")]
Cow<'a, OsStr>,
);
pub enum Token<'a, FlagId> {
Arg(Arg<'a>),
Flag(FlagName<'a>, FlagId, Arg<'a>),
FlagUnary(FlagName<'a>, FlagId),
FlagHelp(Option<&'a str>),
} To separate a single When parsing an |
FYI: I created a PR for the proposed implementation branch: rust-lang/rust#111059 I'm not sure what the exact ordering is of ACP <-> unstable PR, but maybe having the implementation code available for review will help when reading the ACP. |
Proposal
Add a method to
std::ffi::OsStr
that splits it into(&str /* prefix */, &OsStr /* suffix */)
, where the prefix contains valid Unicode and the suffix is the portion of the original input that is not valid Unicode.Problem statement
The
OsStr
type is designed to represent a platform-specific string, which might contain non-Unicode content. It has ato_str(&self) -> Option<&str>
method to check the string is valid Unicode, but this method operates only on the entire string. It's not currently possible to check that a portion of the string is valid Unicode in an OS-independent way.This proposal would add a method that lets an
OsStr
be split into a prefix of valid Unicode, and a suffix of remaining non-Unicode content in the platform encoding.Motivation, use-cases
Command-line options (long flags)
One of the common use cases for
OsStr
is parsing command-line options, a possible format of which has a prefix (the option name) and a suffix (the option value):Unix syntax:
Windows syntax:
When parsing CLI options, the user wants to match option names against values provided by the program, but preserve option values as they are for use with OS APIs.
This function is easy to implement on Unix because
std::os::unix::ffi::OsStrExt
provides free conversion to and from&[u8]
, which can be compared with the UTF-8 bytes of the flag name.However, on Windows, it's basically impossible to implement in safe Rust -- the Windows variant of
OsStrExt
providesIterator<Item = u16>
, and has no mechanism for constructing a non-UnicodeOsStr
at all.Command-line options (short flags)
A less ubiquitous but still commonly used format for "short options" on Unix systems allows multiple flags to be put into one option:
Supporting this requires being able to obtain a
str
with prefix"-xyz"
. The args library then uses flag definitions to tokenize it into one of["-x", "-y", "-z"]
,["-x", "-y", "z"]
, or["-x", "yz"]
.Solution sketches
Add a method
OsStr::to_str_split()
(or whatever name folks prefer) that returns the valid Unicode prefix and the non-Unicode remainder.Rules for the new function:
Note that the calling code would be responsible for handling inputs where the flag value itself is partial Unicode, for example on Unix all absolute paths start with the ASCII character
'/'
.Links and related work
The general topic of examining
OsStr
for prefixes comes up often. A selection of related issues/PRs:OsStr
andCStr
up to par withstr
rfcs#900There is an open issue for supporting the Pattern API on
OsStr
, but (1) it's a significantly larger amount of implementation work, and (2) doesn't appear to allow extracting the prefix without unwrap.What happens now?
This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.
The text was updated successfully, but these errors were encountered: