-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add starts_with and ends_with to OsStr #26499
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
cc @rust-lang/libs |
So this code seems good to me, but i'd prefer to have someone from @rust-lang/libs land it. I'm going to switch to r? @alexcrichton |
Can this hold off on adding Other than that I think that this is fine, the only question being about how we want to pursue these methods into the future. Do we want to duplicate the API surface area of strings onto OS strings immediately, or incrementally? What we have today is kinda the "bare minimum" to get by with the plan to "expand if necessary". I'm somewhat against adding apis incrementally over time as it's bound to just be surprising that OsStr doesn't implement a method or two, so it may be worth taking time to plan out what the final API of OS strings might look like in the long run. That being said I do think it's fine for these to enter in an unstable fashion, for now. If it turns out that everyone really wants these two methods then this is as good a way as any to test the waters. |
/// Returns true if the `other` is a prefix of the `OsStr`. | ||
#[unstable(feature = "os_str_compare", reason = "recently added")] | ||
pub fn starts_with<S: AsRef<OsStr>>(&self, other: S) -> bool { | ||
self.bytes().starts_with(other.as_ref().bytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct for plain bytes and UTF-8, but what about other encodings? If OsStr
is WTF-8, does that allow false positives (say for example that a single byte encoded sequence for a codepoint exists that's a prefix of a multibyte sequence. UTF-8 doesn't have this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SimonSapin? I think these should be fine. If I'm reading the "spec" correctly, WTF-8 is a strict subset of "generalized UTF-8" (https://simonsapin.github.io/wtf-8/#generalized-utf_8) and, from what I can tell, generalized UTF-8 is a prefix-free code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, ok that was simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, WTF-8 preserves the nice properties of UTF-8 like self-synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Just a note for people from the future: No this is not fine because if
self
contains a non-BMP code point (4-byte sequence), andother
contains just a high-surrogate (3-byte sequence), this implementation will always returnfalse
, contradicting the result when given two WTF-16 sequences.
@alexcrichton I started with starts_with/ends_with because I wanted to know if a filename started with '.' and realized that the only reliable way to do this was to use Without distinguishing between utf8 and wtf8, I can implement the following:
Everything else depends on whether we're dealing with WTF-8 or UTF-8. I can write up an RFC if you want. |
@Stebalien Since you mention UTF-8 and WTF-8, on linux there is no encoding for paths in general, they are just arbitrary byte strings. |
@bluss good point. That rules out methods like
|
I'm going to close this for now and write an RFC. |
@Stebalien we actually discussed this today in some libs triage, and we ended up reaching the same conclusion! After a quick overview of the string/osstr apis, we concluded that the "end goal" for what OsStr exposes may end up just being the Pattern methods on strings. Most other methods didn't seem to apply, and the pattern ones all generally seemed to fit nicely (including There may be a possibility of generalizing the |
Hm, right now I see two possible ways to generalize the pattern API to Osstrings:
|
The current algorithm for &str in &str search supports any "ordered alphabet" so I assume it can be adapted to any |
Yeah I'm not sure if it's actually possible to use the same |
I want to call attention to a point that @Kimundi made in passing: a generalization of the Pattern API should, ideally, apply to |
(also inline all non-generic public APIs)
Note: this commit doesn't use the Pattern API because it's
&str
specific.Also Note: this commit doesn't introduce
OsStr::contains
because I'm lazy.I'd be happy to write an RFC if you want to explore other possibilities like
expanding the Pattern API (although I really don't think we should).