Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add starts_with and ends_with to OsStr #26499

Closed
wants to merge 1 commit into from

Conversation

Stebalien
Copy link
Contributor

(also inline all non-generic public APIs)

Note: this commit doesn't use the Pattern API because it's &str specific.
Also Note: this commit doesn't introduce OsStr::contains because I'm lazy.

I'd be happy to write an RFC if you want to explore other possibilities like
expanding the Pattern API (although I really don't think we should).

@rust-highfive
Copy link
Collaborator

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@nikomatsakis
Copy link
Contributor

cc @rust-lang/libs

@nikomatsakis
Copy link
Contributor

So this code seems good to me, but i'd prefer to have someone from @rust-lang/libs land it. I'm going to switch to r? @alexcrichton

@alexcrichton
Copy link
Member

Can this hold off on adding #[inline] for now? Right now it tends to incur a little too much compile-time overhead for not enough runtime win. For example I've never seen OsStr manipulation at the top of any profile.

Other than that I think that this is fine, the only question being about how we want to pursue these methods into the future. Do we want to duplicate the API surface area of strings onto OS strings immediately, or incrementally? What we have today is kinda the "bare minimum" to get by with the plan to "expand if necessary". I'm somewhat against adding apis incrementally over time as it's bound to just be surprising that OsStr doesn't implement a method or two, so it may be worth taking time to plan out what the final API of OS strings might look like in the long run.

That being said I do think it's fine for these to enter in an unstable fashion, for now. If it turns out that everyone really wants these two methods then this is as good a way as any to test the waters.

@alexcrichton alexcrichton added I-needs-decision Issue: In need of a decision. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jun 29, 2015
/// Returns true if the `other` is a prefix of the `OsStr`.
#[unstable(feature = "os_str_compare", reason = "recently added")]
pub fn starts_with<S: AsRef<OsStr>>(&self, other: S) -> bool {
self.bytes().starts_with(other.as_ref().bytes())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct for plain bytes and UTF-8, but what about other encodings? If OsStr is WTF-8, does that allow false positives (say for example that a single byte encoded sequence for a codepoint exists that's a prefix of a multibyte sequence. UTF-8 doesn't have this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SimonSapin? I think these should be fine. If I'm reading the "spec" correctly, WTF-8 is a strict subset of "generalized UTF-8" (https://simonsapin.github.io/wtf-8/#generalized-utf_8) and, from what I can tell, generalized UTF-8 is a prefix-free code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, ok that was simple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, WTF-8 preserves the nice properties of UTF-8 like self-synchronization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Just a note for people from the future: No this is not fine because if self contains a non-BMP code point (4-byte sequence), and other contains just a high-surrogate (3-byte sequence), this implementation will always return false, contradicting the result when given two WTF-16 sequences.

@Stebalien
Copy link
Contributor Author

@alexcrichton I started with starts_with/ends_with because I wanted to know if a filename started with '.' and realized that the only reliable way to do this was to use to_os_string_lossy().starts_with('.').

Without distinguishing between utf8 and wtf8, I can implement the following:

  • starts_with/ends_with (done)
  • contains (done)
  • is_empty/len (done)
  • lines/lines_any (in-progress)

Everything else depends on whether we're dealing with WTF-8 or UTF-8. I can write up an RFC if you want.

@bluss
Copy link
Member

bluss commented Jun 30, 2015

@Stebalien Since you mention UTF-8 and WTF-8, on linux there is no encoding for paths in general, they are just arbitrary byte strings.

@Stebalien
Copy link
Contributor Author

@bluss good point. That rules out methods like split replace, and trim (methods that look at individual characters) because they will behave differently on unix/windows. For example, "OsStr::new("Ünіcôdé").split(OsStr::new("")) act like bytes() on unix and chars() on windows.

find and the slicing methods should still be doable however.

@Stebalien
Copy link
Contributor Author

I'm going to close this for now and write an RFC.

@Stebalien Stebalien closed this Jun 30, 2015
@alexcrichton
Copy link
Member

@Stebalien we actually discussed this today in some libs triage, and we ended up reaching the same conclusion! After a quick overview of the string/osstr apis, we concluded that the "end goal" for what OsStr exposes may end up just being the Pattern methods on strings. Most other methods didn't seem to apply, and the pattern ones all generally seemed to fit nicely (including starts_with and ends_with).

There may be a possibility of generalizing the Pattern trait to work for OS strings as well (cc @Kimundi, the author of the trait), but we can also start out conservatively and assume that a "pattern" is "something that can be an OsStr" to start off with.

@Kimundi
Copy link
Member

Kimundi commented Jul 1, 2015

Hm, right now I see two possible ways to generalize the pattern API to Osstrings:

  • Make the slice type a generic paramter of the trait, and add seperate impls for Pattern<'a, str> and Pattern<'a OsStr> (and maybe Pattern<'a, [T]> as well). That should be relatively easy to do, but makes the API of the actual methods more complicated to read (where P: Pattern<'a> -> where P: Pattern<'a, str>). That said, it doesn't look that terrible either. Would be nice if we could do type Pattern<'a> = GeneralPattern<'a, str> I guess.
  • Change the Pattern API to work with any kind somehow searchable slice, and have wrapper around that that exposes it for str, OsStr and maybe plain [T]. Might reduce code duplication somewhat, but might also just be a more complicated variations of option 1. Also, its unclear if we could have specialized search algorithms as easily with this.

@bluss
Copy link
Member

bluss commented Jul 1, 2015

The current algorithm for &str in &str search supports any "ordered alphabet" so I assume it can be adapted to any [T] where T: Ord or at least any integer types. I think we want to make sure we can still specialize the [u8] case since we want to have all possibilities open to further optimizations (memchr, memcmp and even pcmpestri).

@alexcrichton
Copy link
Member

Yeah I'm not sure if it's actually possible to use the same Pattern trait (or if we even want to), but it's certainly a possibility to explore!

@aturon
Copy link
Member

aturon commented Jul 1, 2015

I want to call attention to a point that @Kimundi made in passing: a generalization of the Pattern API should, ideally, apply to [T] as well. (That would in particular bring Vec<u8> largely up to parity with String as a first-class string type.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-needs-decision Issue: In need of a decision. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants