Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OsStrExt3::from_bytes should be unsafe #1594

Closed
KillTheMule opened this issue Nov 3, 2019 · 10 comments · Fixed by #1852
Closed

OsStrExt3::from_bytes should be unsafe #1594

KillTheMule opened this issue Nov 3, 2019 · 10 comments · Fixed by #1852
Milestone

Comments

@KillTheMule
Copy link

KillTheMule commented Nov 3, 2019

I'm opening this as a result from rust-secure-code/safety-dance#49. The mentioned method uses unsafe, but is not marked unsafe itself. I think that is not right, and while I've found no safety violations here, there's the possibility of a future bug in safe code if the necessary properties aren't upheld. @Shnatsel remarked

I think this might cause some actual memory unsafety down the line, potentially turning into a privilege escalation vulnerability.

and while I don't understand the details of this claim, it's probably worth looking at.

There's some prior art at #1054 and #1058, but the issue of from_bytes was left without action. @H2CO3 said

However, the implementation using mem::transmute() is identical to that of the stdlib, so this shouldn't be too bad, after all.

I'm uncertain about this, as far as I can see, this implementation is only valid on systems where OsStr is an arbitrary byte sequence, e.g. linux, but not on Windows, where OsStr is backed by a struct called Wtf8, where the corresponding method is marked unsafe indeed.

My takeaway is that this seems like an error we should fix. There are quite some usages which are basically trivial to fix by introducing a from_str method on OsStr, like here. There are other usages though which seem harder to fix since slicing an OsStr at arbitray indices doe not work on windows. As far as I can see those usages are bugs and will turn into problems when called with the "right" arguments. They are all in osstringext.rs, so maybe they can be fixed without any change outside that file.

I'm open to sending a PR, but I'd need some guidance on what seems acceptable to you. Also, I'm on linux, so I probably can't run proper benchmarks to see if the changes have a bad effect (note: That's just for osstringext.rs, a potential from_str method should be doable without any performance loss).

@kbknapp
Copy link
Member

kbknapp commented Nov 3, 2019

Thank you for the detailed report!

Without having looking into each use, I think this stems from workarounds that no longer need to exist. I.e. much of what clap needs is just sequences of bytes and not UTF-8 encoded strings. Originally this meant shoehorning OsStr into places (which ultimately led to the issues you've pointed out). However, I think switching to use something like bstr would potentially alleviate some of these issues.

I'll start looking deeper and see if I can come up with a list, including the locations you mentioned, that we should look into fix/replacing.

As far as Windows is concerned, I think this will require some sort of cfg-able abstraction layer over the underlying bstr or whatever we end up using. Some of the coming internal changes I plan on landing over the coming months may make this much easier. Although not really related to this issue, I'd like to break out the common structs clap uses into components that handle their own logic, which will allow us to push much of the related code into smaller components instead of using god-objects. Since this is just internal representation, it shouldn't affect the public API at all.

@KillTheMule
Copy link
Author

I'll happily provide you a list later on. Would you still be interested in a PR that cleans that up a bit short-term?

Theres one thing I want to point out which I don't fully understand, maybe you can shed some light (and maybe it's a point of note anyways): How is this safe? You are making arg out of args_os by lossy conversion, then you find a certain char to split it at, and then you use the byte index of that char to finally split arg. Is there some property of the conversion that makes this work? If arg is wt8, but not utf8 at the start, this all seems to go haywire.

That said, because of this everything will panic in the non-utf8 case on windows anyways, so maybe we should embrace this for now?

@Shnatsel
Copy link
Contributor

Shnatsel commented Nov 6, 2019

We have quite a few levels of abstraction here, so I'll try to go through each one and explain why I think this issue might be a problem in practice.

  1. impl OsStrExt3 for OsStr {
    fn from_bytes(b: &[u8]) -> &Self {
    use std::mem;
    unsafe { mem::transmute(b) }
    }
    allows creating an OsStr from an arbitrary byte sequence.
  2. On Windows OsStr is actually backed by a WTF-8 string. Multiple places in WTF-8 implementation assume that WTF-8 is well-formed UTF-8 except for possibility of encountering surrogates, and call str::from_utf8_unchecked() on the byte slices between the surrogates (example)
  3. Many functions implemented on &str rely on &str only ever containing valid UTF-8, but we have just constructed an &str via str::from_utf8_unchecked() without validating that they actually are valid UTF-8 (other than removing unpaired surrogates). From here on in it should be possible to trigger more or less arbitrary memory out-of-bounds access through invoking internally unsafe functions relying on &str validity invariant being upheld.

The above holds if truly arbitrary byte sequences are allowed to enter OsStrExt3::from_bytes. Windows might or might not restrict the incoming byte sequences sufficiently to mitigate this. Even if it's not exploitable now, it is a very fragile primitive and needs to be fixed.

@Shnatsel
Copy link
Contributor

Shnatsel commented Nov 6, 2019

Upon further discussion, since this applies to &str and not String, the worst we can see from this is information disclosure, not arbitrary code execution, so this is nowhere near as critical as I initially thought.

@dylni
Copy link
Contributor

dylni commented Apr 18, 2020

Hi! I created OsStr Bytes a while ago, but today I removed all unsafe code. This issue came up on that thread, so I was wondering if you would be open to a PR that uses this crate to perform the conversions.

If so, there are some design questions I have about the implementation:

  • Conversions can no longer be zero-cost, so OsStrExt2 methods would be more expensive. I'm thinking of solving this by creating struct similar to OsStr/OsString inside clap that would only need to perform the conversion once. Does this seem reasonable, or do you prefer another way to approach this?
  • If I understand this line correctly, I think the + 1 should be + c.len_utf8(). The split_at call later uses this index, so I want to be sure my belief is correct that it's always valid to split there.
  • From what I've seen, I believe all methods of OsStrExt2 are being used correctly to work with any UTF-8-compatible encoding. However, there are no assertions on these methods to ensure that remains true. If I added these for the new struct, should they be assert! or debug_assert!? I would prefer the former, but I'm not sure how concerned you are about the performance of these methods. Of course, I would use debug_assert! for anything too expensive.

Would this be something you'd want to see added?

@Dylan-DPC-zz
Copy link

@dylni ya sure, you can go ahead and make a PR :) Safety comes first

@dylni
Copy link
Contributor

dylni commented Apr 19, 2020

Great :) I'm a bit busy recently, but I'll send a PR when I can find some time. Plus, that will give this time to gather opinions on implementation.

@Dylan-DPC-zz
Copy link

That works for us :) Thanks

@CreepySkeleton
Copy link
Contributor

@dylni We are totally open to push entrust this job to someone else! 😃 You and your PR are very welcome here!

  1. That would be a bit sad, but I'm seconding Dylan here: correctness is more important than performance, especially in a command line parsing library. Send your PR and, as long as the regression is even a bit reasonable (i.e it's not something crazy as + 10 seconds), we will accept it .

  2. Congratulations! You've found a bug, but this bug needs the stars to be aligned in a very specific way, so virtually nobody triggers it in practice. Must be fixed anyway, of course.

    Conditions:

    • The app has a short option with a non-ASCII name (it's byte length must be > 1). This is why nobody noticed it so far: those names are always ASCII.
    • There must be no space between the option and the value -фBOOM

    Reproducer: (Playground)

    use clap::{App, Arg};
    
    fn main() {
        let opts = App::new("app")
            // 'ф' is a cyrillic character, its byte lenght is 2 in utf-8 
            // a Chinese hieroglyph would more fancy, but I don't have hieroglyphs
            // on my keyboard :)
            .arg(Arg::with_name("opt").short("ф").takes_value(true))
            // no space between the option and the value so the split will trigger
            .get_matches_from(&["test", "-фNOOOOOO"]);
        println!("{:?}", opts.value_of("opt").unwrap());
    }

    Clap 2.x righteously panics, master prints gibberish.

    Changing this line to let i = p[0].as_bytes().len() + c.len_utf8(); will fix it.

  3. Let's just go with assert!. I believe CPU's branch predictor will discard those branches as trivially non-taken ones.

@dylni
Copy link
Contributor

dylni commented Apr 19, 2020

@CreepySkeleton Thanks for the responses and interest! I don't expect the performance hit to be large, but I am interested to see how measurable it is. Some basic testing has shown the methods to be as fast as the standard library's implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants