Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Relocate and improve c_str #435

Closed
wants to merge 6 commits into from

Conversation

mzabaluev
Copy link
Contributor

Relocation and interface changes for the Rust library module c_str.
For the time being, the redesigned module is available as a
standalone project.

  • Move the c_str module out of std to free the standard library from
    public type dependencies on libc.
  • CString is made generic over a destructor type, allowing arbitrary
    deallocation functions.
  • Methods to convert CString to Rust slice types are named
    parse_as_bytes and parse_as_utf8 to reflect the scanning cost and
    the possibility of failure (in the UTF-8 case).
  • Bring the constructors in line with the current conventions in the Rust
    library.
  • Return Result for conversions that may fail, with descriptive error types.
  • Remove the Clone implementation on CString due to lack of purpose.
  • Remove .as_mut_ptr() due to its potential for convenient evil.
  • Add an adaptor type CStrArg for passing string data to foreign functions
    accepting null-terminated strings.
  • Add a trait IntoCStr to enable optimized conversions to CStrArg.

Rejected ideas

Left here as context to the comments below:

  • Split the current CString into a low-level type CStrBuf and
    a length-aware CString to make computation costs explicit.
  • CString had dynamically assigned destructors as arbitrary boxed closures.

Prior issues

@canndrew
Copy link
Contributor

canndrew commented Nov 4, 2014

One thing to note: This wouldn't remove std's dependency on libc and as long as std depends on libc ,libc can't depend on std. As such this work would either have to stay in std or be moved into it's own seperate crate.

@thestinger
Copy link

I don't think this should be in the libc crate. It's not part of the C standard library bindings. If anything, c_vec and c_vec should be in an experimental crate together.

@canndrew
Copy link
Contributor

canndrew commented Nov 4, 2014

Also, as discussed in those issues, it might be best to split CStrBuf and CString into mutable and non-mutable types. One of which allows deallocating the buffer, mutably iterating over it and returning it as a *mut libc::char and the other of which doesn't.

@canndrew
Copy link
Contributor

canndrew commented Nov 4, 2014

A few more thoughts...

It exposes C character type signatures defined in libc as public interface in std, which may create portability problems for Rust code. At present the libc crate consistently defines c_char as i8 even on ARM where the C char is traditionally unsigned, but it's not explained as a conscious break from the C standard, so it might be considered subject to change. In some exotic ABIs, a C char is not even a byte. Note how the type aliases from libc are resolved to their underlying primitive types in the documentation generated for std, leaving the readers unaware that the string pointer type may vary with the architecture.

Always defining c_char as i8 is a bug in liblibc and resolving libc::c_char to it's underlying type is a bug in rustdoc.

There is no support for strings allocated by other means than the standard C malloc.

I think in general we're going to want a way of using other allocators with Rust types including standard types like Vec and String. So it might be an idea to leave the new_with_dtor stuff until we've figured out how we're going to do that. (Having said that I can't find any open issues related to this). I'm also slightly uneasy about running closures in destructors unless there's a good fix for these issues: rust-lang/rust#14875 rust-lang/rust#16135

@mzabaluev
Copy link
Contributor Author

@canndrew: No problem with std depending on libc for the implementation, but I want to move out of std the public dependency on a type alias defined by libc. I agree that libc cannot depend on std, so if c_str and c_vec would be moved to libc they'd have to use the core crates directly, just as they do in Rust source. But as @thestinger suggests this may not be a good idea: the C/POSIX API bindings (the purpose of libc) are going to be much more stable than our experiments with Rust's C glue.

@thestinger
Copy link

The fact that the libc crate isn't currently auto-generated is a major flaw. It only covers a small fraction of the C standard library and POSIX functions / types / constants and it is hard-wired to support only a few C standard library implementations. It would be a very bad idea to continue down this path by adding organization and features that wouldn't be possible to auto-generate from C headers. It is a low-level binding to the C standard library, not a dumping ground for abstractions related to C.

@mzabaluev
Copy link
Contributor Author

I don't see much value in mutable wrappers. For collection-wise mutations one can convert to a String or a Vec and get the full menu of standard Rust library facilities for those, and you'd likely need to allocate a new buffer anyway. To modify characters in place, a mutable pointer can be obtained as unsafe { cstr.as_ptr() as *mut libc::char }, with a visible dangerous cast.

In my understanding, the primary purpose of c_str is to provide adapters between Rust string data (or byte arrays) and foreign libraries working with C strings. There is additional utility, mainly for performance, in being able to keep foreign-allocated C strings around, use them as e.g. map keys without copying, and pass them back to foreign functions with no conversion cost. There is no need, however, to develop full-fledged collections around C strings. Perhaps someone with a good history of working with CString provide more insight on common use cases. Servo?

Quoting Daniel Micay:

    [libc] is a low-level binding to the C standard library,
    not a dumping ground for abstractions related to C.

rust-lang#435 (comment)
@canndrew
Copy link
Contributor

canndrew commented Nov 5, 2014

I've worked with CString a bit and I think there's definitely value in having mutable and non-mutable varieties. Being able to make guarantees about deallocation is enough of a reason. For example, some C functions may return a pointer to a static string as a const char * whereas others may return a char * that I'm supposed to free when I no-longer need. This difference is statically defined in the type of the function so it makes sense to use two different types to deal with the two different cases.

@mzabaluev
Copy link
Contributor Author

@canndrew You make a good point in that the constructors could follow the C ownership conventions regarding pointer constness. So the ownership-taking constructors would take a *mut libc::c_char to prevent accidental freeing of static or unowned data, like in your PR. That said, I still don't see a need for mutable wrapper types.

@ben0x539
Copy link

ben0x539 commented Nov 5, 2014

@thestinger the current organization of libc with all those nested modules already seems hard to reproduce automatically. Do you consider that a bug?

@thestinger
Copy link

@ben0x539: Yes, I think any organization or design decisions in libc that aren't present in the original C headers is a bug. I don't think splitting it up by standard or whatever else makes sense. It should correspond directly to what the platforms provide.

@ben0x539
Copy link

ben0x539 commented Nov 5, 2014

@thestinger System headers seem to feature-gate declarations using macros etc, should we bother translating that into our bindings?

@mzabaluev
Copy link
Contributor Author

@thestinger, @ben0x539: perhaps the idea was to be able to switch off parts of the standards that are missing on some platform. The real libc API variation is probably more fractured, anyway.

The issues of binding to libc should be discussed separately, I think.

@thestinger
Copy link

perhaps the idea was to be able to switch off parts of the standards that are missing on some platform. The real libc API variation is probably more fractured, anyway.

The APIs are inherently non-portable because they use very platform-specific types. For example, c_long is 32-bit on x86_64 Windows and 64-bit on x86_64 *nix. The only guarantee about the C int type is that it's at least 16 bits. There are many portability hazards involved in trying to translate from these types to the saner types defined by Rust.

The APIs are generally not fully specified to there are many differences between platforms in that sense too. For example, malloc(0) returns NULL on some platforms and a unique, valid memory allocation on others.

The issues of binding to libc should be discussed separately, I think.

You're proposing that we put stuff into the C bindings that aren't provided by the C standard library. I think it's appropriate to discuss my opinion on that here. I don't think it makes sense to move libc further away from being auto-generated / matching the C headers.

@mzabaluev
Copy link
Contributor Author

@thestinger I have already removed the question on moving to libc from the PR as a result of the discussion above, thank you :)

mzabaluev added a commit to mzabaluev/rust-c-str that referenced this pull request Nov 5, 2014
The convention in C APIs is to use constness to designate absence
of ownership transfer, so using a `*mut libc::char` in the
ownership-taking constructors may help prevent accidental misuse.
Destructor functions on the C side normally take pointers to
non-const data as well.

rust-lang/rfcs#435 (comment)
@mzabaluev
Copy link
Contributor Author

I have a nagging idea that representation of an unowned buffer could be laid onto CChars, which would implement all the traits and methods that CStrBuf currently provides. Then CStrBuf and CString would only wrap buffers that need a destructor, resulting in some internal optimization: no need for an Option and its associated branching. But I'm not sure it's a good idea to adorn an iterator type with methods and traits unrelated to iteration.

@mzabaluev
Copy link
Contributor Author

A quick census of CString::new in the Rust source tree suggests that CString is mostly used in unowned mode, on stack, to convert to a string slice or bytes. If CChars grows a public constructor, .as_str(), .as_bytes() and the like, it can be used for those cases instead, avoiding the RAII machinery redundant in this case, and adding explicit lifetime control. CChars would also be handy to wrap strings owned by some foreign object. If the object is RAII-wrapped in struct Foo and provides a foreign function to obtain the inner string, a lifetime-safe zero-copy way to get at the inner string would be something like:

impl Foo {
    fn get_inner_string<'a>(&'a self) -> CChars<'a> {
        unsafe { CChars::wrap(raw::foo_get_inner_string(self)) }
    }
}

@mzabaluev
Copy link
Contributor Author

As an experiment, I have added a branch for c_compat with added lifetime-bound type CStrRef as a transient wrapper for non-copying conversions.

@aturon aturon assigned alexcrichton and unassigned aturon Nov 20, 2014
@mzabaluev
Copy link
Contributor Author

CStrRef has become BorrowedCString to go along with the borrow theme in std.

Now I'm getting doubts on whether there's real need for three types to represent a C string. BorrowedCString can resolve most of the concern for hidden length calculations by caching the length in the borrowed value. Perhaps CString can lose the cached length member and merge back with CStrBuf. Storing the calculated length can be useful if one uses the string contents repeatedly, but saving the cost of one copy to an owned std type is likely to be micro-optimization in this case.

@alexcrichton
Copy link
Member

@mzabaluev sorry I should have posted here sooner, but I wanted to give you an update on my progress as well. I've taken over this RFC from @aturon and have been giving this some thought recently. After thinking about this for awhile, I think that all our usage of C strings can be lumped into one of three categories:

  1. I have a Rust type with some bytes, and I want to hand it over to C.
  2. C hands me a pointer, but I do not own the pointer. I do, however, want to look at the pointer!
  3. C hands me a pointer, and I own the pointer. I also want to be able to look at the pointer.

I, like you, have been questioning the fundamental utility of CString as implemented today. I've been sketching out some ideas (just in a few local branches currently), and here's the API I was thinking of:

#[deriving(Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
pub struct CString {
    inner: Vec<libc::c_char>
}

impl CString
    pub fn from_slice(bytes: &[u8]) -> CString;
    pub fn from_vec(vec: Vec<u8>) -> CString;
    pub unsafe fn from_vec_unchecked(vec: Vec<u8>) -> CString;

    pub fn as_slice_with_nul(&self) -> &[libc::c_char];
    pub fn as_bytes(&self) -> &[u8];
    pub fn as_bytes_with_nul(&self) -> &[u8];
}

impl Deref<[libc::c_char]> for CString
// note lack of DerefMut

impl Show for CString 

pub unsafe fn from_raw_buf<'a>(raw: &'a *const libc::c_char) -> &'a [u8];
pub unsafe fn from_raw_buf_with_nul<'a>(raw: &'a *const libc::c_char) -> &'a [u8];

The key idea here is that CString is no longer an "unsafe type" by any means, it's purely a static assertion that its inner vector has no interior 0 bytes, but has a terminating null byte. Additionally it only provides a few methods to view into itself, predominately gaining methods from the Deref implementation. This CString is intended so solve use case (1) I listed above.

For use case (2), the two conversion functions to be located in the c_str module should cover our bases.

For use case (3) I'm still noodling out some ideas, but I'm thinking of a smart-pointer-style vector which will basically be what CVec turns into as well.

I'm also thinking of deprecating all these functions (in modules other than c_str)

  • String::from_raw_buf
  • String::from_raw_buf_len
  • str::from_c_str

What do you think of this? In general I don't want to have a huge proliferation of types to work with here, and I'd rather leverage byte slices as much as possible (that's basically what a C string is).

@mzabaluev
Copy link
Contributor Author

@alexcrichton I think case (1) is currently addressed primarily by ToCStr, which provides room for polymorphic optimization:

fn frob<T: ToCStr>(s: T) {
    s.with_c_str(|p| {
        unsafe { foreign::frob(p); }
    }
}

Note that with_c_str has a fast path for short slices, and trivial non-copying implementations for the C string wrapper types, as seen in my library.

The pointer-to-slice conversion functions will probably address most needs, except the efficient passing back to C as described above. C strings are byte slices with a twist, which can be made use of statically.

Deref seems useful, I'll look into adding the implementations.

In your design, CString is repurposed to only manage Rust-allocated strings, so another type would need to cover case (3). I don't see Rust-side allocation as a sufficient reason for a separate type; I think one RAII type can cover all dynamically allocated strings.

To summarize my proposed solutions for the cases you have listed:

  1. ToCStr implementations to pass strings to C.
  2. BorrowedCString<'a> to wrap unowned strings obtained from C.
  3. CString to manage owned zero-terminated strings of any origin.

@alexcrichton
Copy link
Member

I was hoping to totally remove the ToCStr trait because all it really is saying is "I'm a byte slice" which we already have other traits like BytesContainer for. I agree that the small-string optimization is nice, but it's not impossible if we only have a separate CString type. One drawback today, for example, is that RAII is far more common of an idiom than using a with function and a closure.

Ah yes, I didn't go into too much detail for (3)! I ended up writing a proto-RFC today about this topic, and I'll probably open it up soon-ish. I'd be very curious to see this feedback folded into this RFC though, it's always nice to have a few options to choose from!

@mzabaluev
Copy link
Contributor Author

@alexcrichton Thanks for the document, I'll be happy to provide feedback on your proposal.

The problem I see with trading ToCStr for some inner complexity in CString (such as an enum taking either an in-place short array or a slice-like pointer) is removing a possibility for compile-time optimization in favor of branching which may or may not be eliminated statically.
ToCStr looks like it took some practical experimentation by people who started out with just constructing CString everywhere but found that a lot of the program time goes into allocating short strings. Sure with_c_str is cumbersome, but for those who prefer clearer code and do not care about extra copies there is also to_c_str. I feel it would be bad to give up this degree of compile-time control without some benchmarking. Pickaxe finds @erickt who wrote the existing benchmarks and the micro-optimization.

Getting owned strings under CVec has pretty much the same issue: loss of compile-time information about the trailing zero, which may result in pessimized code paths.

@alexcrichton
Copy link
Member

@mzabaluev indeed! This is a bit of a tradeoff one way or another. I wouldn't expect the case of handing an owned string into Rust and then back into C would be too common, but using CVec would definitely limit that transfer of information.

I'm fairly confident that we can regain a small string optimization, but I would prefer to stress a clear API in this area rather than micro-optimize from the start. We could add helper functions later always, but I think that using an inline vector of ~64 bytes or so would be pretty close to today's small string optimization.

I've now opened the RFC as well: #494

@mzabaluev
Copy link
Contributor Author

I have redesigned the library, bringing it closer to @alexcrichton 's #494 and bringing it up to date with Rust library conventions. CString is still the RAII wrapper, now with generic destructors. There are utility functions to get a byte/string slice directly from the raw pointer. For passing strings from Rust to C, there is type CStrArg and trait IntoCStr.

@aturon
Copy link
Member

aturon commented Jan 2, 2015

@mzabaluev The competing #494 has just been accepted. You can see some details in the comment I added, but the basic point is that #494 takes a more conservative approach and should be extendable to the design you've proposed here over time, if a clear need to do so emerges. We also plan to grow std::ffi over time, hopefully through new modules that begin their life in the crates.io ecosystem.

Thanks again for your work drawing attention to this issue and proposing a concrete design!

@aturon aturon closed this Jan 2, 2015
@mzabaluev
Copy link
Contributor Author

Fair enough, I can see the concerns.
Nevertheless, the crate I've been developing remains usable, and I should publish it on crates.io, perhaps after some renaming.

Thanks to all participants for your work on improving Rust!

@mzabaluev
Copy link
Contributor Author

After some more development, I have published the crate as c_string.

let reaction = c_str!("Awesome!");
unsafe { libc::puts(reaction.as_ptr()) };

wycats added a commit to wycats/rust-rfcs that referenced this pull request Mar 5, 2019
Forwarding Element Modifiers with "Splattributes"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants