ACP: `CStr::bytes` iterator #135

clarfonthey · 2022-11-15T14:09:09Z

Proposal

Problem statement

Many methods of CStr mention that methods like to_bytes will eventually require a scan of the string, even though the string is currently represented internally as a slice and has efficient implementations now. However, no existing methods offer a way to access the underlying string in a future-compatible way.

Motivation, use-cases

In my particular case, I was trying to create a method that was generic over strings, byte slices, and C strings, and used iterators to achieve this. I noticed that CString didn't have any of its own iterator types and wanted to rectify this.

Solution sketches

The proposal is to add the following API:

impl CStr {
    pub fn bytes(&self) -> CStrBytes<'_> { /* ... */ }
}
pub struct CStrBytes<'a> {
    ptr: NonNull<u8>,
    phantom: PhantomData<&'a u8>,
}
impl<'a> CStrBytes<'a> {
    pub fn as_c_str(&self) -> &'a CStr { unsafe { self.ptr.ref() } }
}
impl AsRef<CStr> for CStrBytes<'_> {
    fn as_ref(&self) -> &CStr { unsafe { self.ptr.ref() } }
}
impl Iterator for CStrBytes<'_> {
    type Item = u8;

    /* ... */
}
impl FusedIterator for CStrBytes<'_> {}

Future extensions

In the future, an IntoIterator impl could be added to &CStr that returns this iterator, although that's left out for now.

Iterator trait impls

The proposal explicitly recommends not implementing DoubleEndedIterator or ExactSizeIterator since the assumption is that they will not be efficient for the long-term representation of C strings. Additionally, it recommends adding AsRef and an as_c_str method to mirror the API offered by slice::Iter.

Item type

The iterator returns u8 to mirror the items returned by str::bytes. Additionally, yielding &u8 might encourage callers to cast this to a pointer rather than using the as_c_str method to give a reference to the string itself, which has methods dedicated to this purpose.

As an alternative, NonZeroU8 could be used instead, since we know we're not including the nul (see below). However, this would counteract the return value of to_bytes, which may be less than desired. There's always room for a to_nonzero_bytes method, though.

Nul byte

The iterator will not include the nul byte at the end of the string. From the perspective of including the bytes inside the string, that nul byte is not actually part of the string, merely past the end of it.

From an implementation perspective, including the nul byte could also be less efficient. Since the iterator will only store a single pointer, it cannot actually progress past the end of the string, since it would have no way of knowing without dereferencing that potentially invalid address. So, there would have to be either additional data inside the iterator indicating that it's "done," or the pointer could be nulled out when the iterator is done. In both cases, this would incur extra checks during iteration in addition to checking the byte itself to see if it's zero.

Additionally, if we were to null out the pointer after the iterator was finished, calling as_c_str might have to return a reference to an empty string with an address outside the bounds of the original string. This isn't a super big deal, but it's another way it would differ from the slice::Iter implementation.

Iterator type name

The additional ACP #134 is created in an attempt to resolve this issue.

If that ACP is not accepted, naming the type Bytes and exporting it in the core::ffi module would be unacceptable, since OsStr could theoretically have a bytes iterator in the future. So, the type is named CStrBytes to enforce the fact that it's the bytes of a CStr, and not an OsStr.

Links and related work

ffi submodule ACP: ACP: std::ffi::c and std::ffi::os submodules #134
Implementation PR: Add CStr::bytes iterator rust#104353

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

The text was updated successfully, but these errors were encountered:

SUPERCILEX · 2022-12-05T07:10:50Z

phantom: PhantomData<&'a u8>

This should probably be [u8] since the phantom "owns" the slice as opposed to a single byte. Though I don't really understand phantoms so I could be totally wrong.

Letting you get back a CStr from the Iterator seems reasonable to me. You can imagine someone iterating until they find a path separator and then getting the remaining CStr.

I kind of like the idea of using NonZeroU8 but also kind of hate it lol. Non zero types are just so dang hard to work with. Agree that this should be punted.

Disagree about the NUL byte stance. I don't know how to resolve the performance implications, but the current API has two methods for returning slices: one with the NUL byte and one without. I don't see a compelling reason for the iterator API to differ.

Overall seems like a good idea.

ChrisDenton · 2022-12-05T14:46:01Z

For what it's worth, I made something a bit similar for wide strings. This were for use in parsing the command line so admittedly it was a single use implementation. I did use NonNull because NonNull::new was pretty convenient to use in the implementation and, iirc, made it simple to show correctness (your mileage may vary on that). The downside was basically needing to use constants for characters otherwise it'd be very verbose. But then I preferred to use constants anyway because we don't have wide character literals (which is not an issue that CStr faces).

clarfonthey · 2022-12-05T18:31:37Z

This should probably be [u8] since the phantom "owns" the slice as opposed to a single byte. Though I don't really understand phantoms so I could be totally wrong.

Yeah, I also don't 100% understand if there's even a difference here. The most important part is making sure the lifetime is there.

I kind of like the idea of using NonZeroU8 but also kind of hate it lol. Non zero types are just so dang hard to work with. Agree that this should be punted.

Yeah, IMHO we should be providing more APIs that have NonZeroU8 for CStr but we currently don't, so 🤷🏻.

Not super in favour of mixed APIs (where whether we use it depends on when it was implemented) but I would definitely like to see nonzero versions. Could potentially offer implementations for From on slices of NonZero types to make this more ergonomic.

Disagree about the NUL byte stance. I don't know how to resolve the performance implications, but the current API has two methods for returning slices: one with the NUL byte and one without. I don't see a compelling reason for the iterator API to differ.

To me, including the zero byte in the slice makes a lot of sense because you'd otherwise have to allocate an entirely new array for it, whereas an iterator can accomplish this by just doing .chain(once(0)), even though this will likely be much slower. We can still potentially expose two iterators anyway, but the actual ergonomics aren't really comparable since iterators are much more flexible than slices.

SUPERCILEX · 2022-12-05T18:58:55Z

Could potentially offer implementations for From on slices of NonZero types to make this more ergonomic.

Or honestly it's probably not that bad to have to write .map(|n| n.get()) if you want to go back to the u8. But then again, it's possible everyone will always do that cuz I'm not seeing a use case for NonZeroU8s from a string.

To me, including the zero byte

Did you mean excluding? Otherwise I'm confused by your sentence. And yeah, that's actually a very good point about chaining. That said, I think including the NUL byte could be a common case. For example, say I want to join a CString and CStr. I would convert the CString to a Vec and extend it with the CStr iterator which would be convenient if it included the NUL byte. I think this could come down to performance: if it differs significantly, then my example above should use the non NUL iterator and then manually push a 0 and that would be fine.

I guess there's nothing that would prevent us from adding an iterator with NUL in the future, so I think it makes sense for this ACP to exclude it for now.

dtolnay · 2023-05-30T18:40:05Z

Seconded. I am on board with a CStrBytes iterator whose item type is u8.

However please stick to just the API surface present already on core::str::Bytes. That type has no equivalent of your proposed as_c_str nor an AsRef impl. It would be fine to propose those for Bytes separately from this ACP, and if accepted, would be easy to justify mirroring on CStrBytes.

I will separately look at #134 (not necessarily right away) but I think this single iterator type would be all right to add independent of that.

clarfonthey · 2023-05-30T19:46:21Z

My main justification for adding the AsRef and as_c_str methods was because I was mirroring slice::Iter rather than str::Bytes. I can remove them from the PR if you still would prefer a separate ACP, I just figured that they would be useful and there's already precedent from another existing API.

Tracking issue opened & PR updated.

…iper Add CStr::bytes iterator See rust-lang/libs-team#135 for an ACP. Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.

Rollup merge of rust-lang#104353 - clarfonthey:cstr-bytes-iter, r=cuviper Add CStr::bytes iterator See rust-lang/libs-team#135 for an ACP. Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.

Add CStr::bytes iterator See rust-lang/libs-team#135 for an ACP. Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.

clarfonthey added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Nov 15, 2022

This was referenced Nov 15, 2022

ACP: std::ffi::c and std::ffi::os submodules #134

Closed

Add CStr::bytes iterator rust-lang/rust#104353

Merged

clarfonthey mentioned this issue Dec 7, 2022

ACP: Additional NonZero conversions #145

Closed

dtolnay closed this as completed May 30, 2023

dtolnay added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label May 30, 2023

clarfonthey mentioned this issue May 30, 2023

Tracking Issue for CStr::bytes rust-lang/rust#112115

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACP: `CStr::bytes` iterator #135

ACP: `CStr::bytes` iterator #135

clarfonthey commented Nov 15, 2022 •

edited

Loading

SUPERCILEX commented Dec 5, 2022

ChrisDenton commented Dec 5, 2022

clarfonthey commented Dec 5, 2022

SUPERCILEX commented Dec 5, 2022

dtolnay commented May 30, 2023

clarfonthey commented May 30, 2023 •

edited

Loading

ACP: CStr::bytes iterator #135

ACP: CStr::bytes iterator #135

Comments

clarfonthey commented Nov 15, 2022 • edited Loading

Proposal

Problem statement

Motivation, use-cases

Solution sketches

Future extensions

Iterator trait impls

Item type

Nul byte

Iterator type name

Links and related work

What happens now?

SUPERCILEX commented Dec 5, 2022

ChrisDenton commented Dec 5, 2022

clarfonthey commented Dec 5, 2022

SUPERCILEX commented Dec 5, 2022

dtolnay commented May 30, 2023

clarfonthey commented May 30, 2023 • edited Loading

ACP: `CStr::bytes` iterator #135

ACP: `CStr::bytes` iterator #135

clarfonthey commented Nov 15, 2022 •

edited

Loading

clarfonthey commented May 30, 2023 •

edited

Loading