Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: CStr::bytes iterator #135

Closed
clarfonthey opened this issue Nov 15, 2022 · 6 comments
Closed

ACP: CStr::bytes iterator #135

clarfonthey opened this issue Nov 15, 2022 · 6 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@clarfonthey
Copy link

clarfonthey commented Nov 15, 2022

Proposal

Problem statement

Many methods of CStr mention that methods like to_bytes will eventually require a scan of the string, even though the string is currently represented internally as a slice and has efficient implementations now. However, no existing methods offer a way to access the underlying string in a future-compatible way.

Motivation, use-cases

In my particular case, I was trying to create a method that was generic over strings, byte slices, and C strings, and used iterators to achieve this. I noticed that CString didn't have any of its own iterator types and wanted to rectify this.

Solution sketches

The proposal is to add the following API:

impl CStr {
    pub fn bytes(&self) -> CStrBytes<'_> { /* ... */ }
}
pub struct CStrBytes<'a> {
    ptr: NonNull<u8>,
    phantom: PhantomData<&'a u8>,
}
impl<'a> CStrBytes<'a> {
    pub fn as_c_str(&self) -> &'a CStr { unsafe { self.ptr.ref() } }
}
impl AsRef<CStr> for CStrBytes<'_> {
    fn as_ref(&self) -> &CStr { unsafe { self.ptr.ref() } }
}
impl Iterator for CStrBytes<'_> {
    type Item = u8;

    /* ... */
}
impl FusedIterator for CStrBytes<'_> {}

Future extensions

In the future, an IntoIterator impl could be added to &CStr that returns this iterator, although that's left out for now.

Iterator trait impls

The proposal explicitly recommends not implementing DoubleEndedIterator or ExactSizeIterator since the assumption is that they will not be efficient for the long-term representation of C strings. Additionally, it recommends adding AsRef and an as_c_str method to mirror the API offered by slice::Iter.

Item type

The iterator returns u8 to mirror the items returned by str::bytes. Additionally, yielding &u8 might encourage callers to cast this to a pointer rather than using the as_c_str method to give a reference to the string itself, which has methods dedicated to this purpose.

As an alternative, NonZeroU8 could be used instead, since we know we're not including the nul (see below). However, this would counteract the return value of to_bytes, which may be less than desired. There's always room for a to_nonzero_bytes method, though.

Nul byte

The iterator will not include the nul byte at the end of the string. From the perspective of including the bytes inside the string, that nul byte is not actually part of the string, merely past the end of it.

From an implementation perspective, including the nul byte could also be less efficient. Since the iterator will only store a single pointer, it cannot actually progress past the end of the string, since it would have no way of knowing without dereferencing that potentially invalid address. So, there would have to be either additional data inside the iterator indicating that it's "done," or the pointer could be nulled out when the iterator is done. In both cases, this would incur extra checks during iteration in addition to checking the byte itself to see if it's zero.

Additionally, if we were to null out the pointer after the iterator was finished, calling as_c_str might have to return a reference to an empty string with an address outside the bounds of the original string. This isn't a super big deal, but it's another way it would differ from the slice::Iter implementation.

Iterator type name

The additional ACP #134 is created in an attempt to resolve this issue.

If that ACP is not accepted, naming the type Bytes and exporting it in the core::ffi module would be unacceptable, since OsStr could theoretically have a bytes iterator in the future. So, the type is named CStrBytes to enforce the fact that it's the bytes of a CStr, and not an OsStr.

Links and related work

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

@clarfonthey clarfonthey added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Nov 15, 2022
@SUPERCILEX
Copy link

phantom: PhantomData<&'a u8>

This should probably be [u8] since the phantom "owns" the slice as opposed to a single byte. Though I don't really understand phantoms so I could be totally wrong.


Letting you get back a CStr from the Iterator seems reasonable to me. You can imagine someone iterating until they find a path separator and then getting the remaining CStr.


I kind of like the idea of using NonZeroU8 but also kind of hate it lol. Non zero types are just so dang hard to work with. Agree that this should be punted.


Disagree about the NUL byte stance. I don't know how to resolve the performance implications, but the current API has two methods for returning slices: one with the NUL byte and one without. I don't see a compelling reason for the iterator API to differ.


Overall seems like a good idea.

@ChrisDenton
Copy link
Member

For what it's worth, I made something a bit similar for wide strings. This were for use in parsing the command line so admittedly it was a single use implementation. I did use NonNull because NonNull::new was pretty convenient to use in the implementation and, iirc, made it simple to show correctness (your mileage may vary on that). The downside was basically needing to use constants for characters otherwise it'd be very verbose. But then I preferred to use constants anyway because we don't have wide character literals (which is not an issue that CStr faces).

@clarfonthey
Copy link
Author

This should probably be [u8] since the phantom "owns" the slice as opposed to a single byte. Though I don't really understand phantoms so I could be totally wrong.

Yeah, I also don't 100% understand if there's even a difference here. The most important part is making sure the lifetime is there.

I kind of like the idea of using NonZeroU8 but also kind of hate it lol. Non zero types are just so dang hard to work with. Agree that this should be punted.

Yeah, IMHO we should be providing more APIs that have NonZeroU8 for CStr but we currently don't, so 🤷🏻.

Not super in favour of mixed APIs (where whether we use it depends on when it was implemented) but I would definitely like to see nonzero versions. Could potentially offer implementations for From on slices of NonZero types to make this more ergonomic.

Disagree about the NUL byte stance. I don't know how to resolve the performance implications, but the current API has two methods for returning slices: one with the NUL byte and one without. I don't see a compelling reason for the iterator API to differ.

To me, including the zero byte in the slice makes a lot of sense because you'd otherwise have to allocate an entirely new array for it, whereas an iterator can accomplish this by just doing .chain(once(0)), even though this will likely be much slower. We can still potentially expose two iterators anyway, but the actual ergonomics aren't really comparable since iterators are much more flexible than slices.

@SUPERCILEX
Copy link

Could potentially offer implementations for From on slices of NonZero types to make this more ergonomic.

Or honestly it's probably not that bad to have to write .map(|n| n.get()) if you want to go back to the u8. But then again, it's possible everyone will always do that cuz I'm not seeing a use case for NonZeroU8s from a string.

To me, including the zero byte

Did you mean excluding? Otherwise I'm confused by your sentence. And yeah, that's actually a very good point about chaining. That said, I think including the NUL byte could be a common case. For example, say I want to join a CString and CStr. I would convert the CString to a Vec and extend it with the CStr iterator which would be convenient if it included the NUL byte. I think this could come down to performance: if it differs significantly, then my example above should use the non NUL iterator and then manually push a 0 and that would be fine.

I guess there's nothing that would prevent us from adding an iterator with NUL in the future, so I think it makes sense for this ACP to exclude it for now.

@dtolnay
Copy link
Member

dtolnay commented May 30, 2023

Seconded. I am on board with a CStrBytes iterator whose item type is u8.

However please stick to just the API surface present already on core::str::Bytes. That type has no equivalent of your proposed as_c_str nor an AsRef impl. It would be fine to propose those for Bytes separately from this ACP, and if accepted, would be easy to justify mirroring on CStrBytes.

I will separately look at #134 (not necessarily right away) but I think this single iterator type would be all right to add independent of that.

@dtolnay dtolnay closed this as completed May 30, 2023
@dtolnay dtolnay added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label May 30, 2023
@clarfonthey
Copy link
Author

clarfonthey commented May 30, 2023

My main justification for adding the AsRef and as_c_str methods was because I was mirroring slice::Iter rather than str::Bytes. I can remove them from the PR if you still would prefer a separate ACP, I just figured that they would be useful and there's already precedent from another existing API.

Tracking issue opened & PR updated.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 13, 2024
…iper

Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 13, 2024
…iper

Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 13, 2024
…iper

Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 14, 2024
…iper

Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
jhpratt added a commit to jhpratt/rust that referenced this issue Mar 14, 2024
…iper

Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 14, 2024
…iper

Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 14, 2024
…iper

Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Mar 14, 2024
Rollup merge of rust-lang#104353 - clarfonthey:cstr-bytes-iter, r=cuviper

Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
bors pushed a commit to rust-lang/miri that referenced this issue Mar 15, 2024
Add CStr::bytes iterator

See rust-lang/libs-team#135 for an ACP.

Since rust-lang/libs-team#134 was also accepted, this type is now `core::ffi::c_str::Bytes` instead of `core::ffi::CStrBytes`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

4 participants