-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve &str <-> &CStr16 lifecycle #73
Comments
I started working on this, eliminating some code duplication between Latin-1 and UCS-2 as a first step. Thinking about it further, this work might also be a good occasion to fix a tricky issue in the way we split text in our Warning: Unicode discussion ahead, please momentarily disregard your brain's sense of sanity. As currently designed, our Code points may not be the right boundary, however, because even UCS-2 has combining character sequences. For example, one way to encode the "â" French letter is a ['a', '^'] code point sequences. Splitting between these code points means that our next string will start with a lone '^' combining character, which the UEFI firmware is almost guaranteed to ignore or mis-render, so we don't want to split those sequences across printed string boundaries. Another example of a char sequence that we probably don't want to split in order to keep UEFI's task easy is "\r\n". To keep this one together too, we need to split our strings on grapheme cluster boundaries, a slightly coarser granularity than combining character sequences which the no_std-compatible If we wanted to be really defensive about UEFI messing up cursive scripts such as Arabic and Devanagari, we could even go one step further and split on word boundaries. However, that would cause problems with long words (and non-word entities such as ASCII art) overflowing our buffer and clogging the output, so I'm willing to trust UEFI to handle split words correctly, and wait for buggy firmware reports from users of such scripts before taking any further defensive action. All things considered, I think best equilibrium between simplicity, handling of diverse inputs, and defensive programming against diverging Unicode interpretation in UEFI firmware, is to split our output on grapheme cluster boundaries. I'll try to implement that as I extract the current |
I now have a satisfactory encoder (UTF-8 -> {Latin1, UCS2}), so I started experimenting with a decoder... and found an interesting issue. Here are the options that I can think about, in order of decreasing preference:
These approaches can be combined, for example by implementing fix (1) in unicode_segmentation and workaround (2) in uefi_rs, with the aim of eventually moving to (1) when the next release of unicode_segmentation comes out. This is probably what I'll try doing unless you have a better idea. |
I think exposing this in the It definitely feels like we should maintain the UTF-8 <-> UCS-2 conversions in a separate crate to simplify testing and development. I don't know if there are other legacy APIs which still use UCS-2, but perhaps other Rust users will find a use for the code. I already have a |
I like the idea of extracting the legacy text handling to a crate, but not the scope of the ucs-2 crate:
With this in mind, I would propose to build a more general crate for fixed-width-char string handling, with an initial focus on Latin1/UCS-2 strings with Windows line endings. Would you like that? |
It sounds like an extensive project, but if you already have an idea for a more general crate then I see no issue with using it for |
On the grapheme clustering front, I reached the conclusion that the So I'm probably going back to plan 2: encode a chunk of the UCS-2 input to UTF-8 and run the grapheme segmentation algorithm on this incomplete output. |
Okay, 6+ weeks have elapsed and I have not managed to make any significant progress on this, or any of my other In case anyone else feels like taking over this effort, I pushed a branch called |
Summarizing the current status a bit. Of the utilities mentioned into the issue description, some of them now exist: Cstr16::from_str_with_buf It be good to go through and fill in some additional conversions, like It looks like there are still some implicit For the rest of the discussion about incorrectly splitting combining characters, that seems like its own big can of worms :) |
Since `CStr16` implements `Display`, the [`ToString`](https://doc.rust-lang.org/alloc/string/trait.ToString.html) trait is automatically implemented. That provides `to_string` as long as `alloc` is enabled, so `as_string` isn't needed. rust-osdev#73
Since `CStr16` implements `Display`, the [`ToString`](https://doc.rust-lang.org/alloc/string/trait.ToString.html) trait is automatically implemented. That provides `to_string` as long as `alloc` is enabled, so `as_string` isn't needed. #73
Changed it to take a `&CStr16` rather than implicitly converting from a `&str`. #73
Changed it to take a `&CStr16` rather than implicitly converting from a `&str`. #73
The file info types are now constructed directly from a `&CStr16` instead of internally converting from a `&str`. rust-osdev#73
The file info types are now constructed directly from a `&CStr16` instead of internally converting from a `&str`. rust-osdev#73
The file info types are now constructed directly from a `&CStr16` instead of internally converting from a `&str`. This also removes `FileInfoCreationError::InvalidChar`, since that error is no longer possible. rust-osdev#73
The file info types are now constructed directly from a `&CStr16` instead of internally converting from a `&str`. This also removes `FileInfoCreationError::InvalidChar`, since that error is no longer possible. #73
The spec describes `LoadOptions` as "A pointer to the image’s binary load options. See the OptionalData parameter in [section 3.1.3 Load Options] of the Boot Manager chapter for information on the source of the LoadOptions data." The `OptionalData` field is described as "The remaining bytes in the load option descriptor are a binary data buffer that is passed to the loaded image." So there's no guarantee made about the contents of `LoadOptions`; it's arbitrary binary data as far as the UEFI spec is concerned. So it's not necessarily safe to treat the load_options as a `Char16` pointer, since it might not be aligned and might not be null-terminated. That said, the data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec specifies that command-line parameters are passed that way. To safely support both cases, change the `load_options` field to a `u8` pointer, drop the `load_options` method that converts the data to a `&str`, and add two new methods: `load_options_as_bytes` and `load_options_as_cstr16`. The `set_load_options` has also been changed to take a `u8` pointer instead of a `Char16` pointer. rust-osdev#73
The spec describes `LoadOptions` as "A pointer to the image’s binary load options. See the OptionalData parameter in [section 3.1.3 Load Options] of the Boot Manager chapter for information on the source of the LoadOptions data." The `OptionalData` field is described as "The remaining bytes in the load option descriptor are a binary data buffer that is passed to the loaded image." So there's no guarantee made about the contents of `LoadOptions`; it's arbitrary binary data as far as the UEFI spec is concerned. So it's not necessarily safe to treat the load_options as a `Char16` pointer, since it might not be aligned and might not be null-terminated. That said, the data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec specifies that command-line parameters are passed that way. To safely support both cases, change the `load_options` field to a `u8` pointer, drop the `load_options` method that converts the data to a `&str`, and add two new methods: `load_options_as_bytes` and `load_options_as_cstr16`. The `set_load_options` has also been changed to take a `u8` pointer instead of a `Char16` pointer. rust-osdev#73
The spec describes `LoadOptions` as "A pointer to the image’s binary load options. See the OptionalData parameter in [section 3.1.3 Load Options] of the Boot Manager chapter for information on the source of the LoadOptions data." The `OptionalData` field is described as "The remaining bytes in the load option descriptor are a binary data buffer that is passed to the loaded image." So there's no guarantee made about the contents of `LoadOptions`; it's arbitrary binary data as far as the UEFI spec is concerned. So it's not necessarily safe to treat the load_options as a `Char16` pointer, since it might not be aligned and might not be null-terminated. That said, the data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec specifies that command-line parameters are passed that way. To safely support both cases, change the `load_options` field to a `u8` pointer, drop the `load_options` method that converts the data to a `&str`, and add two new methods: `load_options_as_bytes` and `load_options_as_cstr16`. The `set_load_options` has also been changed to take a `u8` pointer instead of a `Char16` pointer. rust-osdev#73
The spec describes `LoadOptions` as "A pointer to the image’s binary load options. See the OptionalData parameter in [section 3.1.3 Load Options] of the Boot Manager chapter for information on the source of the LoadOptions data." The `OptionalData` field is described as "The remaining bytes in the load option descriptor are a binary data buffer that is passed to the loaded image." So there's no guarantee made about the contents of `LoadOptions`; it's arbitrary binary data as far as the UEFI spec is concerned. So it's not necessarily safe to treat the load_options as a `Char16` pointer, since it might not be aligned and might not be null-terminated. That said, the data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec specifies that command-line parameters are passed that way. To safely support both cases, change the `load_options` field to a `u8` pointer, drop the `load_options` method that converts the data to a `&str`, and add two new methods: `load_options_as_bytes` and `load_options_as_cstr16`. The `set_load_options` has also been changed to take a `u8` pointer instead of a `Char16` pointer. rust-osdev#73
The spec describes `LoadOptions` as "A pointer to the image’s binary load options. See the OptionalData parameter in [section 3.1.3 Load Options] of the Boot Manager chapter for information on the source of the LoadOptions data." The `OptionalData` field is described as "The remaining bytes in the load option descriptor are a binary data buffer that is passed to the loaded image." So there's no guarantee made about the contents of `LoadOptions`; it's arbitrary binary data as far as the UEFI spec is concerned. So it's not necessarily safe to treat the load_options as a `Char16` pointer, since it might not be aligned and might not be null-terminated. That said, the data is *usually* a null-terminated UCS-2 string. The UEFI Shell spec specifies that command-line parameters are passed that way. To safely support both cases, change the `load_options` field to a `u8` pointer, drop the `load_options` method that converts the data to a `&str`, and add two new methods: `load_options_as_bytes` and `load_options_as_cstr16`. The `set_load_options` has also been changed to take a `u8` pointer instead of a `Char16` pointer. #73
I did some research on the combining-characters issue and, tl;dr I think we can safely ignore it.
I did some tests with combining characters, and the two UEFI implementations I tried them on always ignored the combining character. For example, 0x61 is a lower case I tried running the test on real hardware as well. Test machine was a Lenovo Thinkpad X1 Carbon 9th Gen, and it also did not render the accent when written as a combining character. Click to show test codefn test_char_split(image: Handle, st: &SystemTable<Boot>) {
let bt = st.boot_services();
let output_handle = bt.find_handles::<Output>().unwrap_success()[0];
let output = bt
.open_protocol::<Output>(
OpenProtocolParams {
handle: output_handle,
agent: image,
controller: None,
},
OpenProtocolAttributes::Exclusive,
)
.unwrap_success();
let output = unsafe { &mut *output.interface.get() };
#[rustfmt::skip]
let s = CStr16::from_u16_with_nul(&[
// à
0x00e0,
// a
0x0061,
// combining accent
0x0300,
// CR
0x0d,
// LF
0x0a,
// Null
0x0000
]).unwrap();
output.output_string(s).unwrap_success();
bt.stall(10_000_000);
} Of course that's just two UEFI implementations -- it is entirely possible that some other implementations do handle combining characters, but in the absence of evidence of such implementations I don't think we need to bother with a more complex implementation of I also found evidence that conforming UCS-2 implementations are explicitly allowed to ignore combining characters. There are three implementation levels defined, and at level 1 all combining characters can be ignored. I couldn't find anything in the UEFI spec itself indicating what level is supposed to be supported, so it seems conforming UEFI firmware is fine using level 1. It might be useful in some applications to normalize a string such that combining characters are converted to composite characters, but I think that doesn't need to be directly handled in uefi-rs itself. Assuming I haven't gotten my reasoning above wrong, I think everything here is done. I've grepped the source for |
When interfacing Rust with UEFI, one of the API mismatches that we need to take care of is that Rust uses UTF-8 Pascal-style strings while UEFI uses UCS-2 null-terminated strings.
Currently, we have...
CStr16
type for handling borrowed UEFI strings more conveniently and safely.&CStr16
.&str
and silently convert to UCS-2.File::open
), this restricts possible inputs due to finite buffer size.I expect interoperability with UCS-2 to be a common problem in realistic UEFI application, so we may want to improve upon this situation by 1/standardizing conversions between UTF-8 and UCS-2 and 2/eliminating "hidden" conversions between the two whenever possible.
Coming from this perspective, here are some ideas at the core uefi-rs layer...
&str
and an&mut [u16]
as an input, and returns a&CStr16
, a char conversion error, or a "buffer full" error as an output.&CStr16
+&mut [u8]
to&str
or similar errors as above....and more ideas for the convenience
uefi-exts
layer, where allocations are allowed:TryFrom<&CStr16>
implementation forString
.CString16
type with aTryFrom<&str>
implementation.Once we have that, automagic conversion from
&str
to UCS-2 should be eliminated whenever possible, in favor of taking a&CStr16
.If the user has enabled uefi-services conveniences, then converting from
&str
to&CStr16
is just a.try_from()
away. And if the user wants to stick with the core crate, calling the underlying conversion utilities is more involved, but not terribly difficult.The text was updated successfully, but these errors were encountered: