-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800
Comments
I think Unicode code points are much more intuitive to work with, not having to deal with implementation details of some specific encoding. |
Our strings definitely are utf8, it's not just "some specific encoding". We're very much exposing that and expecting programmers to know what that means. As they know what 2s complement integers (not auto-expanding-to-bignums) and 754 floating point (not rationals) are and what they do. If you want an array of unicode codepoints, that's That said, I'm somewhat sympathetic to the arguments about which way to do this. Followup is on the list, over in the thread that created this bug: https://mail.mozilla.org/pipermail/rust-dev/2012-July/002024.html Yes, this is part of the "utf8 monoculture" some people despise, but I am somewhat unrepentant about it. I think it's as stable, flexible and long-term an encoding as we're likely to see for years; the only plausible competitor on the horizon is GB18030 and it even covers different codepoints, so it's not really fair to consider it a "different encoding", it's a whole different charset. And, in any case, my experience is that the harm done to language users, especially systems-language users, by being ambiguous about the in-memory meaning of literals in program text far outweighs the harm done by picking some particular unambiguous interpretation. IOW on this topic I think the risk of underspecification is higher than the risk of overspecification. It would be more useful to support multiple-explicit-encodings -- even permit tagging a whole file as written-to-a-different-default-encoding -- if that every becomes a real concern, than to throw our hands up about the encoding and say "strings are implementation-specific!" Incidentally, it should be trivial to write a syntax extension that maps from encoding-to-encoding at compile time, i.e. one that lets you write |
Thank you for the explanation, it makes a lot of sense. |
Definitely. Some machinery exists in core for handling basic tasks associated with strings in the various operating-system-required encodings; more will wind up in libstd, likely a binding to libicu. |
Nominated for backwards compatible |
accepted for backwards-compatible milestone |
I agree that |
cc me |
I disagree. I don’t see a reason to use code units in (Unicode) literal strings. Why would you want If the concern is that If \xNN is still changed to represent code units, literals that contain invalid UTF-8 like |
I agree with that, @SimonSapin. The reason I wanted it was for byte string literals, but at the time |
Yes, I also want byte literals (and found this when searching for that.) |
If |
I guess the strongest argument in favor of code units is Behdad's, that it would make our string escapes compatible with C and Python. |
Behdad's full argument: Here: "\xHH, \uHHHH, \UHHHHHHHH Unicode escapes", I strongly suggest that |
I don’t know about C, but Behdad’s argument does not apply to Python. Python (both in 2.x and 3.x) has two types of strings: byte strings, where \xHH is a byte and \uHHHH is not an escape sequence; and Unicode strings where \xHH is a code point and |
There doesn't seem to be a definitive argument for either side, and since changing these to be code units makes their validation slightly harder, I'm inclined to just leave as-is and call it done. |
@SimonSapin indeed, I think graydon said the same thing in his initial response to Behdad, along with providing a more complete table of what different languages do here (Though that table is missing C# it seems.) So is Rust going to be more like python and scheme, or more like perl, go, C, C++, ruby... ? |
(having said that, I'm fine with brson's suggestion to leave things as they are.) |
In today's meeting we decided to leave this as is. |
Extends the function contract functionality with a `modifies` clause. The design is different from rust-lang#2594 but serves a similar purpose. The `modifies` clause allows the user to specify which parts of a structure a function may assign to. Essentially refining the `mut` annotation. We allow arbitrary (side-effect free) expressions in the `modifies` clause. The expressions are evaluated as part of the preconditions and passed to the function-under-verification as additional arguments. CBMC is then instructed to check that those locations are assigned. Aliasing means that this also adds the location in the original structure to the write set. Each expression must return a pointer to a value that implements `Arbitrary`. On replacement we then simply assign `*ptr = kani::any()`, relying again on aliasing to update the original structure. Additional tests for the new functionality are provided. Resolves rust-lang#2594 ## Open Questions ### API divergence from CBMC (accepted) The current design goes roughly as follows: We start with a `modifies` annotation on a function ```rs #[modifies(obj.some_expr())] fn target(obj: ...) { ... } ``` And from this we generate code to the effect of (simplified here) ```rs fn target_check(obj: ...) { // Undo the lifetime entanglements let modified_1 = std::mem::transmute::<&'a _, &'b _>(obj.some_expr()); target_wrapper(obj, modified_1); } #[cbmc::assigns(*modified_1)] fn target_wrapper(obj: ..., modified_1: &impl kani::Arbitrary) { ... } ``` Unlike CBMC we expect `obj.some_expr()` to be of a **pointer type** (`*const`, `*mut`, `&mut` or `&`) that points to the object which is target of the modification. So if we had a `t : &mut T` that was modified, CBMC would expect its assigns clause to say `*t`, but we expect `t` (no dereference). The reason is that the code we generate uses the workaround of creating an alias to whichever part of `obj` is modified and registers the alias with CBMC (thereby registering the original also). If we generated code where the right side of `let modified_1 =` is not of pointer type, then the object is moved to the stack and the aliasing destroyed. The open questions is whether we are happy with this change in API. (Yes) ### Test cases when expressions are used in the clause. With more complex expressions in the modifies clause it becomes hard to define good test cases because they reference generated code as in this case: ```rs #[kani::requires(**ptr < 100)] #[kani::modifies(ptr.as_ref())] fn modify(ptr: &mut Box<u32>) { *ptr.as_mut() += 1; } ``` This passes (as it should) and when commenting out the `modifies` clause we get this error: ``` Check 56: modify_wrapper_895c4e.assigns.2 - Status: FAILURE - Description: "Check that *var_2 is assignable" - Location: assigns_expr_pass.rs:8:5 in function modify_wrapper_895c4e ``` The information in this error is very non-specific, hard to read and brittle. How should we define robust "expected" test cases for such errors? ### Corner Cases / Future Improvements - rust-lang#2907 - rust-lang#2908 - rust-lang#2909 ## TODOs - [ ] Test Cases where the clause contains - [x] `Rc` + (`RefCell` or `unsafe`) (see rust-lang#2907) - [x] Fields - [x] Statement expressions - [x] `Vec` (see rust-lang#2909) - [ ] Fat pointers - [ ] update contracts documentation - [x] Make sure the wrapper arguments are unique. - [x] Ensure `nondet-static-exclude` always uses the correct filepath (relative or absolute) - [ ] Test case for multiple `modifies` clauses. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses. --------- Co-authored-by: Zyad Hassan <[email protected]> Co-authored-by: Felipe R. Monteiro <[email protected]>
There's not a lot of consensus on this between languages, but the C and C++ paths (also perl, go, and at least python3 'bytes' literals, though not 'string') treat this escape as a code unit, not a codepoint.
The text was updated successfully, but these errors were encountered: