Are unicode prefixes in string literals necessary? #668

msadeqhe · 2023-09-09T06:51:09Z

msadeqhe
Sep 9, 2023

Currently we have unicode prefixes in string literals:

name: = u8"Someone";

What would be wrong if we could call functions or constructors instead of unicode prefixes?

// `u8` is a function or a type alias (std::u8string)
name: = u8("Someone");

Or even with UFCS:

// `u8` is a function or a type alias (std::u8string)
name: = "Someone".u8();

Or directly within declaration:

// `u8` is a type alias (std::u8string)
name: u8 = "Someone";

Or simply with : syntax:

// `u8` is a type alias (std::u8string)
info: = call(: u8 = "Someone");

Essentially Cpp2 may be able to drop all unicode prefixes. It leads to reducing concept count.

JohelEGP · 2023-09-09T11:47:41Z

JohelEGP
Sep 9, 2023

Remember that Herb doesn't want to get his hands dirty with Unicode yet.

u8 already means cpp2::u8 (std::uint8_t),
so none of your suggestions can work as-is.
Maybe we can use utf𝘕 instead.

We could make the Unicode prefixes work like the built-in integer literals.
But then, how do you translate Cpp1 u8""s to Cpp2?
s is already a UDL for std::string,
but the user wants to specify the input string literal as UTF-8.
Cpp2 would have to invent new syntax divergent from Cpp1 (e.g., (""utf8)s),
which goes against its goals.

Apparently, this is the mapping (https://en.cppreference.com/w/cpp/language/string_literal):

encoding-prefix	Kind of string literal
	Ordinary
`L`	Wide
`u8`	UTF-8
`u`	UTF-16
`U`	UTF-32
`R`	Raw

According to https://tzlaine.github.io/unicode_cppnow_2023/#/0/42/2,
from Applying the Lessons of std::ranges to Unicode in the C++ Standard Library - Zach Laine CppNow 2023,
L means UTF-16 on Windows.
The talk provides hints for further simplifications,
but that is still not part of the C++ Standard library,
which is still settling its handling of Unicode.

0 replies

msadeqhe · 2023-09-09T12:08:02Z

msadeqhe
Sep 9, 2023
Author

I'm thinking about if we already can declare integer variables without suffixes:

x: i32 = 10;
y: longlong = 10;

Why can we not just declare string variables without prefixes?

//       const *char8_t  = "text";
// basic_string<char8_t> = "text";
x: u8string = "text";

//       const *char  = "text";
// basic_string<char> = "text";
y: string = "text";

We could make the Unicode prefixes work like the built-in integer literals. But then, how do you translate Cpp1 u8""s to Cpp2? s is already a UDL for std::string, but the user wants to specify the input string literal as UTF-8. Cpp2 would have to invent new syntax divergent from Cpp1 (e.g., (""utf8)s), which goes against its goals.

I mean, can we just write string(utf8("")) or directly (with UFCS) "".u8string()?

Cppfront can translate this Cpp2 code:

a: u8string = "text";

to this Cpp1 code:

u8string a{u8"text"};

In this way, it won't need a new syntax divergent from Cpp1.

4 replies

JohelEGP Sep 9, 2023

That sounds like #45, but jumping straight into std::basic_string.

AbhinavK00 Sep 9, 2023

I read the conversation on that issue and the comment by willray was something I really agree with but it got no response. cpp2 should improve language arrays, but that's unrelated to this issue. Maybe do something similar to #637 and then discuss how to have string literals be just an array of utf8 chars (i'm not an expert on unicode so no suggestion about the latter).

msadeqhe Sep 12, 2023
Author

@JohelEGP, The underlying type can be anything. I just used basic_string as an example. So in general:

a: const *char8_t = "text";
b: basic_string<char8_t> = "text";
c: u8string = "text";

func: (a: const *char) = {
    print("char");
}

func: (a: const *char8_t) = {
    print("char8_t");
}

func("text"); // prints "char"

utf8: type == const *char8_t;
func(utf8("text")); // prints "char8_t"
func("text".utf8()); // prints "char8_t"
func(: utf8 = "text"); // prints "char8_t"

JohelEGP Sep 12, 2023

Oh, so it's still a (Cpp1) string literal.
Somehow, I thought you intended to make them std::string literals.

SebastianTroy · 2023-09-12T07:17:52Z

SebastianTroy
Sep 12, 2023

If I were using a unicode editor, and I put unicode characters between the quotes, I'd expect the compiler to warn me that my char* string literal contained illegal characters, but my unicode string literal did not. I'm not sure if this is possible, but if it is, would the prefix on the string be be required to ascertain the string type without the compiler needing wider context? In your examples it is easy, but perhaps there are less direct assignments that could require context? Like using the same literal in an if statement to assign to different class members which have different string types? Or a templated out parameter? Or an automatically deduced type based on a literal with only unicode characters between the quotes (was this intentional? Or is the user using a unicode text editor and they just want the char equivalent string?) Not that I don't like your idea, just trying to find faults for discussion. On 12 September 2023 06:31:13 Sadeq ***@***.***> wrote: @JohelEGP<https://github.com/JohelEGP>, The underlying type can be anything. I just used basic_string as an example. So in general: a: const *char8_t = "text"; b: basic_string<char8_t> = "text"; c: u8string = "text"; func: (const *char) = { print("char"); } func: (const *char8_t) = { print("char8_t"); } func("text"); // prints "char" utf8: == const *char8_t; func(utf8("text")); // prints "char8_t" func("text".utf8()); // prints "char8_t" func(: utf8 = "text"); // prints "char8_t" — Reply to this email directly, view it on GitHub<#668 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AALUZQK6EKP5N25Z3AWQFBLXZ7XR7ANCNFSM6AAAAAA4RICDTU>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

msadeqhe Nov 8, 2023
Author

If I were using a unicode editor, and I put unicode characters between the quotes, I'd expect the compiler to warn me that my char* string literal contained illegal characters, but my unicode string literal did not. I'm not sure if this is possible, but if it is, would the prefix on the string be be required to ascertain the string type without the compiler needing wider context?

IDE, code editor and the compiler can warn me if I directly construct a string type with a wrong string literal:

s32: type == std::u32string;

a: std::u8string = "UTF-32 text"; // OK. No warning!
b: std::u8string = s32("UTF-32 text"); // ERROR!

Also it works with any string type:

s32: type == const *char32_t;

a: std::u8string = "UTF-32 text"; // OK. No warning!
b: std::u8string = s32("UTF-32 text"); // ERROR!

That's the same behavior when we use unicode prefixes:

b: std::u8string = u32"UTF-32 text"; // ERROR!

In your examples it is easy, but perhaps there are less direct assignments that could require context? Like using the same literal in an if statement to assign to different class members which have different string types? Or a templated out parameter?

If a member function has overloaded with different string types, we can directly call the constructor to call the appropriate member function:

s8: type == std::u8string;

// if obj.member("UTF-8 text".s8()) {
if obj.member(s8("UTF-8 text")) {
    // ...
}

But if the member function has only one declaration with one string type, and if we don't care about the encoding of string literal, we can write:

if obj.member("UTF-8 text") {
    // ...
}

Or an automatically deduced type based on a literal with only unicode characters between the quotes (was this intentional? Or is the user using a unicode text editor and they just want the char equivalent string?)

No, I think this approach has bad side-effects as you've explained.

Not that I don't like your idea, just trying to find faults for discussion.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are unicode prefixes in string literals necessary? #668

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Are unicode prefixes in string literals necessary? #668

msadeqhe Sep 9, 2023

Replies: 3 comments · 5 replies

JohelEGP Sep 9, 2023

msadeqhe Sep 9, 2023 Author

JohelEGP Sep 9, 2023

AbhinavK00 Sep 9, 2023

msadeqhe Sep 12, 2023 Author

JohelEGP Sep 12, 2023

SebastianTroy Sep 12, 2023

msadeqhe Nov 8, 2023 Author

msadeqhe
Sep 9, 2023

Replies: 3 comments 5 replies

JohelEGP
Sep 9, 2023

msadeqhe
Sep 9, 2023
Author

msadeqhe Sep 12, 2023
Author

SebastianTroy
Sep 12, 2023

msadeqhe Nov 8, 2023
Author