Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GDExtension function to construct StringName directly from char* #78580

Merged

Conversation

Bromeon
Copy link
Contributor

@Bromeon Bromeon commented Jun 22, 2023

So far, an indirection via String was necessary, causing at least 2 allocations and copies (String; String inside StringName). Since StringNames often refer to string literals, this allows them to be directly constructed from C strings.

It also provides the is_static flag: when the source has static storage duration, no copy/allocation will be needed. However, the extension developer needs to uphold this lifetime guarantee.

Some things need to be discussed:

  1. Does this actually support UTF-8? I've only seen ASCII StringNames in the wild...
  2. I adopted the name from the String constructor function, feel free to suggest a better one (especially if 1 doesn't hold).
  3. Are the memory assumptions correct?
    • If a static StringName is created, its destructor should not be invoked?
    • Non-static ones are reference-counted as usual?

@dsnopek
Copy link
Contributor

dsnopek commented Jun 23, 2023

Thanks! The overall idea here looks great :-)

  1. Does this actually support UTF-8? I've only seen ASCII StringNames in the wild...

That's a really good question. Looking through the code, it seems as though StringNames created from char * may be interpreted as latin-1? It definitely doesn't seem like utf-8, though - String::hash() is used to find the right data in the table, and it's basically just casting the chars to uint32_t (as opposed to trying to interpret the multi-byte sequences from utf-8).

  1. I adopted the name from the String constructor function, feel free to suggest a better one (especially if 1 doesn't hold).

Maybe gdextension_string_name_new_from_chars()? Or gdextension_string_name_new_from_latin1_chars() if it's really latin-1 and not just plain ASCII.

If a static StringName is created, its destructor should not be invoked?

I'm really not sure about this one. I think it may still be OK to invoke the destructor? I'm actually quite confused about how StringName::_Data::static_count is being used. This needs more investigation or an expert opinion.


/**
* @name string_name_new_with_utf8_chars
* @since 4.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably actually Godot 4.2 material -- 4.1 is about to go into RC next week!

@Bromeon Bromeon force-pushed the feature/gdextension-stringname-ctor branch 2 times, most recently from 10c9fad to 131fda4 Compare July 30, 2023 13:47
@Bromeon
Copy link
Contributor Author

Bromeon commented Jul 30, 2023

It looks like the encoding depends on whether the StringName is constructed statically or not. string_name_new_with_chars is now named more generally to accomodate for both encodings.

I mentioned the semantics in the docs, but would appreciate a 2nd opinion:

If is_static is true, then:

  • The StringName will reuse the p_contents buffer instead of copying it.
    You must guarantee that the buffer remains valid for the duration of the application (e.g. string literal).
  • The encoding is limited to Latin-1 instead of UTF-8.
  • You must not call a destructor for the StringName.

is_static is purely an optimization and can easily introduce undefined behavior if used wrong. In case of doubt, set it to false.

Since the semantics vary wildly (encoding, lifetime, destruction), would it make sense to split this into two functions?

  • string_name_new_with_utf8_chars
  • string_name_new_static_with_latin1_chars

(the "with_xy_chars" naming follows the string_new_* convention)

@Bromeon
Copy link
Contributor Author

Bromeon commented Jul 30, 2023

@dsnopek

If a static StringName is created, its destructor should not be invoked?

I'm really not sure about this one. I think it may still be OK to invoke the destructor? I'm actually quite confused about how StringName::_Data::static_count is being used. This needs more investigation or an expert opinion.

Regarding that, last time I ran into problems when destroying static StringNames, and IIRC there weren't any memory leaks reported by LSan when not destroying it. Would need to check again, was more than a month ago...


@AThousandShips

Note that a StringName constructed from a utf8 string that contains non-ascii characters will not match equality with one created from utf32, unless the utf8 one converts the buffer to utf32 before creating

From UTF-32, so you mean converted from a String? Because the char* source here is either UTF-8 or Latin-1 (or even ASCII).

But that's an interesting limitation... Why is the buffer not normalized to one format?
Or are you talking about the static case (but then it would be Latin-1/ASCII)?

@AThousandShips
Copy link
Member

AThousandShips commented Jul 30, 2023

Meant char* with utf8, if it is utf8 it would have to be converted to utf32 to be matching with the hash, or done for the hash calculation

It would match with Latin-1/ASCII, but with anything in utf8 outside the ASCII range the hash won't match unless the hash function is altered

If it creates the StringName by creating a String then everything is good

@Bromeon
Copy link
Contributor Author

Bromeon commented Jul 30, 2023

But doesn't Godot use that internally with SNAME all the time?

Or that just happens to work because SNAME is used everywhere and not actually converted from String? But that is also hard to imagine, as user provided input like class names comes from String...

@AThousandShips
Copy link
Member

If you use SNAME it is not guaranteed to work if it contains non-ASCII

@AThousandShips
Copy link
Member

AThousandShips commented Jul 30, 2023

Reason being:

StringName::StringName(const char *p_name, bool p_static) {
_data = nullptr;
ERR_FAIL_COND(!configured);
if (!p_name || p_name[0] == 0) {
return; //empty, ignore
}
MutexLock lock(mutex);
uint32_t hash = String::hash(p_name);

godot/core/string/ustring.cpp

Lines 2765 to 2775 in 75f9c97

uint32_t String::hash(const char *p_cstr) {
uint32_t hashv = 5381;
uint32_t c = *p_cstr++;
while (c) {
hashv = ((hashv << 5) + hashv) + c; /* hash * 33 + c */
c = *p_cstr++;
}
return hashv;
}

vs:

godot/core/string/ustring.cpp

Lines 2816 to 2826 in 75f9c97

uint32_t String::hash(const char32_t *p_cstr) {
uint32_t hashv = 5381;
uint32_t c = *p_cstr++;
while (c) {
hashv = ((hashv << 5) + hashv) + c; /* hash * 33 + c */
c = *p_cstr++;
}
return hashv;
}

@AThousandShips
Copy link
Member

Causing for example "£" to be hashed as if it was "0xC2 0xA3" instead of "0xA3"

@Bromeon
Copy link
Contributor Author

Bromeon commented Jul 30, 2023

Note that a StringName constructed from a utf8 string that contains non-ascii characters will not match equality with one created from utf32, unless the utf8 one converts the buffer to utf32 before creating

So getting back to this, is it even a good idea to expose the UTF-8 based construction to GDExtension? You can be sure that users will run into this trap sooner or later...

Should we maybe just start by exposing the static ASCII constructor?
(I'm btw still not sure if ASCII or Latin-1...)

Or allow UTF-8 construction and convert to UTF-32 in the presence of non-ASCII chars (and leaving it otherwise)? Might still be fast enough for the average case where StringName is pure ASCII.

@AThousandShips
Copy link
Member

AThousandShips commented Jul 30, 2023

I'd say to err on the side of caution and do conversion, arguably unconditionally, unless the performance benefits of having the shorter encoding is worth it

So either:

  1. Clarify that it only supports Latin-1 (encoded as such)
  2. Check for encoding and convert conditionally
  3. Convert unconditionally (i.e. just use a String conversion in the creation step):
    memnew_placement(r_dest, StringName(String::utf8(p_contents), static_cast<bool>(is_static)));

The difference here for ASCII/Latin-1 is that if it is encoded in Latin-1 it'll work, but if it's Latin-1 encoded as utf8 it'll break

While it won't cause problems with built-in StringNames as they are all reliably in the valid range, it will clash with ones created from GDScript if they contain non-ASCII characters

@Bromeon
Copy link
Contributor Author

Bromeon commented Jul 30, 2023

Going always through StringName (3) would however defeat the point of this addition, as that's already possible with the existing API. The main idea behind this PR is to allow faster StringName construction, especially from literals.

Option (1) could still provide an is_static flag and could look as follows:

string_name_new_with_latin1_chars(dest, c_string, is_static);

It would then be the caller's responsibility to make sure that c_string is Latin-1.
This would be symmetric with the existing constructor for String, namely string_new_with_latin1_chars.

Later (or now?), we could then provide a constructor from UTF-8 that would unconditionally convert, just like the String constructor.

@AThousandShips
Copy link
Member

Sounds good!

@dsnopek
Copy link
Contributor

dsnopek commented Jul 31, 2023

So either:

  1. Clarify that it only supports Latin-1 (encoded as such)
  2. Check for encoding and convert conditionally
  3. Convert unconditionally (i.e. just use a String conversion in the creation step):
    memnew_placement(r_dest, StringName(String::utf8(p_contents), static_cast<bool>(is_static)));

I'd personally vote for option nr 1

StringNames are meant to be identifiers, not proper strings. So, in my opinion, sticking with ASCII is fine, it just happens that Godot treats C-strings as Latin-1 (as far as I can tell, at least) which is a mostly harmless superset of ASCII.

Later (or now?), we could then provide a constructor from UTF-8 that would unconditionally convert, just like the String constructor.

I'd vote for later. Given that it's already possible to convert a String to StringName, I don't think we're blocking anyone. We could wait to see if anyone actually asks for this.

(UPDATE: Although, if you really want to address it now, I'd be fine with that too :-))

@Bromeon
Copy link
Contributor Author

Bromeon commented Aug 1, 2023

So, in my opinion, sticking with ASCII is fine, it just happens that Godot treats C-strings as Latin-1 (as far as I can tell, at least) which is a mostly harmless superset of ASCII.

I think this is where things break: while StringNames constructed from ASCII char* compare fine with the StringNames constructed through String, this doesn't hold for Latin-1 char*. See also AThousandShips' comment here and the 2 messages preceding it.

Maybe the hash functions should be aligned? If yes, that would become blocking.

Or we only support ASCII and validate it. But in these constructor functions, there is currently no way to report errors. We could add a GDExtensionBool return type, but it would be inconsistent with other constructors like string_new_*.


I'd vote for later. Given that it's already possible to convert a String to StringName, I don't think we're blocking anyone. We could wait to see if anyone actually asks for this.

(UPDATE: Although, if you really want to address it now, I'd be fine with that too :-))

I already implemented it in the draft 🙂 still needs some testing though.

@dsnopek
Copy link
Contributor

dsnopek commented Aug 7, 2023

I think this is where things break: while StringNames constructed from ASCII char* compare fine with the StringNames constructed through String, this doesn't hold for Latin-1 char*. See also AThousandShips' comment here and the 2 messages preceding it.

Hm, looking at the code, I guess I still don't see why the hash is different for Latin-1 vs UTF-32.

The example of "£" is 0xA3 in Latin-1 and 0x000000A3 in UTF-32. I had thought this worked for all the Latin-1 characters? But I'm not an expert on this, so I could be wrong.

Looking at the hash function (for Latin-1 data):

uint32_t String::hash(const char *p_cstr) { 
 	uint32_t hashv = 5381; 
 	uint32_t c = *p_cstr++; 
  
 	while (c) { 
 		hashv = ((hashv << 5) + hashv) + c; /* hash * 33 + c */ 
 		c = *p_cstr++; 
 	} 
  
 	return hashv; 
 } 

It's casting each character to a uint32_t before operating on it, which I think effectively converts the Latin-1 character to UTF-32 (assuming I'm correct above, about Latin-1 being the same as the first set of unicode characters).

Of course, if you passed UTF-8 data into String::hash(const char *p_cstr) then you'd get a very different hash, but I think this is meant to take Latin-1 data.

Or we only support ASCII and validate it. But in these constructor functions, there is currently no way to report errors. We could add a GDExtensionBool return type, but it would be inconsistent with other constructors like string_new_*.

However, if we operate under the assumption that I'm wrong about the hashing above - which very well may be the case :-) - and we can only correctly handle ASCII, I think it'd be fine documenting that it only takes ASCII and naming it accordingly (ie. string_name_new_with_ascii_chars()), I don't think we necessarily need to validate it.

The binding will have more knowledge about the data coming from the developer, and may already have validated that the data is ASCII (perhaps the language even only supports ASCII?) and adding extra validation in the GDExtension interface is unnecessary. There's already definitely parts of the GDExtension interface that expect you to pass correct data in with no validation, so this wouldn't be anything new.

@AThousandShips
Copy link
Member

AThousandShips commented Aug 7, 2023

To clarify what I was saying, Latin-1, as in the encoding, should work correctly, but I don't know what encodings are used in source as C++ largely uses utf8 AFAIK, so that's my only point of concern, but assuming it is encoded as Latin-1 it should be fully safe

@dsnopek
Copy link
Contributor

dsnopek commented Aug 7, 2023

Ah, ok, thanks for clarifying!

The encoding used by the end-developer is going to depend on their specific programming language, so, in my opinion, it should be up to the language binding to ensure that it's passing data with the right encoding to GDExtension.

@Bromeon Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 2dcb5f8 to 83ff151 Compare August 7, 2023 20:05
@dsnopek
Copy link
Contributor

dsnopek commented Aug 11, 2023

Discussed at the GDExtension meeting, and we think this API looks great! The only remaining question is if there is still an encoding issue. If not, this is probably good to move forward. If so, limiting the API to UTF-8 and ASCII (not Latin-1) would also be fine.

@Bromeon
Copy link
Contributor Author

Bromeon commented Sep 24, 2023

Found the reason of the discrepancy between the hashes:

  1. StringName constructed from char* (Latin-1)

    • uses the uint32_t String::hash(const char *p_cstr) method
    • based on char, which has range [-128, 127] on Windows
    • '±' has value -79
    • assignment to uint32_t leads to value 4294967217
  2. StringName constructed from existing String (UTF-8)

    • uses the uint32_t String::hash() const method
    • based on char32_t
    • U'±' has value 177 (== 256 - 79)

How should we proceed?

  • Make sure the hashes are computed the same for all strings.
    • Would need to be careful that the one which is used for method registrations doesn't change, otherwise all method hashes are invalidated.
    • Any other breaking potential?
  • Make sure that StringName at least uses a cosistent hash function for itself.

I'll try with interpreting char* as uint8_t*...

@Bromeon Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 83ff151 to 13fb823 Compare September 24, 2023 14:16
@Bromeon
Copy link
Contributor Author

Bromeon commented Sep 24, 2023

Thanks for the review, will change parameter names.

Added a 2nd commit that would modify the char* hash and makes the two compare equal in my Rust tests, but we need to see if it doesn't break anything else.

We also need to see how we would do it for wchar_t*, as it essentially suffers from the same problem. It's a bit more difficult, because unlike char, it's size can vary, so I'm not sure if truncating to uint16_t would always give deterministic results.

@Bromeon Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 13fb823 to 1fe778b Compare September 24, 2023 14:22
So far, an indirection via String was necessary, causing at least 2 allocations and copies (String; String inside StringName).
Since StringNames often refer to string literals, this allows them to be directly constructed from C strings.

There are two formats: Latin-1 and UTF-8.

The Latin-1 constructor also provides the `p_is_static` flag: when the source has static storage duration, no copy/allocation will be needed.
However, the extension developer needs to uphold this lifetime guarantee.
@Bromeon Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 1fe778b to f284d55 Compare September 24, 2023 16:27
@Bromeon
Copy link
Contributor Author

Bromeon commented Sep 24, 2023

Added a UTF-8 constructor that takes length, in addition to the one relying on null-termination. Should greatly enhance interoperability with other string types. (I can also remove the one expecting null-terminated char*, in case that's preferred).

Furthermore, I noticed the docs are imprecise -- the p_size argument sometimes means "number of bytes" (Latin-1, UTF-8) and sometimes "number of characters" (Wide, UTF-16, UTF-32). I renamed the latters to p_char_count and updated parameter docs accordingly. Does that make sense?

What's left is mostly the question whether the hash for wchar_t should also be fixed, and how (as it might be 2 or 4 bytes wide).

@Bromeon Bromeon marked this pull request as ready for review September 24, 2023 16:40
@Bromeon Bromeon requested review from a team as code owners September 24, 2023 16:40
@dsnopek
Copy link
Contributor

dsnopek commented Sep 24, 2023

Found the reason of the discrepancy between the hashes

Aha, great catch! I wouldn't have thought to consider the signedness of character type.

Would need to be careful that the one which is used for method registrations doesn't change, otherwise all method hashes are invalidated.

Yes, this is important, I don't want to have to map more hashes :-)

Furthermore, I noticed the docs are imprecise -- the p_size argument sometimes means "number of bytes" (Latin-1, UTF-8) and sometimes "number of characters" (Wide, UTF-16, UTF-32). I renamed the latters to p_char_count and updated parameter docs accordingly. Does that make sense?

Sounds good to me!

What's left is mostly the question whether the hash for wchar_t should also be fixed, and how (as it might be 2 or 4 bytes wide).

Could we do something with constexpr? For example:

    uint32_t val;
    if constexpr (sizeof(wchar_t) == 2) {
        val = static_cast<uint16_t>(input_char);
    } else {
        val = static_cast<uint32_t>(input_char);
    }

@Bromeon Bromeon force-pushed the feature/gdextension-stringname-ctor branch from f284d55 to 2608410 Compare September 24, 2023 17:32
@Bromeon
Copy link
Contributor Author

Bromeon commented Sep 24, 2023

Could we do something with constexpr?

Great idea! I actually ended up using std::conditional_t since the conversion was used in several places, and this adds much less code.

Yes, this is important, I don't want to have to map more hashes :-)

CI would report incompatibility if that happened, correct?

@dsnopek
Copy link
Contributor

dsnopek commented Sep 24, 2023

Great idea! I actually ended up using std::conditional_t since the conversion was used in several places, and this adds much less code.

This code looks great to me!

However, we'll have to see how the core and production teams feel about using std::conditional_t, since it isn't used anywhere in Godot yet, and we try to limit how much of the standard library and newer C++ features we use. (We do already use constexpr in a few places, so that could be a fallback, if necessary - but your version is much shorter and easier to follow, so I'd personally prefer yours.)

CI would report incompatibility if that happened, correct?

Yes, it would, and the tests have passed, so I think we're good :-)

Since char/wchar_t can be either signed or unsigned, its conversion to uint32_t leads to different values depending on platform.
In particular, the same string represented as char* (Latin-1; StringName direct construction) or uint32_t (UTF-8; constructed
via String) previously resulted in different hashes.
@Bromeon Bromeon force-pushed the feature/gdextension-stringname-ctor branch from 2608410 to c770937 Compare September 24, 2023 18:59
@Bromeon
Copy link
Contributor Author

Bromeon commented Sep 24, 2023

However, we'll have to see how the core and production teams feel about using std::conditional_t, since it isn't used anywhere in Godot yet, and we try to limit how much of the standard library and newer C++ features we use.

I see! I changed it to std::conditional<...>::type, which is already used in several places and should thus be less controversial.

@akien-mga akien-mga modified the milestones: 4.x, 4.2 Sep 24, 2023
@akien-mga akien-mga merged commit 40b48b1 into godotengine:master Sep 24, 2023
@akien-mga
Copy link
Member

Thanks!

@Bromeon Bromeon deleted the feature/gdextension-stringname-ctor branch September 25, 2023 08:11
@akien-mga akien-mga changed the title Add GDExtension function to construct StringName directly from char* Add GDExtension function to construct StringName directly from char * Oct 3, 2023
@akien-mga akien-mga changed the title Add GDExtension function to construct StringName directly from char * Add GDExtension function to construct StringName directly from char* Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants