Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFStringRef#stringValue buffer needs space for 4 UTF8 bytes #1345

Merged
merged 1 commit into from
Apr 25, 2021

Conversation

dbwiddis
Copy link
Contributor

Fixes #1342 (again)

macOS incorrectly uses a 3 byte per character conversion when determining the maximum byte size of a UTF8 encoded string. Changing the buffer size calculation to use 4 bytes (plus a null byte) actually produces an appropriately sized buffer.

@matthiasblaesing
Copy link
Member

So Apple manages to screw up to calculate the correct byte width for the character string and so every library out there has to reinvent the wheel? Great ...

But yes, the change looks fine.

@dbwiddis
Copy link
Contributor Author

Apple is not alone in this.

@dbwiddis dbwiddis merged commit 402d4c5 into java-native-access:master Apr 25, 2021
@dbwiddis dbwiddis deleted the utf8is4bytes branch April 25, 2021 18:10
@dbwiddis
Copy link
Contributor Author

OK, so maybe it WAS okay and my test case when I first tried this may have omitted the null byte.

I reported this as a bug to Apple and received this response:

This behaves as expected, and it looks like an issue for you to resolve.

The behavior of the function is correct. The length parameter is specified as the UTF-16 length of the string (as are all lengths with CFString/NSString).

It is correct that in a UTF-8 encoding a character may be encoded with up to 4 bytes. However, in the cases where a character is encoded with 4 bytes in UTF-8, that character will always be encoded as two UTF-16 characters, so in those cases the length would be 2 and not 1, and CFStringGetMaximumSizeForEncoding would return a value that would appropriately fit the resulting UTF-8 encoded data.
If you have a specific example where CFStringGetMaximumSizeForEncoding does not return the correct amount, please provide that and we can investigate further.

I tried to reproduce my earlier error and failed. I may have forgotten to add the null byte in my original failing test case.

It seems that there are some 3-byte UTF-8 characters that are only 2 bytes in UTF-16 (such as the alaf character in the earlier test case) but in the case of 4-byte UTF-8, they always consume 4 bytes in UTF-16 and result in 2 code pairs, thus the "length" determined even for a single "character" will be sufficient.

Oddly, with a single 4-byte character (as displayed) the Java String length represents the number of char required which is 2.

So I will likely be rolling back this PR to the previous version, but I'm going to take some time to try to fully understand Unicode and Java strings before I do anything else.

Sorry for the churn.

@matthiasblaesing
Copy link
Member

Ok - my take on this:

  • Unicode is the specification that lists symbols
  • UTF-X are the encoding that map the symbol into the byte representation
  • a single java char can only represent a subset of all unicode symbols as it is only 16 bit long - only the basic mulilangual pane is supported. For the full unicode range two chars are are required which form a surrogate pair. An int can be used to represent the encoding of all unicode symbols

I don't know what you reported, but considering the original code:

CFIndex length = INSTANCE.CFStringGetLength(this);
if (length.longValue() == 0) {
return "";
}

CFStringGetLength Returns the number (in terms of UTF-16 code pairs) of Unicode characters in a string. quote from specification. This number is equal or higher than the actual number of unicode symbols in the string. So length either holds the real length of the string in unicode symbols or a larger number.

// Calculate maximum possible size in UTF8 bytes
CFIndex maxSize = INSTANCE.CFStringGetMaximumSizeForEncoding(length, kCFStringEncodingUTF8);
if (maxSize.intValue() == kCFNotFound) {
throw new StringIndexOutOfBoundsException("CFString maximum number of bytes exceeds LONG_MAX.");
}

CFStringGetMaximumSizeForEncoding takes as input the The number of Unicode characters to evaluate. and returns The maximum number of bytes that could be needed to represent length number of Unicode characters with the string encoding encoding, or kCFNotFound if the number exceeds LONG_MAX..

There are two interpretations possible here:

  • either Unicode characters in this case means unicode symbol, in that case a worst case for UTF-8 would yield 4 bytes
  • or a UTF-16 code unit (a single 16 bit value, in that case a worst case for UTF-8 would yield 3 bytes

So my take on this, the estimation for a string done by this method should overestimate the required memory.

@dbwiddis
Copy link
Contributor Author

I don't know what you reported

In addition to noting UTF-8 can be 4 bytes, I did provide the sequence of function calls used, referencing the sequence

CFIndex length = CFStringGetLength(aString);
CFIndex maxSize = CFStringGetMaximumSizeForEncoding(length, kCFStringEncodingUTF8) + 1;
char *buffer = (char *)malloc(maxSize);
if (CFStringGetCString(aString, buffer, maxSize, ... )

The way I interpret the reply, CFStringGetLength() will always return at least a length of 2 "unicode characters" for a 4-byte UTF-8 character.

a single java char can only represent a subset of all unicode symbols as it is only 16 bit long

Right. This is a distraction from the root question here, but it is relied upon in the JNA implementation and tests so is relevant. The native method CFStringCreateWithCharacters is used inside the CFStringRef#createCFString() and converts a String to a char[] using toCharArray(). In addition, in my test case, I do an assert using the string length().

Based on my testing, it always appears to create a String with length 2 and a char[2] array.

So my take on this, the estimation for a string done by this method should overestimate the required memory.

It does not hurt to overestimate, which multiplying by 4 does. My only concern is whether the tests for string length and/or character array length should be changed.

@dbwiddis
Copy link
Contributor Author

Rereading this, I think I muddled the issue. Here's my thoughts more succinctly:

  • Regardless of how we create the String, the sequence of CFStringGetLength() and using the result in CFStringGetMaximumSizeForEncoding() while adding 1 for null, should work. This is how it was before I submitted this PR.
  • The current version (this PR) multiplies by 4 instead of 3 which is even more of an overestimate and probably isn't harmful. But there is no need for it.
  • I'm still not sure if the method I use to convert the original Java String to a CFString is correct, but it appears to be.

@dbwiddis
Copy link
Contributor Author

dbwiddis commented May 3, 2021

OK I've convinced myself that Java always adds a char to the String for 4-byte Unicode; it can be detected with Character.isSurrogate(c) in JDK7+. This Guava method is one of several online calculation examples that use 3 except when there are 4 bytes. I'll roll back this change and rework (and add comments to) the test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CoreFoundation's CFStringRef#stringValue doesn't add space for terminating null
2 participants