CFStringRef#stringValue buffer needs space for 4 UTF8 bytes #1345

dbwiddis · 2021-04-25T15:40:57Z

Fixes #1342 (again)

macOS incorrectly uses a 3 byte per character conversion when determining the maximum byte size of a UTF8 encoded string. Changing the buffer size calculation to use 4 bytes (plus a null byte) actually produces an appropriately sized buffer.

matthiasblaesing · 2021-04-25T18:09:00Z

So Apple manages to screw up to calculate the correct byte width for the character string and so every library out there has to reinvent the wheel? Great ...

But yes, the change looks fine.

dbwiddis · 2021-04-25T18:10:36Z

Apple is not alone in this.

dbwiddis · 2021-04-29T23:19:20Z

OK, so maybe it WAS okay and my test case when I first tried this may have omitted the null byte.

I reported this as a bug to Apple and received this response:

This behaves as expected, and it looks like an issue for you to resolve.

The behavior of the function is correct. The length parameter is specified as the UTF-16 length of the string (as are all lengths with CFString/NSString).

It is correct that in a UTF-8 encoding a character may be encoded with up to 4 bytes. However, in the cases where a character is encoded with 4 bytes in UTF-8, that character will always be encoded as two UTF-16 characters, so in those cases the length would be 2 and not 1, and CFStringGetMaximumSizeForEncoding would return a value that would appropriately fit the resulting UTF-8 encoded data.
If you have a specific example where CFStringGetMaximumSizeForEncoding does not return the correct amount, please provide that and we can investigate further.

I tried to reproduce my earlier error and failed. I may have forgotten to add the null byte in my original failing test case.

It seems that there are some 3-byte UTF-8 characters that are only 2 bytes in UTF-16 (such as the alaf character in the earlier test case) but in the case of 4-byte UTF-8, they always consume 4 bytes in UTF-16 and result in 2 code pairs, thus the "length" determined even for a single "character" will be sufficient.

Oddly, with a single 4-byte character (as displayed) the Java String length represents the number of char required which is 2.

So I will likely be rolling back this PR to the previous version, but I'm going to take some time to try to fully understand Unicode and Java strings before I do anything else.

Sorry for the churn.

matthiasblaesing · 2021-04-30T20:48:43Z

Ok - my take on this:

Unicode is the specification that lists symbols
UTF-X are the encoding that map the symbol into the byte representation
a single java char can only represent a subset of all unicode symbols as it is only 16 bit long - only the basic mulilangual pane is supported. For the full unicode range two chars are are required which form a surrogate pair. An int can be used to represent the encoding of all unicode symbols

I don't know what you reported, but considering the original code:

jna/contrib/platform/src/com/sun/jna/platform/mac/CoreFoundation.java

Lines 491 to 494 in 5ac6161

    
           CFIndex length = INSTANCE.CFStringGetLength(this); 
        
           if (length.longValue() == 0) { 
        
               return ""; 
        
           }

CFStringGetLength Returns the number (in terms of UTF-16 code pairs) of Unicode characters in a string. quote from specification. This number is equal or higher than the actual number of unicode symbols in the string. So length either holds the real length of the string in unicode symbols or a larger number.

jna/contrib/platform/src/com/sun/jna/platform/mac/CoreFoundation.java

Lines 495 to 499 in 5ac6161

    
           // Calculate maximum possible size in UTF8 bytes 
        
           CFIndex maxSize = INSTANCE.CFStringGetMaximumSizeForEncoding(length, kCFStringEncodingUTF8); 
        
           if (maxSize.intValue() == kCFNotFound) { 
        
               throw new StringIndexOutOfBoundsException("CFString maximum number of bytes exceeds LONG_MAX."); 
        
           }

CFStringGetMaximumSizeForEncoding takes as input the The number of Unicode characters to evaluate. and returns The maximum number of bytes that could be needed to represent length number of Unicode characters with the string encoding encoding, or kCFNotFound if the number exceeds LONG_MAX..

There are two interpretations possible here:

either Unicode characters in this case means unicode symbol, in that case a worst case for UTF-8 would yield 4 bytes
or a UTF-16 code unit (a single 16 bit value, in that case a worst case for UTF-8 would yield 3 bytes

So my take on this, the estimation for a string done by this method should overestimate the required memory.

dbwiddis · 2021-04-30T21:31:07Z

I don't know what you reported

In addition to noting UTF-8 can be 4 bytes, I did provide the sequence of function calls used, referencing the sequence

CFIndex length = CFStringGetLength(aString);
CFIndex maxSize = CFStringGetMaximumSizeForEncoding(length, kCFStringEncodingUTF8) + 1;
char *buffer = (char *)malloc(maxSize);
if (CFStringGetCString(aString, buffer, maxSize, ... )

The way I interpret the reply, CFStringGetLength() will always return at least a length of 2 "unicode characters" for a 4-byte UTF-8 character.

a single java char can only represent a subset of all unicode symbols as it is only 16 bit long

Right. This is a distraction from the root question here, but it is relied upon in the JNA implementation and tests so is relevant. The native method CFStringCreateWithCharacters is used inside the CFStringRef#createCFString() and converts a String to a char[] using toCharArray(). In addition, in my test case, I do an assert using the string length().

Based on my testing, it always appears to create a String with length 2 and a char[2] array.

So my take on this, the estimation for a string done by this method should overestimate the required memory.

It does not hurt to overestimate, which multiplying by 4 does. My only concern is whether the tests for string length and/or character array length should be changed.

dbwiddis · 2021-04-30T21:37:52Z

Rereading this, I think I muddled the issue. Here's my thoughts more succinctly:

Regardless of how we create the String, the sequence of CFStringGetLength() and using the result in CFStringGetMaximumSizeForEncoding() while adding 1 for null, should work. This is how it was before I submitted this PR.
The current version (this PR) multiplies by 4 instead of 3 which is even more of an overestimate and probably isn't harmful. But there is no need for it.
I'm still not sure if the method I use to convert the original Java String to a CFString is correct, but it appears to be.

dbwiddis · 2021-05-03T14:54:23Z

OK I've convinced myself that Java always adds a char to the String for 4-byte Unicode; it can be detected with Character.isSurrogate(c) in JDK7+. This Guava method is one of several online calculation examples that use 3 except when there are 4 bytes. I'll roll back this change and rework (and add comments to) the test cases.

CFStringRef#stringValue buffer needs space for 4 UTF8 bytes

20aa34f

dbwiddis merged commit 402d4c5 into java-native-access:master Apr 25, 2021

dbwiddis deleted the utf8is4bytes branch April 25, 2021 18:10

dbwiddis added a commit to dbwiddis/jna that referenced this pull request May 3, 2021

Revert java-native-access#1345, clarify comments and add tests

fd4fa08

dbwiddis mentioned this pull request May 3, 2021

Revert #1345, clarify comments and add tests #1348

Merged

dbwiddis added a commit that referenced this pull request May 6, 2021

Revert #1345, clarify comments and add tests (#1348)

f89905a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFStringRef#stringValue buffer needs space for 4 UTF8 bytes #1345

CFStringRef#stringValue buffer needs space for 4 UTF8 bytes #1345

dbwiddis commented Apr 25, 2021

matthiasblaesing commented Apr 25, 2021

dbwiddis commented Apr 25, 2021

dbwiddis commented Apr 29, 2021

matthiasblaesing commented Apr 30, 2021

dbwiddis commented Apr 30, 2021

dbwiddis commented Apr 30, 2021

dbwiddis commented May 3, 2021

CFStringRef#stringValue buffer needs space for 4 UTF8 bytes #1345

CFStringRef#stringValue buffer needs space for 4 UTF8 bytes #1345

Conversation

dbwiddis commented Apr 25, 2021

matthiasblaesing commented Apr 25, 2021

dbwiddis commented Apr 25, 2021

dbwiddis commented Apr 29, 2021

matthiasblaesing commented Apr 30, 2021

dbwiddis commented Apr 30, 2021

dbwiddis commented Apr 30, 2021

dbwiddis commented May 3, 2021