-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix string conversion issues with emoji characters #120
Conversation
@@ -46,6 +46,10 @@ namespace pybind11::detail { | |||
PyObject* strPtr = | |||
PyUnicode_Check(objPtr) ? PyUnicode_AsUTF8String(objPtr) : objPtr; | |||
|
|||
if (!strPtr) { | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is the fix for the crashes but can you tell what was in objPtr
when this occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a string object, but the string format returned when there are emojis fails to properly convert back into a QString.
This is described in more detail in #plugins-dev.
@@ -68,8 +72,7 @@ namespace pybind11::detail { | |||
handle /* parent */) | |||
{ | |||
static_assert(sizeof(QChar) == 2); | |||
return PyUnicode_FromKindAndData(PyUnicode_2BYTE_KIND, src.constData(), | |||
src.length()); | |||
return PyUnicode_FromString(src.toStdString().c_str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to change this? AFAIK, Unicode string in Python can be stored as UTF-16 so this adds an extra conversion from UTF-16 to UTF-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emoji characters can't be directly converted back to QStrings in the previous function. The current workaround is to encode to UTF-16 with a surrogatepass and decode again. This allows us to directly use QStrings from C++ -> Python -> C++ without having reencode the strings again in Python.
See #plugins-dev.
I made some tests to try to understand the root cause. The problem is that So basically, in I think we should fix the Python > C++ code by checking the kinds - This is a better fix than checking the return of Regarding the C++ > Python "fix":
|
Something to bear in mind is that the Win32 and NT APIs don't care about the sequence of code points in a path, just the sequence of 16-bit values (it's a similar story on Unix, except it's eight-bit values), so getting rid of dodgy surrogates before conversion might mean you end up with a mod name that prints the same and compares equal using certain functions, but makes Windows attempt to access a different folder. It shouldn't have been possible to create a mod name with dodgy surrogates in the first place - both the MO2 GUI and Windows Explorer should be encoding strings in canonical form when creating a mod. It might be worth looking into whether they're not actually doing that, or whether the folder name is being subtly mangled when we load it. Also, the Python docs are confusing. UCS2 and UCS4 are both fixed-width encodings, with UCS2 being the subset of UTF-16 that can be represented with one code unit per code point, and UCS4 is the same thing as UTF-32. UCS1 isn't a real thing, so it's unclear if Python's come up with their own thing that's the subset of UTF-8 that can be represented with one code unit per code point (basically equivalent to 7-bit ASCII padded to eight bits), or they're calling UTF-8 UCS1 and calling UTF-16 UCS2 even though they're different. If the former, then that would imply that Python Unicode objects are always fixed-width, and can only express emoji and other high-value code points in UCS4 mode, which would explain why we were getting dodgy surrogates when telling it to use UCS2 mode. The existence of the |
I think a more robust Python -> C++ conversion is fine. In that case I don't think the change to the C++ -> Python is necessary. But you're saying these strings can still display improperly on the python side in some cases? In that case I think using PyUnicode_FromString may make more sense so that plugin authors don't have to potentially reencode strings for random edge cases. |
https://peps.python.org/pep-0393/ seems to confirm my theory from earlier, i.e. that Python stores Unicode strings in whichever it deems to be the narrowest fixed-width encoding that will hold all the code points it contains. That means we either need to do all our string interop with the explicitly-UTF-8 functions, or the explicitly-UCS4 ones, as a UTF-16 string isn't necessarily bit-compatible with the UCS2 functions. The one exception is coming back from Python to C++ - if the function to query a Python string's internal representation says it's already UCS2, we can safely reinterpret that as UTF-16. |
This is what I have now: /**
* Conversion part 1 (Python->C++): convert a PyObject into a QString
* instance or return false upon failure. The second argument
* indicates whether implicit conversions should be applied.
*/
bool type_caster<QString>::load(handle src, bool)
{
PyObject* objPtr = src.ptr();
if (PyBytes_Check(objPtr)) {
value = QString::fromUtf8(PyBytes_AsString(objPtr));
return true;
}
else if (PyUnicode_Check(objPtr)) {
switch (PyUnicode_KIND(objPtr)) {
case PyUnicode_1BYTE_KIND:
value = QString::fromUtf8(PyUnicode_AsUTF8(objPtr));
break;
case PyUnicode_2BYTE_KIND:
value = QString::fromUtf16(
reinterpret_cast<char16_t*>(PyUnicode_2BYTE_DATA(objPtr)),
PyUnicode_GET_LENGTH(objPtr));
break;
case PyUnicode_4BYTE_KIND:
value = QString::fromUcs4(
reinterpret_cast<char32_t*>(PyUnicode_4BYTE_DATA(objPtr)),
PyUnicode_GET_LENGTH(objPtr));
break;
default:
return false;
}
return true;
}
else {
return false;
}
}
/**
* Conversion part 2 (C++ -> Python): convert an QString instance into
* a Python object. The second and third arguments are used to
* indicate the return value policy and parent object (for
* ``return_value_policy::reference_internal``) and are generally
* ignored by implicit casters.
*/
handle type_caster<QString>::cast(QString src, return_value_policy /* policy */,
handle /* parent */)
{
return PyUnicode_DecodeUTF16(reinterpret_cast<const char*>(src.utf16()),
2 * src.length(), nullptr, 0);
} The initial issue is that The In |
That looks pretty good to me. Would you prefer I update this PR? Or do you want to just open a new one? |
You can update this PR. |
Alright, I can get to that in a few hours. |
@Silarn I pushed the change with some extra unit tests. I let you test locally with rootbuilder (I only have checked that the unit tests are passing). |
This change appears to work. The strings are being passed in both directions without errors or mangled characters. |
This should prevent crashes and seems to fix the issue with emoji characters.
I'm unsure if this might cause other problems, however, so a review is very appreciated.