Fix string conversion issues with emoji characters #120

Silarn · 2023-11-29T06:11:38Z

This should prevent crashes and seems to fix the issue with emoji characters.

I'm unsure if this might cause other problems, however, so a review is very appreciated.

Holt59 · 2023-11-29T08:38:28Z

src/pybind11-qt/pybind11_qt_basic.cpp

@@ -46,6 +46,10 @@ namespace pybind11::detail {
        PyObject* strPtr =
            PyUnicode_Check(objPtr) ? PyUnicode_AsUTF8String(objPtr) : objPtr;

+        if (!strPtr) {
+            return false;


I guess this is the fix for the crashes but can you tell what was in objPtr when this occurs?

It's a string object, but the string format returned when there are emojis fails to properly convert back into a QString.

This is described in more detail in #plugins-dev.

Holt59 · 2023-11-29T08:38:31Z

src/pybind11-qt/pybind11_qt_basic.cpp

@@ -68,8 +72,7 @@ namespace pybind11::detail {
                                      handle /* parent */)
    {
        static_assert(sizeof(QChar) == 2);
-        return PyUnicode_FromKindAndData(PyUnicode_2BYTE_KIND, src.constData(),
-                                         src.length());
+        return PyUnicode_FromString(src.toStdString().c_str());


Any reason to change this? AFAIK, Unicode string in Python can be stored as UTF-16 so this adds an extra conversion from UTF-16 to UTF-8.

Emoji characters can't be directly converted back to QStrings in the previous function. The current workaround is to encode to UTF-16 with a surrogatepass and decode again. This allows us to directly use QStrings from C++ -> Python -> C++ without having reencode the strings again in Python.

See #plugins-dev.

Holt59 · 2023-11-30T10:10:26Z

I made some tests to try to understand the root cause. The problem is that PyUnicode_FromKindAndData(PyUnicode_2BYTE_KIND) creates a unicode object encoded as UCS2, which is valid Python, but cannot then be decoded per PyUnicode_AsUTF8String due to the surrogate pair.

So basically, in master, the C++ > Python conversion is valid because having a UCS2 encoded Unicode object is valid Python, but the Python > C++ code is not. The "proper" fix is to check the KIND of the Unicode object and use the proper function to extract it, which is the opposite of what this fix does (or partially) - I'm not saying this PR is not the one we should use, read below.

I think we should fix the Python > C++ code by checking the kinds - This is a better fix than checking the return of PyUnicode_AsUTF8String as this would handle Python strings encoded as UCS2 and UCS4 instead of simply failing to convert. This is a bit more verbose but I can provide the fix (I have it on my home computer, just forgot to push it yesterday... ). Note that I have yet to find a way to create a UCS2 or UCS4 encoded string in Python that does not come from a C bindings, so this may be "overkill" but should not add overhead.

Regarding the C++ > Python "fix":

If we use this PR fix, we should be safe to do whatever we want on the Python side with the converted QString and pass it back to C++ safely, but that incurs extra UTF-16 / UTF-8 conversions - I don't know how Qt handles surrogate pairs, but I trust that this is ok since this seems to work.
If we keep the master code, we save the extra conversions (because we create a UCS2 Unicode that can be directly put back in a QString when calling C++ from Python) BUT, even if UCS2 Unicode object are valid Python, there are many places where this can fail in Python (e.g., try to print a UCS2 Unicode object with a surrogate pair).

AnyOldName3 · 2023-11-30T13:17:36Z

Something to bear in mind is that the Win32 and NT APIs don't care about the sequence of code points in a path, just the sequence of 16-bit values (it's a similar story on Unix, except it's eight-bit values), so getting rid of dodgy surrogates before conversion might mean you end up with a mod name that prints the same and compares equal using certain functions, but makes Windows attempt to access a different folder. It shouldn't have been possible to create a mod name with dodgy surrogates in the first place - both the MO2 GUI and Windows Explorer should be encoding strings in canonical form when creating a mod. It might be worth looking into whether they're not actually doing that, or whether the folder name is being subtly mangled when we load it.

Also, the Python docs are confusing. UCS2 and UCS4 are both fixed-width encodings, with UCS2 being the subset of UTF-16 that can be represented with one code unit per code point, and UCS4 is the same thing as UTF-32. UCS1 isn't a real thing, so it's unclear if Python's come up with their own thing that's the subset of UTF-8 that can be represented with one code unit per code point (basically equivalent to 7-bit ASCII padded to eight bits), or they're calling UTF-8 UCS1 and calling UTF-16 UCS2 even though they're different. If the former, then that would imply that Python Unicode objects are always fixed-width, and can only express emoji and other high-value code points in UCS4 mode, which would explain why we were getting dodgy surrogates when telling it to use UCS2 mode. The existence of the PyUnicode_MAX_CHAR_VALUE hints that this might be how it works.

Silarn · 2023-11-30T17:22:02Z

I think a more robust Python -> C++ conversion is fine.

In that case I don't think the change to the C++ -> Python is necessary. But you're saying these strings can still display improperly on the python side in some cases?

In that case I think using PyUnicode_FromString may make more sense so that plugin authors don't have to potentially reencode strings for random edge cases.

AnyOldName3 · 2023-11-30T19:42:45Z

https://peps.python.org/pep-0393/ seems to confirm my theory from earlier, i.e. that Python stores Unicode strings in whichever it deems to be the narrowest fixed-width encoding that will hold all the code points it contains. That means we either need to do all our string interop with the explicitly-UTF-8 functions, or the explicitly-UCS4 ones, as a UTF-16 string isn't necessarily bit-compatible with the UCS2 functions. The one exception is coming back from Python to C++ - if the function to query a Python string's internal representation says it's already UCS2, we can safely reinterpret that as UTF-16.

Holt59 · 2023-12-01T20:06:13Z

This is what I have now:

    /**
     * Conversion part 1 (Python->C++): convert a PyObject into a QString
     * instance or return false upon failure. The second argument
     * indicates whether implicit conversions should be applied.
     */
    bool type_caster<QString>::load(handle src, bool)
    {
        PyObject* objPtr = src.ptr();

        if (PyBytes_Check(objPtr)) {
            value = QString::fromUtf8(PyBytes_AsString(objPtr));
            return true;
        }
        else if (PyUnicode_Check(objPtr)) {
            switch (PyUnicode_KIND(objPtr)) {
            case PyUnicode_1BYTE_KIND:
                value = QString::fromUtf8(PyUnicode_AsUTF8(objPtr));
                break;
            case PyUnicode_2BYTE_KIND:
                value = QString::fromUtf16(
                    reinterpret_cast<char16_t*>(PyUnicode_2BYTE_DATA(objPtr)),
                    PyUnicode_GET_LENGTH(objPtr));
                break;
            case PyUnicode_4BYTE_KIND:
                value = QString::fromUcs4(
                    reinterpret_cast<char32_t*>(PyUnicode_4BYTE_DATA(objPtr)),
                    PyUnicode_GET_LENGTH(objPtr));
                break;
            default:
                return false;
            }

            return true;
        }
        else {
            return false;
        }
    }

    /**
     * Conversion part 2 (C++ -> Python): convert an QString instance into
     * a Python object. The second and third arguments are used to
     * indicate the return value policy and parent object (for
     * ``return_value_policy::reference_internal``) and are generally
     * ignored by implicit casters.
     */
    handle type_caster<QString>::cast(QString src, return_value_policy /* policy */,
                                      handle /* parent */)
    {
        return PyUnicode_DecodeUTF16(reinterpret_cast<const char*>(src.utf16()),
                                     2 * src.length(), nullptr, 0);
    }

The initial issue is that PyUnicode_FromKindAndData(PyUnicode_2BYTE_KIND) does not create a valid Python unicode object since the string (with emojis) contains characters that cannot be encoded on a single UCS2 character. Once the invalid Python unicode object is created, it creates tons of issue and in particular PyUnicode_AsUTF8 returns a null pointer.

The PyUnicode_DecodeUTF16 in cast avoid the creation of the intermediate std::string, and let Python decide the better storage (it actually uses UCS1 for string with ASCII or simple accents and UCS4 for the emoji).

In load, QString::fromUtf8(PyUnicode_AsUTF8(objPtr)) should be valid regardless of the kind since PyUnicode_AsUTF8 creates a new UTF-8 encoded string embedded in the Unicode object (if not already present). The other two switch cases avoid useless conversion through UTF-8 when possible (might be unnoticeable in practice here).

Silarn · 2023-12-01T20:10:57Z

That looks pretty good to me. Would you prefer I update this PR? Or do you want to just open a new one?

Holt59 · 2023-12-01T20:11:28Z

That looks pretty good to me. Would you prefer I update this PR? Or do you want to just open a new one?

You can update this PR.

Silarn · 2023-12-01T20:12:46Z

Alright, I can get to that in a few hours.

Holt59 · 2023-12-02T16:32:05Z

@Silarn I pushed the change with some extra unit tests. I let you test locally with rootbuilder (I only have checked that the unit tests are passing).

Silarn · 2023-12-02T20:46:56Z

This change appears to work. The strings are being passed in both directions without errors or mangled characters.

Fix string conversion issues with emoji characters

f4ee4da

Silarn requested a review from Holt59 November 29, 2023 06:11

Holt59 reviewed Nov 29, 2023

View reviewed changes

Fix load/cast of QString to avoid issues with surrogate pairs.

aac872c

Holt59 merged commit cc1bc4c into master Dec 3, 2023
2 of 3 checks passed

Holt59 deleted the emoji_fix branch August 9, 2024 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix string conversion issues with emoji characters #120

Fix string conversion issues with emoji characters #120

Silarn commented Nov 29, 2023

Holt59 Nov 29, 2023

Silarn Nov 29, 2023 •

edited

Loading

Holt59 Nov 29, 2023

Silarn Nov 29, 2023 •

edited

Loading

Holt59 commented Nov 30, 2023

AnyOldName3 commented Nov 30, 2023

Silarn commented Nov 30, 2023

AnyOldName3 commented Nov 30, 2023

Holt59 commented Dec 1, 2023 •

edited

Loading

Silarn commented Dec 1, 2023

Holt59 commented Dec 1, 2023

Silarn commented Dec 1, 2023

Holt59 commented Dec 2, 2023

Silarn commented Dec 2, 2023

Fix string conversion issues with emoji characters #120

Fix string conversion issues with emoji characters #120

Conversation

Silarn commented Nov 29, 2023

Holt59 Nov 29, 2023

Choose a reason for hiding this comment

Silarn Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Holt59 Nov 29, 2023

Choose a reason for hiding this comment

Silarn Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Holt59 commented Nov 30, 2023

AnyOldName3 commented Nov 30, 2023

Silarn commented Nov 30, 2023

AnyOldName3 commented Nov 30, 2023

Holt59 commented Dec 1, 2023 • edited Loading

Silarn commented Dec 1, 2023

Holt59 commented Dec 1, 2023

Silarn commented Dec 1, 2023

Holt59 commented Dec 2, 2023

Silarn commented Dec 2, 2023

Silarn Nov 29, 2023 •

edited

Loading

Silarn Nov 29, 2023 •

edited

Loading

Holt59 commented Dec 1, 2023 •

edited

Loading