Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix string conversion issues with emoji characters #120

Merged
merged 2 commits into from
Dec 3, 2023
Merged

Conversation

Silarn
Copy link
Member

@Silarn Silarn commented Nov 29, 2023

This should prevent crashes and seems to fix the issue with emoji characters.

I'm unsure if this might cause other problems, however, so a review is very appreciated.

@Silarn Silarn requested a review from Holt59 November 29, 2023 06:11
@@ -46,6 +46,10 @@ namespace pybind11::detail {
PyObject* strPtr =
PyUnicode_Check(objPtr) ? PyUnicode_AsUTF8String(objPtr) : objPtr;

if (!strPtr) {
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the fix for the crashes but can you tell what was in objPtr when this occurs?

Copy link
Member Author

@Silarn Silarn Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a string object, but the string format returned when there are emojis fails to properly convert back into a QString.

This is described in more detail in #plugins-dev.

@@ -68,8 +72,7 @@ namespace pybind11::detail {
handle /* parent */)
{
static_assert(sizeof(QChar) == 2);
return PyUnicode_FromKindAndData(PyUnicode_2BYTE_KIND, src.constData(),
src.length());
return PyUnicode_FromString(src.toStdString().c_str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to change this? AFAIK, Unicode string in Python can be stored as UTF-16 so this adds an extra conversion from UTF-16 to UTF-8.

Copy link
Member Author

@Silarn Silarn Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emoji characters can't be directly converted back to QStrings in the previous function. The current workaround is to encode to UTF-16 with a surrogatepass and decode again. This allows us to directly use QStrings from C++ -> Python -> C++ without having reencode the strings again in Python.

See #plugins-dev.

@Holt59
Copy link
Member

Holt59 commented Nov 30, 2023

I made some tests to try to understand the root cause. The problem is that PyUnicode_FromKindAndData(PyUnicode_2BYTE_KIND) creates a unicode object encoded as UCS2, which is valid Python, but cannot then be decoded per PyUnicode_AsUTF8String due to the surrogate pair.

So basically, in master, the C++ > Python conversion is valid because having a UCS2 encoded Unicode object is valid Python, but the Python > C++ code is not. The "proper" fix is to check the KIND of the Unicode object and use the proper function to extract it, which is the opposite of what this fix does (or partially) - I'm not saying this PR is not the one we should use, read below.

I think we should fix the Python > C++ code by checking the kinds - This is a better fix than checking the return of PyUnicode_AsUTF8String as this would handle Python strings encoded as UCS2 and UCS4 instead of simply failing to convert. This is a bit more verbose but I can provide the fix (I have it on my home computer, just forgot to push it yesterday... ). Note that I have yet to find a way to create a UCS2 or UCS4 encoded string in Python that does not come from a C bindings, so this may be "overkill" but should not add overhead.

Regarding the C++ > Python "fix":

  • If we use this PR fix, we should be safe to do whatever we want on the Python side with the converted QString and pass it back to C++ safely, but that incurs extra UTF-16 / UTF-8 conversions - I don't know how Qt handles surrogate pairs, but I trust that this is ok since this seems to work.
  • If we keep the master code, we save the extra conversions (because we create a UCS2 Unicode that can be directly put back in a QString when calling C++ from Python) BUT, even if UCS2 Unicode object are valid Python, there are many places where this can fail in Python (e.g., try to print a UCS2 Unicode object with a surrogate pair).

@AnyOldName3
Copy link
Member

Something to bear in mind is that the Win32 and NT APIs don't care about the sequence of code points in a path, just the sequence of 16-bit values (it's a similar story on Unix, except it's eight-bit values), so getting rid of dodgy surrogates before conversion might mean you end up with a mod name that prints the same and compares equal using certain functions, but makes Windows attempt to access a different folder. It shouldn't have been possible to create a mod name with dodgy surrogates in the first place - both the MO2 GUI and Windows Explorer should be encoding strings in canonical form when creating a mod. It might be worth looking into whether they're not actually doing that, or whether the folder name is being subtly mangled when we load it.

Also, the Python docs are confusing. UCS2 and UCS4 are both fixed-width encodings, with UCS2 being the subset of UTF-16 that can be represented with one code unit per code point, and UCS4 is the same thing as UTF-32. UCS1 isn't a real thing, so it's unclear if Python's come up with their own thing that's the subset of UTF-8 that can be represented with one code unit per code point (basically equivalent to 7-bit ASCII padded to eight bits), or they're calling UTF-8 UCS1 and calling UTF-16 UCS2 even though they're different. If the former, then that would imply that Python Unicode objects are always fixed-width, and can only express emoji and other high-value code points in UCS4 mode, which would explain why we were getting dodgy surrogates when telling it to use UCS2 mode. The existence of the PyUnicode_MAX_CHAR_VALUE hints that this might be how it works.

@Silarn
Copy link
Member Author

Silarn commented Nov 30, 2023

I think a more robust Python -> C++ conversion is fine.

In that case I don't think the change to the C++ -> Python is necessary. But you're saying these strings can still display improperly on the python side in some cases?

In that case I think using PyUnicode_FromString may make more sense so that plugin authors don't have to potentially reencode strings for random edge cases.

@AnyOldName3
Copy link
Member

https://peps.python.org/pep-0393/ seems to confirm my theory from earlier, i.e. that Python stores Unicode strings in whichever it deems to be the narrowest fixed-width encoding that will hold all the code points it contains. That means we either need to do all our string interop with the explicitly-UTF-8 functions, or the explicitly-UCS4 ones, as a UTF-16 string isn't necessarily bit-compatible with the UCS2 functions. The one exception is coming back from Python to C++ - if the function to query a Python string's internal representation says it's already UCS2, we can safely reinterpret that as UTF-16.

@Holt59
Copy link
Member

Holt59 commented Dec 1, 2023

This is what I have now:

    /**
     * Conversion part 1 (Python->C++): convert a PyObject into a QString
     * instance or return false upon failure. The second argument
     * indicates whether implicit conversions should be applied.
     */
    bool type_caster<QString>::load(handle src, bool)
    {
        PyObject* objPtr = src.ptr();

        if (PyBytes_Check(objPtr)) {
            value = QString::fromUtf8(PyBytes_AsString(objPtr));
            return true;
        }
        else if (PyUnicode_Check(objPtr)) {
            switch (PyUnicode_KIND(objPtr)) {
            case PyUnicode_1BYTE_KIND:
                value = QString::fromUtf8(PyUnicode_AsUTF8(objPtr));
                break;
            case PyUnicode_2BYTE_KIND:
                value = QString::fromUtf16(
                    reinterpret_cast<char16_t*>(PyUnicode_2BYTE_DATA(objPtr)),
                    PyUnicode_GET_LENGTH(objPtr));
                break;
            case PyUnicode_4BYTE_KIND:
                value = QString::fromUcs4(
                    reinterpret_cast<char32_t*>(PyUnicode_4BYTE_DATA(objPtr)),
                    PyUnicode_GET_LENGTH(objPtr));
                break;
            default:
                return false;
            }

            return true;
        }
        else {
            return false;
        }
    }

    /**
     * Conversion part 2 (C++ -> Python): convert an QString instance into
     * a Python object. The second and third arguments are used to
     * indicate the return value policy and parent object (for
     * ``return_value_policy::reference_internal``) and are generally
     * ignored by implicit casters.
     */
    handle type_caster<QString>::cast(QString src, return_value_policy /* policy */,
                                      handle /* parent */)
    {
        return PyUnicode_DecodeUTF16(reinterpret_cast<const char*>(src.utf16()),
                                     2 * src.length(), nullptr, 0);
    }

The initial issue is that PyUnicode_FromKindAndData(PyUnicode_2BYTE_KIND) does not create a valid Python unicode object since the string (with emojis) contains characters that cannot be encoded on a single UCS2 character. Once the invalid Python unicode object is created, it creates tons of issue and in particular PyUnicode_AsUTF8 returns a null pointer.

The PyUnicode_DecodeUTF16 in cast avoid the creation of the intermediate std::string, and let Python decide the better storage (it actually uses UCS1 for string with ASCII or simple accents and UCS4 for the emoji).

In load, QString::fromUtf8(PyUnicode_AsUTF8(objPtr)) should be valid regardless of the kind since PyUnicode_AsUTF8 creates a new UTF-8 encoded string embedded in the Unicode object (if not already present). The other two switch cases avoid useless conversion through UTF-8 when possible (might be unnoticeable in practice here).

@Silarn
Copy link
Member Author

Silarn commented Dec 1, 2023

That looks pretty good to me. Would you prefer I update this PR? Or do you want to just open a new one?

@Holt59
Copy link
Member

Holt59 commented Dec 1, 2023

That looks pretty good to me. Would you prefer I update this PR? Or do you want to just open a new one?

You can update this PR.

@Silarn
Copy link
Member Author

Silarn commented Dec 1, 2023

Alright, I can get to that in a few hours.

@Holt59
Copy link
Member

Holt59 commented Dec 2, 2023

@Silarn I pushed the change with some extra unit tests. I let you test locally with rootbuilder (I only have checked that the unit tests are passing).

@Silarn
Copy link
Member Author

Silarn commented Dec 2, 2023

This change appears to work. The strings are being passed in both directions without errors or mangled characters.

@Holt59 Holt59 merged commit cc1bc4c into master Dec 3, 2023
2 of 3 checks passed
@Holt59 Holt59 deleted the emoji_fix branch August 9, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants