gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639

vstinner · 2024-06-17T13:36:48Z

Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.

Issue: [C API] Add an efficient public PyUnicodeWriter API #119182

📚 Documentation preview 📚: https://cpython-previews--120639.org.readthedocs.build/

vstinner · 2024-06-17T13:37:39Z

PR to discuss extensions to the PyUnicodeWriter API:

PyAPI_FUNC(int) PyUnicodeWriter_WriteWideChar(
    PyUnicodeWriter *writer,
    wchar_t *str,
    Py_ssize_t size);

PyAPI_FUNC(int) PyUnicodeWriter_DecodeUTF8Stateful(
    PyUnicodeWriter *writer,
    const char *string,         /* UTF-8 encoded string */
    Py_ssize_t length,          /* size of string */
    const char *errors,         /* error handling */
    Py_ssize_t *consumed);      /* bytes consumed */

Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.

vstinner · 2024-06-17T15:57:04Z

cc @serhiy-storchaka @malemburg @zooba

malemburg · 2024-06-19T08:22:10Z

Objects/unicodeobject.c

+    if (size < 0) {
+        size = wcslen(str);
+    }
+    PyObject *obj = PyUnicode_FromWideChar(str, size);


Since this API will be used a lot to build Python Unicode objects from wchar_t input, I think it's better to try to optimize it and avoid creating a temporary object.

The PyUnicode_FromWideChar() could be refactored using a private helper shared by both PyUnicode_FromWideChar () and this PyUnicodeWriter_WriteWideChar() to make this possible: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L1794

Ok. I optimized PyUnicodeWriter_WriteWideChar(). I ran a benchmark on _testcapi.test_unicodewriter_widechar():

$ env/bin/python -m pyperf timeit -s 'from _testcapi import test_unicodewriter_widechar' 'test_unicodewriter_widechar()' -o ref.json -v (...) $ python3 -m pyperf compare_to ref.json optim.json Mean +- std dev: [ref] 203 ns +- 9 ns -> [optim] 150 ns +- 3 ns: 1.35x faster

It's a 1.4x faster, so it's worth it. It saves around 53 ns for 3 calls to PyUnicodeWriter_WriteWideChar().

Avoid a temporary Unicode object, write directly into the writer.

vstinner · 2024-06-19T10:22:24Z

@malemburg: Is PyUnicodeWriter_DecodeUTF8Stateful() the API that you wanted?

malemburg · 2024-06-19T13:04:45Z

@malemburg: Is PyUnicodeWriter_DecodeUTF8Stateful() the API that you wanted?

Yes, thanks for adding that.

Objects/unicodeobject.c

serhiy-storchaka · 2024-06-20T08:07:45Z

Objects/unicodeobject.c

-PyObject *
-PyUnicode_FromWideChar(const wchar_t *u, Py_ssize_t size)
+static inline int
+unicode_fromwidechar(const wchar_t *u, Py_ssize_t size,


It seems that more than a half of this function is now specific to the caller. This is a mess. I wonder, would not it be simpler if write it as two different functions specialized for their case?

I refactored PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar(): I added unicode_write_widechar() and removed unicode_convert_wchar_to_ucs4(). Does it look better?

Remove unicode_convert_wchar_to_ucs4(). Refactor PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar().

serhiy-storchaka

There is also unicode_fromformat_write_wcstr. Do you leave it to the next PR?

serhiy-storchaka · 2024-06-20T12:30:41Z

Objects/unicodeobject.c

+        // This code assumes that unicode can hold one more code point than
+        // wstr characters for a terminating null character.


I think this is no longer true, after adding the (iter+1) < end check.

Objects/unicodeobject.c

Modules/_testcapi/unicode.c

serhiy-storchaka · 2024-06-20T13:09:54Z

Modules/_testcapi/unicode.c

+
+    // consumed is 0 if write fails
+    consumed = 12345;
+    assert(PyUnicodeWriter_DecodeUTF8Stateful(writer, "invalid\xFF", -1, NULL, &consumed) < 0);


This do nothing in non-debug build.

Assertions are always built in _testcapi.c: the NDEBUG macro is undefined early in parts.h.

Modules/_testcapi/unicode.c

serhiy-storchaka · 2024-06-20T13:21:57Z

Modules/_testcapi/unicode.c

+    if (PyUnicodeWriter_WriteWideChar(writer, L"-", 1) < 0) {
+        goto error;
+    }
+    if (PyUnicodeWriter_WriteWideChar(writer, L"euro=\u20AC", -1) < 0) {


Also test surrogate pairs and non-BMP characters.

Since the code depends on the kind of the buffer string, you need to test different combinations: write different strings after writing a UCS2 or UCS4 string.

I suggest to implement in C a function which creates a PyUnicodeWriter, write the first argument as a Python string, then covert the second argument to the wchar_t* string and write it with size specified as optional third argument, and return the result. This helper function can be called in Python code with different arguments. The result will be checked even in non-debug build. You can test much more cases.

Co-authored-by: Serhiy Storchaka <[email protected]>

vstinner · 2024-06-20T14:05:00Z

@serhiy-storchaka: I tried to address most of your reviews. Would you mind to review the updated PR?

For tests, it's really complicated to write tests in C. I think that I will try to expose the C API PyUnicodeWriter in Python to write tests in Python in a following PR. I wanted to do that at the beginning, but it was quicker to start with C. Now the C test suite of PyUnicodeWriter is already quite big!

@serhiy-storchaka:

There is also unicode_fromformat_write_wcstr. Do you leave it to the next PR?

Right, I prefer to leave it as it is for now and write a following PR.

vstinner · 2024-06-21T17:33:47Z

Ok, I merged this PR as a starting point. I will rework tests in a follow-up PR.

Thanks @serhiy-storchaka and @malemburg for your reviews.

vstinner · 2024-06-21T17:50:13Z

I will rework tests in a follow-up PR.

Rewrite tests in Python: #120845

) Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions. Co-authored-by: Serhiy Storchaka <[email protected]>

bedevere-app bot mentioned this pull request Jun 17, 2024

[C API] Add an efficient public PyUnicodeWriter API #119182

Closed

vstinner mentioned this pull request Jun 17, 2024

Add PyUnicodeWriter API capi-workgroup/decisions#27

Closed

pythongh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful()

8aa73b7

Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.

vstinner force-pushed the WIP_unicode_writer_more branch from 7c4cc95 to 8aa73b7 Compare June 17, 2024 15:56

vstinner changed the title ~~[WIP] gh-119182: Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful()~~ gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() Jun 17, 2024

vstinner marked this pull request as ready for review June 17, 2024 15:56

bedevere-app bot added the awaiting core review label Jun 17, 2024

doc: fix typo

788a85f

vstinner added the skip news label Jun 17, 2024

malemburg reviewed Jun 19, 2024

View reviewed changes

Optimize PyUnicodeWriter_WriteWideChar()

e67a8b4

Avoid a temporary Unicode object, write directly into the writer.

serhiy-storchaka reviewed Jun 19, 2024

View reviewed changes

Objects/unicodeobject.c Outdated Show resolved Hide resolved

Update Objects/unicodeobject.c

de56475

serhiy-storchaka self-requested a review June 19, 2024 14:55

Fix compiler warning

e48eec7

serhiy-storchaka reviewed Jun 20, 2024

View reviewed changes

Add unicode_write_widechar()

75fa8ba

Remove unicode_convert_wchar_to_ucs4(). Refactor PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar().

serhiy-storchaka reviewed Jun 20, 2024

View reviewed changes

vstinner and others added 3 commits June 20, 2024 15:40

Update Doc/c-api/unicode.rst

3f284f8

Co-authored-by: Serhiy Storchaka <[email protected]>

Address Serhiy's review

1e018d2

Add more tests

6f29c53

vstinner merged commit 4123226 into python:main Jun 21, 2024
36 checks passed

vstinner deleted the WIP_unicode_writer_more branch June 21, 2024 17:33

bedevere-app bot removed the awaiting core review label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639

gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639

vstinner commented Jun 17, 2024 •

edited

Loading

vstinner commented Jun 17, 2024 •

edited

Loading

vstinner commented Jun 17, 2024

malemburg Jun 19, 2024

vstinner Jun 19, 2024

vstinner commented Jun 19, 2024

malemburg commented Jun 19, 2024

serhiy-storchaka Jun 20, 2024

vstinner Jun 20, 2024

serhiy-storchaka left a comment

serhiy-storchaka Jun 20, 2024

serhiy-storchaka Jun 20, 2024

vstinner Jun 20, 2024

serhiy-storchaka Jun 20, 2024

vstinner commented Jun 20, 2024

vstinner commented Jun 21, 2024

vstinner commented Jun 21, 2024

		// This code assumes that unicode can hold one more code point than
		// wstr characters for a terminating null character.

gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639

gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639

Conversation

vstinner commented Jun 17, 2024 • edited Loading

vstinner commented Jun 17, 2024 • edited Loading

vstinner commented Jun 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Jun 19, 2024

malemburg commented Jun 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Jun 20, 2024

vstinner commented Jun 21, 2024

vstinner commented Jun 21, 2024

vstinner commented Jun 17, 2024 •

edited

Loading

vstinner commented Jun 17, 2024 •

edited

Loading