-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639
Conversation
PR to discuss extensions to the PyAPI_FUNC(int) PyUnicodeWriter_WriteWideChar(
PyUnicodeWriter *writer,
wchar_t *str,
Py_ssize_t size);
PyAPI_FUNC(int) PyUnicodeWriter_DecodeUTF8Stateful(
PyUnicodeWriter *writer,
const char *string, /* UTF-8 encoded string */
Py_ssize_t length, /* size of string */
const char *errors, /* error handling */
Py_ssize_t *consumed); /* bytes consumed */ |
Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.
7c4cc95
to
8aa73b7
Compare
Objects/unicodeobject.c
Outdated
if (size < 0) { | ||
size = wcslen(str); | ||
} | ||
PyObject *obj = PyUnicode_FromWideChar(str, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this API will be used a lot to build Python Unicode objects from wchar_t input, I think it's better to try to optimize it and avoid creating a temporary object.
The PyUnicode_FromWideChar() could be refactored using a private helper shared by both PyUnicode_FromWideChar () and this PyUnicodeWriter_WriteWideChar() to make this possible: https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L1794
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I optimized PyUnicodeWriter_WriteWideChar(). I ran a benchmark on _testcapi.test_unicodewriter_widechar()
:
$ env/bin/python -m pyperf timeit -s 'from _testcapi import test_unicodewriter_widechar' 'test_unicodewriter_widechar()' -o ref.json -v
(...)
$ python3 -m pyperf compare_to ref.json optim.json
Mean +- std dev: [ref] 203 ns +- 9 ns -> [optim] 150 ns +- 3 ns: 1.35x faster
It's a 1.4x faster, so it's worth it. It saves around 53 ns for 3 calls to PyUnicodeWriter_WriteWideChar()
.
Avoid a temporary Unicode object, write directly into the writer.
@malemburg: Is |
Yes, thanks for adding that. |
Objects/unicodeobject.c
Outdated
PyObject * | ||
PyUnicode_FromWideChar(const wchar_t *u, Py_ssize_t size) | ||
static inline int | ||
unicode_fromwidechar(const wchar_t *u, Py_ssize_t size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that more than a half of this function is now specific to the caller. This is a mess. I wonder, would not it be simpler if write it as two different functions specialized for their case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar(): I added unicode_write_widechar() and removed unicode_convert_wchar_to_ucs4(). Does it look better?
Remove unicode_convert_wchar_to_ucs4(). Refactor PyUnicode_FromWideChar() and PyUnicodeWriter_WriteWideChar().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also unicode_fromformat_write_wcstr
. Do you leave it to the next PR?
Objects/unicodeobject.c
Outdated
// This code assumes that unicode can hold one more code point than | ||
// wstr characters for a terminating null character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no longer true, after adding the (iter+1) < end
check.
|
||
// consumed is 0 if write fails | ||
consumed = 12345; | ||
assert(PyUnicodeWriter_DecodeUTF8Stateful(writer, "invalid\xFF", -1, NULL, &consumed) < 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This do nothing in non-debug build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertions are always built in _testcapi.c: the NDEBUG macro is undefined early in parts.h.
if (PyUnicodeWriter_WriteWideChar(writer, L"-", 1) < 0) { | ||
goto error; | ||
} | ||
if (PyUnicodeWriter_WriteWideChar(writer, L"euro=\u20AC", -1) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also test surrogate pairs and non-BMP characters.
Since the code depends on the kind of the buffer string, you need to test different combinations: write different strings after writing a UCS2 or UCS4 string.
I suggest to implement in C a function which creates a PyUnicodeWriter, write the first argument as a Python string, then covert the second argument to the wchar_t*
string and write it with size specified as optional third argument, and return the result. This helper function can be called in Python code with different arguments. The result will be checked even in non-debug build. You can test much more cases.
Co-authored-by: Serhiy Storchaka <[email protected]>
@serhiy-storchaka: I tried to address most of your reviews. Would you mind to review the updated PR? For tests, it's really complicated to write tests in C. I think that I will try to expose the C API PyUnicodeWriter in Python to write tests in Python in a following PR. I wanted to do that at the beginning, but it was quicker to start with C. Now the C test suite of PyUnicodeWriter is already quite big!
Right, I prefer to leave it as it is for now and write a following PR. |
Ok, I merged this PR as a starting point. I will rework tests in a follow-up PR. Thanks @serhiy-storchaka and @malemburg for your reviews. |
Rewrite tests in Python: #120845 |
) Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions. Co-authored-by: Serhiy Storchaka <[email protected]>
) Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions. Co-authored-by: Serhiy Storchaka <[email protected]>
) Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions. Co-authored-by: Serhiy Storchaka <[email protected]>
Add PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful() functions.
📚 Documentation preview 📚: https://cpython-previews--120639.org.readthedocs.build/