Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C API] Add an efficient public PyUnicodeWriter API #119182

Closed
vstinner opened this issue May 19, 2024 · 32 comments
Closed

[C API] Add an efficient public PyUnicodeWriter API #119182

vstinner opened this issue May 19, 2024 · 32 comments
Labels
topic-C-API type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented May 19, 2024

Feature or enhancement

Creating a Python string object in an efficient way is complicated. Python has private _PyUnicodeWriter API. It's being used by these projects:

Affected projects (5):

  • Cython (3.0.9)
  • asyncpg (0.29.0)
  • catboost (1.2.3)
  • frozendict (2.4.0)
  • immutables (0.20)

I propose making the API public to promote it and help C extensions maintainers to write more efficient code to create Python string objects.

API:

typedef struct PyUnicodeWriter PyUnicodeWriter;

PyAPI_FUNC(PyUnicodeWriter*) PyUnicodeWriter_Create(void);
PyAPI_FUNC(void) PyUnicodeWriter_Discard(PyUnicodeWriter *writer);
PyAPI_FUNC(PyObject*) PyUnicodeWriter_Finish(PyUnicodeWriter *writer);

PyAPI_FUNC(void) PyUnicodeWriter_SetOverallocate(
    PyUnicodeWriter *writer,
    int overallocate);

PyAPI_FUNC(int) PyUnicodeWriter_WriteChar(
    PyUnicodeWriter *writer,
    Py_UCS4 ch);
PyAPI_FUNC(int) PyUnicodeWriter_WriteUTF8(
    PyUnicodeWriter *writer,
    const char *str,  // decoded from UTF-8
    Py_ssize_t len);  // use strlen() if len < 0
PyAPI_FUNC(int) PyUnicodeWriter_Format(
    PyUnicodeWriter *writer,
    const char *format,
    ...);

// Write str(obj)
PyAPI_FUNC(int) PyUnicodeWriter_WriteStr(
    PyUnicodeWriter *writer,
    PyObject *obj);

// Write repr(obj)
PyAPI_FUNC(int) PyUnicodeWriter_WriteRepr(
    PyUnicodeWriter *writer,
    PyObject *obj);

// Write str[start:end]
PyAPI_FUNC(int) PyUnicodeWriter_WriteSubstring(
    PyUnicodeWriter *writer,
    PyObject *str,
    Py_ssize_t start,
    Py_ssize_t end);

The internal writer buffer is overallocated by default. PyUnicodeWriter_Finish() truncates the buffer to the exact size if the buffer was overallocated.

Overallocation reduces the cost of exponential complexity when adding short strings in a loop. Use PyUnicodeWriter_SetOverallocate(writer, 0) to disable overallocation just before the last write.

The writer takes care of the internal buffer kind: Py_UCS1 (latin1), Py_UCS2 (BMP) or Py_UCS4 (full Unicode Character Set). It also implements an optimization if a single write is made using PyUnicodeWriter_WriteStr(): it returns the string unchanged without any copy.


Example of usage (simplified code from Python/unionobject.c):

static PyObject *
union_repr(PyObject *self)
{
    unionobject *alias = (unionobject *)self;
    Py_ssize_t len = PyTuple_GET_SIZE(alias->args);

    PyUnicodeWriter *writer = PyUnicodeWriter_Create();
    if (writer == NULL) {
        return NULL;
    }

    for (Py_ssize_t i = 0; i < len; i++) {
        if (i > 0 && PyUnicodeWriter_WriteUTF8(writer, " | ", 3) < 0) {
            goto error;
        }
        PyObject *p = PyTuple_GET_ITEM(alias->args, i);
        if (PyUnicodeWriter_WriteRepr(writer, p) < 0) {
            goto error;
        }
    }
    return PyUnicodeWriter_Finish(writer);

error:
    PyUnicodeWriter_Discard(writer);
    return NULL;
}

Linked PRs

@vstinner vstinner added type-feature A feature request or enhancement topic-C-API labels May 19, 2024
vstinner added a commit to vstinner/cpython that referenced this issue May 19, 2024
Move the private _PyUnicodeWriter API to the internal C API.
@vstinner
Copy link
Member Author

Benchmark using:

bench_concat: Mean +- std dev: 2.07 us +- 0.03 us
bench_writer: Mean +- std dev: 894 ns +- 13 ns

PyUnicodeWriter is 2.3x faster than PyUnicode_Concat()+PyUnicode_Append().

The difference comes from overallocation: if I add PyUnicodeWriter_SetOverallocate(writer, 0); after PyUnicodeWriter_Create(), PyUnicodeWriter has the same performance than PyUnicode_Concat()+PyUnicode_Append(). Overallocation avoids str += str quadratic complexity (well, at least, it reduces the complexity).

The PyUnicodeWriter API makes overallocation easy to use.

cc @serhiy-storchaka

@vstinner
Copy link
Member Author

By the way, PyPy provides __pypy__.builders.StringBuilder for "Fast String Concatenation": https://doc.pypy.org/en/latest/__pypy__-module.html#fast-string-concatenation to work around the str += str quadratic complexity.

$ pypy3.9 
>>>> import __pypy__
>>>> b=__pypy__.builders.StringBuilder()
>>>> b.append('x')
>>>> b.append('=')
>>>> b.append('value')
>>>> b.build()
'x=value'

@vstinner
Copy link
Member Author

Article about this performance problem in Python: https://lwn.net/Articles/816415/

@gvanrossum
Copy link
Member

Curious if this warrants a further API PyUnicodeWriter_WriteStr(writer, obj) which appends repr(obj) (just as WriteStr(writer, obj) can be seen to append str(obj)), and eventually the development of a new type slot that writes the repr or str of an object to a writer rather than returning a string object. (And maybe even an "WriteAscii" to write ascii(obj)) and WriteFormat to do something with formats. :-)

I know, I know, hyper-generalization, yet this is what the Union example is screaming for... I suppose we can add those later.

How long has the internal writer API existed?

Would these be in the Stable ABI / Limited API from the start? (API-wise these look stable.)

@vstinner
Copy link
Member Author

Curious if this warrants a further API PyUnicodeWriter_WriteStr(writer, obj) which appends repr(obj)

I suppose that you mean PyUnicodeWriter_WriteRepr().

Curious if this warrants a further API PyUnicodeWriter_WriteStr(writer, obj) which appends repr(obj) (just as WriteStr(writer, obj) can be seen to append str(obj)), and eventually the development of a new type slot that writes the repr or str of an object to a writer rather than returning a string object. (And maybe even an "WriteAscii" to write ascii(obj)) and WriteFormat to do something with formats. :-)

There is already a collection of helper function accepting a writer and I find this really cool. It's not "slot-based", since each function has many formatting options.

extern int _PyLong_FormatWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    int base,
    int alternate);

extern int _PyLong_FormatAdvancedWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    PyObject *format_spec,
    Py_ssize_t start,
    Py_ssize_t end);

extern int _PyFloat_FormatAdvancedWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    PyObject *format_spec,
    Py_ssize_t start,
    Py_ssize_t end);

extern int _PyComplex_FormatAdvancedWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    PyObject *format_spec,
    Py_ssize_t start,
    Py_ssize_t end);

extern int _PyUnicode_FormatAdvancedWriter(
    _PyUnicodeWriter *writer,
    PyObject *obj,
    PyObject *format_spec,
    Py_ssize_t start,
    Py_ssize_t end);

extern Py_ssize_t _PyUnicode_InsertThousandsGrouping(
    _PyUnicodeWriter *writer,
    Py_ssize_t n_buffer,
    PyObject *digits,
    Py_ssize_t d_pos,
    Py_ssize_t n_digits,
    Py_ssize_t min_width,
    const char *grouping,
    PyObject *thousands_sep,
    Py_UCS4 *maxchar);

These functions avoid memory copies. For example, _PyLong_FormatWriter() writes directly digits in the writter buffer, without the need of a temporary buffer.

How long has the internal writer API existed?

12 years: I added it in 2012.

commit 202fdca133ce8f5b0c37cca1353070e0721c688d
Author: Victor Stinner <[email protected]>
Date:   Mon May 7 12:47:02 2012 +0200

    Close #14716: str.format() now uses the new "unicode writer" API instead of the
    PyAccu API. For example, it makes str.format() from 25% to 30% faster on Linux.

I wrote this API to fix the major performance regression after PEP 393 – Flexible String Representation was implemented. After my optimization work, many string operations on Unicode objects became faster than Python 2 operations on bytes! Especially when treating only ASCII characters which is the most common case. I mostly optimized str.format() and str % args where are powerful but complex.

In 2016, I wrote an article about the two "writer" APIs that I wrote to optimize: https://vstinner.github.io/pybyteswriter.html

Would these be in the Stable ABI / Limited API from the start? (API-wise these look stable.)

I would prefer to not add it to the limited C API directly, but wait one Python version to see how it goes.

@gvanrossum
Copy link
Member

(Yes, I meant WriteRepr.) I like these other helpers -- can we just add them all to the public API? Or are there issues with any of them?

@vstinner
Copy link
Member Author

vstinner commented May 20, 2024

(Yes, I meant WriteRepr.) I like these other helpers -- can we just add them all to the public API? Or are there issues with any of them?

I added the following function which should fit most of these use cases:

PyAPI_FUNC(int) PyUnicodeWriter_FromFormat(
    PyUnicodeWriter *writer,
    const char *format,
    ...);

Example to write repr(obj):

PyUnicodeWriter_FromFormat(writer, "%R", obj);

Example to write str(obj):

PyUnicodeWriter_FromFormat(writer, "%S", obj);

It's the same format than PyUnicode_FromFormat(). Example:

PyUnicodeWriter_FromFormat(writer, "Hello %s, %i.", "Python", 123);

@encukou
Copy link
Member

encukou commented May 21, 2024

Thank you, this looks very useful!

I see that PyUnicodeWriter_Finish frees the writer. That's great; it allows optimizations we can also use in other writers/builders in the future. (Those should have a consistent API.)
One thing to note is that PyUnicodeWriter_Finish should free the writer even when an error occurs.
Maybe PyUnicodeWriter_Free should be named e.g. PyUnicodeWriter_Discard to emphasize that you should only call it if you didn't Finish.

The va_arg function is problematic for non-C languages, but it's possible to get the functionality with other functions – especially if we add a number-writing helper, so I'm OK with adding it.

The proposed API is nice and minimal. My bet about what users will ask for next goes to PyUnicodeWriter_WriteUTF8String (for IO) & PyUnicodeWriter_WriteUTF16String (for Windows or Java interop).

Name bikeshedding:

  • PyUnicodeWriter_WriteUCS4Char rather than PyUnicodeWriter_WriteChar -- character is an overloaded term, let's be specific.
  • PyUnicodeWriter_WriteFormat (or WriteFromFormat?) rather than PyUnicodeWriter_FromFormat -- it's writing, not creating a writer.

I see the PR hides underscored API that some existing projects use. I thought we weren't doing that any more.

@vstinner
Copy link
Member Author

PyUnicodeWriter_WriteUCS4Char rather than PyUnicodeWriter_WriteChar -- character is an overloaded term, let's be specific.

"WriteChar" name comes from PyUnicode_ReadChar() and PyUnicode_WriteChar() names. I don't think that mentioning UCS4 is useful.

PyUnicodeWriter_WriteFormat (or WriteFromFormat?) rather than PyUnicodeWriter_FromFormat -- it's writing, not creating a writer.

I would prefer just "PyUnicodeWriter_Format()". I prefer to not support str.format() which is more a "Python API" than a C API. It's less convenient to use in C. If we don't support str.format(), "PyUnicodeWriter_Format()" is fine for the "PyUnicode_FormFormat()" variant.

@encukou
Copy link
Member

encukou commented May 21, 2024

Yeah, PyUnicodeWriter_Format sounds good. It avoids the PyX_FromY scheme we use for constructing new objects.

I think that using unqualified Char for a UCS4 codepoint was a mistake we shouldn't continue, but I'm happy to be outvoted on that.

vstinner added a commit to vstinner/cpython that referenced this issue May 21, 2024
Move the private _PyUnicodeWriter API to the internal C API.
@vstinner
Copy link
Member Author

The proposed API is nice and minimal. My bet about what users will ask for next goes to PyUnicodeWriter_WriteUTF8String (for IO) & PyUnicodeWriter_WriteUTF16String (for Windows or Java interop).

I propose to add PyUnicodeWriter_WriteString() which decodes from UTF-8 (in strict mode).

PyUnicodeWriter_WriteASCIIString() has an undefined behavior if the string contains non-ASCII characters. Maybe it should be removed in favor of PyUnicodeWriter_WriteString() which is safer (well defined behavior for non-ASCII characters: decode them from UTF-8).

vstinner added a commit to vstinner/cpython that referenced this issue May 22, 2024
Add unicode_decode_utf8_writer() to write directly characters into a
_PyUnicodeWriter writer. Optimize PyUnicode_FromFormat() by using the
new unicode_decode_utf8_writer().

Rename unicode_fromformat_write_cstr() to
unicode_fromformat_write_utf8().

Microbenchmark on the code:

    return PyUnicode_FromFormat(
        "%s %s %s %s %s.",
        "format", "multiple", "utf8", "short", "strings");

Result: 620 ns +- 8 ns -> 382 ns +- 2 ns: 1.62x faster.
vstinner added a commit to vstinner/cpython that referenced this issue May 22, 2024
Add unicode_decode_utf8_writer() to write directly characters into a
_PyUnicodeWriter writer: avoid the creation of a temporary string.
Optimize PyUnicode_FromFormat() by using the new
unicode_decode_utf8_writer().

Rename unicode_fromformat_write_cstr() to
unicode_fromformat_write_utf8().

Microbenchmark on the code:

    return PyUnicode_FromFormat(
        "%s %s %s %s %s.",
        "format", "multiple", "utf8", "short", "strings");

Result: 620 ns +- 8 ns -> 382 ns +- 2 ns: 1.62x faster.
vstinner added a commit to vstinner/cpython that referenced this issue May 22, 2024
Add unicode_decode_utf8_writer() to write directly characters into a
_PyUnicodeWriter writer: avoid the creation of a temporary string.
Optimize PyUnicode_FromFormat() by using the new
unicode_decode_utf8_writer().

Rename unicode_fromformat_write_cstr() to
unicode_fromformat_write_utf8().

Microbenchmark on the code:

    return PyUnicode_FromFormat(
        "%s %s %s %s %s.",
        "format", "multiple", "utf8", "short", "strings");

Result: 620 ns +- 8 ns -> 382 ns +- 2 ns: 1.62x faster.
vstinner added a commit to vstinner/cpython that referenced this issue May 22, 2024
Add unicode_decode_utf8_writer() to write directly characters into a
_PyUnicodeWriter writer: avoid the creation of a temporary string.
Optimize PyUnicode_FromFormat() by using the new
unicode_decode_utf8_writer().

Rename unicode_fromformat_write_cstr() to
unicode_fromformat_write_utf8().

Microbenchmark on the code:

    return PyUnicode_FromFormat(
        "%s %s %s %s %s.",
        "format", "multiple", "utf8", "short", "strings");

Result: 620 ns +- 8 ns -> 382 ns +- 2 ns: 1.62x faster.
@serhiy-storchaka
Copy link
Member

The main problem with the current private PyUnicodeWriter C API is that it requires allocating the PyUnicodeWriter value on the stack, but its layout is an implementation detail, and exposing such API would prevent future changes. The proposed new C API allocates the data in dynamic memory, which makes it more portable and future proof. But this can add additional overhead. Also, if we use dynamic memory, why not make PyUnicodeWriter a subclass of PyObject? Then Py_DECREF could be used to destroy it, we could store multiple writers in a collection, and we can even provide Python interface for it.

@vstinner
Copy link
Member Author

The proposed new C API allocates the data in dynamic memory, which makes it more portable and future proof. But this can add additional overhead.

I ran benchmarks and using the proposed public API remains interesting in terms of performance: see benchmarks below.

Also, if we use dynamic memory, why not make PyUnicodeWriter a subclass of PyObject? Then Py_DECREF could be used to destroy it, we could store multiple writers in a collection, and we can even provide Python interface for it.

Adding a Python API is appealing, but I prefer to restrict this discussion to a C API and only discuss later the idea of exposing it at the Python level.

For the C API, I don't think that Py_DECREF() semantics and inheriting from PyObject are really worth it.

vstinner added a commit to vstinner/cpython that referenced this issue May 22, 2024
Add unicode_decode_utf8_writer() to write directly characters into a
_PyUnicodeWriter writer: avoid the creation of a temporary string.
Optimize PyUnicode_FromFormat() by using the new
unicode_decode_utf8_writer().

Rename unicode_fromformat_write_cstr() to
unicode_fromformat_write_utf8().

Microbenchmark on the code:

    return PyUnicode_FromFormat(
        "%s %s %s %s %s.",
        "format", "multiple", "utf8", "short", "strings");

Result: 620 ns +- 8 ns -> 382 ns +- 2 ns: 1.62x faster.
@vstinner
Copy link
Member Author

I renamed functions:

  • PyUnicodeWriter_WriteString() => PyUnicodeWriter_WriteUTF8(): API with const char *str.
  • PyUnicodeWriter_WriteStr() => PyUnicodeWriter_WriteString(): API with PyObject *str.
  • PyUnicodeWriter_FromFormat() => PyUnicodeWriter_Format().

@vstinner
Copy link
Member Author

@encukou:

I see the PR hides underscored API that some existing projects use. I thought we weren't doing that any more.

Right, I would like to hide/remove the internal API from the public C API in Python 3.14 while adding the new public C API. The private _PyUnicodeWriter API exposes the _PyUnicodeWriter structure (members). Its API is more complicated and more error-prone.

I prepared a PR for pythoncapi-compat to check that it's possible to implement the new API on Python 3.6-3.13: python/pythoncapi-compat#95

vstinner added a commit to vstinner/cpython that referenced this issue May 23, 2024
Move the private _PyUnicodeWriter API to the internal C API.
@serhiy-storchaka
Copy link
Member

There is some confusion with names. The String suffix usually means the C string (const char *) argument. Str is only used in PyObject_Str() which is the C analogue of the str() function.

So, for consistency we should use PyUnicodeWriter_WriteString() for writing the C string. This left us with the question what to do with Python strings. PyUnicodeWriter_WriteStr() implies that str() is called for argument. Even if we add such API, it is worth to have also a more restricted function which fails if non-string is passed by accident.

@vstinner
Copy link
Member Author

This left us with the question what to do with Python strings.

We can refer to them as "Unicode", such as: PyUnicodeWriter_WriteUnicode(). Even if the Python type is called "str", in C, it's the PyUnicodeObject: https://docs.python.org/dev/c-api/unicode.html

mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024
…ython#120809)

The public PyUnicodeWriter API enables overallocation by default and
so is more efficient. It also makes the code simpler and shorter.
mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024
)

Add PyUnicodeWriter_WriteWideChar() and
PyUnicodeWriter_DecodeUTF8Stateful() functions.

Co-authored-by: Serhiy Storchaka <[email protected]>
mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024
Use PyUnicodeWriter_WriteWideChar() in PyUnicode_FromFormat()
mrahtz pushed a commit to mrahtz/cpython that referenced this issue Jun 30, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
…120799)

The public PyUnicodeWriter API enables overallocation by default and
so is more efficient.

Benchmark:

python -m pyperf timeit \
    -s 't = list[int, float, complex, str, bytes, bytearray, ' \
                 'memoryview, list, dict]' \
    'str(t)'

Result:

1.49 us +- 0.03 us -> 1.10 us +- 0.02 us: 1.35x faster
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
…on#120797)

The public PyUnicodeWriter API enables overallocation by default and
so is more efficient.

Benchmark:

python -m pyperf timeit \
    -s 't = int | float | complex | str | bytes | bytearray' \
       ' | memoryview | list | dict' \
    'str(t)'

Result:

1.29 us +- 0.02 us -> 1.00 us +- 0.02 us: 1.29x faster
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
Use strchr() and ucs1lib_find_max_char() to optimize the code path
formatting sub-strings between '%' formats.
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
…ython#120809)

The public PyUnicodeWriter API enables overallocation by default and
so is more efficient. It also makes the code simpler and shorter.
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
)

Add PyUnicodeWriter_WriteWideChar() and
PyUnicodeWriter_DecodeUTF8Stateful() functions.

Co-authored-by: Serhiy Storchaka <[email protected]>
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
Use PyUnicodeWriter_WriteWideChar() in PyUnicode_FromFormat()
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
noahbkim pushed a commit to hudson-trading/cpython that referenced this issue Jul 11, 2024
@vstinner
Copy link
Member Author

See also #121710 : [C API] Add PyBytesWriter API.

estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…120799)

The public PyUnicodeWriter API enables overallocation by default and
so is more efficient.

Benchmark:

python -m pyperf timeit \
    -s 't = list[int, float, complex, str, bytes, bytearray, ' \
                 'memoryview, list, dict]' \
    'str(t)'

Result:

1.49 us +- 0.03 us -> 1.10 us +- 0.02 us: 1.35x faster
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…on#120797)

The public PyUnicodeWriter API enables overallocation by default and
so is more efficient.

Benchmark:

python -m pyperf timeit \
    -s 't = int | float | complex | str | bytes | bytearray' \
       ' | memoryview | list | dict' \
    'str(t)'

Result:

1.29 us +- 0.02 us -> 1.00 us +- 0.02 us: 1.29x faster
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
Use strchr() and ucs1lib_find_max_char() to optimize the code path
formatting sub-strings between '%' formats.
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
…ython#120809)

The public PyUnicodeWriter API enables overallocation by default and
so is more efficient. It also makes the code simpler and shorter.
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
)

Add PyUnicodeWriter_WriteWideChar() and
PyUnicodeWriter_DecodeUTF8Stateful() functions.

Co-authored-by: Serhiy Storchaka <[email protected]>
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
Use PyUnicodeWriter_WriteWideChar() in PyUnicode_FromFormat()
estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
vstinner added a commit to vstinner/cpython that referenced this issue Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-C-API type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants