Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-121710: Add PyBytesWriter API #121726

Closed
wants to merge 5 commits into from
Closed

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Jul 13, 2024

Doc/c-api/bytes.rst Show resolved Hide resolved
Modules/_testcapi/bytes.c Show resolved Hide resolved
Objects/bytesobject.c Show resolved Hide resolved
Objects/bytesobject.c Show resolved Hide resolved
Objects/bytesobject.c Show resolved Hide resolved
Objects/bytesobject.c Show resolved Hide resolved
Objects/bytesobject.c Outdated Show resolved Hide resolved
Objects/bytesobject.c Show resolved Hide resolved
Objects/bytesobject.c Show resolved Hide resolved
@vstinner
Copy link
Member Author

@picnixz: Such coding style review is not helping. It's the first version of the PR, we didn't discuss the API yet. I would prefer comments about the API rather than 7 comments about spaces and PEP 7. It's too early for that.

@picnixz
Copy link
Contributor

picnixz commented Jul 14, 2024

Oh ok my bad (I'll try to reduce those kind of reviews in the future). Then, I have a suggestion about maybe having the possibility to expose something for byte substrings (similar to PyUnicodeWriter_WriteSubstring) as well as format strings like PyUnicodeWriter_Format and possibly 1-char bytes. While all those functions are equivalent PyBytesWriter_WriteBytes calls, if I were to ask for a specific one, it would be PyUnicodeWriter_Format which could be very useful for a public API (I don't think the V version is needed though since it's not exposed in the Unicode API).

So, ideally:

  • PyBytesWriter_Format(PyBytesWriter *, char **, const char *, ...)
  • PyBytesWriter_WriteBytes(PyBytesWriter *, char **, PyObject *) which accepts PyObject * arguments
  • PyBytesWriter_WriteBuffer(PyBytesWriter *, char **, const void *, Py_ssize_t) which accepts const void * arguments (the current WriteBytes)
  • PyBytesWriter_WriteChar(PyBytesWriter *, char **, char) -- efficient specialization for size = 1 (maybe not a char, but perhaps Py_UCS*`)
  • PyBytesWriter_WriteSubBytes(PyBytesWriter *, char **, PyObject *, Py_ssize_t, Py_ssize_t) (or any other name) for writing a sub-range of bytes object.

(Again, sorry for my nitpicky review).

@vstinner
Copy link
Member Author

@picnixz: I would prefer to make the API as small as possible. Where do these use cases come from? Did you see usage in the current C code of CPython with the _PyBytesWriter API?

In your list, the most appealing is PyBytesWriter_Format() :-)

PyBytesWriter_Format(PyBytesWriter *, char **, const char *, ...)

While it's tempting to add as many features as PyUnicodeWriter API, I'm not convinced that this function is needed right now.

PyBytesWriter_WriteBytes(PyBytesWriter *, char **, PyObject *) which accepts PyObject * arguments

You can already call PyBytesWriter_WriteBytes(writer, &str, PyBytes_AsString(bytes), PyBytes_Size(bytes)). I never had to "write" a Python bytes object when using the _PyBytesWriter API. I'm not convinced that it's a common use case.

PyBytesWriter_WriteSubBytes(PyBytesWriter *, char **, PyObject *, Py_ssize_t, Py_ssize_t) (or any other name) for writing a sub-range of bytes object.

You can already call PyBytesWriter_WriteBytes(writer, &str, PyBytes_AsString(bytes) + start, PyBytes_Size(bytes) - skip) to write a substring. Again, I'm not convinced that it's common to write Python bytes objects.

PyBytesWriter_WriteChar(PyBytesWriter *, char *, char) -- efficient specialization for size = 1 (maybe not a char, but perhaps Py_UCS`)

PyBytesWriter is not PyUnicodeWriter. The intended usage is to allocate enough bytes and then just use str pointer. Example:

    ascii_data = PyBytesWriter_Prepare(&writer, 10);
    *ascii_data++ = ' ';
    memcpy(ascii_data, "abc", 3); ascii_data += 3;
    *ascii_data++ = '-';
    ...

WriteChar() would be way slower.

@picnixz
Copy link
Contributor

picnixz commented Jul 14, 2024

Errr.. my idea was to have a parity with the PyUnicodeWriter API, but if you want to make it as small as possible, I'm fine with it (I never had to use the private API to write a bytes string either and I didn't check the code to see occurrences of it actually). I think the PyBytesWriter_Format(PyBytesWriter *, char **, const char *, ...) would be then the only one I would advocate for.

@vstinner
Copy link
Member Author

Errr.. my idea was to have a parity with the PyUnicodeWriter API

Look at the current usage of _PyBytesWriter. You will see that the code is very different. The hot code doesn't use the writer but just "str" pointer. The writer is just there to manage memory.

PyUnicodeWriter is used for other use cases and so it deserves a different API.

Don't call _PyBytesWriter_Dealloc() on error.
The caller must check str. Any additional check has a negative impact
on performance.
@vstinner
Copy link
Member Author

PR rebased on main to fix a merge conflict.

@vstinner
Copy link
Member Author

vstinner commented Aug 5, 2024

I created a C API Working Group issue: capi-workgroup/decisions#39

@vstinner
Copy link
Member Author

vstinner commented Aug 7, 2024

Microbenchmark comparing bytearray to PyBytesWriter API:

Result:

$ env/bin/python bench.py
bytearray: Mean +- std dev: 93.0 ns +- 1.4 ns
byteswriter: Mean +- std dev: 23.4 ns +- 0.9 ns

PyBytesWriter is 4x faster (93.0 => 23.4 ns) than using bytearray.

@encukou
Copy link
Member

encukou commented Aug 8, 2024

It seems that all the needed information could fit inside the memory of a PyBytesObject:

  • buffer -> _Py_CAST(PyBytesObject*, writer)->ob_sval
  • allocated -> Py_SIZE(writer)
  • min_size -> unused: the user can be track this separately
  • overallocate - unused (always 1 in this PR)
  • use_small_buffer, small_buffer - unused: small buffer is just an initial overallocation
  • use_bytearray - unused
  • the “current position” is tracked separately in str.

With that we could avoid allocating the big struct. With a very quick and dirty proof of concept, I get reasonable results, which suggest this could be useful for alternative implementations or future CPython versions:

This PR:
byteswriter: Mean +- std dev: 15.4 ns +- 0.6 ns
My proof of concept:
byteswriter: Mean +- std dev: 17.3 ns +- 1.0 ns

For the API, it would mean that PyBytesWriter_Prepare and PyBytesWriter_WriteBytes can reallocate the writer itself, and would take a PyBytesWriter **. That shouldn't hurt with the _PyBytesWriter implementation.

@vstinner
Copy link
Member Author

vstinner commented Aug 9, 2024

min_size -> unused: the user can be track this separately

It's used if you call Prepare() multiple times.

use_small_buffer, small_buffer - unused: small buffer is just an initial overallocation

It's used. It does defer the Python bytes object until Finish() which can avoid having to call the inefficient _PyBytes_Resize() function.

With that we could avoid allocating the big struct.

Why do you want to avoid that? I would like to reuse the existing private _PyBytesWriter code which is well tested. According to benchmarks, allocating PyBytesWriter on the heap memory has no significant cost.

@encukou
Copy link
Member

encukou commented Aug 21, 2024

Why do you want to avoid that?

It would be nice to keep the possibility open for other implementations, if possible.

I would like to reuse the existing private _PyBytesWriter code which is well tested.

But, this is not exposing all of the private interface, and I'm not sure if the existing tests apply to what's exposed here. In particular, the public API doesn't have:

  • disabling overallocation
  • a writable min_size field
  • bytearray support

An alternate implementation would not need the machinery for those features, and could do without min_size and a separate small_buffer.

@vstinner
Copy link
Member Author

vstinner commented Oct 7, 2024

I decided to reject this API for now. It's too low-level and too error prone: capi-workgroup/decisions#39 (comment)

@vstinner vstinner closed this Oct 7, 2024
@vstinner vstinner deleted the bytes_writer branch October 7, 2024 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants