You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(I'll include the problem here, as the "problems" repo seems "done" now that PEP-733 is up).
Traditional C APIs take zero-terminated strings, which means that Python strings that with embedded NUL bytes appear truncated. There are many ways to get such a char*: converted directly using PyUnicode_AsUTF8, encoded and accessed via PyBytes_AsString, or accessed with something like PyUnicode_AsUTF8AndSize while ignoring the size.
Many APIs that convert to char* raise an error on embedded NUL bytes. On that:
This is safe, but it needs an extra O(n) search, which is not necessary for all tasks.
Soft-deprecate PyUnicode_AsUTF8, nudging people toward PyUnicode_AsUTF8AndSize. (It's still possible to ignore the size, but it's much less likely to do so on purpose -- unless we encourage people to mechanically replace PyUnicode_AsUTF8(s) with PyUnicode_AsUTF8AndSize(s, NULL)).
In CPython, use the "pointer+size" representation more --- only use "pointer only" for working with external APIs or for backwards compatibility. This might help find APIs we might want to expose.
In APIs that look up names and take aliases (codec names, hash algorithm names, timezone names, etc.), the embedded NUL is not as security issue. For example, I don't see a problem with UTF-8, utf8 and utf8\0spamspamspaaam all naming the same encoding. (The fact that some APIs will reject the latter string, and others will not, is unfortunate but not terrible.)
In error/warning messages, we might want to filter out newlines, backspaces, terminal escape sequences and the like. If we're not doing that, there's not much additional harm in allowing an “end of message” control character. (FWIW, PyObject_Repr is very useful for arbitrary strings, though we shouldn't call it “safe” as it still passes Unicode lookalikes or BIDI characters through.)
The text was updated successfully, but these errors were encountered:
The new API is not always appropriate as a replacement for PyUnicode_AsUTF8, and when it is appropriate it only replaces 2 lines (plus declarations/error handling)
It does nothing to nudge people away from PyUnicode_AsUTF8
IMO, it encourages going in the wrong direction -- "pointer only" representation rather than "pointer+size"
(I'll include the problem here, as the "problems" repo seems "done" now that PEP-733 is up).
Traditional C APIs take zero-terminated strings, which means that Python strings that with embedded NUL bytes appear truncated. There are many ways to get such a
char*
: converted directly usingPyUnicode_AsUTF8
, encoded and accessed viaPyBytes_AsString
, or accessed with something likePyUnicode_AsUTF8AndSize
while ignoring the size.Many APIs that convert to
char*
raise an error on embedded NUL bytes. On that:PyUnicode_AsUTF8
) to do this. See the reverted [C API] Change PyUnicode_AsUTF8() to return NULL on embedded null characters python/cpython#111089We could:
PyUnicode_AsUTF8
, nudging people towardPyUnicode_AsUTF8AndSize
. (It's still possible to ignore the size, but it's much less likely to do so on purpose -- unless we encourage people to mechanically replacePyUnicode_AsUTF8(s)
withPyUnicode_AsUTF8AndSize(s, NULL)
).Notes on some of the issues @vstinner collected in python/cpython#111656 (comment):
UTF-8
,utf8
andutf8\0spamspamspaaam
all naming the same encoding. (The fact that some APIs will reject the latter string, and others will not, is unfortunate but not terrible.)PyObject_Repr
is very useful for arbitrary strings, though we shouldn't call it “safe” as it still passes Unicode lookalikes or BIDI characters through.)The text was updated successfully, but these errors were encountered: