-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-125196: Use PyUnicodeWriter in error handlers #125262
Conversation
PyCodec_ReplaceErrors() and PyCodec_BackslashReplaceErrors() now use the public PyUnicodeWriter API.
@@ -700,27 +700,38 @@ PyObject *PyCodec_IgnoreErrors(PyObject *exc) | |||
} | |||
|
|||
|
|||
static inline int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to move this inline function into the header file? That way it can omit the declaration in the header file and integrate well. It looks like a public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I implemented #125201 which should replace this static inline function.
Benchmark: import codecs
import pyperf
runner = pyperf.Runner()
replace = codecs.lookup_error('replace')
backslashreplace = codecs.lookup_error('backslashreplace')
for LEN in (1, 1000):
text = "\xff" * LEN
exc = UnicodeEncodeError("ascii", text, 0, len(text), "reason")
runner.bench_func(f"replace encode len={LEN}", replace, exc)
text = "\xff" * LEN
exc = UnicodeTranslateError(text, 0, len(text), "reason")
runner.bench_func(f"replace translate len={LEN}", replace, exc)
data = b"\x20\x80\xff" * LEN
exc = UnicodeDecodeError("ascii", data, 0, len(data), "reason")
runner.bench_func(f"backslashreplace decode len={LEN}", backslashreplace, exc)
text = "\x20\xff\u20ac\U0010ffff" * LEN
exc = UnicodeEncodeError("ascii", text, 0, len(text), "reason")
runner.bench_func(f"backslashreplace encode len={LEN}", backslashreplace, exc)
text = "\x20\xff\u20ac\U0010ffff" * LEN
exc = UnicodeTranslateError(text, 0, len(text), "reason")
runner.bench_func(f"backslashreplace translate len={LEN}", backslashreplace, exc) Results, Python built with
It's way slower :-( This PR has a naive PyUnicodeWriter_Fill() implementation calling PyUnicodeWriter_WriteChar() in a loop. I wrote PR gh-125201 which is way more efficient:
|
For the benchmark, I had to call directly the error handler functions. Using ASCII doesn't work (it doesn't measure my change), since |
Let's keep the private API for now, since it's faster. |
PyCodec_ReplaceErrors() and PyCodec_BackslashReplaceErrors() now use the public PyUnicodeWriter API.