-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-91924: Optimize unicode_check_encoding_errors() #93200
Conversation
Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common built-in encodings and error handlers to avoid creating a temporary Unicode string object, whereas these encodings and error handlers are known to be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is slower than corresponding checks in PyUnicode_Decode()
and PyUnicode_AsEncodedString()
. And it is always called in that functions, so you do double work. And it is called before _Py_normalize_encoding()
, so you do triple work.
It seems like there is a misunderstanding here. My change is about the unicode_check_encoding_errors() function which is always called by PyUnicode_AsEncodedString() if Python is built in debug mode. The purpose of this PR is to make a Python debug build "less slow". Microbenchmark on diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 3bc776140a..264e419d82 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -5832,6 +5832,33 @@ settrace_to_record(PyObject *self, PyObject *list)
Py_RETURN_NONE;
}
+static PyObject *
+bench_encode(PyObject *self, PyObject *loops_obj)
+{
+ Py_ssize_t loops = PyLong_AsSsize_t(loops_obj);
+ if (loops == -1 && PyErr_Occurred()) {
+ return NULL;
+ }
+
+ PyObject *str = PyUnicode_FromString("");
+ if (str == NULL) {
+ return NULL;
+ }
+
+ _PyTime_t t1 = _PyTime_GetPerfCounter();
+ for (Py_ssize_t i=0; i < loops; i++) {
+ PyObject *obj = PyUnicode_AsEncodedString(str, "utf-8", "strict");
+ Py_DECREF(obj);
+ }
+ _PyTime_t t2 = _PyTime_GetPerfCounter();
+
+ Py_DECREF(str);
+
+ double dt = _PyTime_AsSecondsDouble(t2 - t1);
+ return PyFloat_FromDouble(dt);
+
+}
+
static PyObject *negative_dictoffset(PyObject *, PyObject *);
static PyObject *test_buildvalue_issue38913(PyObject *, PyObject *);
static PyObject *getargs_s_hash_int(PyObject *, PyObject *, PyObject*);
@@ -6122,6 +6149,7 @@ static PyMethodDef TestMethods[] = {
{"get_feature_macros", get_feature_macros, METH_NOARGS, NULL},
{"test_code_api", test_code_api, METH_NOARGS, NULL},
{"settrace_to_record", settrace_to_record, METH_O, NULL},
+ {"bench_encode", bench_encode, METH_O, NULL},
{NULL, NULL} /* sentinel */
};
Script: import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench_encode) Result:
|
Extract of the PR:
_PyCodec_Lookup() calls normalizestring() + PyUnicode_InternInPlace() + PyDict_GetItemWithError(). normalizestring() calls PyUnicode_FromString(): it decodes the encoding name from UTF-8 and allocates a memory block on the heap memory. It's cheap, but it has a significant impact on performance (see my benchmark) when we know in advance that the encoding name is valid. |
Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common
built-in encodings and error handlers to avoid creating a temporary
Unicode string object, whereas these encodings and error handlers are
known to be valid.