gh-91924: Optimize unicode_check_encoding_errors() #93200

vstinner · 2022-05-25T02:47:07Z

Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common
built-in encodings and error handlers to avoid creating a temporary
Unicode string object, whereas these encodings and error handlers are
known to be valid.

Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common built-in encodings and error handlers to avoid creating a temporary Unicode string object, whereas these encodings and error handlers are known to be valid.

vstinner · 2022-05-25T02:47:15Z

cc @serhiy-storchaka

serhiy-storchaka

It is slower than corresponding checks in PyUnicode_Decode() and PyUnicode_AsEncodedString(). And it is always called in that functions, so you do double work. And it is called before _Py_normalize_encoding(), so you do triple work.

vstinner · 2022-05-25T10:30:54Z

It is slower than corresponding checks in PyUnicode_Decode() and PyUnicode_AsEncodedString(). And it is always called in that functions, so you do double work. And it is called before _Py_normalize_encoding(), so you do triple work.

It seems like there is a misunderstanding here. My change is about the unicode_check_encoding_errors() function which is always called by PyUnicode_AsEncodedString() if Python is built in debug mode.

The purpose of this PR is to make a Python debug build "less slow".

Microbenchmark on utf-8 encoding and strict error handler:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 3bc776140a..264e419d82 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -5832,6 +5832,33 @@ settrace_to_record(PyObject *self, PyObject *list)
     Py_RETURN_NONE;
 }
 
+static PyObject *
+bench_encode(PyObject *self, PyObject *loops_obj)
+{
+    Py_ssize_t loops = PyLong_AsSsize_t(loops_obj);
+    if (loops == -1 && PyErr_Occurred()) {
+        return NULL;
+    }
+
+    PyObject *str = PyUnicode_FromString("");
+    if (str == NULL) {
+        return NULL;
+    }
+
+    _PyTime_t t1 = _PyTime_GetPerfCounter();
+    for (Py_ssize_t i=0; i < loops; i++) {
+        PyObject *obj = PyUnicode_AsEncodedString(str, "utf-8", "strict");
+        Py_DECREF(obj);
+    }
+    _PyTime_t t2 = _PyTime_GetPerfCounter();
+
+    Py_DECREF(str);
+
+    double dt = _PyTime_AsSecondsDouble(t2 - t1);
+    return PyFloat_FromDouble(dt);
+
+}
+
 static PyObject *negative_dictoffset(PyObject *, PyObject *);
 static PyObject *test_buildvalue_issue38913(PyObject *, PyObject *);
 static PyObject *getargs_s_hash_int(PyObject *, PyObject *, PyObject*);
@@ -6122,6 +6149,7 @@ static PyMethodDef TestMethods[] = {
     {"get_feature_macros", get_feature_macros, METH_NOARGS, NULL},
     {"test_code_api", test_code_api, METH_NOARGS, NULL},
     {"settrace_to_record", settrace_to_record, METH_O, NULL},
+    {"bench_encode", bench_encode, METH_O, NULL},
     {NULL, NULL} /* sentinel */
 };

Script:

import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench_encode)

Result:

pydebug, gcc -O0: Mean +- std dev: [gcc_O0_ref] 1.95 us +- 0.03 us -> [gcc_O0_pr] 147 ns +- 4 ns: 13.20x faster
pydebug, gcc -Og: Mean +- std dev: [gcc_Og_ref] 651 ns +- 7 ns -> [gcc_Og_pr] 35.6 ns +- 0.8 ns: 18.29x faster

vstinner · 2022-05-25T10:32:47Z

Extract of the PR:

        // Fast path for the most common built-in encodings. Even if the codec
        // is cached, _PyCodec_Lookup() decodes the bytes string from UTF-8 to
        // create a temporary Unicode string (the key in the cache).

_PyCodec_Lookup() calls normalizestring() + PyUnicode_InternInPlace() + PyDict_GetItemWithError().

normalizestring() calls PyUnicode_FromString(): it decodes the encoding name from UTF-8 and allocates a memory block on the heap memory. It's cheap, but it has a significant impact on performance (see my benchmark) when we know in advance that the encoding name is valid.

gh-91924: Optimize unicode_check_encoding_errors()

622d301

Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common built-in encodings and error handlers to avoid creating a temporary Unicode string object, whereas these encodings and error handlers are known to be valid.

vstinner added the skip news label May 25, 2022

bedevere-bot added the awaiting core review label May 25, 2022

serhiy-storchaka reviewed May 25, 2022

View reviewed changes

vstinner merged commit 5f8c3fb into python:main May 26, 2022

bedevere-bot removed the awaiting core review label May 26, 2022

vstinner deleted the unicode_check_encoding_errors branch May 26, 2022 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-91924: Optimize unicode_check_encoding_errors() #93200

gh-91924: Optimize unicode_check_encoding_errors() #93200

vstinner commented May 25, 2022

vstinner commented May 25, 2022

serhiy-storchaka left a comment

vstinner commented May 25, 2022 •

edited

Loading

vstinner commented May 25, 2022

gh-91924: Optimize unicode_check_encoding_errors() #93200

gh-91924: Optimize unicode_check_encoding_errors() #93200

Conversation

vstinner commented May 25, 2022

vstinner commented May 25, 2022

serhiy-storchaka left a comment

Choose a reason for hiding this comment

vstinner commented May 25, 2022 • edited Loading

vstinner commented May 25, 2022

vstinner commented May 25, 2022 •

edited

Loading