Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-91924: Optimize unicode_check_encoding_errors() #93200

Merged
merged 1 commit into from
May 26, 2022
Merged

gh-91924: Optimize unicode_check_encoding_errors() #93200

merged 1 commit into from
May 26, 2022

Conversation

vstinner
Copy link
Member

Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common
built-in encodings and error handlers to avoid creating a temporary
Unicode string object, whereas these encodings and error handlers are
known to be valid.

Avoid _PyCodec_Lookup() and PyCodec_LookupError() for most common
built-in encodings and error handlers to avoid creating a temporary
Unicode string object, whereas these encodings and error handlers are
known to be valid.
@vstinner
Copy link
Member Author

cc @serhiy-storchaka

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is slower than corresponding checks in PyUnicode_Decode() and PyUnicode_AsEncodedString(). And it is always called in that functions, so you do double work. And it is called before _Py_normalize_encoding(), so you do triple work.

@vstinner
Copy link
Member Author

vstinner commented May 25, 2022

It is slower than corresponding checks in PyUnicode_Decode() and PyUnicode_AsEncodedString(). And it is always called in that functions, so you do double work. And it is called before _Py_normalize_encoding(), so you do triple work.

It seems like there is a misunderstanding here. My change is about the unicode_check_encoding_errors() function which is always called by PyUnicode_AsEncodedString() if Python is built in debug mode.

The purpose of this PR is to make a Python debug build "less slow".

Microbenchmark on utf-8 encoding and strict error handler:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index 3bc776140a..264e419d82 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -5832,6 +5832,33 @@ settrace_to_record(PyObject *self, PyObject *list)
     Py_RETURN_NONE;
 }
 
+static PyObject *
+bench_encode(PyObject *self, PyObject *loops_obj)
+{
+    Py_ssize_t loops = PyLong_AsSsize_t(loops_obj);
+    if (loops == -1 && PyErr_Occurred()) {
+        return NULL;
+    }
+
+    PyObject *str = PyUnicode_FromString("");
+    if (str == NULL) {
+        return NULL;
+    }
+
+    _PyTime_t t1 = _PyTime_GetPerfCounter();
+    for (Py_ssize_t i=0; i < loops; i++) {
+        PyObject *obj = PyUnicode_AsEncodedString(str, "utf-8", "strict");
+        Py_DECREF(obj);
+    }
+    _PyTime_t t2 = _PyTime_GetPerfCounter();
+
+    Py_DECREF(str);
+
+    double dt = _PyTime_AsSecondsDouble(t2 - t1);
+    return PyFloat_FromDouble(dt);
+
+}
+
 static PyObject *negative_dictoffset(PyObject *, PyObject *);
 static PyObject *test_buildvalue_issue38913(PyObject *, PyObject *);
 static PyObject *getargs_s_hash_int(PyObject *, PyObject *, PyObject*);
@@ -6122,6 +6149,7 @@ static PyMethodDef TestMethods[] = {
     {"get_feature_macros", get_feature_macros, METH_NOARGS, NULL},
     {"test_code_api", test_code_api, METH_NOARGS, NULL},
     {"settrace_to_record", settrace_to_record, METH_O, NULL},
+    {"bench_encode", bench_encode, METH_O, NULL},
     {NULL, NULL} /* sentinel */
 };
 

Script:

import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench_encode)

Result:

  • pydebug, gcc -O0: Mean +- std dev: [gcc_O0_ref] 1.95 us +- 0.03 us -> [gcc_O0_pr] 147 ns +- 4 ns: 13.20x faster
  • pydebug, gcc -Og: Mean +- std dev: [gcc_Og_ref] 651 ns +- 7 ns -> [gcc_Og_pr] 35.6 ns +- 0.8 ns: 18.29x faster

@vstinner
Copy link
Member Author

Extract of the PR:

        // Fast path for the most common built-in encodings. Even if the codec
        // is cached, _PyCodec_Lookup() decodes the bytes string from UTF-8 to
        // create a temporary Unicode string (the key in the cache).

_PyCodec_Lookup() calls normalizestring() + PyUnicode_InternInPlace() + PyDict_GetItemWithError().

normalizestring() calls PyUnicode_FromString(): it decodes the encoding name from UTF-8 and allocates a memory block on the heap memory. It's cheap, but it has a significant impact on performance (see my benchmark) when we know in advance that the encoding name is valid.

@vstinner vstinner merged commit 5f8c3fb into python:main May 26, 2022
@vstinner vstinner deleted the unicode_check_encoding_errors branch May 26, 2022 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants