Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of valid index calculation #498

Open
eendebakpt opened this issue Nov 18, 2022 · 6 comments
Open

Optimization of valid index calculation #498

eendebakpt opened this issue Nov 18, 2022 · 6 comments

Comments

@eendebakpt
Copy link

In listobject.c there is an optimization to check whether an index is valid (e.g. 0 <= index < N) using a single comparison.
The same optimization is not used in other files such as tupleobject.c (except for _collectionsmodule.c)

  • Is there a reason the optimization is used only for the list and not a tuple?
  • Would a PR applying this optimization be acceptable? (either only for the tuple, or for all locations where we can do this)

I created a branch with the idea for the tupleobject.c since that might be the most performance critical:

python/cpython@main...eendebakpt:cpython:list_tuple

I could not measure significant performance improvements on my system, but perhaps there are other systems/compilers where there is a measurable difference.

@brandtbucher
Copy link
Member

If we do this, we should just define a nice, readable macro for it in pymacro.h and stop reinventing the wheel everywhere. We do similar moves, for example, all over bytecodes.c.

@eendebakpt
Copy link
Author

If we do this, we should just define a nice, readable macro for it in pymacro.h and stop reinventing the wheel everywhere. We do similar moves, for example, all over bytecodes.c.

@brandtbucher Do you mean bytesarray.c or bytesobject.c? In those I can find index checks, but not in bytecodes.c.

Putting a macro in pymacro.h seems like a good idea. A real macro, or an inline function then? Such a change would make the code more consistent, but it would also mean many (very small) changes in the codebase.

@gvanrossum
Copy link
Collaborator

I think Brandt is referring to things like

            Py_ssize_t signed_magnitude = Py_SIZE(sub);
            DEOPT_IF(((size_t)signed_magnitude) > 1, BINARY_SUBSCR);

which I found at line 386 in Python/bytecodes.c.

@eendebakpt
Copy link
Author

That is indeed a similar trick! I found 3 occurrences in bytecodes.c, the code then changes from:

DEOPT_IF(((size_t)signed_magnitude) > 1, BINARY_SUBSCR);

to

DEOPT_IF(invalid_index(signed_magnitude, 2), BINARY_SUBSCR);

@JelleZijlstra
Copy link
Contributor

Does this actually make a difference in practice? Can't the compiler already do this optimization?

@eendebakpt
Copy link
Author

eendebakpt commented Nov 25, 2022

Does this actually make a difference in practice? Can't the compiler already do this optimization?

The PR is created is more about consistency than optimization. With optimizations on there is a difference in the generated code (an extra test %rsi,%rsi, see details below), but I do not think it makes a large difference (perhaps only in the tuple/list indexing or the usage in bytecodes.c)

Output of `gdb -batch -ex "disassemble/rs tupleitem" ./python`

Main build with optimizations on:

Dump of assembler code for function tupleitem:
Objects/tupleobject.c:
363	{
   0x0000000000244e50 <+0>:	f3 0f 1e fa	endbr64 

364	    if (i < 0 || i >= Py_SIZE(a)) {
   0x0000000000244e54 <+4>:	48 85 f6	test   %rsi,%rsi
   0x0000000000244e57 <+7>:	78 10	js     0x244e69 <tupleitem+25>

./Include/object.h:
144	    return var_ob->ob_size;
   0x0000000000244e59 <+9>:	48 3b 77 10	cmp    0x10(%rdi),%rsi
   0x0000000000244e5d <+13>:	7d 0a	jge    0x244e69 <tupleitem+25>

Objects/tupleobject.c:
368	    return Py_NewRef(a->ob_item[i]);
   0x0000000000244e5f <+15>:	48 8b 44 f7 18	mov    0x18(%rdi,%rsi,8),%rax

./Include/object.h:
641	    Py_INCREF(obj);
   0x0000000000244e64 <+20>:	48 83 00 01	addq   $0x1,(%rax)

642	    return obj;
   0x0000000000244e68 <+24>:	c3	retq   

Objects/tupleobject.c:
365	        PyErr_SetString(PyExc_IndexError, "tuple index out of range");
   0x0000000000244e69 <+25>:	50	push   %rax
   0x0000000000244e6a <+26>:	48 8b 3d 8f 52 28 00	mov    0x28528f(%rip),%rdi        # 0x4ca100 <PyExc_IndexError>
   0x0000000000244e71 <+33>:	48 8d 35 fa e5 12 00	lea    0x12e5fa(%rip),%rsi        # 0x373472
   0x0000000000244e78 <+40>:	e8 13 17 05 00	callq  0x296590 <PyErr_SetString>

366	        return NULL;
   0x0000000000244e7d <+45>:	31 c0	xor    %eax,%eax
   0x0000000000244e7f <+47>:	5a	pop    %rdx
   0x0000000000244e80 <+48>:	c3	retq   
End of assembler dump.

PR build with optimizations on:

Dump of assembler code for function tupleitem:
Objects/tupleobject.c:
363	{
   0x0000000000245820 <+0>:	f3 0f 1e fa	endbr64 

./Include/pymacro.h:
171	    return (size_t)i >= (size_t)limit;
   0x0000000000245824 <+4>:	48 39 77 10	cmp    %rsi,0x10(%rdi)
   0x0000000000245828 <+8>:	76 0a	jbe    0x245834 <tupleitem+20>

Objects/tupleobject.c:
368	    return Py_NewRef(a->ob_item[i]);
   0x000000000024582a <+10>:	48 8b 44 f7 18	mov    0x18(%rdi,%rsi,8),%rax

./Include/object.h:
646	    Py_INCREF(obj);
   0x000000000024582f <+15>:	48 83 00 01	addq   $0x1,(%rax)

647	    return obj;
   0x0000000000245833 <+19>:	c3	retq   

Objects/tupleobject.c:
365	        PyErr_SetString(PyExc_IndexError, "tuple index out of range");
   0x0000000000245834 <+20>:	50	push   %rax
   0x0000000000245835 <+21>:	48 8b 3d c4 48 28 00	mov    0x2848c4(%rip),%rdi        # 0x4ca100 <PyExc_IndexError>
   0x000000000024583c <+28>:	48 8d 35 2f dc 12 00	lea    0x12dc2f(%rip),%rsi        # 0x373472
   0x0000000000245843 <+35>:	e8 68 17 05 00	callq  0x296fb0 <PyErr_SetString>

366	        return NULL;
   0x0000000000245848 <+40>:	31 c0	xor    %eax,%eax
   0x000000000024584a <+42>:	5a	pop    %rdx
   0x000000000024584b <+43>:	c3	retq   
End of assembler dump.

System: linux, gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants