Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split micro-ops that have different behavior depending on low bit of oparg. #115457

Closed
markshannon opened this issue Feb 14, 2024 · 1 comment
Closed
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@markshannon
Copy link
Member

markshannon commented Feb 14, 2024

Splitting these micro-ops will improve performance by reducing the number of branches, the size of code generated, and the number of holes in the JIT stencils. There is no real downside; the increase in complexity at runtime is negligible and there isn't much increased complexity in the tooling.

Taking _LOAD_ATTR_INSTANCE_VALUE as an example, as it is the dynamically most common.

    op(_LOAD_ATTR_INSTANCE_VALUE, (index/1, owner -- attr, null if (oparg & 1))) {
        ...

can be split into

    op(_LOAD_ATTR_INSTANCE_VALUE_0, (index/1, owner -- attr)) {
        assert((oparg & 1) == 0);
        ...

and

    op(_LOAD_ATTR_INSTANCE_VALUE_1, (index/1, owner -- attr, null)) {
        assert((oparg & 1) == 1);
        ...

Each of these is simpler, thus smaller and faster than the base version.
We can always choose one of the two split version when projecting the trace, so we don't need an implementation of the base version at all. This means that the tier 2 interpreter and stencils aren't much bigger than before.

Linked PRs

@markshannon markshannon added the performance Performance or resource usage label Feb 14, 2024
@erlend-aasland erlend-aasland added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Feb 14, 2024
@markshannon markshannon self-assigned this Feb 15, 2024
@markshannon
Copy link
Member Author

It makes sense to do replication at the same time as splitting.

By replication, I mean creating a copy of the replicated uop for each oparg in a given set.
This is not such an obvious win, as we need multiple stencils, but each stencil can be significantly smaller than the original.

The best example is _INIT_CALL_PY_EXACT_ARGS which has a stencil of 753 bytes, whereas the stencil for _INIT_CALL_PY_EXACT_ARGS_0 is only 308 bytes, a ~60% saving.

_LOAD_FAST shows a smaller saving, from 45 bytes down to 31 for LOAD_FAST_0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

2 participants