Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tier 2 optimizer. #557

Open
markshannon opened this issue Mar 9, 2023 · 15 comments
Open

Tier 2 optimizer. #557

markshannon opened this issue Mar 9, 2023 · 15 comments
Labels
epic-tier2-optimizer Linear code region optimizer for 3.13 and beyond.

Comments

@markshannon
Copy link
Member

This is the top level issue for the tier 2 optimizer for CPython 3.13 and beyond.

Try to keep discussion here at a high level and discuss the details on the sub-issues.

The tier 2 optimizer has always promised to optimize larger regions than the tier 1 (PEP 659) optimizer. But we have been a bit vague as to what those regions would be.

In an earlier discussion, I referred to them as "projected short traces".
The term "trace" is a bit misleading, as it suggest some sort of recording of program execution.

The optimization I propose is more akin to basic block versioning, than the trace recording of PyPy.
However, instead of basic blocks, we would be optimizing dynamic superblocks.
The extent of the superblocks would be determined at runtime from profiling data gathered by the tier 1 interpreter.

The term "superblocks" might also be a bit misleading as they might include inlined calls, but I it's the best name I could come up with for now. We could call the tier 2 optimization, "superblock versioning", as we intend to handle polymorphism in much the same way as BBV.

For this we to work, we need to be able to do the following:

  • Modify the tier 1 interpreter to detect "hotspots" to determine where to start the dynamic superblocks.
  • Create the superblocks
  • Optimize the superblocks (it might make sense to merge some or all of this with the creation of the superblocks)
  • Deoptimize the superblocks (to enable more speculation in the optimizer)
  • Manage the superblocks, discarding cold ones and preventing memory use getting too large.
@markshannon markshannon added the epic-tier2-optimizer Linear code region optimizer for 3.13 and beyond. label Mar 9, 2023
@Fidget-Spinner
Copy link
Collaborator

What's the general timeline you're looking for implementing all these?

@markshannon
Copy link
Member Author

What's the general timeline you're looking for implementing all these?

3.13. Except the compiler, which will probably be 3.14

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Mar 10, 2023

Sorry I'm confused: which compiler are you referring to? The one used for the optimization passes on the tier 2 code?

@markshannon
Copy link
Member Author

What is a "superblock"?

A superblock is a linear piece of code with one entry and multiple exits. It differs from an extended (basic) block in that it may duplicate some code.

Take this CFG:

basic_blocks gv

With extended blocks the CFG would look like this:
extended_blocks gv

Whereas with superblocks, we get this:
super_blocks gv

Why use superblocks?

When considering regions for optimization, we want as large as regions as possible without excessive duplication.
But we also want regions that we can optimize quickly and effectively.
Linear sequences of code can be optimized quickly. However basic blocks are too small to optimized effectively.
So we choose superblocks.

In the example above, the duplication of D allows it to be specialized for the context.
Obviously we don't want excessive duplication of code, but used judiciously it can help optimization considerably.

@markshannon
Copy link
Member Author

Sorry I'm confused: which compiler are you referring to? The one used for the optimization passes on the tier 2 code?

The compiler to machine code. Nothing else is referred to as a "compiler" in this tier in order to avoid any more confusion.
It is already confusing enough that we have the source -> bytecode compiler, the C compiler and in the future a bytecode -> machine compiler.

[Accidently edited your comment earlier]

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Mar 10, 2023

Whereas with superblocks, we get this: super_blocks gv

The basic blocks in the lazy basic block versioning paper generate this and not the former (in fact, it's exactly the same as the diagram you drew). The paper's name is a little misleading. The only part of it that's about "basic blocks" is that the type propagation occurs at basic block boundaries, the actual code generated are superblocks as well.

In fact, this is how it would look like:
image

@markshannon
Copy link
Member Author

markshannon commented Mar 10, 2023

I don't see how they have the type information available to compile the specialized D' when compiling A.
They want the final code to look like the above, but it is built a basic block at a time. Isn't it?
They cannot know whether B or C is the most likely successor of A, just which happens to execute first immediately after A is compiled.

Compiling one block at a time allows propagation of type information, but can prevent effective partial evaluation.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Mar 10, 2023

They cannot know whether B or C is the most likely successor of A, just which happens to execute first immediately after A is compiled.

Yes their argument is that the code will naturally generate like that since it's just generated according to the path taken at runtime.

They want the final code to look like the above, but it is built a basic block at a time. Isn't it?

Yes. The code object is like a stack, so the next generated block is written onto the empty space following the next basic block. That illustration is after both branches are generated.

Compiling one block at a time allows propagation of type information, but can prevent effective partial evaluation.

If we're emitting tier 2 bytecode, the final machine code compiler can convert the entire chunk that is generated from the basic block versioning and do the same large-scale partial evaluation and other optimisations that this proposes right?

So like

Tier 1 --- (n times) ---> Tier 2 bytecode ------ (x times) -----> Machine code

By the time the machine code is generated, assuming the tier 2 bytecode is executed many times, there will be enough basic blocks generated to do the same large scale optimisations you're suggesting.

@markshannon
Copy link
Member Author

We don't want to have to convert to machine code to do partial evaluation.

Creating larger regions and optimizing those is the job of the tier 3 optimizer.

@markshannon
Copy link
Member Author

The optimizer pipeline should look something like this:

optimize_pipeline gv

@maximecb
Copy link

Yes their argument is that the code will naturally generate like that since it's just generated according to the path taken at runtime.

When considering branches that are taken for dynamic type tests, it happens very often that these branches are heavily biased in one direction (or even only ever go one direction). So, being able to see which branch is first executed at run time can give you a nice linear sequence of blocks. It's a very good heuristic in practice.

@GoEdgar
Copy link

GoEdgar commented Jun 19, 2023

Am I correct that peephole optimizations like removing unnecessary stack operations and so on will be performed on uops, which a bit earlier have been formed by tracer? If so, how do we convert optimized uops to the super instructions? It looks like that we need to return to the interpreter with a super instruction whose logic can have a huge number of variations of combinations of uops, and in this case the interpreter will need to know a particular variation of combinations in order to be able to interpret
image

@diohabara
Copy link

Hi @markshannon!
I have a question about Tier 2 optimizer. If this feature is implemented, will the format of future .pyc be impacted? Will this change influence only runtime bytecode format?

@markshannon
Copy link
Member Author

This all happens at runtime. The format of .pyc files will not be changed.
We might change the format of .pyc files for other reasons, like faster imports, though.

@gvanrossum
Copy link
Collaborator

@GoEdgar

Am I correct that peephole optimizations like removing unnecessary stack operations and so on will be performed on uops, which a bit earlier have been formed by tracer? If so, how do we convert optimized uops to the super instructions? It looks like that we need to return to the interpreter with a super instruction whose logic can have a huge number of variations of combinations of uops, and in this case the interpreter will need to know a particular variation of combinations in order to be able to interpret image

Maybe there's some terminological confusion here. "Super-instructions" are the exact opposite of what we emit to superblocks (for the latter, we emit micro-ops, usually called uops). For an example of a superblock, see this comment: python/cpython#106529 (comment)

Also, we're not using tracing to form superblocks -- we're using projection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic-tier2-optimizer Linear code region optimizer for 3.13 and beyond.
Projects
None yet
Development

No branches or pull requests

6 participants