dead store elimination in the compiler #104635

carljm · 2023-05-18T23:19:35Z

There are several code patterns that can cause multiple stores to the same locals index in near succession, without any intervening instructions that can possibly use the stored value. One such code pattern is e.g. a, a, b = x, y, z. Another (since PEP 709) is a = [_ for a in []]. There are likely others.

In such cases, all but the final store can be replaced with POP_TOP. This alone isn't a huge gain, but this can also allow apply_static_swaps to take effect in these cases (because it can't safely reorder writes to the same location, but it can safely reorder POP_TOPs at will), removing SWAPs, which shrinks code size and reduces runtime overhead.

Not sure how much of a gain this will be in realistic cases, but it's also not hard to implement, so it seems worth doing as a marginal improvement to compiler output.

Note that in order to maintain tracing / debugger compatibility, we can only do this if line number is the same for all the store instructions. And we definitely can't eliminate the final store, even if it also appears to be unused, since it can always be visible to tracers/debuggers/locals()/etc.

Linked PRs

The text was updated successfully, but these errors were encountered:

corona10 · 2023-05-28T13:24:16Z

I just experimented with my curiosity. (please understand that I'm not an optimization expert)
With my super naive implementation, it shows 14% slower than baseline.

I think that this is because of the non-existence of super instruction(POP_TOP__POP_TOP / POP_TOP__STORE_FAST)
while baseline has super instruction(STORE_FAST__STORE_FAST).

If we try to implement such optimization, do you think we must also implement the super instruction?
Or do you think that that scenario is not realistic? (I meant the 14% slower scenario only happens at microbenchmark?)

Please let me know if I understood your suggestion wrongly :)

baseline

  1           0 RESUME                   0

  2           2 LOAD_CONST               1 ((1, 2, 3, 4))
              4 UNPACK_SEQUENCE          4
              8 STORE_FAST               0 (a)
             10 STORE_FAST               0 (a)
             12 STORE_FAST               1 (b)
             14 STORE_FAST               1 (b)
             16 RETURN_CONST             0 (None)

suggestion

  1           0 RESUME                   0

  2           2 LOAD_CONST               1 ((1, 2, 3, 4))
              4 UNPACK_SEQUENCE          4
              8 POP_TOP
             10 STORE_FAST               0 (a)
             12 POP_TOP
             14 STORE_FAST               1 (b)
             16 RETURN_CONST             0 (None)

benchmark

import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench dead_store",
              stmt="""
a, a, b, b = 1, 2, 3, 4
""")

corona10 · 2023-05-28T13:35:22Z

With super-instruction: 6-7% faster
: corona10@a96ed91


Mean +- std dev: [base] 10.7 ns +- 0.2 ns -> [opt] 12.2 ns +- 0.1 ns: 1.14x slower # Only removing dead store
Mean +- std dev: [base] 10.7 ns +- 0.2 ns -> [super] 10.1 ns +- 0.1 ns: 1.06x faster # Removing dead store + super_instruction

For more realistic example

import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench dead_store",
              stmt="""
_, _, _, a = 1, 2, 3, 4
""")


Mean +- std dev: [base] 10.8 ns +- 0.2 ns -> [super] 10.1 ns +- 0.1 ns: 1.07x faster

corona10 · 2023-05-29T09:05:25Z

pyperformance benchmark: https://gist.github.com/corona10/2cfd404ac0cc9d9a61c71d4365e7c5bb
1.01x faster (might be noise)

carljm · 2023-05-30T21:34:09Z

Ah, I hadn't considered the possibility that we'd need new super-instructions to make this a win instead of a regression. Thanks for trying it out!

Not sure if there are reasons to avoid adding new super-instructions, but apart from that issue, this still seems worth it to me. Made some comments on the draft PR.

markshannon · 2023-05-31T09:38:45Z

OOI what are the advantages of this over the superoptimizer approach?
#102869

The superoptimizer would need to understand stores, but that doesn't seem too complex.

corona10 · 2023-05-31T16:10:45Z

OOI what are the advantages of this over the superoptimizer approach?

In my opinion, the two optimizations are responsible for optimization in different layers. I believe that flowgraph-based optimization can optimize with lower cost compared to simulating stack side effects, and the superoptimizer can optimize a wider window that a flowgraph cannot handle. Therefore, I think these two optimizations complement each other, and if flowgraph-based optimization does not interfere with the optimization of the superoptimizer, it is acceptable to add additional optimizations.

Maybe @carljm has a different opinion from me.

carljm · 2023-05-31T17:01:41Z

@markshannon This issue is just tracking the fact that we can (and should) replace dead STORE_FAST with POP_TOP, and if we do so, it will allow better static elimination of SWAP instructions (and maybe in some cases allow eliminating a corresponding load and the POP_TOP as well). AFAIK that isn't currently explicitly tracked or mentioned anywhere; the currently proposed superoptimizer design explicitly leaves stores out of scope.

I don't have strong opinions about whether we do this via the existing optimize_basic_block infra, or via an expanded superoptimizer design. But I think it's quite simple to do via the existing infra, so I don't see any reason to block doing it now on the basis that someday we might do it in a superoptimizer instead.

The main wrinkle is that it seems we need to add some POP_TOP based super-instructions in order to keep micro-benchmarks positive with this change (otherwise they lose the benefit of the existing STORE_FAST super-instructions), but that will be true whether we do this in super-optimizer or standalone.

iritkatriel · 2023-05-31T18:01:13Z

... multiple stores to the same locals index in near succession, without any intervening instructions that can possibly use the stored value.

Also need to make sure there isn't something between them that could raise an exception. We don't currently have this, we probably need to add this in the opcode_metadata.

markshannon · 2023-06-05T08:40:02Z

Are there many dead stores in Python, given that two writes must occur on the same line for the first to count as dead?

I suspect that they are quite rare. Has anyone tried to measure how many?

corona10 · 2023-06-05T13:35:22Z

Are there many dead stores in Python, given that two writes must occur on the same line for the first to count as dead?
I suspect that they are quite rare. Has anyone tried to measure how many?

Assuming that super instruction is applied when the corresponding code pattern is used, it is true that it is a small proportion when observed indirectly.
#105040 (comment)

However, this only means that super instruction does not need to be applied, and whether reducing the opcodes for (_, _, a = some_tuple) is a separate issue.

corona10 · 2023-06-05T14:32:35Z

However, this only means that super instruction does not need to be applied, and whether reducing the opcodes for (_, _, a = some_tuple) is a separate issue.

Okay we don't need to add extra super instructions for this optimization after Mark's PR is merged.
It becomes 3-4% faster.
cc @carljm

Benchmarks

import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench dead_store",
              stmt="""
_, _, _, a = 1, 2, 3, 4
""")

Mean +- std dev: [base] 11.6 ns +- 0.1 ns -> [opt] 11.2 ns +- 0.1 ns: 1.03x faster

import pyperf

runner = pyperf.Runner()
runner.timeit(name="bench static swaps",
              stmt="""
_ = f(True, False)
""",
setup="""
def f(x, y):
    a, a = x, y
    return a
""")


Mean +- std dev: [static_base] 27.1 ns +- 0.3 ns -> [static_opt] 26.0 ns +- 0.2 ns: 1.04x faster

…ompiler

carljm · 2023-06-06T19:04:06Z

Are there many dead stores in Python

I also suspect it's not common. But given that the latest PR shows a clear improvement for this case without the need to add superinstructions, and with just a few lines in the compiler, it seems almost pure win to do it, even if the case is not common.

…ompiler

gh-105320)

…106583)

carljm added the performance Performance or resource usage label May 18, 2023

carljm mentioned this issue May 18, 2023

compiler can incorrectly optimize a run of stores to the same name preceded by a SWAP #104615

Closed

bedevere-bot mentioned this issue May 28, 2023

gh-104635: Eliminate redundant STORE_FAST instructions in the compiler #105040

Closed

corona10 added a commit to corona10/cpython that referenced this issue Jun 5, 2023

pythongh-104635: Eliminate redundant STORE_FAST instructions in the c…

1a8881b

…ompiler

bedevere-bot mentioned this issue Jun 5, 2023

gh-104635: Eliminate redundant STORE_FAST instructions in the compiler #105320

Merged

corona10 added a commit to corona10/cpython that referenced this issue Jun 5, 2023

pythongh-104635: Eliminate redundant STORE_FAST instructions in the c…

6b51678

…ompiler

corona10 added a commit to corona10/cpython that referenced this issue Jun 7, 2023

pythongh-104635: Eliminate redundant STORE_FAST instructions in the c…

c693438

…ompiler

corona10 added a commit that referenced this issue Jun 7, 2023

gh-104635: Eliminate redundant STORE_FAST instructions in the compiler (

aa5b762

gh-105320)

corona10 added a commit to corona10/cpython that referenced this issue Jul 4, 2023

pythongh-104635: add NO_SIDE_EFFECT flag to opcode metadata

ffbd4ef

corona10 added a commit to corona10/cpython that referenced this issue Jul 4, 2023

pythongh-104635: add NO_SIDE_EFFECT flag to opcode metadata

b8df695

corona10 added a commit to corona10/cpython that referenced this issue Jul 4, 2023

pythongh-104635: add NO_SIDE_EFFECT flag to opcode metadata

24a2aa0

corona10 added a commit to corona10/cpython that referenced this issue Jul 4, 2023

pythongh-104635: Add NO_EXCEPTION flag to opcode metadata

11cf8ec

bedevere-bot mentioned this issue Jul 4, 2023

gh-104635: Add NO_EXCEPTION flag to opcode metadata #106394

Closed

corona10 added a commit to corona10/cpython that referenced this issue Jul 9, 2023

pythongh-104635: Extend optimization for dead store elimination

f3b8d7e

bedevere-bot mentioned this issue Jul 9, 2023

gh-104635: Expand optimization for dead store elimination #106571

Closed

2 tasks

corona10 added a commit to corona10/cpython that referenced this issue Jul 10, 2023

pythongh-104635: Add a test case for variables that have a dependency.

e92bc67

bedevere-bot mentioned this issue Jul 10, 2023

gh-104635: Add a test case for variables that have a dependency. #106583

Merged

corona10 added a commit that referenced this issue Jul 11, 2023

gh-104635: Add a test case for variables that have a dependency. (gh-…

115df84

…106583)

iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dead store elimination in the compiler #104635

dead store elimination in the compiler #104635

carljm commented May 18, 2023 •

edited by bedevere-bot

Loading

corona10 commented May 28, 2023 •

edited

Loading

corona10 commented May 28, 2023 •

edited

Loading

corona10 commented May 29, 2023 •

edited

Loading

carljm commented May 30, 2023

markshannon commented May 31, 2023

corona10 commented May 31, 2023

carljm commented May 31, 2023 •

edited

Loading

iritkatriel commented May 31, 2023

markshannon commented Jun 5, 2023

corona10 commented Jun 5, 2023

corona10 commented Jun 5, 2023 •

edited

Loading

carljm commented Jun 6, 2023

dead store elimination in the compiler #104635

dead store elimination in the compiler #104635

Comments

carljm commented May 18, 2023 • edited by bedevere-bot Loading

Linked PRs

corona10 commented May 28, 2023 • edited Loading

baseline

suggestion

benchmark

corona10 commented May 28, 2023 • edited Loading

corona10 commented May 29, 2023 • edited Loading

carljm commented May 30, 2023

markshannon commented May 31, 2023

corona10 commented May 31, 2023

carljm commented May 31, 2023 • edited Loading

iritkatriel commented May 31, 2023

markshannon commented Jun 5, 2023

corona10 commented Jun 5, 2023

corona10 commented Jun 5, 2023 • edited Loading

Benchmarks

carljm commented Jun 6, 2023

carljm commented May 18, 2023 •

edited by bedevere-bot

Loading

corona10 commented May 28, 2023 •

edited

Loading

corona10 commented May 28, 2023 •

edited

Loading

corona10 commented May 29, 2023 •

edited

Loading

carljm commented May 31, 2023 •

edited

Loading

corona10 commented Jun 5, 2023 •

edited

Loading