Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase fraction of code executed by tier 2. #118093

Open
2 of 3 tasks
Tracked by #654
markshannon opened this issue Apr 19, 2024 · 2 comments
Open
2 of 3 tasks
Tracked by #654

Increase fraction of code executed by tier 2. #118093

markshannon opened this issue Apr 19, 2024 · 2 comments
Labels
3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@markshannon
Copy link
Member

markshannon commented Apr 19, 2024

According to stats and profiling only about 40% of bytecode instructions are executed by tier 2 and the remaining 60% by tier 1.

We the expected improvements to the JIT and tier 2 optimizer we expect tier 2 (with JIT) to have a significantly faster than tier 1.
It therefore make sense to get the fraction of instructions executed by tier 2 up from 40% to nearer 90%.

To do that we need to:

Linked PRs

@markshannon markshannon added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.13 bugs and security fixes labels Apr 19, 2024
@brandtbucher
Copy link
Member

brandtbucher commented Jul 17, 2024

By my count we're currently hovering around 54% of code executed in tier two (our benchmarks run about 266 billion tier one instructions on normal builds and 122 billion instructions on JIT builds). I've identified a few strategies for improving this (based on stats and tracing through how we execute a bunch of the benchmarks) and will start landing PRs soon. No magic bullets here, just chipping away at things:

  • Add specializations for CALL_KW and CALL_FUNCTION_EX. We probably don't need to do anything too crazy here early on in terms of optimizing calls... we can start by just adding a handful of specializations that allow us to trace through them instead of ending the trace.
  • Add tier two support to several other instructions that are prematurely ending traces. Some of these are easy (CALL_LIST_APPEND, IMPORT_NAME, LOAD_NAME, BUILD_SET, SEND_GEN, and IMPORT_FROM), and some are harder (LOAD_ATTR_PROPERTY, BINARY_SUBSCR_GETITEM, CALL_ALLOC_AND_ENTER_INIT, RAISE_VARARGS, and BINARY_OP_INPLACE_ADD_UNICODE).
  • Specialize SEND_ASYNC_GEN_ANEXT, using a similar shim frame as CALL_ALLOC_AND_ENTER_INIT.
  • Handle underflow, either dynamically (with DYNAMIC_EXIT) or statically (by using the current stack when projecting to infer callers). A more radical idea could be to start recording traces instead of projecting them. That simplifies a lot of things, but is a big rewrite of some pretty core stuff.
  • Turn more DEOPT_IFs into EXIT_IFs for better handling of control flow and polymorphism. _FOR_ITER_TIER_TWO is an obvious candidate here, but there are others, too.
  • Allow shorter traces (it's tricky to do this while still requiring progress, but doable).
  • Better handling of polymorphism (our current progress requirement inhibits this, but that can be relaxed without too much trouble).
  • Remove invalid traces from side exits (currently we just remove them from the bytecode, and side exits not only keep the invalid trace alive and continually deopting, but also prevent new traces from taking their place).
  • Be better about closing loops in one trace, by allowing a single jump to occur anywhere in a trace, rather than always at the start.

My motivation for this is to make JIT improvements more pronounced. We currently spend less than 10% of our time in the JIT (vs ~25% of our time in tier one), which means that we need to improve the performance of JIT code by over 10% just to see a 1% improvement on the benchmarks. My (probably ambitious) goal is to get the fraction of code executed in tier two up to around 80% (meaning, in the neighborhood of 25%-30% of the total time spent running the benchmarks) in the next couple of weeks. Then the improvements can be easier to measure and iterate on.

@brandtbucher
Copy link
Member

It's also worth noting that our stats are currently broken on benchmarks that use C extensions or spawn subprocesses. So the actual numbers may vary a bit right now, but probably aren't heavily biased one way or another.

@brandtbucher brandtbucher added 3.14 new features, bugs and security fixes and removed 3.13 bugs and security fixes labels Jul 17, 2024
jeremyhylton pushed a commit to jeremyhylton/cpython that referenced this issue Aug 19, 2024
markshannon added a commit that referenced this issue Aug 20, 2024
…123140)

* Convert CALL_ALLOC_AND_ENTER_INIT to micro-ops such that tier 2 supports it

* Allow inexact arguments for CALL_ALLOC_AND_ENTER_INIT.
blhsing pushed a commit to blhsing/cpython that referenced this issue Aug 22, 2024
blhsing pushed a commit to blhsing/cpython that referenced this issue Aug 22, 2024
blhsing pushed a commit to blhsing/cpython that referenced this issue Aug 22, 2024
pythonGH-123140)

* Convert CALL_ALLOC_AND_ENTER_INIT to micro-ops such that tier 2 supports it

* Allow inexact arguments for CALL_ALLOC_AND_ENTER_INIT.
markshannon pushed a commit that referenced this issue Aug 22, 2024
…_GENERAL` (GH-123212)

Specialize classes without vectorcall as CALL_NON_PY_GENERAL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

2 participants