Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster startup -- Experiment E -- "deep-freeze" code objects as static C data structures #84

Closed
gvanrossum opened this issue Aug 15, 2021 · 25 comments

Comments

@gvanrossum
Copy link
Collaborator

gvanrossum commented Aug 15, 2021

This idea was @markshannon's, This a variant of something first proposed by Jeethu Rao of Facebook in https://bugs.python.org/issue34690. I'm just writing it up (and I may try to execute). The name "deep-freeze" is Mark's.

We could write a Python script that generates C code for a code object (including nested code objects). This would completely avoid the need for unmarshalling frozen modules (but only those). It would replace the current approach to freezing the marshalled code objects (which also generates C code, but it's just an array of bytes).

Suppose a simple code object has 4 bytes of data, "ABCD". We can then generate something like this:

static PyBytesObject co_code_1 = {
    .ob_refcnt = 999999999,
    .ob_type = &PyBytes_Type,
    .ob_size = 4,
    .ob_shash = -1,  // To be filled in dynamically
    .ob_sval = "ABCD"
};

static PyCodeObject code_1 = {
    ...
    co_code = &co_code_1,
    ...
};

(Lots of details left out, including the restructuring of object headers.)

An immediate concern here is the intended separation of static objects for multiple interpreters (@ericsnowcurrently). We're introducing something here that would be tricky to clone per interpreter. So maybe that kills the idea immediately?

Another concern is the initializer for ob_sval -- this field is actually declared as an array of 1 char, and the compiler will presumably balk if we put more in there. I think there's a solution though by declaring such bytes objects as a struct containing a modified header (omitting ob_sval) and a separate array of characters:

static struct {
    PyBytesObjectWithoutObSval head;  // PyBytesObject without ob_sval field
    char ob_sval[4+1];
} co_code_1 = {
    .head = {
        .ob_refcnt = 999999999,
        .ob_type = &PyBytes_Type,
        .ob_size = 4,
        .ob_shash = -1,
    },
    .ob_sval = "ABCD"
};
@ericsnowcurrently
Copy link
Collaborator

ericsnowcurrently commented Aug 17, 2021

We could write a Python script that generates C code for a code object (including nested code objects).

I like it!

This would completely avoid the need for unmarshalling frozen modules (but only those).

The certainly reduces the utility of the approach, though it would certainly benefit startup performance. I suppose the question is, would the startup benefits make it worth adding this second way to encode modules (in addition to marshal, i.e. .pyc), with the associated maintenance costs?

FWIW, here's an idea to mitigate those costs (somewhat):

Considering that the generated code would be a relatively small subset of C, we do have some options for compiling source modules (or even .pyc) at runtime. A number of decent embeddable compilers exist (e.g. tcc).

This would require shipping with such a compiler, so it probably wouldn't make sense to ship as part of the CPython runtime. However, it could make sense as an extension module (e.g. published on PyPI), which would register an import hook.

It would replace the current approach to freezing the marshalled code objects (which also generates C code, but it's just an array of bytes).

+1

In addition to the performance benefits, it will be easier to read (and debug) than the marshal format. The only downside I see is the generated text will be substantially (?) bigger. However, it likely won't be such a difference to matter.

An immediate concern here is the intended separation of static objects for multiple interpreters (@ericsnowcurrently). We're introducing something here that would be tricky to clone per interpreter. So maybe that kills the idea immediately?

Any PyObject * is not safely shareable in a world where interpreters no longer share a GIL. So using those code objects as-is wouldn't work. However, I wouldn't say it's a non-starter. Here are some options (off the top of my head):

  • memcpy() once for each interpreter (perhaps use the generated one as-is in the main interpreter)
  • use structs that map to PyCodeObject, etc., instead of generating the various PyObject directly
    • creating the PyCodeObject from it is still much lighter than unmarshalling
    • the whole object graph could be allocated with a single malloc() (using computed offsets for each needed object)
  • use a (semi) proxy type that wraps the generated PyCodeObject

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Aug 18, 2021

Brief status update:

  • I can generate static data for 'hello world'
  • It compiles
  • It links, with a small dummy extension module that exports the code object
  • It segfaults when I try to run the code object (import hello; exec(hello.get_toplevel()))

I got it to work on Windows and Mac, but the C compilers are a bit different. In the end on Windows (for MSVC) I had to pretend the generated code was C++ to get the initializers to compile. Also, MSVC insists that the fields in a code object are initialized in the order in which they are declared. It seems clang is not so picky. (This is a pain because the fields have been reordered for improved performance.)

Definitely still missing:

  • More object types, in particular int, float, complex, frozenset
  • Fill in several more fields of code objects (e.g. co_localsplus, which I have to reconstruct from co_varnames and friends); I suspect herein lies the cause of the segfault
  • Hook it up with the frozen import infrastructure

I think we may still need the old frozen import infrastructure; that generates C code for frozen files from a C program, but the new code generator is too complex to write in C. That means we need a complete working Python interpreter to run the generator. And it must be the Python we're targeting, it can't be an older version, because it compiles the source and introspects the resulting code object as the input for the code generator.

Also, Eric and I talked it through and it looks like we'll be able to combine this with his "freeze all the startup modules" project, and then it will vanish all the unmarshalling cost of startup.

A concern is the size of the generated code -- "hello world" turns into 300 lines of C code.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Aug 19, 2021

Another status report. I think I've got all fields of the code object filled in now -- a simple test program using cells seems to work.

Still to do:

  • float
  • complex
  • bool
  • non-ASCII strings
  • frozenset(*)
  • multi-digit integers
  • Ellipsis
  • Integrate with freeze infrastructure
  • Reduce output size by sharing constants
  • Perf measurements
  • Integrate with Eric's "freeze 80 modules"
  • Add fixups so we can compile as C on Windows
  • Generate Windows project files
  • Ignore compiled code if MAGIC has changed

(*) For now, I hacked around frozenset by generating a tuple. This works for "if x in {2, 3, 5}" and that's the only context where a frozenset can occur in a code object, AFAICT (it's put there by the AST optimizer).

Currently the generated code is about 50x the size of the Python source (counting lines).

@dpgeorge
Copy link

In case you are interested, and maybe it leads to some inspiration: in MicroPython there is the concept of "frozen" code which is generated through the process: source -> compiled bytecode -> C data structure (.py -> .mpy -> .c). The final C data structures contain compiled Python functions, all their nested functions/classes, and all their nested constant objects (str, bytes, long int, float). They are all const so can be placed in ROM and don't need any fixup at load/import time.

The tool that converts .mpy (similar to .pyc) to .c is https://github.com/micropython/micropython/blob/master/tools/mpy-tool.py . An example generated .c file is:
frozen_content.c.txt (that's pretty small, usually they are 10k+ lines, even up to 100k+). Freezing code like this is the main way that MicroPython is used in production.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Aug 20, 2021

Oh, that's very similar -- to the point where I'm glad MicroPython is also open source else we'd have to license the patent off you. :-) The main differences are:

  • I'm using the compile() builtin instead of reading the compiled bytecode from a file
  • CPython objects have in-line reference counts so the data can't be const or ROM-able

Also I expect that eventually we'll need some fixup, to support multiple interpreters (see discussion above). And to handle frozenset.

@markshannon
Copy link
Member

markshannon commented Aug 20, 2021

I think we could find some prior art 🙂

One thing I found worked well doing this for HotPy, was using standardized names. That way if you reference the string "hello", you don't need to record where it went, just that it has been/will be emitted.
Reference to string "hello": &str_hello.
Reference to the tuple ("hi", 1): &tuple2__str_hi__int_1 (make sure to escape underscores in strings).
etc, etc.

It also makes duplicates and missing references really easy to debug.

@brandtbucher
Copy link
Member

For now, I hacked around frozenset by generating a tuple. This works for "if x in {2, 3, 5}" and that's the only context where a frozenset can occur in a code object, AFAICT (it's put there by the AST optimizer).

There's one more place that I know of: constant set displays longer than two elements. It's done in the compiler, in starunpack_helper:

>>> import dis
>>> dis.dis("{0, 1, 2}")
  1           0 BUILD_SET                0
              2 LOAD_CONST               0 (frozenset({0, 1, 2}))
              4 SET_UPDATE               1
              6 RETURN_VALUE

...though tuples probably work fine there as well.

@gvanrossum
Copy link
Collaborator Author

One thing I found worked well doing this for HotPy, was using standardized names. That way if you reference the string "hello", you don't need to record where it went, just that it has been/will be emitted.
Reference to string "hello": &str_hello.
Reference to the tuple ("hi", 1): &tuple2__str_hi__int_1 (make sure to escape underscores in strings).
etc, etc.

I was thinking of using a similar mechanism as used by the bytecode compiler and by marshal to merge duplicates -- just keep a dict whose keys are (type, value) and whose values are (in this case) the name of the data structure generated for that value. Since everything is a DAG we just can generate things bottom up.

It may be nice though to have some indication of the value in the variable name -- currently the variable name just reflects where it is in the structure (e.g. toplevel_consts_44_consts_3).

In any case that's all just niceties. I've hooked everything up on UNIX (*) and it seems to work, although I haven't shown anything is faster yet. (Since I only have importlib/_bootstrap and importlib/_bootstrap_external working, based on Eric's flamegraph I'd expect only a 3-5% improvement, which is hard to measure on my Mac.)

A new concern is popping up in my head regarding bootstrapping: the code generator relies on the built Python interpreter -- it doesn't work with an older version, since the various fields of code objects have different meanings. But now suppose someone is changing the code object (for example, changing some bytecode instruction). We need to be able to run the generator without depending on the code it generated previously. The best thing I can think of for now is to have an environment variable that means "don't use generated code objects". Instead, it will fall back on the old frozen marshal data.


(*) I don't want to create the MSVC project files until I have to, plus I doubt that we can link with C++ files.

@gvanrossum
Copy link
Collaborator Author

Quick note (also @ericsnowcurrently) :On Windows, since everything's linked statically into the core Python DLL, it suffices to add all the .c files to PCbuild\pythobcore.vcxproj. There's a long list of <ClCompile .../> items to which we can add these.

@gvanrossum
Copy link
Collaborator Author

Another quick note: According to Steve Dower, this was proposed a few years ago by Instagram and at a core sprint Larry made an implementation that saw 20-30% improvement (in startup time, I presume). Alas, the PR was rejected because people wanted to be able to edit the .py files and get those changes immediately effective. Apparently Larry solved this by also searching for the module on sys.path, but that brought the performance down to a meager 1% (not surprising because stat() is almost as expensive as open(), and read() is virtually free in comparison).

So we may have to deal with this somehow. For changes to the bytecode interpreter we can bake the magic number into the generated code and check it upon use (falling back on the disk version if there's no match), but for changes to the .py files users would at least have to remember to type "make". Hm.

@gvanrossum
Copy link
Collaborator Author

I have things kind of working with a variant on Eric's freeze 80 modules, but on my noisy Mac laptop I can't see much of a difference yet; the total time for "python -S -c pass" is ~16 ms each way, and the noise is 2-3 ms. :-(

The binary size tripled from 4.5 Mib to 13.4 Mib though, so I have to work on that (reducing that will also improve times).

@gvanrossum gvanrossum changed the title Faster startup -- Experiment E -- "freeze" code objects as static C data structures Faster startup -- Experiment E -- "deep-freeze" code objects as static C data structures Aug 25, 2021
@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Aug 25, 2021

I ended up removing the line <encodings>, from freeze_modules.py, because freezing the entire encodings package added 123 more frozen files and 122 more deep-frozen files, for little benefit -- I think startup is actually faster without these.

Here are some results (in summary: startup is 19% faster):

Platform: darwin, N=100
  python3.10 ...
    100 python3.10 runs in 1.561 sec
  Programs/_bootstrap_python ...
    100 Programs/_bootstrap_python runs in 1.514 sec
  ./python.exe ...
    100 ./python.exe runs in 1.271 sec
__TEXT  __DATA  __OBJC  others  dec     hex
2736128 229376  0       4295901184      4298866688      1003b8000       Programs/_bootstrap_python
2916352 1015808 0       4296638464      4300570624      100558000       ./python.exe

The last two lines show the segment sizes; to summarize, the frozen and deep-frozen files together add about 180 KiB text and 800 KiB of data (I don't know what the other segments are). But we get the data (at least the part representing deep-frozen files, not the part representing frozen marshal data) back in a corresponding reduction in heap size.

The "Programs/_bootstrap_python" is a special Python binary that contains no deep-frozen files and only the three traditionally frozen files with marshal data (importlib/_bootstrap, importlib/_bootstrap_external, and zipimport). It should correspond to python.exe on the main branch, before this work. The "python3.10" binary used for the test is the one from the PSF installer, it's built differently (using frameworks, IIUC) but its optimization level is the same (PGO etc.).

(I didn't bother showing results with the (deep-)frozen encodings package, but it was significantly slower, around 1.49 sec.)

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Aug 25, 2021

For comparison, here's the same test script run in the 3.10 branch (with _bootstrap_python commented out):

Platform: darwin, N=100
  python3.10 ...
    100 python3.10 runs in 1.554 sec
  ./python.exe ...
    100 ./python.exe runs in 1.405 sec
__TEXT  __DATA  __OBJC  others  dec     hex
2785280 229376  0       4296409088      4299423744      100440000

This shows the installed python is slower than the built Python (perhaps due to it loading Python from a DLL?) but my built Python with deep-frozen code is still faster.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Aug 25, 2021

Starting a new TODO list:

  • Add fixups so we can compile as C on Windows
  • Generate Windows project files
  • Ignore compiled code if MAGIC has changed
  • Reintegrate with Eric's freeze_modules.py script once it's ready

@ericsnowcurrently
Copy link
Collaborator

ericsnowcurrently commented Aug 26, 2021

FWIW, with just the existing freezing (see #82) I see a roughly 15% improvement of total startup speed.

I ended up removing the line <encodings>, from freeze_modules.py, because freezing the entire encodings package added 123 more frozen files and 122 more deep-frozen files, for little benefit -- I think startup is actually faster without these.

If I exclude encodings from freezing it's a little slower.

@nascheme
Copy link

I have to wonder how this is considered Mark's idea. The Instagram people did a similar thing years ago. I did quite a lot of work on it, starting with Larry's PR. I described the work it in this project, see below:

#32 (comment)

nascheme/cpython@87b2bf3

@gvanrossum
Copy link
Collaborator Author

I have to wonder how this is considered Mark's idea.

I’m sorry, I wasn’t aware of the prior art when I started this issue.

The Instagram people did a similar thing years ago. I did quite a lot of work on it, starting with Larry's PR. I described the work it in this project, see below:

#32 (comment)

nascheme/cpython@87b2bf3

Thanks, I will read that. Did you get it to work on Windows?

@gvanrossum
Copy link
Collaborator Author

Okay, now I feel even sillier, since that approach is very similar to mine, and I responded to your comment. But I don’t get why the serializer has to be written in C? Mine is pure Python (generating C, of course). I do see similar timing results. I’m also curious what tests didn’t pass for you (I didn’t get to that yet).

@nascheme
Copy link

Did you get it to work on Windows?

I never tested it so I suppose there could be problems. I'm not sure if Jeethu or Larry ever tested it either.

@nascheme
Copy link

It seems to me there is no reason the serializer has to be in C. I suspect Jeethu found it easier to write that way. I didn't change much in how the serializer worked from the original patch, just fixed some problems with newer versions of Python and made it easier to re-generate and build.

I can't recall much details about issues, possible that I fixed nearly all test failures. One remaining issue was that _collections_abc can not be frozen. I think that could be due to some subtle import/package system behavior.

@gvanrossum
Copy link
Collaborator Author

Weird, I managed to freeze _collections_abc without problems.

Jeethu’s serializer does frozen sets, which have a lot of internal fields. I punted by replacing them with tuples, those work fine in the places where byte code uses them. :-)

I also punt on hashes, those get computed on first use if you set them to -1.

I set all refcounts to 999999999 and don’t worry about missing refs to None, these objects are immortal.

The Windows (MSVC) C compiler doesn’t like references to external (or maybe all other) objects, you have to use C++ mode…

@markshannon
Copy link
Member

As various people seem keen to claim that that this was their idea. I would point out that freezing the state of an executable is much older than any of the above claims, dating back to the 1980s at least.

As for applying it to Python, the original HotPy was built with freezing back in 2009, but the RPython tool chain could be considered an elaborate version of this idea and predates that by a few years.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Sep 14, 2021

(Status update: I'm waiting for Eric Snow to finish several changes to the frozen modules infrastructure. Once that has landed and is deemed stable I will re-integrate my code with that infrastructure and do proper benchmarks to determine whether to proceed.)

(Further update:) Landed the UNIX version.

@gvanrossum
Copy link
Collaborator Author

gvanrossum commented Nov 17, 2021

Working on Windows now. This requires:

  • Manually generate deep-frozen .c files
  • Add lines like <ClCompile Include="..\Python\deepfreeze\os.c" /> to pythoncore.vcxproj
  • Figure out what to do about codecs.c (there's already Python\codecs.c)
  • Fix an issue where on Windows, ntpath is subsituted for posixpath in frozen.c (bpo-45272, bpo-45273)
  • Update deepfreeze.py to run with any old Python (it must read the .h file written by _freeze_module.exe and then use a pure-Python marshal.loads() implementation)
  • Update _freeze_module.vcxproj to run the deepfreeze.py script (the <Exec> element is an implicit loop over whatever's given as %(None.ModName) etc.)
  • Profile and speed up umarshal.py
  • Update freeze_modules.py to do the project file updates
  • Cleanup: Add proper unit tests for umarshal.py

Draft PR: python/cpython#29648

We can then also update the UNIX Makefile to get rid of _bootstrap_python (but there will still be dependencies).

@gvanrossum
Copy link
Collaborator Author

Despite the open items I consider this done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

6 participants