-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster startup -- Experiment E -- "deep-freeze" code objects as static C data structures #84
Comments
I like it!
The certainly reduces the utility of the approach, though it would certainly benefit startup performance. I suppose the question is, would the startup benefits make it worth adding this second way to encode modules (in addition to marshal, i.e. .pyc), with the associated maintenance costs? FWIW, here's an idea to mitigate those costs (somewhat): Considering that the generated code would be a relatively small subset of C, we do have some options for compiling source modules (or even .pyc) at runtime. A number of decent embeddable compilers exist (e.g. tcc). This would require shipping with such a compiler, so it probably wouldn't make sense to ship as part of the CPython runtime. However, it could make sense as an extension module (e.g. published on PyPI), which would register an import hook.
+1 In addition to the performance benefits, it will be easier to read (and debug) than the marshal format. The only downside I see is the generated text will be substantially (?) bigger. However, it likely won't be such a difference to matter.
Any
|
Brief status update:
I got it to work on Windows and Mac, but the C compilers are a bit different. In the end on Windows (for MSVC) I had to pretend the generated code was C++ to get the initializers to compile. Also, MSVC insists that the fields in a code object are initialized in the order in which they are declared. It seems clang is not so picky. (This is a pain because the fields have been reordered for improved performance.) Definitely still missing:
I think we may still need the old frozen import infrastructure; that generates C code for frozen files from a C program, but the new code generator is too complex to write in C. That means we need a complete working Python interpreter to run the generator. And it must be the Python we're targeting, it can't be an older version, because it compiles the source and introspects the resulting code object as the input for the code generator. Also, Eric and I talked it through and it looks like we'll be able to combine this with his "freeze all the startup modules" project, and then it will vanish all the unmarshalling cost of startup. A concern is the size of the generated code -- "hello world" turns into 300 lines of C code. |
Another status report. I think I've got all fields of the code object filled in now -- a simple test program using cells seems to work. Still to do:
(*) For now, I hacked around frozenset by generating a tuple. This works for "if x in {2, 3, 5}" and that's the only context where a frozenset can occur in a code object, AFAICT (it's put there by the AST optimizer). Currently the generated code is about 50x the size of the Python source (counting lines). |
In case you are interested, and maybe it leads to some inspiration: in MicroPython there is the concept of "frozen" code which is generated through the process: source -> compiled bytecode -> C data structure (.py -> .mpy -> .c). The final C data structures contain compiled Python functions, all their nested functions/classes, and all their nested constant objects (str, bytes, long int, float). They are all The tool that converts .mpy (similar to .pyc) to .c is https://github.com/micropython/micropython/blob/master/tools/mpy-tool.py . An example generated .c file is: |
Oh, that's very similar -- to the point where I'm glad MicroPython is also open source else we'd have to license the patent off you. :-) The main differences are:
Also I expect that eventually we'll need some fixup, to support multiple interpreters (see discussion above). And to handle frozenset. |
I think we could find some prior art 🙂 One thing I found worked well doing this for HotPy, was using standardized names. That way if you reference the string "hello", you don't need to record where it went, just that it has been/will be emitted. It also makes duplicates and missing references really easy to debug. |
There's one more place that I know of: constant set displays longer than two elements. It's done in the compiler, in >>> import dis
>>> dis.dis("{0, 1, 2}")
1 0 BUILD_SET 0
2 LOAD_CONST 0 (frozenset({0, 1, 2}))
4 SET_UPDATE 1
6 RETURN_VALUE ...though tuples probably work fine there as well. |
I was thinking of using a similar mechanism as used by the bytecode compiler and by marshal to merge duplicates -- just keep a dict whose keys are It may be nice though to have some indication of the value in the variable name -- currently the variable name just reflects where it is in the structure (e.g. In any case that's all just niceties. I've hooked everything up on UNIX (*) and it seems to work, although I haven't shown anything is faster yet. (Since I only have importlib/_bootstrap and importlib/_bootstrap_external working, based on Eric's flamegraph I'd expect only a 3-5% improvement, which is hard to measure on my Mac.) A new concern is popping up in my head regarding bootstrapping: the code generator relies on the built Python interpreter -- it doesn't work with an older version, since the various fields of code objects have different meanings. But now suppose someone is changing the code object (for example, changing some bytecode instruction). We need to be able to run the generator without depending on the code it generated previously. The best thing I can think of for now is to have an environment variable that means "don't use generated code objects". Instead, it will fall back on the old frozen marshal data. (*) I don't want to create the MSVC project files until I have to, plus I doubt that we can link with C++ files. |
Quick note (also @ericsnowcurrently) :On Windows, since everything's linked statically into the core Python DLL, it suffices to add all the .c files to |
Another quick note: According to Steve Dower, this was proposed a few years ago by Instagram and at a core sprint Larry made an implementation that saw 20-30% improvement (in startup time, I presume). Alas, the PR was rejected because people wanted to be able to edit the .py files and get those changes immediately effective. Apparently Larry solved this by also searching for the module on sys.path, but that brought the performance down to a meager 1% (not surprising because stat() is almost as expensive as open(), and read() is virtually free in comparison). So we may have to deal with this somehow. For changes to the bytecode interpreter we can bake the magic number into the generated code and check it upon use (falling back on the disk version if there's no match), but for changes to the .py files users would at least have to remember to type "make". Hm. |
I have things kind of working with a variant on Eric's freeze 80 modules, but on my noisy Mac laptop I can't see much of a difference yet; the total time for "python -S -c pass" is ~16 ms each way, and the noise is 2-3 ms. :-( The binary size tripled from 4.5 Mib to 13.4 Mib though, so I have to work on that (reducing that will also improve times). |
I ended up removing the line Here are some results (in summary: startup is 19% faster):
The last two lines show the segment sizes; to summarize, the frozen and deep-frozen files together add about 180 KiB text and 800 KiB of data (I don't know what the other segments are). But we get the data (at least the part representing deep-frozen files, not the part representing frozen marshal data) back in a corresponding reduction in heap size. The "Programs/_bootstrap_python" is a special Python binary that contains no deep-frozen files and only the three traditionally frozen files with marshal data (importlib/_bootstrap, importlib/_bootstrap_external, and zipimport). It should correspond to python.exe on the main branch, before this work. The "python3.10" binary used for the test is the one from the PSF installer, it's built differently (using frameworks, IIUC) but its optimization level is the same (PGO etc.). (I didn't bother showing results with the (deep-)frozen encodings package, but it was significantly slower, around 1.49 sec.) |
For comparison, here's the same test script run in the 3.10 branch (with _bootstrap_python commented out):
This shows the installed python is slower than the built Python (perhaps due to it loading Python from a DLL?) but my built Python with deep-frozen code is still faster. |
Starting a new TODO list:
|
FWIW, with just the existing freezing (see #82) I see a roughly 15% improvement of total startup speed.
If I exclude |
I have to wonder how this is considered Mark's idea. The Instagram people did a similar thing years ago. I did quite a lot of work on it, starting with Larry's PR. I described the work it in this project, see below: |
I’m sorry, I wasn’t aware of the prior art when I started this issue.
Thanks, I will read that. Did you get it to work on Windows? |
Okay, now I feel even sillier, since that approach is very similar to mine, and I responded to your comment. But I don’t get why the serializer has to be written in C? Mine is pure Python (generating C, of course). I do see similar timing results. I’m also curious what tests didn’t pass for you (I didn’t get to that yet). |
I never tested it so I suppose there could be problems. I'm not sure if Jeethu or Larry ever tested it either. |
It seems to me there is no reason the serializer has to be in C. I suspect Jeethu found it easier to write that way. I didn't change much in how the serializer worked from the original patch, just fixed some problems with newer versions of Python and made it easier to re-generate and build. I can't recall much details about issues, possible that I fixed nearly all test failures. One remaining issue was that |
Weird, I managed to freeze _collections_abc without problems. Jeethu’s serializer does frozen sets, which have a lot of internal fields. I punted by replacing them with tuples, those work fine in the places where byte code uses them. :-) I also punt on hashes, those get computed on first use if you set them to -1. I set all refcounts to 999999999 and don’t worry about missing refs to None, these objects are immortal. The Windows (MSVC) C compiler doesn’t like references to external (or maybe all other) objects, you have to use C++ mode… |
As various people seem keen to claim that that this was their idea. I would point out that freezing the state of an executable is much older than any of the above claims, dating back to the 1980s at least. As for applying it to Python, the original HotPy was built with freezing back in 2009, but the RPython tool chain could be considered an elaborate version of this idea and predates that by a few years. |
(Status update: I'm waiting for Eric Snow to finish several changes to the frozen modules infrastructure. Once that has landed and is deemed stable I will re-integrate my code with that infrastructure and do proper benchmarks to determine whether to proceed.) (Further update:) Landed the UNIX version. |
Working on Windows now. This requires:
Draft PR: python/cpython#29648 We can then also update the UNIX |
Despite the open items I consider this done. |
This idea was @markshannon's,This a variant of something first proposed by Jeethu Rao of Facebook in https://bugs.python.org/issue34690. I'm just writing it up (and I may try to execute). The name "deep-freeze" is Mark's.We could write a Python script that generates C code for a code object (including nested code objects). This would completely avoid the need for unmarshalling frozen modules (but only those). It would replace the current approach to freezing the marshalled code objects (which also generates C code, but it's just an array of bytes).
Suppose a simple code object has 4 bytes of data,
"ABCD"
. We can then generate something like this:(Lots of details left out, including the restructuring of object headers.)
An immediate concern here is the intended separation of static objects for multiple interpreters (@ericsnowcurrently). We're introducing something here that would be tricky to clone per interpreter. So maybe that kills the idea immediately?
Another concern is the initializer for
ob_sval
-- this field is actually declared as an array of 1 char, and the compiler will presumably balk if we put more in there. I think there's a solution though by declaring such bytes objects as a struct containing a modified header (omittingob_sval
) and a separate array of characters:The text was updated successfully, but these errors were encountered: