-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: pclntab is too big #36313
Comments
Also discussed in #27266 And there are some interesting suggestions in https://news.ycombinator.com/item?id=21907144 |
I spent some time looking at this earlier this year (see e.g. CLs 172079 and 171820), and then stalled. The major impediment was the inscrutability of the linker. I hope that the new linker will open doors. There is definitely lots more room for improvement. I have had private conversations with a few Go team members about improving the string encoding in pclntab. Most strings in the pclntab never actually get used by the program, and thus could be materialized lazily, perhaps from traditional compression or from a trie-ish encoding. This isn't trivial but you could do something like: Set up an appropriately-sized bss symbol with known offsets for all materialized strings, and when you need to materialize a string, decompress/assemble it into the correct spot for all other future users (atomically or with locks somehow). I did a very rough calculation and estimated that this could shrink pclntab by about 10%. |
@jeremyfaller has also been brainstorming about this. The concern I would have about such a scheme (putting on my devil's advocate hat) is that rodata can be shared across multiple instances of an executable across a given system, whereas if you go the BSS route, each process has to have it's own chunk of virtual memory. On the other hand, it's not at all clear whether people care about that (who out there is running many instances of the same Go program on the same machine?). |
I think there's some confusion in the blog post. Sorry if I'm the one confused instead. The blog post talks about Has anyone checked which tables are actually the problem? I think all the commentary above is discussing the file/line tables. |
Speaking for myself, I've looked at all of them; see CL 171760 for a recent tweak. I think the most attention is going to the file/line tables because that's where the most obvious inefficiencies are. (A few years ago I experimented with using a nybble-based instead of byte-based varint encoding for pc/value pairs. It helped, but not very much.) But all that doesn't foreclose the possibility of me being confused. :) |
Yes, as @heschik said, there are several tables in pclntab. Here is a breakdown:
The strings are not large portion of the pclntab. I have another confusion: the "growth" is growing over time, or the output size grows as the input size grows? I.e. are CockroachDB v19.1 and v1.0 built with same version of Go? For the former, there are some growth over time, as we added more kinds of tables including e.g. inline tables, stack objects, and register maps, as mentioned in #27266. For the latter, it seems growing linearly, at least from the table above. |
As @thanm mentioned, @jeremyfaller has been looking at this problem recently and has done some preliminary analysis. He'll be back Thursday. He was going to prototype splitting the strings into two or three components (e.g., package+type+symbol), so each component can be deduped. I'm not sure what the current state of that is. As @cherrymui points out, the strings aren't a whole lot of the pclntab, but that also seems like low-hanging fruit. @jeremyfaller has also been looking at more typical binaries that have much longer package paths than cmd/compile, and hence more string data. @cherrymui, what tool did you use to generate that breakdown? It would be good to also see the further breakdown of "pcdata", since there are several tables in that. An important misconception in the blog post is that pclntab only contains traceback symbolization data. It also contains a lot of critical data for the garbage collector. I did some experiments a while back with compressing stack maps. At the time, I noted that Huffman-coding the PCDATA delta stream roughly halved the size of the stack maps. However, that depended on the linker generating an optimal Huffman table for the whole binary, which would increase link time. Here's the code for those experiments, though it's probably bitrotted. |
@aclements I just added print statements in the linker. I can do that for the pcdata tables. |
As @thanm mentions, I was taking a look at starting to better encode pclntab. His tradeoffs are accurate -- trading some read-only-mappable memory for BSS. I was pulled to thinking about it because of internal Google binaries, where path names are quite long, and some experimentation showed that as much as 50% of pcln was used for strings. I started a design encoding paths by stripping common roots, and for function names, stripping module and type names. It's still early days, but I thought we'd see reductions in binary size > 5%. As @cherrymui points out, that gain will be minimal for the majority of users. Right now, I was writing that document, and a change to cmd/objdump to generate stats to help us figure out where we need to focus our efforts. |
The title of this bug (as well as the blog post) claims that pclntab grows super-linearly in the number of functions in the binary, but it's completely unclear to me what that claim is based on. It seems that they're varying both the Cockroach version and the Go version, so not only did Cockroach get bigger, but we also added more pclntab tables for the garbage collector since Go 1.2. That aside, the table in the blog post that compares Cockroach in 2017 to 2019 has exactly two data points in it, with the rest "projected", so I don't know how you get any non-linear projection from that. The pclntab is large, a significant fraction of Go binaries, and its size is starting to cause issues for cutting-edge users, so this is worth working on. I just don't understand the basis for the super-linear growth claim. |
@aclements here is a breakdown of the pcdata:
The register maps are the most of them. |
Great. I'm planning on getting rid of those. :) Of all the pc tables, the pcline is actually the biggest. I wonder if that's because of instruction interleaving and inlining. I bet that's also wasting a lot of bits writing down very small deltas. Since we only use that table for symbolization, it doesn't have to be extremely efficient to decode, so we could imagine using a more compact encoding of it. |
That's what I was hinting at with a "nybble-based varint encoding" above. In case it helps, here are my notes on this topic, from Feb 2016 (some things may have changed since then!):
|
Here's some evidence against that claim. I ran these commands in my
I dropped the results in a spreadsheet and plotted "% of binary that is pclntab" (y-axis) vs "binary size" (x-axis). Results: Looks pretty non-superlinear to me. So now we can get back to the task of shrinking it. :) |
Thanks for sending your notes on that. I was thinking of more compact but harder-to-decode formats for pcline in particular, such as Huffman coding (probably with a fixed table we generate from some corpus; maybe a per-package table generated by the compiler) or possibly Golomb-Rice coding (since pcline in particular is probably roughly geometrically distributed, though this wouldn't extend well to other tables).
Awesome. Thanks for running that experiment! |
One interesting point mentioned on ycombinator is that there are platform-specific ways in which the pclntab could be stored, such as in DWARF format on ELF based OS, or, possibly externally, in .pdb files on Windows. In case this information is not available because it was stripped, the pclntab information needed could be calculated at run time. I don't know how feasible this idea would be for go, though, but I consider it worth mentioning. |
If I understand correctly, the dead code elimination pass of the linker is quite conservative about removing methods implementing interfaces. It keeps the methods if their name matches any method of an interface. If the average per-function overhead of the GC metadata is low (like ~5% ), it's probably worth investigating how those redundant symbols would cost. I imagine it helps for projects like GUI toolkits where big interfaces are implemented by many widgets, while only a small set would actually be used by the user application. |
So, I finally got a chance to put together a CL (cl/213830) that breaks pclntab up, looking at the sizes of each individual piece. It slightly overcounts (as there's deduplication in most of the pcln table in the linker). Sizes (percentage of pcln) for each section of all go tools for darwin are:
As previously suspected, the filenames aren't nearly the problem in typical code they've been seen to be in some Google binaries. From this data, and an offline discussion with @aclements, it seems to me that the course of action should be:
|
It's also required to reduce the size of a binary even in a compressed state. For example sometimes it's required to add an application to an xz-compressed initrd on a very small storage (like 8MiB). So in my extremely humble opinion it would be more useful if the first there will be considered ways to reduce data instead of compressing it. Moreover sometimes performance is not important at all and size is very important. It would be nice to have an ability to choose the priorities (like |
You probably know this, but one sticking point with these ideas is that there is public API that exposes some of these things as strings. For example, |
@josharian Yeah. @thanm mentioned it above, and the idea was to decode these strings into a space allocated in BSSNOPTR. It should be a relatively simple matter of preallocating the BSSNOPTR space at load time, and copying them out when they're used. The problem (again as Than pointed out) is that this decreases binary size at the expense of BSS. (Probably not an issue for most binaries, but still a tradeoff we'd make.) |
One group of people who really want small binaries are Go developers targetting iOS. In certain iOS contexts (Network Extensions, Notification Extensions at least), the OS limits you to 15 MiB of total disk + memory. (That is, your binary is pinned into memory and can't page in from flash). So small binaries are critical to having enough memory left over to actually use. So any optimizations that reduce binary size don't help (or might hurt!) if they require too much runtime memory allocation to decompress data. |
FWIW, I see in one binary here that 10.28% of the bytes of func name strings are bytes found in the prefix of the lexicographically following func name. (sometimes funcs are named prefixes of each other...
So with (ptr/offset, len) encoding, in this binary with 19378 redundant bytes of func prefixes, and with 5878 funcs, we could only win (barely) with a uint16 max string length. 19378 - (5878 * 2 byte uint16) = 7,622 bytes of savings. Likely not worth it, at least with that representation. But encoding closure names differently would help a lot:
(Btw, I'm using https://github.com/bradfitz/shotizam to analyze Go binary sizes, slurping their pcln tables into sqlite) |
That’s easy enough to implement, if we can settle on an encoding. The current one has the virtue of making it easy to know where to find the closure; I’ve used that a fair amount. Want to suggest an alternative? (A counter per package?) |
Sorry, I meant if we went with something like ropes (mentioned above) or were otherwise fine allocating at runtime for https://golang.org/pkg/runtime/#Func.Name then we could encode the same user-visible name in a much more compact form. |
I've just reached the limit of 1 MiB Flash in STM32L476RG. pclntab is 345 KiB (34%) $ nm --size-sort -S shell.elf|tail -5 In my case I can't trade any RAM for smaller pclntab. The above STM32L476RG MCU has only 128 KiB of RAM. The most popular STM32F405/7 has "huge" 192 KiB RAM. |
@embeddedgo Would it be possible to add a compiler flag that would generate a separate registry file, and then have pclntab only contain references into the registry file? In other words, it would not contain any kinds of strings. It would make interpreting log output harder, as it would require looking up log entries in the registry. In the case of IoT and other embedded scenarios, I would argue that could be well worth the reduction in binary size, however. |
I'm working on merging go1.16 changes to Embedded Go which is more about porting my noos/thumb and noos/riscv64 targets to this new Go version. Thre are a lot of chanegs in 1.16 pclntab code. One of my use cases relies on smaller pclntab so I need to preserve this feature. I'll keep your proposal in mind while working. |
Here's the 2021 update for the original link: https://dr-knz.net/go-executable-size-visualization-with-d3-2021.html In summary, the bytes previously in |
An update on the article that originated this issue was posted recently: https://www.cockroachlabs.com/blog/go-file-size-update/ Archive link: https://web.archive.org/web/20210414213416/https://www.cockroachlabs.com/blog/go-file-size-update/ |
This followup article is generating more heat than light. Locking this issue to collaborators only. |
This article contains results from applying a visualization tool:
https://www.cockroachlabs.com/blog/go-file-size/
If the analysis is correct, pclntab needs a major redesign.
Separated from #6853 to keep the conversation focused.
The text was updated successfully, but these errors were encountered: