-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core::fmt
machinery optimized for size
#18
Comments
Maybe it's good to (also) write guidelines recommending people avoid |
What would the guideline suggest to use instead of core::fmt? Other than don't use formatting at all. |
Exactly "don't use formatting at all." For example, when a function returns an error, don't avoid using |
I would be interested in seeing the equivalent results for C and C++
(printf vs std::cout) for the same target before I worried unduly.
|
Sounds like a good, general (as in, not embedded specific) recommendation to me.
I guess my worry is most types will implement
I want run time to also be compared and I want to see these numbers for an 8-bit (e.g. AVR) micro. AFAICT, the current implementation doesn't do almost any inlining because the core::fmt implementation uses trait objects (a lot?) so there are a bunch of fmt-related functions in the final binary and all those function calls affect the run time. |
In my case:
So I think 2k bytes of code is not a big deal really, compared to 20k needed for floating point. And people are ok with 30k+ binaries in C world (apparently). The thing I'm worried about is the fact that rust does 1Kbyte memset each time you print a float: |
Is this (a) optimized and (b) the increase for formatting an integer? or is the increase the same regardless of the kind of formatting one does?
The float formatting code is pretty hardcode. It prints floats with "full" precision AIUI. I wonder if we could have some sort of "pluggable" formatting where one can bring their own formatting code that does different trades off between code size and precision. And by pluggable, I mean that you can recycle the deriving infrastructure for your formatting traits, i.e. the existing |
Yes, it is always the same, because libc comes precompiled with gcc toolchain, so it does not matter if I use optimizations or integers instead of floats. I even tried |
Not only is it a 1K memset, it's 1K of stack space that's required. On embedded targets wasting 1K of stack space like this is atrocious. |
In the embedded world, formatting is well known to require large amount of code. See for example section 3.15.2.1 of http://sdcc.sourceforge.net/doc/sdccman.pdf . I've seen tables like those for other compilers but I can't seem to find them right now.
Is that amount actually necessary? The maximum length needed to represent an f64 is approximately "-1.4503599627370496e-308".len()==24 |
Maybe I'm late in the discussion but I'm the initial author of flt2dec formatting code. I think the stack space can be reduced as long as the formatting itself requests only that amount of digits. In the other words, the stack usage for I don't think that the stack usage can be reduced to some hundred bytes, however, precisely because in the worst case (at about 0.x% probability) fp formatting requires seven bignums, each weighing 160 bytes. Three of them can be discarded at the expense of performance, but remaining four are not, so 640 bytes + output buffer is the absolute minimum. If you have to optimize past this point your only bet is to use Grisu2 which is correct (the output will remap to the original value when parsed) and shortest (there is no other shorter output that is correct) but not closest (there may be other outputs of the same length closer to the original value). To the this end dtoa might be desirable---Serde uses it for fast fp formatting. |
For formatting of integers, I recommend numtoa; it generates fast and small code IME. You'll have to put together different buffers into a human readable message though. Another option is just to send everything using a binary format. The receiver end will have to parse the binary stuff and format it properly for human consumption. This uses less resources on the device side. These days I use byteorder for binary serialization but I wish we had better options in the ecosystem. |
Indeed, I think a lot of the use cases for formatting could be handled by a procedural macro that replaced uses of let x: u32 = 0;
let y: u32 = 0;
write!(&mut out, "variable x is {}", x);
write!(&mut out, "variable y is {}", y); could be replaced by write!(&mut out, "1");
write!(&mut out, &x as &[u8; 4])
write!(&mut out, "2");
write!(&mut out, &y as &[u8; 4]); And then a document describing that message "1" is "variable x is {}" followed by a little-endian u32 (assuming the code is compiled for a little endian CPU). A program could use that document to translate the output into the desired human-readable output. All the formatting and the storage of the constant strings would be done on some other computer. I think something like this could make for a pretty slick low-overhead method of debug printing. |
Yeah, that sounds like an autogenerated enum that sits in a crate common to both the device program and the host side. Something like this: // common code
enum Message {
X(u32),
Y(u32),
}
impl Message {
fn deserialize(buffer: &[u8; 5]) -> Result<Self> { .. }
fn serialize(&self, buffer: &mut [u8; 5]) { .. }
}
impl fmt::Display for Message { .. }
// device side
Message::X(42).serialize(&mut buffer);
sink.write_all(&buffer); // sends [0u8, 42u32] through the wire
Message::Y(24).serialize(&mut buffer);
sink.write_all(&buffer); // sends [1u8, 24u32] through the wire
// host side
let message = Message::deserialize(&buffer).unwrap();
println!("{}", message); // prints "variable x is 42"
// ..
let message = Message::deserialize(&buffer).unwrap();
println!("{}", message); // prints "variable y is 24" Now we just need a macro wizard to implement the procedural macro (or macro 1.1) that takes care of all the boilerplate. |
An alternative to |
@japaric I think you're numbers are not quite up-to-date anymore. The overhead keeps growing and growing. A simple
into a dummy Buffer creates a 2744 bytes .text section on thumbv6m nowadays. Just yesterdays changes caused the code generated for Yet the compiler/linker leaves plenty of dead code hanging around, like And then there's another issue. The code is so fat that using fmt (or fast_fmt or just numtoa) let's me constantly bump into this:
|
Is there any recognized way around this? Even when eliminating Debug from my code, I still get a lot of bloat from the various assertions/messages in slice and other places throughout the library. I wish there was a way to conditionally compile them out for release builds. It's hard to fit things into 16K :) |
@jonwingfield Not sure if this addresses your issue, but @dhylands wrote a related crate awhile back:
|
Relevant libcore PR: Cut down on number formatting code size. |
Noting that this is still an issue, but tagging @diondokter to possibly link to more relevant upstream tracking issues. We may decide to close this in favor of just tracking the upstream tracking issue. |
Yes, so the major thing happening right now is the addition of the In lots of places the std does this:
In these places it's really easy to cfg out the fast algo and get some savings. In any case, there is a tracking issue in the Rust repo: rust-lang/rust#125612 Note though that this flag can't solve all fmt woes. This doesn't allow us to change the public api of the std, so the base design has to stay the same. We can now control the used algorithms, but we can't change to a pluggable fmt api. I recommend we close this issue. Anything that would be discussed here should probably move to the linked tracking issue. |
Closing as part of the 2024 triage - further discussion to be had at rust-lang/rust#125612 |
This
generates a .text section of 94 bytes
This
generates a .text section of 314 bytes
But this
generates a .text section of 2340 (!) bytes
Everything (including
core
) was compiled with opt-level=3 + LTO + panic=abort. FWIW, opt-levelss
andz
make little difference.The question is: Can we improve binary sizes? (and run time?)
Steps To Reproduce
Then tweak
examples/minimal.rs
accordingly.Observations
iprintln
definitionwrite_str
implementationLooking at the disassembly of
print-with-fmt.rs
. It seems that the compiler is not inlining anything insidecore::fmt
. The probable cause is that thefmt::Write
trait makes use of trait objects in its implementation.Possible solutions
Reduce the number of
panic!
s andunwrap
s in thecore::fmt
implementation.Create an alternative
core::fmt
implementation that uses generics instead of trait objects to incentivize inlining and better optimizations.Create an alternative
core::fmt
implementation that provides less formatting options.The last two solutions will require a new
fmt::Write
trait.Meta
Tagging with the
community
label as this may require an out of tree re-implementation of thefmt::Write
trait that could end up living in crates.io.The text was updated successfully, but these errors were encountered: