Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] CORE write() formatting control #34

Closed
Atmelfan opened this issue Jan 11, 2020 · 16 comments
Closed

[FEATURE] CORE write() formatting control #34

Atmelfan opened this issue Jan 11, 2020 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@Atmelfan
Copy link

Problem

I'm trying to implement a protocol which does not accept the scientific format in all places. It would be useful to control if the decimal output is written in normal or scientific format.

The number of significant digits would also be nice to have some degree of control over. Rounding the number to the desired before doesn't help if the rounded value isn't representable (example 1.2f32 -> 1.2000000476837158).

Solution

A extra write function which can take formatting hints, possibly write_format(n, format, significant_digits, bytes) where:

  • n - Value to be written
  • format - An enum for desired format
  • significant_digits - A usize of maximum number of significant digits, 0 could mean "Don't care".
  • bytes - Output buffer

Prerequisites

If applicable to the feature request, here are a few things you should provide to help me understand the issue:

  • Rust version : rustc 1.39.0-nightly (4295eea90 2019-08-30)
  • lexical version : 0.6.2
  • lexical compilation features used: lexical-core: features=["radix"], default-features=false

Alternatives

  • String manipulation - Expensive and buggy, needs to check the result to figure out how to cut it up.
@Atmelfan Atmelfan added the enhancement New feature or request label Jan 11, 2020
@Alexhuszagh
Copy link
Owner

This would be tricky, since we currently defer float formatting to 1 of 3 different backends, user configurable.

Only the unoptimized Grisu algorithm I would have any control over, so it would need to be implemented there, which is considerably slower than Ryu (~2x) and relatively a bit slower than dtoa.

However, this would certainly be doable, at least in the internal Grisu algorithm, since the generation of digits and their emission to the output buffer are done in subsequent steps. It may be faster to contact the maintainer of ryu and dtoa to ask for a similar feature, since this may be a feature required upstream for optimal performance.

There are, however, a few assumptions I would likely need to modify in that case (these are all fairly trivial):
1). The maximum number of digits a float may write is current hardcoded at 64 (without radix), or 256 (with radix), which without scientific notation may increase dramatically. If we control the number of significant digits and the presence or absence of scientific notation, that may change drastically (up to 325 digits without the radix feature, and up to 1075 digits with the radix feature), and would change depending on the user-provided options.

I may consider forking ryu and seeing if implementing this is possible without negatively impacting performance, since this might be the ideal approach.

@Atmelfan
Copy link
Author

Atmelfan commented Jan 11, 2020

Is it possible to estimate the size reasonably before conversion? If so the return type could be a result, returning an error if the buffer is insufficient and leave dealing with that for the user.

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Jan 11, 2020

@Atmelfan Not really because of radix changes, although with the parameters it would be easy to estimate the upper bound on the number of digits.

If we were serializing a binary number to hex, it would be easy to estimate the number of digits because all values in binary are representible in hex. However, quickly estimating the number of digits a binary number might require isn't so simple, and is complicated by more modern serializing algorithms (Dragon4, Ryu, Grisu3, etc.), which all aim to efficiently calculate the minimum number of digits required to accurately represent a number in decimal. I'm only going to mention Grisu here, because I haven't implemented Dragon4 or Ryu, and therefore don't know the internal logic of it (and might get something wrong).

For Grisu, it first normalizes the floating point number (in an extended-representation, for stability, where a 64-bit integer is used for the significant digits or mantissa, and 16-bits is used for the binary exponent), and then calculates the upper and lower boundary for the value. You can safely approximate this with the following Python code (requires NumPy), using a well-known boundary condition (0.1):

Please note the below code is a way of approximating the process logically, it doesn't actually reflect any real Grisu code, but aims to clearly explain what the code is doing and finding the boundary on values it can and cannot represent.

Example

import numpy as np

ONE = np.uint64(1)
SIGN_MASK = np.uint64(0x8000000000000000)
INFINITY_BITS = np.uint64(0x7FF0000000000000)
NEGATIVE_INFINITY_BITS = INFINITY_BITS | SIGN_MASK
ZERO = np.float64(0.0)
# Maximum number of digits in decimal that can
# contribute to a number.
# This is calculated via:
#   `−emin + p2 + ⌊(emin + 1) log(2, b) − log(1 − 2^(−p2), b)⌋`
#       Where:
#           b = radix
#           emin = -1022
#           p2 = 53
MAX_DIGITS = 768

def to_bits(double):
    '''Get float bits as an unsigned, 64-bit integer.'''
    return np.frombuffer(double.tobytes(), dtype=np.uint64)[0]

def from_bits(bits):
    '''Create double from bit representation as an integer.'''
    return np.frombuffer(bits.tobytes(), dtype=np.float64)[0]

def is_sign_negative(double):
    '''Determine if the sign bit is negative.'''
    return to_bits(double) & SIGN_MASK != 0

def is_sign_positive(double):
    '''Determine if the sign bit is positive.'''
    return not is_sign_negative(double)

def next_double(double):
    '''Get the next greater float.'''

    bits = to_bits(double)
    if is_sign_negative(double) and double == ZERO:
        # -0.0
        return ZERO
    elif bits == INFINITY_BITS:
        return from_bits(INFINITY_BITS)
    elif is_sign_negative(double):
        return from_bits(bits - ONE)
    else:
        return from_bits(bits + ONE)


def previous_double(double):
    '''Get the previous greater float.'''

    bits = to_bits(double)
    if is_sign_positive(double) and double == ZERO:
        # +0.0
        return -ZERO
    elif bits == NEGATIVE_INFINITY_BITS:
        return from_bits(NEGATIVE_INFINITY_BITS)
    elif is_sign_negative(double):
        return from_bits(bits + ONE)
    else:
        return from_bits(bits - ONE)

def format_double(double):
    '''Format double with all possibly contributing significant digits.'''
    return '{{:0.{}f}}'.format(MAX_DIGITS).format(double)

halfway = np.float64(0.1)
next = next_double(halfway)
previous = previous_double(halfway)
print('0.1={}'.format(format_double(halfway)))
print('next(0.1)={}'.format(format_double(next)))
print('previous(0.1)={}'.format(format_double(previous)))

Output

0.1=0.100000000000000005551115123125782702118158340454101562500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
next(0.1)=0.100000000000000019428902930940239457413554191589355468750000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
previous(0.1)=0.099999999999999991673327315311325946822762489318847656250000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

As you can see, the boundary shows that it cannot differentiate between 0.1 and the actual stored value of x, so the Grisu algorithm emits 0.1. In order to generate these digits for the upper and lower bound, it must effectively divide by powers of 10 (although computationally it uses tricks that are much faster).

In short, I don't believe there's any short circuit possible, at least not easily. However, it should be doable if the calculation of the number of significant digits is integrated into the actual formatting routine, but would have to be tightly coupled to it.

Does that make more sense @Atmelfan, I understand this is a lot more complicated than it may initially seem.

@Alexhuszagh
Copy link
Owner

That said, it's very easy to estimate the upper bound on the number of digits written before conversion:

1). Number of significant digits.
2). Max size of the exponent (either number of digits in the maximum or minimum exponent, for example, '-324' would be 5, or the number of digits required to write out that exponent, for example, '-324' would be 325)

This would then create a trivial max bound on the number of digits very easily.

@MartinKavik
Copy link

Hi @Alexhuszagh!
I think your library would be very useful for Wasm, especially for Wasm in browsers.

I'm writing a Rust front-end framework and I've tried many float-to-string libraries and only lexical produces a reasonable binary size.

Optimized Wasm file of an example app with one f64 -> String conversion:

  • std: 101 KB
  • lexical: 87 KB
  • other libs: cca 100 - 500 KB

However I can't use it now because of missing features in formatting / write API. The API doesn't have to be super-nice because I can write a higher-level wrapper for it.
I would like to write an alternative for cases like

format!("{:.2}", num)

I've found a newer API:

lexical::to_string_with_options(num, &options)

and it looks like you are just working on it. So.. is there chance that support for these types of formatting corresponds with your roadmap?

Thanks!

P.S.
The smallest binary is created with

lexical = { git = "...rust-lexical", rev = "...", features = ["std"], default_features = false }

and the biggest with Ryu (96 KB). So I think only the internal / Grisu backend is interesting for Wasm / me.

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented May 5, 2021

@MartinKavik This is low priority for the immediate short term, but after this week I'm planning on working on this. Controlling the number of digits should be very doable, especially since we control write formatting with ndigits (see the following code sample). With this, it would be trivial to pass a max number of written digits (see the code blocks below), Currently, ndigits is just the number of digits produced by the Grisu algorithm, however, this could easily be modified to allow more or less digits be added. If the desired_digits > ndigits, simply pad zeros to the end of the string. If desired_digits < ndigits then we could check if digits[desired_digits+1] >= b'5', and then if so, round up, and emit the digits. If not, just emit the digits.

This, of course, would prevent round-trips for truncated values, but if you're using shorter formatting control, I doubt this is of any relevance.

let offset = offset.as_usize();
// These are all safe, since digits.len() >= ndigits, and
// dest.len() >= ndigits+1, so the range must be valid.
copy_to_dst(dest, &digits[..offset]);
dest[offset] = decimal_point;
copy_to_dst(&mut dest[offset + 1..], &digits[offset..ndigits]);

Just an FYI that I will be slightly busy for the next week on other projects, but this should be prioritized after that. Thanks for the excellent suggestion.

I will likely implement it, along with a few other features (custom decimal points, and exponent characters) as part of the options API, and only available for WriteFloatOptions if the ryu or dtoa backend is not selected (since neither of these backends support write control).

@MartinKavik
Copy link

@Alexhuszagh Thanks a lot and don't hurry with the implementation :)
My idea is to use lexical and ufmt as a foundation for better std::fmt-like functions. Both libs are hidden under a non-default feature flag in my framework until they are ready to replace std:fmt in most cases.

if the ryu or dtoa backend is not selected (since neither of these backends support write control).

I was playing a bit with the crate pretty_dtoa. It uses the crate ryu_floating_decimal - a forked ryu that allows to write formatting functions. Just FYI.

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented May 10, 2021

@MartinKavik quick question, can you think of any other format options that would be appealing other than:

  1. Max significant digits (round after, must be >= than min significant digits)
  2. Min significant digits (pad with 0s after, must be <= max significant digits)
  3. Decimal point (already supported)
  4. Exponent character (already supported)
  5. No exponent notation
  6. Trimming floats ("1.0" becomes "1", already supported).
  7. Forced scientific format.
  8. Forced mantissa sign.

I'm currently drafting initial versions of this functionality, so I'd be interested in feedback, particularly around your use-case.

@Atmelfan
Copy link
Author

Atmelfan commented May 10, 2021

I'm not @MartinKavik, but forced scientific format and sign character is common in test equipment.

@Alexhuszagh
Copy link
Owner

Excellent, thank you. Adding both to the list.

@MartinKavik
Copy link

@MartinKavik quick question, can you think of any other format options that would be appealing other than

  • It looks like the only std float-related formatting tool is precision.
  • For regular web apps, No exponent notation + Max/Min significant digits + Decimal point should be enough - e.g. you want to show products with their price in a table - $123.45 or 123,45 Kč (a Czech eshop).
  • I can imagine rounding to the nearest integer or up/down to the nearest integer in combination with Trimming floats could be useful, too.
  • @Alexhuszagh Maybe you can look at pretty_dtoa's list of all options to find out if there isn't something interesting for you because I have a pretty limited knowledge about floats to predict what other people need.

@Alexhuszagh
Copy link
Owner

Thanks, that's given me a few suggestions. There's also a few anti-features I won't be including, such as:

  • ignore_extremes: This should already be handled by the digit/mantissa-generating algorithm.
  • max_width: This is already covered by max significant digits, and could lead to erroneous results.
  • max_decimal_digits: Once again, handled by max significant digits. This should be different. "1.23456e5" should be any different than "123456.78901", but in this case they very much are.
  • min_decimal_digits: Same reason as above, should be handled by min_significant_digits.

But the round_mode, and the exponent break points are both excellent ideas.

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Aug 7, 2021

A tentative implementation has been done here, which supports all the features aforementioned, including a few others.

It's scheduled for the 0.8 release, which focuses on better format control, optimizing algorithms, faster compile times, and fallbacks for very compact implementations. A few modifications should significantly reduce the size of the resulting binaries here:

  1. Inferring exponents rather than explicitly storing them.
  2. Less inlining.

The static tables have therefore been reduces from 1.4KB to 700 bytes, since we no longer need to store the exponents (and the padding), reducing the size of each element from 16 bytes to 8 bytes (a u64).

@Alexhuszagh
Copy link
Owner

Alexhuszagh commented Sep 1, 2021

Ok I've done the actual size calculations prior to release, and the compact implementation seems to be ~9x smaller on x86_64 Linux using opt-level=s. Note that the Y-axis is a log scale.

The measurement may be inexact, but is compared to an empty file with various tricks to ensure code is not optimized out, and then the .text, .rodata, and .data segment sizes are read directly to measure the increase in binary size. I don't really know of a more exact way.

Anyway, I'm planning for a release early next week.

Write Float

@Alexhuszagh
Copy link
Owner

Implemented as of lexical v6.0.0 and lexical-core v0.8.0. Please use the compact and the Options API to allow format precision control and enable compact binary sizes.

@matteosantama
Copy link

min_decimal_digits: Same reason as above, should be handled by min_significant_digits

@Alexhuszagh Do you mind elaborating on this? Suppose I have a LONG list of floats of unknown magnitude and I would like them all formatted with exactly 3 decimal places. How can I use min_significant_digits/max_significant_digits to accomplish this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants