[FEATURE] CORE write() formatting control #34

Atmelfan · 2020-01-11T17:17:13Z

Problem

I'm trying to implement a protocol which does not accept the scientific format in all places. It would be useful to control if the decimal output is written in normal or scientific format.

The number of significant digits would also be nice to have some degree of control over. Rounding the number to the desired before doesn't help if the rounded value isn't representable (example 1.2f32 -> 1.2000000476837158).

Solution

A extra write function which can take formatting hints, possibly write_format(n, format, significant_digits, bytes) where:

n - Value to be written
format - An enum for desired format
significant_digits - A usize of maximum number of significant digits, 0 could mean "Don't care".
bytes - Output buffer

Prerequisites

If applicable to the feature request, here are a few things you should provide to help me understand the issue:

Rust version : rustc 1.39.0-nightly (4295eea90 2019-08-30)
lexical version : 0.6.2
lexical compilation features used: lexical-core: features=["radix"], default-features=false

Alternatives

String manipulation - Expensive and buggy, needs to check the result to figure out how to cut it up.

The text was updated successfully, but these errors were encountered:

Alexhuszagh · 2020-01-11T18:15:35Z

This would be tricky, since we currently defer float formatting to 1 of 3 different backends, user configurable.

Ryu (default)
Grisu3
Grisu (Internal)

Only the unoptimized Grisu algorithm I would have any control over, so it would need to be implemented there, which is considerably slower than Ryu (~2x) and relatively a bit slower than dtoa.

However, this would certainly be doable, at least in the internal Grisu algorithm, since the generation of digits and their emission to the output buffer are done in subsequent steps. It may be faster to contact the maintainer of ryu and dtoa to ask for a similar feature, since this may be a feature required upstream for optimal performance.

There are, however, a few assumptions I would likely need to modify in that case (these are all fairly trivial):
1). The maximum number of digits a float may write is current hardcoded at 64 (without radix), or 256 (with radix), which without scientific notation may increase dramatically. If we control the number of significant digits and the presence or absence of scientific notation, that may change drastically (up to 325 digits without the radix feature, and up to 1075 digits with the radix feature), and would change depending on the user-provided options.

I may consider forking ryu and seeing if implementing this is possible without negatively impacting performance, since this might be the ideal approach.

Atmelfan · 2020-01-11T18:34:24Z

Is it possible to estimate the size reasonably before conversion? If so the return type could be a result, returning an error if the buffer is insufficient and leave dealing with that for the user.

Alexhuszagh · 2020-01-11T19:20:01Z

@Atmelfan Not really because of radix changes, although with the parameters it would be easy to estimate the upper bound on the number of digits.

If we were serializing a binary number to hex, it would be easy to estimate the number of digits because all values in binary are representible in hex. However, quickly estimating the number of digits a binary number might require isn't so simple, and is complicated by more modern serializing algorithms (Dragon4, Ryu, Grisu3, etc.), which all aim to efficiently calculate the minimum number of digits required to accurately represent a number in decimal. I'm only going to mention Grisu here, because I haven't implemented Dragon4 or Ryu, and therefore don't know the internal logic of it (and might get something wrong).

For Grisu, it first normalizes the floating point number (in an extended-representation, for stability, where a 64-bit integer is used for the significant digits or mantissa, and 16-bits is used for the binary exponent), and then calculates the upper and lower boundary for the value. You can safely approximate this with the following Python code (requires NumPy), using a well-known boundary condition (0.1):

Please note the below code is a way of approximating the process logically, it doesn't actually reflect any real Grisu code, but aims to clearly explain what the code is doing and finding the boundary on values it can and cannot represent.

Example

import numpy as np

ONE = np.uint64(1)
SIGN_MASK = np.uint64(0x8000000000000000)
INFINITY_BITS = np.uint64(0x7FF0000000000000)
NEGATIVE_INFINITY_BITS = INFINITY_BITS | SIGN_MASK
ZERO = np.float64(0.0)
# Maximum number of digits in decimal that can
# contribute to a number.
# This is calculated via:
#   `−emin + p2 + ⌊(emin + 1) log(2, b) − log(1 − 2^(−p2), b)⌋`
#       Where:
#           b = radix
#           emin = -1022
#           p2 = 53
MAX_DIGITS = 768

def to_bits(double):
    '''Get float bits as an unsigned, 64-bit integer.'''
    return np.frombuffer(double.tobytes(), dtype=np.uint64)[0]

def from_bits(bits):
    '''Create double from bit representation as an integer.'''
    return np.frombuffer(bits.tobytes(), dtype=np.float64)[0]

def is_sign_negative(double):
    '''Determine if the sign bit is negative.'''
    return to_bits(double) & SIGN_MASK != 0

def is_sign_positive(double):
    '''Determine if the sign bit is positive.'''
    return not is_sign_negative(double)

def next_double(double):
    '''Get the next greater float.'''

    bits = to_bits(double)
    if is_sign_negative(double) and double == ZERO:
        # -0.0
        return ZERO
    elif bits == INFINITY_BITS:
        return from_bits(INFINITY_BITS)
    elif is_sign_negative(double):
        return from_bits(bits - ONE)
    else:
        return from_bits(bits + ONE)


def previous_double(double):
    '''Get the previous greater float.'''

    bits = to_bits(double)
    if is_sign_positive(double) and double == ZERO:
        # +0.0
        return -ZERO
    elif bits == NEGATIVE_INFINITY_BITS:
        return from_bits(NEGATIVE_INFINITY_BITS)
    elif is_sign_negative(double):
        return from_bits(bits + ONE)
    else:
        return from_bits(bits - ONE)

def format_double(double):
    '''Format double with all possibly contributing significant digits.'''
    return '{{:0.{}f}}'.format(MAX_DIGITS).format(double)

halfway = np.float64(0.1)
next = next_double(halfway)
previous = previous_double(halfway)
print('0.1={}'.format(format_double(halfway)))
print('next(0.1)={}'.format(format_double(next)))
print('previous(0.1)={}'.format(format_double(previous)))

Output

0.1=0.100000000000000005551115123125782702118158340454101562500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
next(0.1)=0.100000000000000019428902930940239457413554191589355468750000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
previous(0.1)=0.099999999999999991673327315311325946822762489318847656250000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

As you can see, the boundary shows that it cannot differentiate between 0.1 and the actual stored value of x, so the Grisu algorithm emits 0.1. In order to generate these digits for the upper and lower bound, it must effectively divide by powers of 10 (although computationally it uses tricks that are much faster).

In short, I don't believe there's any short circuit possible, at least not easily. However, it should be doable if the calculation of the number of significant digits is integrated into the actual formatting routine, but would have to be tightly coupled to it.

Does that make more sense @Atmelfan, I understand this is a lot more complicated than it may initially seem.

Alexhuszagh · 2020-01-11T21:49:13Z

That said, it's very easy to estimate the upper bound on the number of digits written before conversion:

1). Number of significant digits.
2). Max size of the exponent (either number of digits in the maximum or minimum exponent, for example, '-324' would be 5, or the number of digits required to write out that exponent, for example, '-324' would be 325)

This would then create a trivial max bound on the number of digits very easily.

MartinKavik · 2021-05-04T21:09:39Z

Hi @Alexhuszagh!
I think your library would be very useful for Wasm, especially for Wasm in browsers.

I'm writing a Rust front-end framework and I've tried many float-to-string libraries and only lexical produces a reasonable binary size.

Optimized Wasm file of an example app with one f64 -> String conversion:

std: 101 KB
lexical: 87 KB
other libs: cca 100 - 500 KB

However I can't use it now because of missing features in formatting / write API. The API doesn't have to be super-nice because I can write a higher-level wrapper for it.
I would like to write an alternative for cases like

format!("{:.2}", num)

I've found a newer API:

lexical::to_string_with_options(num, &options)

and it looks like you are just working on it. So.. is there chance that support for these types of formatting corresponds with your roadmap?

Thanks!

P.S.
The smallest binary is created with

lexical = { git = "...rust-lexical", rev = "...", features = ["std"], default_features = false }

and the biggest with Ryu (96 KB). So I think only the internal / Grisu backend is interesting for Wasm / me.

Alexhuszagh · 2021-05-05T23:08:25Z

@MartinKavik This is low priority for the immediate short term, but after this week I'm planning on working on this. Controlling the number of digits should be very doable, especially since we control write formatting with ndigits (see the following code sample). With this, it would be trivial to pass a max number of written digits (see the code blocks below), Currently, ndigits is just the number of digits produced by the Grisu algorithm, however, this could easily be modified to allow more or less digits be added. If the desired_digits > ndigits, simply pad zeros to the end of the string. If desired_digits < ndigits then we could check if digits[desired_digits+1] >= b'5', and then if so, round up, and emit the digits. If not, just emit the digits.

This, of course, would prevent round-trips for truncated values, but if you're using shorter formatting control, I doubt this is of any relevance.

rust-lexical/lexical-core/src/ftoa/grisu2.rs

Lines 641 to 646 in 4b136c1

    
           let offset = offset.as_usize(); 
        
           // These are all safe, since digits.len() >= ndigits, and 
        
           // dest.len() >= ndigits+1, so the range must be valid. 
        
           copy_to_dst(dest, &digits[..offset]); 
        
           dest[offset] = decimal_point; 
        
           copy_to_dst(&mut dest[offset + 1..], &digits[offset..ndigits]);

Just an FYI that I will be slightly busy for the next week on other projects, but this should be prioritized after that. Thanks for the excellent suggestion.

I will likely implement it, along with a few other features (custom decimal points, and exponent characters) as part of the options API, and only available for WriteFloatOptions if the ryu or dtoa backend is not selected (since neither of these backends support write control).

MartinKavik · 2021-05-07T09:54:48Z

@Alexhuszagh Thanks a lot and don't hurry with the implementation :)
My idea is to use lexical and ufmt as a foundation for better std::fmt-like functions. Both libs are hidden under a non-default feature flag in my framework until they are ready to replace std:fmt in most cases.

if the ryu or dtoa backend is not selected (since neither of these backends support write control).

I was playing a bit with the crate pretty_dtoa. It uses the crate ryu_floating_decimal - a forked ryu that allows to write formatting functions. Just FYI.

Alexhuszagh · 2021-05-10T14:22:00Z

@MartinKavik quick question, can you think of any other format options that would be appealing other than:

Max significant digits (round after, must be >= than min significant digits)
Min significant digits (pad with 0s after, must be <= max significant digits)
Decimal point (already supported)
Exponent character (already supported)
No exponent notation
Trimming floats ("1.0" becomes "1", already supported).
Forced scientific format.
Forced mantissa sign.

I'm currently drafting initial versions of this functionality, so I'd be interested in feedback, particularly around your use-case.

Atmelfan · 2021-05-10T15:21:04Z

I'm not @MartinKavik, but forced scientific format and sign character is common in test equipment.

Alexhuszagh · 2021-05-10T15:27:38Z

Excellent, thank you. Adding both to the list.

MartinKavik · 2021-05-11T10:10:21Z

@MartinKavik quick question, can you think of any other format options that would be appealing other than

It looks like the only std float-related formatting tool is precision.
For regular web apps, No exponent notation + Max/Min significant digits + Decimal point should be enough - e.g. you want to show products with their price in a table - $123.45 or 123,45 Kč (a Czech eshop).
I can imagine rounding to the nearest integer or up/down to the nearest integer in combination with Trimming floats could be useful, too.
@Alexhuszagh Maybe you can look at pretty_dtoa's list of all options to find out if there isn't something interesting for you because I have a pretty limited knowledge about floats to predict what other people need.

Alexhuszagh · 2021-05-11T17:18:38Z

Thanks, that's given me a few suggestions. There's also a few anti-features I won't be including, such as:

ignore_extremes: This should already be handled by the digit/mantissa-generating algorithm.
max_width: This is already covered by max significant digits, and could lead to erroneous results.
max_decimal_digits: Once again, handled by max significant digits. This should be different. "1.23456e5" should be any different than "123456.78901", but in this case they very much are.
min_decimal_digits: Same reason as above, should be handled by min_significant_digits.

But the round_mode, and the exponent break points are both excellent ideas.

Alexhuszagh · 2021-08-07T18:03:12Z

A tentative implementation has been done here, which supports all the features aforementioned, including a few others.

It's scheduled for the 0.8 release, which focuses on better format control, optimizing algorithms, faster compile times, and fallbacks for very compact implementations. A few modifications should significantly reduce the size of the resulting binaries here:

Inferring exponents rather than explicitly storing them.
Less inlining.

The static tables have therefore been reduces from 1.4KB to 700 bytes, since we no longer need to store the exponents (and the padding), reducing the size of each element from 16 bytes to 8 bytes (a u64).

Alexhuszagh · 2021-09-01T23:16:22Z

Ok I've done the actual size calculations prior to release, and the compact implementation seems to be ~9x smaller on x86_64 Linux using opt-level=s. Note that the Y-axis is a log scale.

The measurement may be inexact, but is compared to an empty file with various tricks to ensure code is not optimized out, and then the .text, .rodata, and .data segment sizes are read directly to measure the increase in binary size. I don't really know of a more exact way.

Anyway, I'm planning for a release early next week.

Alexhuszagh · 2021-09-03T05:17:31Z

Implemented as of lexical v6.0.0 and lexical-core v0.8.0. Please use the compact and the Options API to allow format precision control and enable compact binary sizes.

matteosantama · 2022-08-16T17:43:34Z

min_decimal_digits: Same reason as above, should be handled by min_significant_digits

@Alexhuszagh Do you mind elaborating on this? Suppose I have a LONG list of floats of unknown magnitude and I would like them all formatted with exactly 3 decimal places. How can I use min_significant_digits/max_significant_digits to accomplish this?

Atmelfan added the enhancement New feature or request label Jan 11, 2020

Atmelfan assigned Alexhuszagh Jan 11, 2020

Alexhuszagh closed this as completed Sep 3, 2021

matteosantama mentioned this issue Aug 19, 2022

feat[rust, python]: add float_precision parameter to DataFrame.write_csv pola-rs/polars#4504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] CORE write() formatting control #34

[FEATURE] CORE write() formatting control #34

Atmelfan commented Jan 11, 2020

Alexhuszagh commented Jan 11, 2020

Atmelfan commented Jan 11, 2020 •

edited

Loading

Alexhuszagh commented Jan 11, 2020 •

edited

Loading

Alexhuszagh commented Jan 11, 2020

MartinKavik commented May 4, 2021

Alexhuszagh commented May 5, 2021 •

edited

Loading

MartinKavik commented May 7, 2021

Alexhuszagh commented May 10, 2021 •

edited

Loading

Atmelfan commented May 10, 2021 •

edited

Loading

Alexhuszagh commented May 10, 2021

MartinKavik commented May 11, 2021

Alexhuszagh commented May 11, 2021

Alexhuszagh commented Aug 7, 2021 •

edited

Loading

Alexhuszagh commented Sep 1, 2021 •

edited

Loading

Alexhuszagh commented Sep 3, 2021

matteosantama commented Aug 16, 2022

[FEATURE] CORE write() formatting control #34

[FEATURE] CORE write() formatting control #34

Comments

Atmelfan commented Jan 11, 2020

Problem

Solution

Prerequisites

Alternatives

Alexhuszagh commented Jan 11, 2020

Atmelfan commented Jan 11, 2020 • edited Loading

Alexhuszagh commented Jan 11, 2020 • edited Loading

Alexhuszagh commented Jan 11, 2020

MartinKavik commented May 4, 2021

Alexhuszagh commented May 5, 2021 • edited Loading

MartinKavik commented May 7, 2021

Alexhuszagh commented May 10, 2021 • edited Loading

Atmelfan commented May 10, 2021 • edited Loading

Alexhuszagh commented May 10, 2021

MartinKavik commented May 11, 2021

Alexhuszagh commented May 11, 2021

Alexhuszagh commented Aug 7, 2021 • edited Loading

Alexhuszagh commented Sep 1, 2021 • edited Loading

Alexhuszagh commented Sep 3, 2021

matteosantama commented Aug 16, 2022

Atmelfan commented Jan 11, 2020 •

edited

Loading

Alexhuszagh commented Jan 11, 2020 •

edited

Loading

Alexhuszagh commented May 5, 2021 •

edited

Loading

Alexhuszagh commented May 10, 2021 •

edited

Loading

Atmelfan commented May 10, 2021 •

edited

Loading

Alexhuszagh commented Aug 7, 2021 •

edited

Loading

Alexhuszagh commented Sep 1, 2021 •

edited

Loading