Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamped parsing/formatting #236

Closed
jhpratt opened this issue Mar 17, 2020 · 6 comments
Closed

Revamped parsing/formatting #236

jhpratt opened this issue Mar 17, 2020 · 6 comments
Assignees
Labels
A-formatting Area: formatting A-parsing Area: parsing C-enhancement Category: an enhancement with existing code C-tracking-issue Category: tracking issue for a feature/release E-medium Some experience needed.
Milestone

Comments

@jhpratt
Copy link
Member

jhpratt commented Mar 17, 2020

While not ideal, I think it would be best to revamp the parsing and formatting of the various structs in the time crate.

According to tokei, the src/format directory has ~1100 lines of code. There is also a small amount of code in the files for each struct. While some of this code will be able to be reused, most of it will likely be replaced.

Major changes that I think would be sensible to make are:

  • Eliminate single-letter specifiers

    It's just plain confusing. Some specifiers (like %Y) are easily remembered, but most are not. Can you tell me the difference between %w and %W without looking at the reference? I certainly can't.

  • More modifiers, even if specifier-specific

    Let's allow for tons of options! Allow colons to be present (or not) for a UTC offset! There are certainly other things that could be allowed in the future.

  • Ability to lazily format

    This is by far the easiest one, as it's just an API addition. It would probably be best to just return impl Display, so as to avoid any doc-hidden structs or other API guarantees.

The combination of the first two leads to an inherent problem: when does the parser (of formatting strings) know when the specifier is over? Luckily, there's a solution that is both simple and keeps the parser simple: use a bracketed/parenthesized grouping delimiter. Due to the necessarily longer names of specifiers, the modifiers can be separated (both visually and logically) by a single space.

Another change that could prove useful for performance is a public API for the various specifier-modifier combinations. A macro could then be provided that would parse the formatting string at compile-time, such that the formatting string parser could be dropped by rustc as dead code.


This is certainly quite a bit to put out all at once. If you have any thoughts (for or against), leave a comment!


Edit: Some notes for myself as to intent.

It would be nice to be able to assume some default value for a component, such that it need not necessarily be present in the string being parsed. It'll probably be necessary to expose the raw values that were parsed, which would also allow a third-party to use those values freely.

@jhpratt jhpratt added the C-feature-request Category: a new feature (not already implemented) label Mar 17, 2020
@jhpratt jhpratt self-assigned this Apr 20, 2020
@mehcode
Copy link

mehcode commented May 16, 2020

I'm a bit mixed on this. I'm going to write out what I think you're describing against some data I have. Please correct me if I'm wrong in understanding your idea here.

2020-05-12T23_40_45.010569Z
OffsetDateTime::parse(
    "{year}-{month}-{day}T{hour}_{minute}_{second}.{nanoseconds}")?;

Have you thought of "simply" using the ISO Unicode standard here for this?

https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table

OffsetDateTime::parse(
    "yyyy-MM-dd'T'HH_mm_ss.SSSSSSSSS")?;

Tracking back to my issue, ISO Unicode specifies that the S.. sequence will match "at most" the number of characters you provide.

Here is some more documentation on the ISO Unicode format: https://date-fns.org/v2.13.0/docs/parse#description

@jhpratt
Copy link
Member Author

jhpratt commented May 16, 2020

There's already support for RFC3339, but that's not the format you're seeking. The variable number of decimal points was special-cased for that.

Unicode isn't ISO, but I think it's far clearer to specify things using words rather than using a seemingly arbitrary letter in many cases.

@jhpratt jhpratt added A-formatting Area: formatting A-parsing Area: parsing C-cleanup Category: cleanup of existing code C-enhancement Category: an enhancement with existing code C-tracking-issue Category: tracking issue for a feature/release E-medium Some experience needed. and removed C-feature-request Category: a new feature (not already implemented) labels May 16, 2020
@jhpratt
Copy link
Member Author

jhpratt commented May 24, 2020

I've just added this issue to TWIR's call for participation. As such, here's an update on the basics:

Lazily formatting is done, and was trivial to complete. The rewrite won't impact the ability to do this.

Having specifier-specific modifiers is still planned, though a list of what's desired would be great. Right now, all I've got for certain is the various padding options (none, space, zero) for a number of specifiers as well as whether or not the UTC offset (currently %z) should contain a colon or not.

I began rewriting the parser for formatting strings yesterday. It uses a superior design, as it is a thin wrapper around a &str that implements Iterator, yielding an Item that is either a Literal that should be output as-is or a Specifier that should be processed. Eventually, Item will be able to be combined with various components (Date, Time, etc.), resulting in a struct that will implement Display. The design based around an iterator avoids allocating a Vec of formatting items, instead parsing the string as necessary. This should yield a noticeable performance improvement.

With regard to the format itself, I'm moving forward with bracketed specifiers. To output a literal [, it must be doubled (as is done with { in format!).

Ignoring the necessary modifiers for padding, the equivalent of %Y-%m-%dT%H:%M%S will be [year]-[month]-[day]T[hour]:[minute]:[second], or something very similar.

NB: I am considering changing the internal design to use &[u8] instead of &str. This would allow slicing and indexing to be simpler and avoid the compiler inserting possible panics. Indexing and slicing is done on a byte level. The compiler still inserts possible panics, though they'll never be reached (see rust-lang/rust#72558).

@jhpratt jhpratt added the C-seeking-input 📣 Category: community input is desired label May 25, 2020
@jhpratt
Copy link
Member Author

jhpratt commented Jun 9, 2020

While the compiler allows it, lazy formatting turns out to be unidiomatic, as it requires fallibility originating from the time crate.

Additionally, the return value of this function is fmt::Result which is a type alias of Result<(), std::fmt::Error>. Formatting implementations should ensure that they propagate errors from the Formatter (e.g., when calling write!). However, they should never return errors spuriously. That is, a formatting implementation must and may only return an error if the passed-in Formatter returns an error. This is because, contrary to what the function signature might suggest, string formatting is an infallible operation. This function only returns a result because writing to the underlying stream might fail and it must provide a way to propagate the fact that an error has occurred back up the stack.

As such, lazy formatting will be removed in 0.3. I'm investigating ways to handle formatting at all; right now I'm leaning towards an implementation of .format() that just returns a Result.

Note that it may still be possible to have infallible formatting (and as such would implement Display). This might be tricky, as it involves what is essentially typestate.

@jhpratt
Copy link
Member Author

jhpratt commented Nov 17, 2020

Revamped formatting has been fully implemented. There is a format method similar to the one that already exists and a format_into method that accepts a &dyn Write. Both can be used in #![no_std] environments; the former requires an allocator.

The format description can be constructed manually or it can be parsed from a textual representation of it; the latter requires an allocator.

@jhpratt jhpratt added this to the v0.3 milestone Jan 19, 2021
@jhpratt jhpratt removed C-cleanup Category: cleanup of existing code C-seeking-input 📣 Category: community input is desired labels Jan 23, 2021
@jhpratt
Copy link
Member Author

jhpratt commented Apr 22, 2021

The full syntax is documented here. Both formatting and parsing are fully implemented and tested.

@jhpratt jhpratt closed this as completed Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-formatting Area: formatting A-parsing Area: parsing C-enhancement Category: an enhancement with existing code C-tracking-issue Category: tracking issue for a feature/release E-medium Some experience needed.
Projects
None yet
Development

No branches or pull requests

2 participants