Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse datetimes and timestamps with leading and/or trailing whitespace #5544

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

guilload
Copy link
Member

@guilload guilload commented Nov 6, 2024

Description

Per title. Request by Airmail. Supported by ES.

How was this PR tested?

  • Updated unit tests
  • make test-all

Copy link
Contributor

@trinity-1686a trinity-1686a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if for Strptime patterns we might not want to trim the parser spec: if the parser is %Y (this pattern only parse years, but that's not the point), it won't parse 2024 as valid because it will expect those spaces we just removed

@guilload
Copy link
Member Author

guilload commented Nov 6, 2024

That makes sense. Let me add a commit on top of this PR.

@@ -36,22 +36,24 @@ pub fn parse_date_time_str(
date_time_str: &str,
date_time_formats: &[DateTimeInputFormat],
) -> Result<TantivyDateTime, String> {
let trimmed_date_time_str = date_time_str.trim();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let trimmed_date_time_str = date_time_str.trim();
let trimmed_date_time_str = date_time_str.trim_ascii();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably minor, but triming unicode is considerably more expensive than trimming ascii.
(It requires decoding utf-8, and check if each individual char is a whitespace or not. )

I'd feel safer if we restricted ourselves to ascii. It will just prevent us from trimming weird whitespace like the japanese " ".

@fulmicoton
Copy link
Contributor

@guilload can you merge this?

and update tests to show pattern is already trimmed by default
@trinity-1686a
Copy link
Contributor

it seems the strptime is already lenient to whitespaces inside the pattern (but not the parsed string, which we trim ourselves anyway). I updated the tests in consequence

the java-compatible parser used for range requests doesn't have access to that trim code I think, but it's not going to reject documents, so I don't think we necessarily care as much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants