Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add error handling option to StrptimeOptions #20115

Closed
asfimport opened this issue Feb 11, 2022 · 16 comments
Closed

[C++] Add error handling option to StrptimeOptions #20115

asfimport opened this issue Feb 11, 2022 · 16 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Feb 11, 2022

We want to have an option to either raise, ignore or return NA in case of format mismatch.

See pandas.to_datetime and lubridate parse_date_time for examples.

Reporter: Rok Mihevc / @rok
Assignee: Rok Mihevc / @rok
Watchers: Rok Mihevc / @rok

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-15665. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
An option to have it return Null (for the failing string(s)) instead of raising an error sounds good!

I am not sure about an "ignore" option. I know that pandas has it, but I find it a bit a strange option especially in context of an arrow kernel. In that case, it doesn't return a timestamp typed array, but the original string array (so which means that the kernel has no predictable output type).
Such logic to "try the conversion and if it errors keep the original array" could also be handled on a higher level than the C++ kernel?

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Thanks for the feedback Joris! Yeah, ignore would not play well with Arrow and should be solved elsewhere. Removing from description.

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
@rok  how far do you think we are from solving this ticket? It would be a massive help for us since it would enable the coalesce() route into solving many of the date-time parsing bindings tickets.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
@dragosmg I am not sure what the issue I'm seeing is so don't want to estimate. But I will prioritise it over other tickets.

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
Many thanks.

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
@rok, @jorisvandenbossche & @jonkeane
I will ask the question here and move it if it is the wrong place / follow-up Jira required. In R strptime returns NA / NULL in the following circumstances:

  1. format doesn't match string - e.g. "1999-12-31" and "%Y-%d-%M"

  2. string doesn't make sense, given the format - e.g. string is "this is a string that doesn't make sense"

  3. the string can be parsed with the given format, but implicitly relies on rollover - string is "1999-02-30" and is parsed as the "1999-03-02" Date.

    not sure 1 and 2 are actually distinct, but the 3rd part is different from the current Arrow behaviour

    {code:r}
    library(arrow, warn.conflicts = FALSE)
    library(lubridate, warn.conflicts = FALSE)
    library(dplyr, warn.conflicts = FALSE)

    df <- tibble::tibble(string_date = "1999-02-30")

    df %>%
    mutate(date = strptime(string_date, format = "%Y-%m-%d"))
    #> # A tibble: 1 × 2
    #> string_date date
    #>
    #> 1 1999-02-30 NA

df %>%
arrow_table() %>%
mutate(date = strptime(string_date, format = "%Y-%m-%d")) %>%
collect()
#> # A tibble: 1 × 2
#> string_date date
#>
#> 1 1999-02-30 1999-03-02 00:00:00
{code}

How are things done in Python? Does the R behaviour align with your expectations / Is it breaking any ISO Standard?

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
I think 1. and 2. would both cause parsing error. 3. is getting into inference territory (see ARROW-15666).

Python stdlib strptime just throws errors AFAIK and pandas has it's own pd.to_datetime that has tons of options and you can play with this example here.

Strptime format is notoriously non-standardized so we probably just want to adopt c++ stdlib behaviour.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
(3) sounds wrong, but like I said before: you ( @dragosmg) should look at what happens in python or if there is some standard where that is the indeed the right thing.

(1) + (2) both sound like they could be implemented as "if strptime fails to parse, (optionally) return null". No reason for us to go to far into why it didn't parse.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
I think in Python we'll probably try to get as close as possible to to_datetime so you can just focus on that.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
For case 3, in pandas we don't "rollover" for a day that is too large for the given month, but rather raise an error:

>>> pd.to_datetime("1999-02-30", format="%Y-%m-%d")
...
ValueError: time data 1999-02-30 doesn't match format specified

And Python's stdlib seems to do that:

>>> datetime.datetime.strptime("1999-02-30", "%Y-%m-%d")
...
ValueError: day is out of range for month

Arrow indeed does roll-over:

>>> import pyarrow.compute as pc
>>> print(pc.strptime("1999-02-30", format="%Y-%m-%d", unit="s"))
1999-03-02 00:00:00

Personally, I don't like that behaviour, but I suppose we get this from the system strptime? (so that might even depend on your OS?)
It might be interesting to check what date.h's version of strptime does.

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
Thanks all. I was about to comment with a pandas example (so thanks Joris for providing one and Rok for pointing me to Trinket). In conclusion, it's fair to say the rolling over is unexpected for Python too.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
For cases 1 and 2, I agree that both are cases that should error (which already is what Arrow does now)

Only, for case 1, you might have a typo in your example ("M" vs "m"), because we are parsing minutes (and the missing month gets filled with 1):

>>> print(datetime.datetime.strptime("1999-12-31", "%Y-%d-%M"))
1999-01-12 00:31:00

(pandas does the same, and Arrow as well)

If I change that to use "%Y-%d-%m" (lower case m), Python and pandas give an error for this:

>>> print(datetime.datetime.strptime("1999-12-31", "%Y-%d-%m"))
...
ValueError: unconverted data remains: 1

and so does Arrow ("Failed to parse string: '1999-12-31' ..")

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
It's maybe worth opening a separate JIRA for the undesired "roll-over" behaviour, in case we end up closing this issue for the general "return null if it errors" feature

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
I didn't realise 3. was actual arrow behaviour (I see it on mac). Very much agreed with removing it.

@asfimport
Copy link
Collaborator Author

Dragoș Moldovan-Grünfeld / @dragosmg:
I created ARROW-15948

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Issue resolved by pull request 12464
#12464

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants