[C++] Add error handling option to StrptimeOptions #20115

asfimport · 2022-02-11T15:06:05Z

We want to have an option to either raise, ignore or return NA in case of format mismatch.

See pandas.to_datetime and lubridate parse_date_time for examples.

Reporter: Rok Mihevc / @rok
Assignee: Rok Mihevc / @rok
Watchers: Rok Mihevc / @rok

Related issues:

[R] strptime should return NA (not error) with format mismatch (blocks)
[C++] Strptime issues umbrella (is a child of)
[C++] strptime rolls-over dates not in range for current month (relates to)
[R] [C++] Should strptime support a partial format? (is depended upon by)

PRs and other links:

_{Note: This issue was originally created as ARROW-15665. Please see the migration documentation for further details.}

asfimport · 2022-02-15T18:17:43Z

Joris Van den Bossche / @jorisvandenbossche:
An option to have it return Null (for the failing string(s)) instead of raising an error sounds good!

I am not sure about an "ignore" option. I know that pandas has it, but I find it a bit a strange option especially in context of an arrow kernel. In that case, it doesn't return a timestamp typed array, but the original string array (so which means that the kernel has no predictable output type).
Such logic to "try the conversion and if it errors keep the original array" could also be handled on a higher level than the C++ kernel?

asfimport · 2022-02-15T19:30:25Z

Rok Mihevc / @rok:
Thanks for the feedback Joris! Yeah, ignore would not play well with Arrow and should be solved elsewhere. Removing from description.

asfimport · 2022-03-10T16:08:51Z

Dragoș Moldovan-Grünfeld / @dragosmg:
@rok how far do you think we are from solving this ticket? It would be a massive help for us since it would enable the coalesce() route into solving many of the date-time parsing bindings tickets.

asfimport · 2022-03-10T16:50:09Z

Rok Mihevc / @rok:
@dragosmg I am not sure what the issue I'm seeing is so don't want to estimate. But I will prioritise it over other tickets.

asfimport · 2022-03-10T16:59:55Z

Dragoș Moldovan-Grünfeld / @dragosmg:
Many thanks.

asfimport · 2022-03-15T15:01:39Z

Dragoș Moldovan-Grünfeld / @dragosmg:
@rok, @jorisvandenbossche & @jonkeane
I will ask the question here and move it if it is the wrong place / follow-up Jira required. In R strptime returns NA / NULL in the following circumstances:

format doesn't match string - e.g. "1999-12-31" and "%Y-%d-%M"
string doesn't make sense, given the format - e.g. string is "this is a string that doesn't make sense"
the string can be parsed with the given format, but implicitly relies on rollover - string is "1999-02-30" and is parsed as the "1999-03-02" Date.

not sure 1 and 2 are actually distinct, but the 3rd part is different from the current Arrow behaviour

{code:r}
library(arrow, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- tibble::tibble(string_date = "1999-02-30")

df %>%
mutate(date = strptime(string_date, format = "%Y-%m-%d"))
#> # A tibble: 1 × 2
#> string_date date
#>
#> 1 1999-02-30 NA

df %>%
arrow_table() %>%
mutate(date = strptime(string_date, format = "%Y-%m-%d")) %>%
collect()
#> # A tibble: 1 × 2
#> string_date date
#>
#> 1 1999-02-30 1999-03-02 00:00:00
{code}

How are things done in Python? Does the R behaviour align with your expectations / Is it breaking any ISO Standard?

asfimport · 2022-03-15T19:07:06Z

Rok Mihevc / @rok:
I think 1. and 2. would both cause parsing error. 3. is getting into inference territory (see ARROW-15666).

Python stdlib strptime just throws errors AFAIK and pandas has it's own pd.to_datetime that has tons of options and you can play with this example here.

Strptime format is notoriously non-standardized so we probably just want to adopt c++ stdlib behaviour.

asfimport · 2022-03-15T19:16:58Z

Jonathan Keane / @jonkeane:
(3) sounds wrong, but like I said before: you ( @dragosmg) should look at what happens in python or if there is some standard where that is the indeed the right thing.

(1) + (2) both sound like they could be implemented as "if strptime fails to parse, (optionally) return null". No reason for us to go to far into why it didn't parse.

asfimport · 2022-03-15T19:31:49Z

Rok Mihevc / @rok:
I think in Python we'll probably try to get as close as possible to to_datetime so you can just focus on that.

asfimport · 2022-03-16T10:42:56Z

Joris Van den Bossche / @jorisvandenbossche:
For case 3, in pandas we don't "rollover" for a day that is too large for the given month, but rather raise an error:

>>> pd.to_datetime("1999-02-30", format="%Y-%m-%d")
...
ValueError: time data 1999-02-30 doesn't match format specified

And Python's stdlib seems to do that:

>>> datetime.datetime.strptime("1999-02-30", "%Y-%m-%d")
...
ValueError: day is out of range for month

Arrow indeed does roll-over:

>>> import pyarrow.compute as pc
>>> print(pc.strptime("1999-02-30", format="%Y-%m-%d", unit="s"))
1999-03-02 00:00:00

Personally, I don't like that behaviour, but I suppose we get this from the system strptime? (so that might even depend on your OS?)
It might be interesting to check what date.h's version of strptime does.

asfimport · 2022-03-16T10:46:05Z

Dragoș Moldovan-Grünfeld / @dragosmg:
Thanks all. I was about to comment with a pandas example (so thanks Joris for providing one and Rok for pointing me to Trinket). In conclusion, it's fair to say the rolling over is unexpected for Python too.

asfimport · 2022-03-16T10:49:42Z

Joris Van den Bossche / @jorisvandenbossche:
For cases 1 and 2, I agree that both are cases that should error (which already is what Arrow does now)

—

Only, for case 1, you might have a typo in your example ("M" vs "m"), because we are parsing minutes (and the missing month gets filled with 1):

>>> print(datetime.datetime.strptime("1999-12-31", "%Y-%d-%M"))
1999-01-12 00:31:00

(pandas does the same, and Arrow as well)

If I change that to use "%Y-%d-%m" (lower case m), Python and pandas give an error for this:

>>> print(datetime.datetime.strptime("1999-12-31", "%Y-%d-%m"))
...
ValueError: unconverted data remains: 1

and so does Arrow ("Failed to parse string: '1999-12-31' ..")

asfimport · 2022-03-16T10:51:47Z

Joris Van den Bossche / @jorisvandenbossche:
It's maybe worth opening a separate JIRA for the undesired "roll-over" behaviour, in case we end up closing this issue for the general "return null if it errors" feature

asfimport · 2022-03-16T11:06:48Z

Rok Mihevc / @rok:
I didn't realise 3. was actual arrow behaviour (I see it on mac). Very much agreed with removing it.

asfimport · 2022-03-16T11:15:50Z

Dragoș Moldovan-Grünfeld / @dragosmg:
I created ARROW-15948

asfimport · 2022-03-24T15:26:08Z

David Li / @lidavidm:
Issue resolved by pull request 12464
#12464

asfimport closed this as completed Mar 24, 2022

asfimport assigned rok Jan 10, 2023

This was referenced Jan 11, 2023

[R] strptime should return NA (not error) with format mismatch #31114

Closed

[R] [C++] Should strptime support a partial format? #31254

Open

[C++] Strptime issues umbrella #31324

Open

[C++] strptime rolls-over dates not in range for current month #31374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Add error handling option to StrptimeOptions #20115

[C++] Add error handling option to StrptimeOptions #20115

asfimport commented Feb 11, 2022 •

edited

Loading

asfimport commented Feb 15, 2022

asfimport commented Feb 15, 2022

asfimport commented Mar 10, 2022

asfimport commented Mar 10, 2022

asfimport commented Mar 10, 2022

asfimport commented Mar 15, 2022

asfimport commented Mar 15, 2022

asfimport commented Mar 15, 2022

asfimport commented Mar 15, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 24, 2022

[C++] Add error handling option to StrptimeOptions #20115

[C++] Add error handling option to StrptimeOptions #20115

Comments

asfimport commented Feb 11, 2022 • edited Loading

Related issues:

PRs and other links:

asfimport commented Feb 15, 2022

asfimport commented Feb 15, 2022

asfimport commented Mar 10, 2022

asfimport commented Mar 10, 2022

asfimport commented Mar 10, 2022

asfimport commented Mar 15, 2022

asfimport commented Mar 15, 2022

asfimport commented Mar 15, 2022

asfimport commented Mar 15, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 16, 2022

asfimport commented Mar 24, 2022

asfimport commented Feb 11, 2022 •

edited

Loading