-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Add error handling option to StrptimeOptions #20115
Comments
Joris Van den Bossche / @jorisvandenbossche: I am not sure about an "ignore" option. I know that pandas has it, but I find it a bit a strange option especially in context of an arrow kernel. In that case, it doesn't return a timestamp typed array, but the original string array (so which means that the kernel has no predictable output type). |
Rok Mihevc / @rok: |
Dragoș Moldovan-Grünfeld / @dragosmg: |
Rok Mihevc / @rok: |
Dragoș Moldovan-Grünfeld / @dragosmg: |
Dragoș Moldovan-Grünfeld / @dragosmg:
df %>% How are things done in Python? Does the R behaviour align with your expectations / Is it breaking any ISO Standard? |
Rok Mihevc / @rok: Python stdlib strptime just throws errors AFAIK and pandas has it's own pd.to_datetime that has tons of options and you can play with this example here. Strptime format is notoriously non-standardized so we probably just want to adopt c++ stdlib behaviour. |
Jonathan Keane / @jonkeane: (1) + (2) both sound like they could be implemented as "if strptime fails to parse, (optionally) return null". No reason for us to go to far into why it didn't parse. |
Rok Mihevc / @rok: |
Joris Van den Bossche / @jorisvandenbossche: >>> pd.to_datetime("1999-02-30", format="%Y-%m-%d")
...
ValueError: time data 1999-02-30 doesn't match format specified And Python's stdlib seems to do that: >>> datetime.datetime.strptime("1999-02-30", "%Y-%m-%d")
...
ValueError: day is out of range for month Arrow indeed does roll-over: >>> import pyarrow.compute as pc
>>> print(pc.strptime("1999-02-30", format="%Y-%m-%d", unit="s"))
1999-03-02 00:00:00 Personally, I don't like that behaviour, but I suppose we get this from the system |
Dragoș Moldovan-Grünfeld / @dragosmg: |
Joris Van den Bossche / @jorisvandenbossche: — Only, for case 1, you might have a typo in your example ("M" vs "m"), because we are parsing minutes (and the missing month gets filled with 1): >>> print(datetime.datetime.strptime("1999-12-31", "%Y-%d-%M"))
1999-01-12 00:31:00 (pandas does the same, and Arrow as well) If I change that to use >>> print(datetime.datetime.strptime("1999-12-31", "%Y-%d-%m"))
...
ValueError: unconverted data remains: 1 and so does Arrow ("Failed to parse string: '1999-12-31' ..") |
Joris Van den Bossche / @jorisvandenbossche: |
Rok Mihevc / @rok: |
Dragoș Moldovan-Grünfeld / @dragosmg: |
We want to have an option to either raise, ignore or return NA in case of format mismatch.
See pandas.to_datetime and lubridate parse_date_time for examples.
Reporter: Rok Mihevc / @rok
Assignee: Rok Mihevc / @rok
Watchers: Rok Mihevc / @rok
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-15665. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: