Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] strptime rolls-over dates not in range for current month #31374

Open
asfimport opened this issue Mar 16, 2022 · 6 comments
Open

[C++] strptime rolls-over dates not in range for current month #31374

asfimport opened this issue Mar 16, 2022 · 6 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 16, 2022

I noticed some potentially unexpected behaviour when converting from string to date. Days that are out of bounds for the given month are rolled over into the following month.

I think the expected behaviour would be to either error (Python) or return NULL/NA (R), but not to roll over dates in the following month.

library(arrow, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- tibble::tibble(string_date = "1999-02-30")

# base R returns NA
df %>% 
  mutate(date = strptime(string_date, format = "%Y-%m-%d"))
#> # A tibble: 1 × 2
#>   string_date date  
#>   <chr>       <dttm>
#> 1 1999-02-30  NA

# arrow rolls over the 30th of February into the 2nd of March
df %>% 
  arrow_table() %>% 
  mutate(date = strptime(string_date, format = "%Y-%m-%d")) %>% 
  collect()
#> # A tibble: 1 × 2
#>   string_date date               
#>   <chr>       <dttm>             
#> 1 1999-02-30  1999-03-02 00:00:00

Thanks Alenka, Joris and Rok for helping me with the Python examples:
pandas:

>>> import pandas as pd
>>> pd.to_datetime("1999-02-30", format="%Y-%m-%d")
...
ValueError: time data 1999-02-30 doesn't match format specified

datetime:

>>> import datetime
>>> from datetime import datetime
>>> datetime.strptime("1999-02-30", "%Y-%m-%d")
...
ValueError: day is out of range for month

arrow:

>>> import pyarrow.compute as pc
>>> print(pc.strptime("1999-02-30", format="%Y-%m-%d", unit="s"))
1999-03-02 00:00:00

Reporter: Dragoș Moldovan-Grünfeld / @dragosmg
Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-15948. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Copying my comment from the other issue: personally, I don't like that behaviour, but I suppose we get this from the system strptime? (so that might even depend on your OS?)
It might be interesting to check what date.h's version of strptime does.

I think the expected behaviour would be to either error (Python) or return NULL/NA (R),

I would say it is both: by default error, and optionally return null (after ARROW-15665)

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Two notes:

Perhaps system strptimes have an option to disable this behaviour?

Switching from system to date.h strptime might have performance implications so we should probably benchmark as we change this. 

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Does date.h have a strptime? We use a vendored implementation from musl on Windows.

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Ah, well, let's not use it, then :)

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
But maybe it's nice to dates? :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant