Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python][R] Add format inference option to StrptimeOptions #31120

Open
asfimport opened this issue Feb 11, 2022 · 4 comments
Open

[C++][Python][R] Add format inference option to StrptimeOptions #31120

asfimport opened this issue Feb 11, 2022 · 4 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Feb 11, 2022

We want to have an option to infer timestamp format.

See pandas.to_datetime and lubridate parse_date_time for examples.

Reporter: Rok Mihevc / @rok
Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-15666. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Is there functionality available for this that we could reuse (eg in date.h)? As I am not sure we should start implementing custom logic for that ourselves

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
I don't expect this would be in date.h scope. If I understand correctly Pandas and R/lubridate both infer a format on a subset of rows and use that format to parse the rest. Perhaps we can directly use that logic for now (I believe this was @dragosmg's idea too) and see if we actually need this in C++?

@asfimport
Copy link
Collaborator Author

Matthew Roeschke / @mroeschke:
Speaking from experience on the pandas side, I agree with @jorisvandenbossche and would caution against "inference" logic. While convenient for users, the maintenance burden can be quite significant since inference tends to have an indefinite scope, leading to more custom logic, edge cases, etc

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Thanks for the warning Matthew, much appreciated!
Looking at the utility-to-complexity ratio this does seem like something we'd better avoid.

An idea would be to perhaps use the already existing pandas logic (if pandas is available at runtime) to do the format inference and then pass the inferred format to c++ and do the rest of the op there. Same for lubridate in R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant