Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Timezone database configuration and access #28868

Closed
asfimport opened this issue Jun 24, 2021 · 17 comments
Closed

[C++] Timezone database configuration and access #28868

asfimport opened this issue Jun 24, 2021 · 17 comments
Assignees
Milestone

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 24, 2021

Note: currently timezone database is not available on windows so timezone aware operations will fail.

We're using tz.h library which needs an updated timezone database to correctly handle timezoned timestamps. See installation instructions.

We have the following options for getting a timezone database:

  1. local (non-windows) OS timezone database - no work required.

  2. arrow bundled folder - we could bundle the database at build time for windows. Database would slowly go stale.

  3. download it from IANA Time Zone Database at runtime - tz.h gets the database at runtime, but curl (and 7-zip on windows) are required.

  4. local user-provided folder - user could provide a location at buildtime. Nice to have.

  5. allow runtime configuration - at runtime say: "the tzdata can be found at this location"

    For more context see: ARROW-12980 and PEP 615.

Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127
Watchers: Rok Mihevc / @rok

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13168. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
Just a +1 for a "runtime configuration" option. In R we have the tzdb package. Currently it only provides the text format of the IANA database but we could use that approach if we need something different (maintained sepaerately to keep it up to date). I'm less familiar with Python but I imagine something similar exist there, too.

list.files(tzdb::tzdb_path("text"))
#>  [1] "africa"            "antarctica"        "asia"             
#>  [4] "australasia"       "backward"          "backzone"         
#>  [7] "calendars"         "checklinks.awk"    "checktab.awk"     
#> [10] "CONTRIBUTING"      "etcetera"          "europe"           
#> [13] "factory"           "iso3166.tab"       "leap-seconds.list"
#> [16] "leapseconds"       "leapseconds.awk"   "LICENSE"          
#> [19] "Makefile"          "NEWS"              "northamerica"     
#> [22] "README"            "southamerica"      "theory.html"      
#> [25] "version"           "windowsZones.xml"  "ziguard.awk"      
#> [28] "zishrink.awk"      "zone.tab"          "zone1970.tab"     
#> [31] "zoneinfo2tdf.pl"

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Nice to see interest on this @paleolimbot! I will probably pick it up next.

I'm not sure we want to depend on R/Python libraries being present from C++, but I'll look into this option, thanks!

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
It looks like that tzdb is a wrapper around this C++ library: https://github.com/HowardHinnant/date

I also saw that same library recommended in this StackOverflow answer by a Microsoft employee about Windows TZ libraries.

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Indeed it appears to be @wjones127. It also seems to be doing the bundling approach - including the timezone data with the install - https://github.com/r-lib/tzdb/tree/main/inst/tzdata

We also use date.h and tz.h (https://github.com/HowardHinnant/date#summary) in Arrow for almost all time related things.

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
For sure you don't want to depend on it being present from C+! From our end on the R bindings, though, it means we can support Windows through the tzdb R package (if C+ lets us point at a directory with a database in the right format).

@asfimport
Copy link
Collaborator Author

Rok Mihevc / @rok:
Ah, I see your point. Neat! Is tzdb package always present?

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
We can force it to be present at build-time, force it to be present at install time or force it to be present at runtime (probably what we'd do).

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
The same is the case for python: https://pypi.org/project/tzdata/

Ideally we would be able to specify the path at runtime, I think.

@asfimport
Copy link
Collaborator Author

Alessandro Molina / @amol-:
@pitrou I think that given that Python, R and C++ would have to be aligned in regard of how they can load(or embed) timezone database to work on Windows, it would be helpful to have your opinion. Personally I think that a reasonable solution would be to pass to C++ the "path" to the directory containing the database. So that both R and Py and pass their own dir and C++ can embedd files and provide them at install time. What do you think?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I think a runtime setting is the best solution here - if easily implemented. This is a pattern already in place for TLS certificates, and could be moved to a more general global-config routine:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/filesystem.h#L518-L536

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
It might need some tweaking of the vendored date.h to allow this (I seem to remember from when we started adding timezone related kernels that it doesn't support a runtime configuration out of the box)

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
I started to look at this then got busy with something else. My thought was we might be able to achieve setting this at runtime using set_install and reload_tzdb() (See these lines). Not 100% sure this will work; there seems to be some issue that this supports the text format and not some binary format (I have some more investigating to do).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
date.h's set_install() seems to support the text form of the IANA database, which is also what R provides.

However, on Python, pytz provides the binary form of the IANA database, which date.h currently doesn't support on Windows.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I suppose we mainly care about tzdata and not pytz for Python (going forward, since that is what the standard library uses), however that seems to also use the binary form (https://pypi.org/project/tzdata/)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Yeah, let's start with R-only if that's easier.

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
Yep, I am starting with R-only right now. Will have a PR up later today.

@asfimport
Copy link
Collaborator Author

Jonathan Keane / @jonkeane:
Issue resolved by pull request 12536
#12536

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants