-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specific timezone support for to_timetamp*()
#686
Comments
@alamb @westonpace .... your thoughts on this matter would be really welcome. :) To be specific, if you do There is also the possibility to simplify things and eliminate the timezone field from Arrow, but that's a really big change. For example, Parquet does not store the timezone but simply has a isAdjustedToUTC field. If True, that means timestamps are UTC, and if not that means local timestamp. |
This does seem like an inconsistency with the just agreed upon convention. I would expect all of these methods to convert to
Correct, I think DataFusion is interpreting
I'm not sure if you are saying the are not comparable or they should not be comparable. They "should" be comparable if you are comparing two columns with a non-None timestamp (e.g.
As long as it is optional that is probably a good idea. However, that sounds like it is adding capability and wouldn't fix anything on its own.
I'm not sure exactly what you are saying here. |
I don't think |
FWIW this is what we do in IOx as well
I think a second optional argument for
I think a better solution would be for DataFusion to implement coercion rules between date/time/timestamp types. In your example of a timestamp_col with
by adding the following cast (please forgive me the pseudo-SQL)
|
It sounds like everyone agrees second, optional argument to As for casting/coercion, we have:
by adding the following cast (please forgive me the pseudo-SQL)
The above is saying that OTOH @westonpace wrote:
It seems to me that coercing Timestamp(_, None) to So follow up / action items / possibilities:
BTW I looked up the C/C++ API, and the Python API, and none of those appear to support timezones in the timestamp type, just a unit. It could be argued that 4) would bring the Rust API closer to the other languages APIs, and when arrow datatypes are translated from Flight or the binary representation, how is the timezone encoded, if at all? (if 4 is a worthy avenue to discuss, I can start a email chain) cheers and thanks |
While this might be theoretically the best option, I think this would be a fairly painful change (at least for us in IOx) as our timestamps are
Perhaps
I am actually a fan of this direction -- I am fairly sure the use of a Timezone that is something other than I think it would keep things simpler to convert timestamps on the edge (aka if it was "local time" convert to UTC prior to putting into DataFusion) and then have all the processing machinery deal with UTC timestamps only |
I'm not even a user (yet); I see two big great reasons not to add an argument: 1. Ambiguous calling pattern. When converting string to Arrow timestamp, there are two timezones to consider: the input-string timezone and the output-timestamp timezone. Which one would this argument mean? If I see the call, Someone else might expect If three smart people reading the same line of code would interpret it three different ways, that's a design problem. 2. PostgreSQL compatibility. Postgres has two timestamp types; Arrow timestamp+tz-metadata columns are neither. In Postgres, Postgres also has a Postgres has gone 30 years without Arrow's timestamp+tz-metadata types. In the words of Tom Lane, "if we simply took away to_timestamp(), most users would be better off". Be careful about adding a feature: removing one might help more users accomplish more tasks, more smoothly. |
@adamhooper thank you for the perspectives. I see the challenges with adding a second argument to Can you suggest a strategy for the usecase described in this ticket, namely "writing a query using SQL against data stored in parquet that has an explicit timezone?" In other words, how do you suggest supporting the following query? SELECT ...
FROM my_parquet_file_where_timestamp_col_has_a_specified_timezone
WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z") Which currently fails because the arrow type of |
@alamb would anybody be horrified/appalled at the following:
To me, dropping timezone info is an easy (if rude) decision. DataFusion implements a subset of Postgres, and Postgres has no timezone metadata. There's an argument to be made for passing the timezone through unmodified; but that would be hard/convoluted, probably needing new syntax; and does anybody actually want that? IPC with datetime+timezone tuples is nonstandard and heavyweight; Spark doesn't even have a type for that. Forcing UTC won't stop anybody from accomplishing anything. So the only hairy decision is what to do with Arrow (Primer on Postgres, for the uninitiated: |
@AdamHopper personally I'd be OK with your recommendations, but some thoughts:
Also, I see your point about Thus, while
|
@velvia Great points My last two suggestions were exactly about interop -- and a transition period. Today, DataFusion users interpret timezone=null to mean I hadn't thought of timestamp resolution. Again, I'm out of my element (for now) :). The crux of my suggestion is to ignore the As for the But -- sidetracking here -- how important are these types? |
@alamb @westonpace and others: just to continue this conversation and try and bring it to a close.
Do the above sound about right? I would also like to observe current obstacles to 1) and 2) in the Arrow/DF codebase (there are more, just what I've found in the last ~2 weeks):
|
To add to the above list, to_timestamp(string) should really return a Timestamp Scalar with the proper timezone derived from parsing the string. Currently no TZ info is returned, but for example if the date time string has "Z" in it, then the resulting scalar timezone should be "UTC". The bigger question really is how much effort to pour into all this. Since the other Arrow implementations and the Arrow spec already specifies TimeZone, and due to many issues mentioned above, it seems to me that at least we should try to get notion of UTC vs local time zone, and be able to maintain that distinction, correctly. IE, Rust DataFusion should IMHO have first class support of both |
yes that sounds right I am very supportive of an effort to have first class (and correct) timestamp support in Rust both for Arrow and DataFusion. |
Let's make this concrete. Here is a proposal of what I think makes sense (and I think the consistent with @velvia's suggestion for a second argument). The second argument simply serves to define the output type; Note to really make this useful we would also have to provide some way within Arrow and DataFusion to cast back and forth between different timestamps, doing the offset conversions appropriately
I think @adamhooper 's point that "someone might get confused about what timezone the second argument is referring to" at #686 (comment) is legitmate, but one we can handle with appropriate documentation. |
@alamb Could you please write example results? Maybe centered around 1970-01-01 so the math is trivial?
I think allowing
|
Does this seem correct?
I think the one most interesting to me is row 4 so double check that one (also note the presence/absence of a trailing Z to indicate local time or not). By "output when printed" I mean what an implementation would get if it tried to force the value into an ISO-8601 string. Just a few extra thoughts... Does Will Rust ever generate an Output Arrow Type with a time zone string (e.g. MST)? My point with the last two questions is that offset conversions aren't too bad if you aren't handling tzdata strings. Once you do then you need a timezone database which is a bit of a pain but solvable. The C++ impl recently added compute kernels which rely on a timezone database and it's a non-trivial effort. @rok has been spearheading this effort so you might reach out if you go down this road. |
@westonpace Could you please explain what happens on row 4? The rest of it seems perfect to me. Thank you for answering :). |
@adamhooper here is what I had in mind in terms of values. Row 4 is also the most interesting, perhaps. I think this is consistent with what @westonpace has in his table
It handles a bunch of different formats (see source) but mostly trying to
The Rust arrow implementation already can encode the timezone string but the support in compute kernels and datafusion for a timezone other than However, in my mind the point of adding a second argument to |
Oh, I see -- this is existing behavior. But then, does existing behavior define a different result for row 1? https://github.com/apache/arrow-datafusion/blob/5900b4c6829b0bdbe69e1f95fb74e935bc8f33d4/datafusion/src/physical_plan/datetime_expressions.rs#L85 In my view, if the existing behavior has to stay, it has to stay; but if it doesn't have to stay, then as a user I'd feel much more comfortable if DataFusion didn't expose the server's timezone -- or at least made it a session variable, as Postgres does. Note that we're up to four timezones in every call to
My use case is to run custom SQL from the general public. I realize I can use environment variables to force DataFusion's "server timezone" to be UTC; but that doesn't negate the learning curve. Postgres has the same complexity. Its syntax:
... and I'll bet most Postgres misunderstand these concepts for years until they figure it out. |
on master, if you execute
I don't think it has to say I think the choice of where to translate a local time into one with a specific timezone is probably best left to a higher level than DataFusion. DataFusion isn't a database system -- it does not have the notion of a client or server. It is more like a tool kit with which to build such things.
I think we have three timezones that are important:
|
Aside: this case
Then I think what you have proposed matches (as best I can tell) the recently agreed upon spec so that is all good. Usability-wise, I don't know, but I'll admit I don't have much stake in a usability decision (I'm more worried about interoperability). So take this next paragraph with a grain of salt 😆 It feels to me like
This would change the meaning of the fourth row so it would have the value 0. |
Agreed with @adamhooper - as a user I would strongly dislike the row 4 behavior as it would make my logic a function of the environment I'd run it on instead of only a function of the input. Keep in mind that timezones can have nonexistent and ambiguous times that need to be handled when "localizing" (converting wall time) to UTC. In arrow there's an open PR for a |
Indeed, the more I look at this the more I wonder why the first row doesn't use the server time but the fourth row does use the server time. |
Yes I agree that is what it is doing.
Theoretically I agree that a two step process is clearest; The challenge is that the output type of Imagine a query that is like |
Isn't output of |
This sounds exactly like the problems Postgres faced. Postgres' solutions all have predictable types:
If this kind of grammar is too tricky to implement -- or, heck, if DataFusion wants to be more legible -- these solutions could be rewritten in function-style syntax:
Postgres's |
Yes, this is what I thought @velvia was more or less proposing, but instead of those names, use a second optional argument
|
Hi folks,
Excellent discussion; I need to go back and reread the detailed scenarios that Andrew and others laid out.
I did propose a second argument to the to_timestamp() functions, though I’m slightly leaning towards having separate functions, a separate one specifically for timezone conversion or designation - mainly for clarity.
To help us think through it though, let’s take the default case where `to_timestamp()` does NOT have a second argument, which most people (definitely those coming from other SQL dialects) would omit.
Arrow/DataFusion must know the output type currently, so we have to pick one. Say it is Timestamp(_, UTC). Would this be correct? (UTC seems to be the only
The cases are:
- string, with no clear timezone designation. In this case, local (server) timezone might be picked to interpret, and then the results could be converted to be UTC-based timestamp. There is ambiguity though, but one can argue if the original string was UTC it should have “Z” or appropriate offset anyways.
- string, with timezone designation (+/- offset, or Z). In this case we get the timezone from the string, we could convert output to be UTC based.
- int64/uint64. In this case, this is a numeric timestamp from epoch, but there is again ambiguity. Do we treat it as local timezone based or UTC based?
- Timestamp column. If the source is already a timestamp column, the user is looking to probably convert units (say mills to nanos) and preserve the timezone. This is a case where having a fixed output type does not really work well, but for this case we can actually specify that DataFusion produce an array with the same timezone output as the input, so it’s fine.
I apologize in advance if I’m rehashing discussions from earlier in the thread still need to catch up.
… On Jul 22, 2021, at 12:20 PM, Andrew Lamb ***@***.***> wrote:
TO_TIMESTAMP_UTC
TO_TIMESTAMP_LOCALTIME
Yes, this is what I thought @velvia <https://github.com/velvia> was more or less proposing, but instead of those names, use a second optional argument
to_timestamp(.., 'UTC') --> TO_TIMESTAMP_UTC(..)
to_timestamp(..) --> TO_TIMESTAMP_LOCALTIME(..)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#686 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIDPW4W353SLEIUIWPPZMLTZBVOTANCNFSM473NSWHA>.
|
Timezone aware timestamp parsing was recently added to arrow-rs - apache/arrow-rs#3795 In particular it follow the approach laid out in the arrow standard In particular
This means the following parsing rules
Crucially this means "when parsing a timestamp without a timezone into a timestamp with a timezone, it assumes the given timestamp is already in the destination timezone". Or
It also means that, assuming using Tz and not Local or something else, timestamp parsing is agnostic to the local timezone of the machine, which I think is an extremely important property. |
With the addition of optional chrono format strings in #5398 I don't think having a second argument to the to_timestamp function which may or may not be a desired output TZ would be a good solution now. Technically possible but overloading a parameter to have multiple meanings is in general not a good solution. |
The timestamp types are still the same as reported in this ticket:
We could definitely make a new set of functions like But I agree with @Omega359 that another parameter is probably not a good idea at this time |
Spark has a couple of functions that I think might be useful in this context:
Postgresql provides the following:
|
Describe the bug
I'm not sure I'd call it a "bug" per se, but all of the
to_timestamp*()
functions outputTimestampTypes
in Arrow that have a Timezone field of None. The issues here are:Some("UTC")
the following fails due to incompatible types:WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z")
(because timestamp_col has Some(UTC) but to_timestamp returns None)I'm aware there was a recent vote to treat the None timestamp like a local timestamp, but this isn't always the case either. For example, we have been using the output of None timestamps from DataFusion but we treat all our timestamps as UTC internally.
To Reproduce
Steps to reproduce the behavior:
I can create a PR to have a test to show the above comparison failing. Take some data with a timezone with UTC and do the above comparison.
Expected behavior
The best solution I can think of would be for
to_timestamp(...)
to support a second, optional argument where the timezone can be specified. This allows for casting/conversion and generation of timestamps with a specific timezone requirement. There are some technical challenges which can be overcome:There are other "hacks" which are possible:
Some("UTC")
is equivalent toNone
, so we could make Some("UTC") cast to None and vice versa, but this would lose precision for many people.The text was updated successfully, but these errors were encountered: