Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specific timezone support for to_timetamp*() #686

Open
Tracked by #8282 ...
velvia opened this issue Jul 5, 2021 · 33 comments
Open
Tracked by #8282 ...

Specific timezone support for to_timetamp*() #686

velvia opened this issue Jul 5, 2021 · 33 comments
Labels
bug Something isn't working

Comments

@velvia
Copy link
Contributor

velvia commented Jul 5, 2021

Describe the bug
I'm not sure I'd call it a "bug" per se, but all of the to_timestamp*() functions output TimestampTypes in Arrow that have a Timezone field of None. The issues here are:

  • There is no way to specify a specific timezone currently to convert to
  • Timestamp types with different timezones are not comparable / compatible types. For example, if I have data with a timezone of Some("UTC") the following fails due to incompatible types: WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z") (because timestamp_col has Some(UTC) but to_timestamp returns None)

I'm aware there was a recent vote to treat the None timestamp like a local timestamp, but this isn't always the case either. For example, we have been using the output of None timestamps from DataFusion but we treat all our timestamps as UTC internally.

To Reproduce
Steps to reproduce the behavior:
I can create a PR to have a test to show the above comparison failing. Take some data with a timezone with UTC and do the above comparison.

Expected behavior
The best solution I can think of would be for to_timestamp(...) to support a second, optional argument where the timezone can be specified. This allows for casting/conversion and generation of timestamps with a specific timezone requirement. There are some technical challenges which can be overcome:

  • Current code in datetime_expressions.rs for example uses macros which assume a static, const type, which makes it difficult to input variable timezone args
  • The function signature checks would be really complicated

There are other "hacks" which are possible:

  • For our use case, Some("UTC") is equivalent to None, so we could make Some("UTC") cast to None and vice versa, but this would lose precision for many people.
@velvia velvia added the bug Something isn't working label Jul 5, 2021
@velvia
Copy link
Contributor Author

velvia commented Jul 5, 2021

@alamb @westonpace .... your thoughts on this matter would be really welcome. :)

To be specific, if you do to_timestamp(date_string) then it can figure out the timezone from the string, sometimes. However, current architecture of Arrow/DF requires a specific timezone output type.
If you did to_timestamp(int64) then you would not have any timezone info at all, and would require the input in the second argument no matter what for completeness.

There is also the possibility to simplify things and eliminate the timezone field from Arrow, but that's a really big change. For example, Parquet does not store the timezone but simply has a isAdjustedToUTC field. If True, that means timestamps are UTC, and if not that means local timestamp.
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp

@westonpace
Copy link
Member

I'm not sure I'd call it a "bug" per se, but all of the to_timestamp*() functions output TimestampTypes in Arrow that have a Timezone field of None. The issues here are:

This does seem like an inconsistency with the just agreed upon convention. I would expect all of these methods to convert to Timestamp(unit, "UTC").

I'm aware there was a recent vote to treat the None timestamp like a local timestamp, but this isn't always the case either. For example, we have been using the output of None timestamps from DataFusion but we treat all our timestamps as UTC internally.

Correct, I think DataFusion is interpreting Timestamp(unit, None) as "instant". This is still valid, but no longer "conventional" and could cause interoperability problems. That being said, "local" timestamps don't often get stored or travel between implementations. As long as DataFusion treats both Timestamp(unit, None) and Timestamp(unit, UTC) the same then I wouldn't say it is urgent or must-fix. It's probably a topic for discussion within DF.

Timestamp types with different timezones are not comparable / compatible types. For example, if I have data with a timezone of Some("UTC") the following fails due to incompatible types: WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z") (because timestamp_col has Some(UTC) but to_timestamp returns None)

I'm not sure if you are saying the are not comparable or they should not be comparable. They "should" be comparable if you are comparing two columns with a non-None timestamp (e.g. Timestamp(units, "UTC") should be comparable with Timestamp(units, "+0700") but neither should compare to Timestamp(units, None). This is going by the newly discussed "conventional" logic where None is "local" and UTC is "instant. Since DF has it backwards right now you'd need to invert the logic.

The best solution I can think of would be for to_timestamp(...) to support a second, optional argument where the timezone can be specified.

As long as it is optional that is probably a good idea. However, that sounds like it is adding capability and wouldn't fix anything on its own.

There is also the possibility to simplify things and eliminate the timezone field from Arrow, but that's a really big change. For example, Parquet does not store the timezone but simply has a isAdjustedToUTC field. If True, that means timestamps are UTC, and if not that means local timestamp.

I'm not sure exactly what you are saying here. isAdjustedToUTC basically boils down to timezone != null (I might have that backwards). If you are arguing to avoid using timestamps other than UTC or null then I agree that is a viable option. Implementations can always store timestamp info in a separate column or always represent "zoned date times" as structs (given that the only reason to use "zoned date time" in the first place is typically for field extraction).

@westonpace
Copy link
Member

To be specific, if you do to_timestamp(date_string) then it can figure out the timezone from the string, sometimes. However, current architecture of Arrow/DF requires a specific timezone output type.
If you did to_timestamp(int64) then you would not have any timezone info at all, and would require the input in the second argument no matter what for completeness.

I don't think to_timestamp(date_string) should always put the time zone into the data type. It should either never do so and have a second, optional time zone argument (for consistency with the other to_timestamp methods) or it should expose a boolean parameter (keep_timezone or something like that). The time zone is needed in order to figure out the correct instant. However, it is often not important beyond that. What happens if you have strings from different time zones?

@alamb
Copy link
Contributor

alamb commented Jul 6, 2021

@velvia

we have been using the output of None timestamps from DataFusion but we treat all our timestamps as UTC internally.

FWIW this is what we do in IOx as well

The best solution I can think of would be for to_timestamp(...) to support a second, optional argument where the timezone can be specified.

I think a second optional argument for to_timestamp is a fine solution that would be backwards compatible and also allow for improvements going forward

For example, if I have data with a timezone of Some("UTC") the following fails due to incompatible types: WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z") (because timestamp_col has Some(UTC) but to_timestamp returns None)

I think a better solution would be for DataFusion to implement coercion rules between date/time/timestamp types. In your example of a timestamp_col with Timestamp(Milliseconds, Some("UTC")), then I would like DataFusion to handle

WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z")

by adding the following cast (please forgive me the pseudo-SQL)

WHERE timestamp_col > cast (to_timestamp("2021-06-21T12:00Z") as timestamp(UTC))

@velvia
Copy link
Contributor Author

velvia commented Jul 6, 2021

It sounds like everyone agrees second, optional argument to to_timestamp*() to force a specific timezone result works.

As for casting/coercion, we have:

@alamb

I think a better solution would be for DataFusion to implement coercion rules between date/time/timestamp types. In your example of a timestamp_col with Timestamp(Milliseconds, Some("UTC")), then I would like DataFusion to handle

WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z")

by adding the following cast (please forgive me the pseudo-SQL)

WHERE timestamp_col > cast (to_timestamp("2021-06-21T12:00Z") as timestamp(UTC))

The above is saying that Timestamp(_, None) can be coerced into Timestamp(_, Some("UTC")), is that right? (AFAICT, we can't just say that only functions called in WHERE clauses that output None be coerced?) However, a None implies local timestamp, which is not the same as an instant with UTC (although semantically for UrbanLogiq that definitely works. :)

OTOH @westonpace wrote:

I'm not sure if you are saying the are not comparable or they should not be comparable. They "should" be comparable if you are comparing two columns with a non-None timestamp (e.g. Timestamp(units, "UTC") should be comparable with Timestamp(units, "+0700") but neither should compare to Timestamp(units, None). This is going by the newly discussed "conventional" logic where None is "local" and UTC is "instant. Since DF has it backwards right now you'd need to invert the logic.

It seems to me that coercing Timestamp(_, None) to Timestamp(_, "UTC") would not fit the recent convention, at least automatic coercion without context.

So follow up / action items / possibilities:

  1. Should the current output of to_timestamp(...) default to Timestamp(_, Some(UTC)) ? This sounds like a yes (and would solve my use case)
  2. to_timestamp(....) can be extended to add a second argument to accept different time zones, or None (not sure how to specify None if the default is UTC, per item 1)
  3. Coercion ... the details of this needs to be figured out, semantics
  4. Or, simplify everything and have the timestamp types not have timezone, but just if they are UTC or local (similar to Parquet).

BTW I looked up the C/C++ API, and the Python API, and none of those appear to support timezones in the timestamp type, just a unit. It could be argued that 4) would bring the Rust API closer to the other languages APIs, and when arrow datatypes are translated from Flight or the binary representation, how is the timezone encoded, if at all?

(if 4 is a worthy avenue to discuss, I can start a email chain)

cheers and thanks

@alamb
Copy link
Contributor

alamb commented Jul 7, 2021

Should the current output of to_timestamp(...) default to Timestamp(_, Some(UTC)) ? This sounds like a yes (and would solve my use case)

While this might be theoretically the best option, I think this would be a fairly painful change (at least for us in IOx) as our timestamps are Timestamp(Nanoseconds, None)

to_timestamp(....) can be extended to add a second argument to accept different time zones, or None (not sure how to specify None if the default is UTC, per item 1)

Perhaps to_timestamp('....', NULL) could work

Or, simplify everything and have the timestamp types not have timezone, but just if they are UTC or local (similar to Parquet).

I am actually a fan of this direction -- I am fairly sure the use of a Timezone that is something other than None will not work well in Arrow / DataFusion (Because I tried using Timestamp(.., Some("UTC") in IOx and basic query stuff didn't work

I think it would keep things simpler to convert timestamps on the edge (aka if it was "local time" convert to UTC prior to putting into DataFusion) and then have all the processing machinery deal with UTC timestamps only

@adamhooper
Copy link

The best solution I can think of would be for to_timestamp(...) to support a second, optional argument where the timezone can be specified.

I think a second optional argument for to_timestamp is a fine solution that would be backwards compatible and also allow for improvements going forward

I'm not even a user (yet); I see two big great reasons not to add an argument:

1. Ambiguous calling pattern. When converting string to Arrow timestamp, there are two timezones to consider: the input-string timezone and the output-timestamp timezone. Which one would this argument mean?

If I see the call, TO_TIMESTAMP('2021-01-01T00:00:00', 'America/Montreal'), then only result I can imagine as "useful" is 2021-01-01T05:00Z -- that is, "the input is from my clock." (@westonpace, I think we're aligned here.)

Someone else might expect 2020-12-31T19:00-0500, saying, "I have UTC input and I want Montreal wall-clock time." A third person might expect 2021-01-01Z with timezone-metadata America/Montreal.

If three smart people reading the same line of code would interpret it three different ways, that's a design problem.

2. PostgreSQL compatibility. Postgres has two timestamp types; Arrow timestamp+tz-metadata columns are neither.

In Postgres, TO_TIMESTAMP() creates a TIMESTAMP WITH TIME ZONE -- that is, a moment on our shared, global timeline, stored as 64-bit integer since the UNIX epoch. It has no timezone: each moment happens once everywhere in the world, regardless of timezone.

Postgres also has a TIMESTAMP WITHOUT TIME ZONE type -- that is, a calendar-date-plus-wall-clock type ... but Postgres' own TO_TIMESTAMP() doesn't output it. That's on purpose. I think most people using TIMESTAMP WITHOUT TIME ZONE are using it in error, misled by the name. It represents the very opposite of what it sounds like. In Postgres, it's handy for calendaring apps and not much else. (There are other uses in Pandas/R.)

Postgres has gone 30 years without Arrow's timestamp+tz-metadata types.

In the words of Tom Lane, "if we simply took away to_timestamp(), most users would be better off".

Be careful about adding a feature: removing one might help more users accomplish more tasks, more smoothly.

@alamb
Copy link
Contributor

alamb commented Jul 8, 2021

@adamhooper thank you for the perspectives. I see the challenges with adding a second argument to to_timestamp and would love to consider other alternatives.

Can you suggest a strategy for the usecase described in this ticket, namely "writing a query using SQL against data stored in parquet that has an explicit timezone?"

In other words, how do you suggest supporting the following query?

SELECT ...
FROM my_parquet_file_where_timestamp_col_has_a_specified_timezone
WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z")

Which currently fails because the arrow type of to_timestamp has None as the timezone, but the type of of timestamp_col has a specific timezone?

@adamhooper
Copy link

@alamb would anybody be horrified/appalled at the following:

  • to_timestamp() always returns TIMESTAMP WITH TIMEZONE -- i.e., UTC.
  • my_parquet_file_with_isAdjustedToUTC_true are always TIMESTAMP WITH TIMEZONE -- i.e., UTC. (Parquet doesn't store timezones anyway.)
  • my_arrow_ipc_file_with_timezone_america_montreal produces TIMESTAMP WITH TIMEZONE -- i.e., UTC. (DataFusion simply drops the metadata, without warning.)
  • my_parquet_file_with_isAdjustedToUTC_false produces an error for now, saying DataFusion doesn't support TIMESTAMP WITHOUT TIME ZONE.
  • my_arrow_ipc_file_with_timezone_null gives a deprecation warning and TIMESTAMP WITH TIMEZONE; then, some version in the future, DataFusion re-implements it as TIMESTAMP WITHOUT TIMEZONE.

To me, dropping timezone info is an easy (if rude) decision. DataFusion implements a subset of Postgres, and Postgres has no timezone metadata. There's an argument to be made for passing the timezone through unmodified; but that would be hard/convoluted, probably needing new syntax; and does anybody actually want that? IPC with datetime+timezone tuples is nonstandard and heavyweight; Spark doesn't even have a type for that. Forcing UTC won't stop anybody from accomplishing anything.

So the only hairy decision is what to do with Arrow timezone=null. Up until today, DataFusion users have been using it as TIMESTAMP WITH TIMEZONE, right? If that's the case, then there's great news: nobody who's using DataFusion is using TIMESTAMP WITHOUT TIMEZONE :). Sounds like that type would be a new feature ... and I expect it would be a low priority, since nobody asked for it. (Again, Spark has no such type.)

(Primer on Postgres, for the uninitiated: TIMESTAMP WITH TIME ZONE means UTC; it doesn't store a timezone.)

@velvia
Copy link
Contributor Author

velvia commented Jul 8, 2021

@AdamHopper personally I'd be OK with your recommendations, but some thoughts:

  • There is some interop considerations between TIMESTAMP WITH TIMEZONE ie TimestampType(, UTC) vs TimestampType(, None). The reason is that so far DataFusion has been generating TimestampType(_, None). If this changes to UTC, which makes sense as far as conventions go, then we need some way for people with data with Timestamp(, None) to be able to convert it over to (, UTC).
  • I'm not sure it would be acceptable to throw an error at isAdjustedToUTC = false as I'm sure people have Parquet files out there with this, and then DataFusion/Arrow wouldn't be able to read it. But maybe it's OK if you give people a way to cast / convert the timestamps over.

Also, I see your point about to_timestamp() having a timezone argument being confusing. However there is a second use of to_timestamp() right now, which is for conversion purposes, from other types to Timestamp(resolution, tz). The reason it exists like this is because there is no standard SQL/PG way to work with multiple timestamp resolutions, which are required and supported by all Arrow implementations.

Thus, while TO_TIMESTAMP('2021-01-01T00:00:00', 'America/Montreal') might be confusing, I believe the following would be more clear:

  • to_timestamp_millis(my_timestamp_col, 'UTC') -> cast my_timestamp_col from int or another timestamp resolution to millisecond resolution, and make sure it has UTC timezone

@adamhooper
Copy link

@velvia Great points

My last two suggestions were exactly about interop -- and a transition period. Today, DataFusion users interpret timezone=null to mean TIMESTAMP WITH TIMEZONE; but eventually, DataFusion must consider timezone=null to mean TIMESTAMP WITHOUT TIMEZONE, right? Users will need to change what they're doing; the transition path will be hard and it depends on the DataFusion community.

I hadn't thought of timestamp resolution. Again, I'm out of my element (for now) :).

The crux of my suggestion is to ignore the timezone metadata field (treat it as a boolean, timezone=null or timezone=UTC). That translates Parquet <=> Arrow <=> PostgreSQL cleanly. Treating it like a boolean keeps the feature list small and saves people from confusion.

As for the TO_TIMESTAMP() parameter: Postgres has a different function might do exactly what you're suggesting. AT TIME ZONE toggles the "WITH TIMEZONE-ness" of a timestamp. It's also callable as a function, TIMEZONE(zone, timestamp).

But -- sidetracking here -- how important are these types? TIMESTAMP WITH TIME ZONE is clearly mission-critical. I think TIMESTAMP WITHOUT TIME ZONE has its place; but it's in a different ballpark, right. (Even Spark doesn't have TIMESTAMP WITHOUT TIME ZONE; and, well, does DataFusion?) Arrow's timezone metadata column is even more obscure: I've only heard of it in Pandas and R.

@velvia
Copy link
Contributor Author

velvia commented Jul 16, 2021

@alamb @westonpace and others: just to continue this conversation and try and bring it to a close.

  1. It sounds like the Arrow/Rust community would like to transition to supporting Timestamp(_, UTC) as the standard for UTC-based processing, rather than Timestamp(_, None).
  2. Since some people already use the existing Timestamp(_, None) to really mean Timestamp(_, UTC) (including us in some cases), and there is also data (in Parquet) that has at least UTC timezone info, we should provide interop to convert between the two, or at least cast things between.
  3. A timezone() function or with_timezone() or at_timezone() function would be clearer than adding a second argument to to_timestamp(...)

Do the above sound about right?

I would also like to observe current obstacles to 1) and 2) in the Arrow/DF codebase (there are more, just what I've found in the last ~2 weeks):

  • ScalarValue in DataFusion only supports Timestamp types with None timezone
  • Coercion of Timestamp(, UTC) to Timestamp(, None) (or the other way around) is potentially wrong and also has some schema type check issues

@velvia
Copy link
Contributor Author

velvia commented Jul 17, 2021

To add to the above list, to_timestamp(string) should really return a Timestamp Scalar with the proper timezone derived from parsing the string. Currently no TZ info is returned, but for example if the date time string has "Z" in it, then the resulting scalar timezone should be "UTC".

The bigger question really is how much effort to pour into all this. Since the other Arrow implementations and the Arrow spec already specifies TimeZone, and due to many issues mentioned above, it seems to me that at least we should try to get notion of UTC vs local time zone, and be able to maintain that distinction, correctly. IE, Rust DataFusion should IMHO have first class support of both Timestamp(_, None) as well as Timestamp(_, UTC), everywhere.

@alamb
Copy link
Contributor

alamb commented Jul 17, 2021

Do the above sound about right?

yes that sounds right

I am very supportive of an effort to have first class (and correct) timestamp support in Rust both for Arrow and DataFusion.

@alamb
Copy link
Contributor

alamb commented Jul 21, 2021

Let's make this concrete. Here is a proposal of what I think makes sense (and I think the consistent with @velvia's suggestion for a second argument).

The second argument simply serves to define the output type; Note to really make this useful we would also have to provide some way within Arrow and DataFusion to cast back and forth between different timestamps, doing the offset conversions appropriately

Example Output Arrow Type Value
to_timestamp('2021-07-20') Timestamp(Nanoseconds, None) nanos since unix epic until 2021-07-20 local time
to_timestamp('2021-07-20+00:00') Timestamp(Nanoseconds, None) nanos since unix epic until 2021-07-20 UTC
to_timestamp('2021-07-20+01:00') Timestamp(Nanoseconds, None) nanos since unix epic until 2021-07-20 UTC+1
to_timestamp('2021-07-20', 'UTC') Timestamp(Nanoseconds, Utc) nanos since unix epic until 2021-07-20 local time
to_timestamp('2021-07-20+00:00', 'UTC') Timestamp(Nanoseconds, Utc) nanos since unix epic until 2021-07-20 UTC,
to_timestamp('2021-07-20+01:00', 'UTC') Timestamp(Nanoseconds, Utc) nanos since unix epic until 2021-07-20 UTC NOTE only this timestamps value needs to be shifted to some other value to be relative to UTC

I think @adamhooper 's point that "someone might get confused about what timezone the second argument is referring to" at #686 (comment) is legitmate, but one we can handle with appropriate documentation.

@adamhooper
Copy link

@alamb Could you please write example results? Maybe centered around 1970-01-01 so the math is trivial?

TO_TIMESTAMP('1970-01-01+01:00') => will this produce -3_600_000_000_000? 0? Or +3_600_000_000_000? (I would expect -3_600_000_000_000 regardless of second param, because why would the +01:00 be in the input string otherwise....)

I think allowing 'UTC' as a second parameter seems not-terrible. (That's a compliment ;).) It's backwards-compatible and not too confusing. You could even plod towards setting it by default, across major versions, by adding a 'local' magic default value; deprecating NULL; switching the default to 'UTC'; deprecating the param entirely; then finally dropping the param. I think allowing other (non-'UTC') values would be confusing to a doc-defeating extent (and of questionable value, since Parquet doesn't support other timezones).

TO_TIMESTAMP('1970-01-01') => you wrote "local time" in two examples, and I think that's ambiguous because it defines the result in terms of the result. Do you mean the client's localtime? The server's localtime? The session localtime? Or "calendar-date-plus-wall-time" -- e.g., "nanos since unix epoch until 1970-01-01 UTC"? So TO_TIMESTAMP('1970-01-01') would always give 0 and TO_TIMESTAMP('1970-01-01', 'UTC') would always give 0?

TO_TIMESTAMP('1970-01-01+01:00', 'UTC') => -3_600_000_000_000?

@westonpace
Copy link
Member

westonpace commented Jul 21, 2021

Does this seem correct?

Example Output Arrow Type Value Output when printed
to_timestamp('2021-07-20') Timestamp(Nanoseconds, None) nanos since unix epic until 2021-07-20 local time 2021-07-20T00:00:00
to_timestamp('2021-07-20+00:00') Timestamp(Nanoseconds, None) nanos since unix epic until 2021-07-20 UTC 2021-07-20T00:00:00
to_timestamp('2021-07-20+01:00') Timestamp(Nanoseconds, None) nanos since unix epic until 2021-07-20 UTC+1 2021-07-19T23:00:00
to_timestamp('2021-07-20', 'UTC') Timestamp(Nanoseconds, Utc) nanos since unix epic until 2021-07-20 local time 2021-07-19T00:00:00Z
to_timestamp('2021-07-20+00:00', 'UTC') Timestamp(Nanoseconds, Utc) nanos since unix epic until 2021-07-20 UTC 2021-07-20T00:00:00Z
to_timestamp('2021-07-20+01:00', 'UTC') Timestamp(Nanoseconds, Utc) nanos since unix epic until 2021-07-20 UTC 2021-07-20T23:00:00Z

I think the one most interesting to me is row 4 so double check that one (also note the presence/absence of a trailing Z to indicate local time or not).

By "output when printed" I mean what an implementation would get if it tried to force the value into an ISO-8601 string.

Just a few extra thoughts...

Does to_timestamp only accept offsets in the strings (e.g. +01:00 and not tzdata strings or things like MST)? For reference, ISO-8601 only deals with offsets.

Will Rust ever generate an Output Arrow Type with a time zone string (e.g. MST)?

My point with the last two questions is that offset conversions aren't too bad if you aren't handling tzdata strings. Once you do then you need a timezone database which is a bit of a pain but solvable. The C++ impl recently added compute kernels which rely on a timezone database and it's a non-trivial effort. @rok has been spearheading this effort so you might reach out if you go down this road.

@adamhooper
Copy link

@westonpace Could you please explain what happens on row 4?

The rest of it seems perfect to me. Thank you for answering :).

@alamb
Copy link
Contributor

alamb commented Jul 21, 2021

@adamhooper here is what I had in mind in terms of values. Row 4 is also the most interesting, perhaps. I think this is consistent with what @westonpace has in his table

Example Output Arrow Type Value
to_timestamp('1970-01-01') Timestamp(Nanoseconds, None) 0
to_timestamp('1970-01-01+00:00') Timestamp(Nanoseconds, None) 0
to_timestamp('1970-01-01+01:00') Timestamp(Nanoseconds, None) -3600000000000 (1 hour of nanoseconds)
to_timestamp('1970-01-01', 'UTC') Timestamp(Nanoseconds, Utc) 5*3600000000000 (5 hour of nanoseconds) assuming code was run on a computer set to EDT, UTC-5:00
to_timestamp('1970-01-01+00:00', 'UTC') Timestamp(Nanoseconds, Utc) 0
to_timestamp('1970-01-01+01:00', 'UTC') Timestamp(Nanoseconds, Utc) -3600000000000 (1 hour of nanoseconds)

Does to_timestamp only accept offsets in the strings (e.g. +01:00 and not tzdata strings or things like MST)? For reference, ISO-8601 only deals with offsets.

It handles a bunch of different formats (see source) but mostly trying to

/// Examples of accepted inputs:
/// * `1997-01-31T09:26:56.123Z`        # RCF3339
/// * `1997-01-31T09:26:56.123-05:00`   # RCF3339
/// * `1997-01-31 09:26:56.123-05:00`   # close to RCF3339 but with a space rather than T
/// * `1997-01-31T09:26:56.123`         # close to RCF3339 but no timezone offset specified
/// * `1997-01-31 09:26:56.123`         # close to RCF3339 but uses a space and no timezone offset
/// * `1997-01-31 09:26:56`             # close to RCF3339, no fractional seconds

Will Rust ever generate an Output Arrow Type with a time zone string (e.g. MST)?

The Rust arrow implementation already can encode the timezone string but the support in compute kernels and datafusion for a timezone other than None is basically non existent

However, in my mind the point of adding a second argument to to_timestamp is to allow the construction of output arrow types with timezone strings. The tz database has a rust wrapper which we may have to pull in https://github.com/chronotope/chrono-tz

@adamhooper
Copy link

adamhooper commented Jul 21, 2021

to_timestamp('1970-01-01', 'UTC') Timestamp(Nanoseconds, Utc) 5*3600000000000 (5 hour of nanoseconds) assuming code was run on a computer set to EDT, UTC-5:00

Oh, I see -- this is existing behavior. But then, does existing behavior define a different result for row 1? https://github.com/apache/arrow-datafusion/blob/5900b4c6829b0bdbe69e1f95fb74e935bc8f33d4/datafusion/src/physical_plan/datetime_expressions.rs#L85

In my view, if the existing behavior has to stay, it has to stay; but if it doesn't have to stay, then as a user I'd feel much more comfortable if DataFusion didn't expose the server's timezone -- or at least made it a session variable, as Postgres does.

Note that we're up to four timezones in every call to TO_TIMESTAMP():

  • NULL: "localtime" the Arrow concept -- calendar-date-plus-wall-time
  • UTC
  • The input-string timezone offset
  • DataFusion server's "local timezone" (distinct from Arrow's "local time", and presumably unknown to my users)

My use case is to run custom SQL from the general public. I realize I can use environment variables to force DataFusion's "server timezone" to be UTC; but that doesn't negate the learning curve.

Postgres has the same complexity. Its syntax:

  • SELECT timestamp_with_time_zone AT TIME ZONE 'America/Eastern' => converts UTC to localtime
  • SELECT timestamp_with_time_zone::TIMESTAMP => converts UTC to session-timezone localtime
  • `SELECT timestamp_without_time_zone AT TIME ZONE 'America/Eastern' => converts localtime to UTC
  • SELECT timestamp_without_time_zone::TIMESTAMPTZ => converts localtime to UTC using session-timezone localtime

... and I'll bet most Postgres misunderstand these concepts for years until they figure it out.

@alamb
Copy link
Contributor

alamb commented Jul 21, 2021

Oh, I see -- this is existing behavior. But then, does existing behavior define a different result for row 1?

on master, if you execute to_timestamp('1970-01-01') on a server running DataFusion in a UTC-5 timezone, you get 5*3600000000000

In my view, if the existing behavior has to stay, it has to stay;

I don't think it has to say

I think the choice of where to translate a local time into one with a specific timezone is probably best left to a higher level than DataFusion. DataFusion isn't a database system -- it does not have the notion of a client or server. It is more like a tool kit with which to build such things.

Note that we're up to four timezones in every call to TO_TIMESTAMP():

I think we have three timezones that are important:

  1. The input string timezone offset (may not exist)
  2. The local timezone on which datafusion is run (used in the case the user does not supply a timezone)
  3. The target timezone of the arrow type

@westonpace
Copy link
Member

westonpace commented Jul 21, 2021

Aside: this case to_timestamp('2021-07-20', 'UTC') could be ambiguous if the server time zone happens to be in a DST shift.

I think this is consistent with what @westonpace has in his table

Then I think what you have proposed matches (as best I can tell) the recently agreed upon spec so that is all good. Usability-wise, I don't know, but I'll admit I don't have much stake in a usability decision (I'm more worried about interoperability). So take this next paragraph with a grain of salt 😆

It feels to me like to_timestamp is doing two steps. First, it is converting to a string to a Timestamp (respecting the offset in the string, if any). Second, it is "reinterpret" casting (schema only change) that timestamp to the desired local/instant semantics. I'd personally prefer something that breaks things into two steps if needed.

Example One Step Type Two Step Equivalent Two Step Type
to_timestamp('1970-01-01') Timestamp(Nanoseconds, None) reinterpret_zone(to_timestamp('1970-01-01'), None) Timestamp(Nanoseconds, None)
to_timestamp('1970-01-01+00:00') Timestamp(Nanoseconds, Utc) reinterpret_zone(to_timestamp('1970-01-01+00:00'), None) Timestamp(Nanoseconds, None)
to_timestamp('1970-01-01+01:00') Timestamp(Nanoseconds, +01:00) reinterpret_zone(to_timestamp('1970-01-01+01:00'), None) Timestamp(Nanoseconds, None)
to_timestamp('1970-01-01', 'UTC') Timestamp(Nanoseconds, None) reinterpret_zone(to_timestamp('1970-01-01'), 'UTC') Timestamp(Nanoseconds, Utc)
to_timestamp('1970-01-01+00:00', 'UTC') Timestamp(Nanoseconds, Utc) reinterpret_zone(to_timestamp('1970-01-01+00:00', 'UTC'), 'Utc') Timestamp(Nanoseconds, Utc)
to_timestamp('1970-01-01+01:00', 'UTC') Timestamp(Nanoseconds, +01:00) reinterpret_zone(to_timestamp('1970-01-01+01:00'), 'UTC') Timestamp(Nanoseconds, Utc)

This would change the meaning of the fourth row so it would have the value 0.

@rok
Copy link
Member

rok commented Jul 21, 2021

Agreed with @adamhooper - as a user I would strongly dislike the row 4 behavior as it would make my logic a function of the environment I'd run it on instead of only a function of the input.

Keep in mind that timezones can have nonexistent and ambiguous times that need to be handled when "localizing" (converting wall time) to UTC. In arrow there's an open PR for a tz_localize kernel.

@westonpace
Copy link
Member

Indeed, the more I look at this the more I wonder why the first row doesn't use the server time but the fourth row does use the server time.

@alamb
Copy link
Contributor

alamb commented Jul 22, 2021

It feels to me like to_timestamp is doing two steps. First, it is converting to a string to a Timestamp (respecting the offset in the string, if any). Second, it is "reinterpret" casting (schema only change) that timestamp to the desired local/instant semantics.

Yes I agree that is what it is doing.

I'd personally prefer something that breaks things into two steps if needed.

Theoretically I agree that a two step process is clearest; The challenge is that the output type of to_timestamp needs to be known in advance (aka planning time) prior to access to the values of the strings.

Imagine a query that is like to_timestamp(string_column) that needs to produce an array that has uniform type. What type is chosen? Assuming the second argument must be a string constant (aka would not allow to_timestamp(timestamp_column, timezone_column) the output type can be known with the proposal in #686 (comment))

@rok
Copy link
Member

rok commented Jul 22, 2021

Theoretically I agree that a two step process is clearest; The challenge is that the output type of to_timestamp needs to be known in advance (aka planning time) prior to access to the values of the strings.

Isn't output of to_timestamp just timestamp(unit, timezone_string)? (I'm not familiar with Rust implementation).

@adamhooper
Copy link

Theoretically I agree that a two step process is clearest; The challenge is that the output type of to_timestamp needs to be known in advance (aka planning time) prior to access to the values of the strings.

This sounds exactly like the problems Postgres faced. Postgres' solutions all have predictable types:

  • SELECT CAST('2021-07-22T9:40:00-04:00' AS TIMESTAMPTZ) => 2021-07-22 13:40 (zone=UTC)
  • SELECT CAST('2021-07-22T9:40:00-04:00' AS TIMESTAMP) => 2021-07-22 09:40 (zone=null)
  • SELECT '2021-07-22T13:40Z'::TIMESTAMPTZ AT TIME ZONE 'Europe/Paris' => 2021-07-22 15:40 (zone=null)
  • SELECT '2021-07-22T15:40'::TIMESTAMP AT TIME ZONE 'Europe/Paris' => 2021-07-22 13:40 (zone=UTC)

If this kind of grammar is too tricky to implement -- or, heck, if DataFusion wants to be more legible -- these solutions could be rewritten in function-style syntax:

  • TO_TIMESTAMP_UTC() => input is string, output is Timestamp(UTC). Absence of offsets in the input string means they're interpreted as UTC. (Postgres interprets the absence of offsets as session timezone, not UTC; I'm proposing DataFusion hard-code UTC until it gets session timezone ... which may be never ;).)
  • TO_TIMESTAMP_LOCALTIME() => input is string, output is Timestamp(null). Offsets in the input string are ignored.
  • TIMESTAMP_UTC_TO_LOCALTIME(timestamp_utc, zone) => input is Timestamp(UTC), output is Timestamp(null).
  • TIMESTAMP_LOCALTIME_TO_UTC(timestamp_localtime, zone) => input is Timestamp(null), output is Timestamp(UTC).

Postgres's TO_TIMESTAMP() is a quasi-alias for the first item in my lists. It outputs a fixed type.

@alamb
Copy link
Contributor

alamb commented Jul 22, 2021

TO_TIMESTAMP_UTC
TO_TIMESTAMP_LOCALTIME

Yes, this is what I thought @velvia was more or less proposing, but instead of those names, use a second optional argument

to_timestamp(.., 'UTC') --> TO_TIMESTAMP_UTC(..)
to_timestamp(..) --> TO_TIMESTAMP_LOCALTIME(..)

@velvia
Copy link
Contributor Author

velvia commented Jul 22, 2021 via email

@tustvold
Copy link
Contributor

tustvold commented Mar 12, 2023

Timezone aware timestamp parsing was recently added to arrow-rs - apache/arrow-rs#3795

In particular it follow the approach laid out in the arrow standard

In particular

/// If a Timestamp column has a non-empty timezone, changing the timezone
/// to a different non-empty value is a metadata-only operation:
/// the timestamp values need not change as their point of reference remains
/// the same (the Unix epoch).
///
/// However, if a Timestamp column has no timezone value, changing it to a
/// non-empty value requires to think about the desired semantics.
/// One possibility is to assume that the original timestamp values are
/// relative to the epoch of the timezone being set; timestamp values should
/// then adjusted to the Unix epoch (for example, changing the timezone from
/// empty to "Europe/Paris" would require converting the timestamp values
/// from "Europe/Paris" to "UTC", which seems counter-intuitive but is
/// nevertheless correct).

This means the following parsing rules

Example Output Arrow Type Value
to_timestamp('1970-01-01') Timestamp(Nanoseconds, None) 0
to_timestamp('1970-01-01+00:00') Timestamp(Nanoseconds, None) 0
to_timestamp('1970-01-01+01:00') Timestamp(Nanoseconds, None) -3600000000000 (1 hour of nanoseconds)
to_timestamp('1970-01-01', 'UTC') Timestamp(Nanoseconds, Utc) 0
to_timestamp('1970-01-01+00:00', 'UTC') Timestamp(Nanoseconds, Utc) 0
to_timestamp('1970-01-01+01:00', 'UTC') Timestamp(Nanoseconds, Utc) -3600000000000 (1 hour of nanoseconds)
to_timestamp('1970-01-01', '+01:00') Timestamp(Nanoseconds, +01:00) -3600000000000 (1 hour of nanoseconds)
to_timestamp('1970-01-01+00:00', '+01:00') Timestamp(Nanoseconds, +01:00) 0
to_timestamp('1970-01-01+01:00', '+01:00') Timestamp(Nanoseconds, +01:00) -3600000000000 (1 hour of nanoseconds)

Crucially this means "when parsing a timestamp without a timezone into a timestamp with a timezone, it assumes the given timestamp is already in the destination timezone". Or

to_timestamp('${timestamp}', '${tz}') == to_timestamp('${timestamp}${tz}', *) == to_timestamp('${timestamp}${tz}')

It also means that, assuming using Tz and not Local or something else, timestamp parsing is agnostic to the local timezone of the machine, which I think is an extremely important property.

@Omega359
Copy link
Contributor

With the addition of optional chrono format strings in #5398 I don't think having a second argument to the to_timestamp function which may or may not be a desired output TZ would be a good solution now. Technically possible but overloading a parameter to have multiple meanings is in general not a good solution.

@alamb
Copy link
Contributor

alamb commented Feb 12, 2024

The timestamp types are still the same as reported in this ticket:

❯ select arrow_typeof(to_timestamp(12345));
+------------------------------------------+
| arrow_typeof(to_timestamp(Int64(12345))) |
+------------------------------------------+
| Timestamp(Nanosecond, None)              |
+------------------------------------------+
1 row in set. Query took 0.002 seconds.

❯ select arrow_typeof(to_timestamp_micros(12345));
+-------------------------------------------------+
| arrow_typeof(to_timestamp_micros(Int64(12345))) |
+-------------------------------------------------+
| Timestamp(Microsecond, None)                    |
+-------------------------------------------------+
1 row in set. Query took 0.002 seconds.

We could definitely make a new set of functions like to_timestamp_utc or something that returned a Timestamp(.., "UTC").

But I agree with @Omega359 that another parameter is probably not a good idea at this time

@Omega359
Copy link
Contributor

Spark has a couple of functions that I think might be useful in this context:

  • convert_timezone([sourceTz, ]targetTz, sourceTs) - converts a timestamp without a tz from the given source tz to the target tz
  • from_utc_timestamp(timestamp, timezone) - takes a timestamp in UTC and converts it to the given tz

Postgresql provides the following:

  • timestamp without time zone AT TIME ZONE zone
  • timestamp with time zone AT TIME ZONE zone
  • timezone(zone, timestamp) / timestamp AT TIME ZONE zone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants