-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Expand date-time schema declarations to describe timezone awareness #8280
Comments
@sherifnada, by STRING_DATETIME(ImmutableMap.of("type", "string", "format", "date-time", "with-timezone": false)),
STRING_DATETIMEZ(ImmutableMap.of("type", "string", "format", "date-time", "with-timezone": true)); |
@tuliren IDK where in the java code it would live, but at the protocol level it would live next to the
|
Got it. Here is where the Java code is: The current setup ( |
Update: the above comment is wrong. The timestamp string always ends with a |
@tuliren @sherifnada it looks to me that we have at least two options here to handle the DB timezone
In my opinion, we will avoid a lot of trouble and we can skip adding the additional property to JSON Schema in case we will work only with the UTC timezone, what do you think? |
I think this only works if the source supports outputting a timestamp in UTC. It is possible that the source has no idea about what the timezone is for a timestamp. In that case, differentiating between timestamp with or without time zone is still important. My previous comment was wrong. I have added an update there. |
Summary on this issue:
|
Hi @tuliren, |
@tuliren @sherifnada I checked a few things with MySQL DB and made sure that it doesn't care about timezone almost for all time-related data types except timestamp, and it will return a constant time value irrespective of the DB timezone. Also, in case we will try to pass the timezone in the insert statement for timestamp it will be ignored by the DB engine in favor of the DB timezone. This means that:
It looks like it is a common practice for DBs to work with UTC-based time-related data types, without having the information about the time zone (except for timestamp, which always relies on DB timezone). In order to avoid double converting (DB -> source = convert to UTC from DB timezone, source -> destination = convert from UTC to destination timezone) we should declare in the connection for source/destination that we work in UTC, in this case, source DB will convert timestamp into UTC out of the box and at the same time destination will consume the timestamp in UTC and convert it into DB timezone under the hood. With the described approach, we will need to add additional JDBC param for sources/destinations (e.g. serverTimezone=UTC) and use proper enum from the JsonSchemaPrimitives without huge code changes and we can avoid zoned time-related enums. |
Next steps:
|
Tell us about the problem you're trying to solve
Some of our most used connectors (postgres, mssql, mysql, etc..) make an explicit distinction between dates which declare timezones and those that do not. Broadly speaking there are two types of date types:
timestamp with timezone
: For example2021-11-01T00:00:00Z+3
describes an absolute moment in time which can always be resolved to a single instant without any ambiguitytimestamp without timezone
: e.g:2021-11-01T00:00:00
describes a moment in time which can be interpreted in many ways depending on the local timezone. Such a date type inherently does not have any timezones attached to it, and is usually inferred as "local time". This is useful in cases where timezone is irrelevant e.g: when scheduling a job to run at 5pm in the local process' timezone regardless of where this process runs.JSON schema does not make a distinction between these two types of dates. It uses the
format: date-time
directive to describe that a particular string conforms to the ISO8601 standard. However, ISO8601 does not make any promises about timezone awareness. For example, both the strings2021-11-01T00:00:00Z+3
and2021-11-01T00:00:00
are valid ISO8601 strings.This makes it impossible to write any dates coming from OLTP databases using the appropriate date types in the destination without breaking data integrity. Let's say we're syncing data from postgres to postgres. Currently, the only thing the postgres source DB can say about the column type is
format: date-time
. If the destination creates a column typetimestamp without timezone
but the source has timezone info, we drop TZ info which breaks data integrity. Same thing in the opposite situation.The only viable workaround is to write all such dates as strings in the destination. This is pretty inconvenient to query as it pushes the concern of parsing dates onto the user.
Describe the solution you’d like
I want a few things:
Taken together, these steps would allow us to stop writing such date times as strings in the destination and write them with the appropriate type.
I can think of three ways, broadly speaking, in which we can achieve this:
airbyte_type_format
which can add information on top of JSON Schema types. This field can be flexibly used to add information on top of the basic json-schema types (string/number/etc..) to give more information about them. For example, it can be used to say that astring, format: date-time
isairbyte_type_format: timestamp_with_timezone
orwithout_timezone
. But it can also be used to say that for example, a numeric type is ashort
orbigInteger
etc.. in the future.airbyte_timezone_aware: <boolean>
which would only be used for fields with type/formattype: string, format:date-time
to describe this very specific case.Describe the alternative you’ve considered or used
The only alternative which doesn't break data integrity from sources which make a distinction about timezone awareness is to write these values as strings in the destination. This is pretty bad UX because it offloads this concern to the user.
Additional context
This is a very high value issue as it would allow us to support writing date time types from our most used source connectors i.e: databases.
The text was updated successfully, but these errors were encountered: