Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Expand date-time schema declarations to describe timezone awareness #8280

Closed
2 of 7 tasks
sherifnada opened this issue Nov 29, 2021 · 11 comments
Closed
2 of 7 tasks
Labels
area/connectors Connector related issues area/protocol Epic normalization team/destinations Destinations team's backlog type/enhancement New feature or request

Comments

@sherifnada
Copy link
Contributor

sherifnada commented Nov 29, 2021

Tell us about the problem you're trying to solve

Some of our most used connectors (postgres, mssql, mysql, etc..) make an explicit distinction between dates which declare timezones and those that do not. Broadly speaking there are two types of date types:

  • timestamp with timezone: For example 2021-11-01T00:00:00Z+3 describes an absolute moment in time which can always be resolved to a single instant without any ambiguity
  • timestamp without timezone: e.g: 2021-11-01T00:00:00 describes a moment in time which can be interpreted in many ways depending on the local timezone. Such a date type inherently does not have any timezones attached to it, and is usually inferred as "local time". This is useful in cases where timezone is irrelevant e.g: when scheduling a job to run at 5pm in the local process' timezone regardless of where this process runs.

JSON schema does not make a distinction between these two types of dates. It uses the format: date-time directive to describe that a particular string conforms to the ISO8601 standard. However, ISO8601 does not make any promises about timezone awareness. For example, both the strings 2021-11-01T00:00:00Z+3 and 2021-11-01T00:00:00 are valid ISO8601 strings.

This makes it impossible to write any dates coming from OLTP databases using the appropriate date types in the destination without breaking data integrity. Let's say we're syncing data from postgres to postgres. Currently, the only thing the postgres source DB can say about the column type is format: date-time. If the destination creates a column type timestamp without timezone but the source has timezone info, we drop TZ info which breaks data integrity. Same thing in the opposite situation.

The only viable workaround is to write all such dates as strings in the destination. This is pretty inconvenient to query as it pushes the concern of parsing dates onto the user.

Describe the solution you’d like

I want a few things:

  • the airbyte protocol to make a distinction between datetime types based on their timezone awareness. Change JsonSchemaPrimitive to a class #9913
  • destinations/normalization to take this into account and create columns with the appropriate type
  • destination acceptance tests should verify that destinations can appropriately handle these types
  • source DBs to appropriately leverage these new types when declaring the types of their columns
  • source acceptance tests should verify that sources output the correct types
  • If we choose a solution which deviates from JSON Schema, specifically document the extra JSONSchema fields we support/add to the specification.

Taken together, these steps would allow us to stop writing such date times as strings in the destination and write them with the appropriate type.

I can think of three ways, broadly speaking, in which we can achieve this:

  1. Add a new annotation airbyte_type_format which can add information on top of JSON Schema types. This field can be flexibly used to add information on top of the basic json-schema types (string/number/etc..) to give more information about them. For example, it can be used to say that a string, format: date-time is airbyte_type_format: timestamp_with_timezone or without_timezone. But it can also be used to say that for example, a numeric type is a short or bigInteger etc.. in the future.
  2. Add a specific annotation for this specific case. E.g add an annotation airbyte_timezone_aware: <boolean> which would only be used for fields with type/format type: string, format:date-time to describe this very specific case.
  3. Use a wholly different type system than json-schema. I'm discarding this solution because it's too disruptive right now.

Describe the alternative you’ve considered or used

The only alternative which doesn't break data integrity from sources which make a distinction about timezone awareness is to write these values as strings in the destination. This is pretty bad UX because it offloads this concern to the user.

Additional context

This is a very high value issue as it would allow us to support writing date time types from our most used source connectors i.e: databases.

@sherifnada sherifnada added type/enhancement New feature or request area/connectors Connector related issues normalization area/protocol labels Nov 29, 2021
@sherifnada
Copy link
Contributor Author

cc @tuliren @keu

@alexandr-shegeda alexandr-shegeda self-assigned this Dec 9, 2021
@tuliren
Copy link
Contributor

tuliren commented Dec 12, 2021

@sherifnada, by annotation, you mean we add extra enum with new fields in the current jsonSchemaTypeMap in JsonSchemaPrimitive, right? Like currently a date has { type: string, format: date-time }. Proposal 2 means two new datetime strings:

STRING_DATETIME(ImmutableMap.of("type", "string", "format", "date-time", "with-timezone": false)),
STRING_DATETIMEZ(ImmutableMap.of("type", "string", "format", "date-time", "with-timezone": true));

@sherifnada
Copy link
Contributor Author

@tuliren IDK where in the java code it would live, but at the protocol level it would live next to the type parameter e.g

{
type: string
airbyte_type_format: timestamp_with_timezone
}

@tuliren
Copy link
Contributor

tuliren commented Dec 13, 2021

Got it.

Here is where the Java code is:
https://github.com/airbytehq/airbyte/blob/master/airbyte-protocol/models/src/main/java/io/airbyte/protocol/models/JsonSchemaPrimitive.java#L11

The current setup (jsonSchemaTypeMap) is consistent with what you have in mind. We can just create a new enum.

@tuliren
Copy link
Contributor

tuliren commented Dec 13, 2021

Actually I think currently we return a UTC timestamp. So the timestamp string we generate should always ends with a Z, standing for UTC. Is it not the case? If not, we can just do that, and we don't need a timezone aware string type.


Update: the above comment is wrong. The timestamp string always ends with a Z because we parse the timestamp in this way (via DataTypeUtils#toISO8601String). It does not mean that the timestamp returned from source is actually in UTC.

@alexandr-shegeda
Copy link
Contributor

@tuliren @sherifnada it looks to me that we have at least two options here to handle the DB timezone

In my opinion, we will avoid a lot of trouble and we can skip adding the additional property to JSON Schema in case we will work only with the UTC timezone, what do you think?

@tuliren
Copy link
Contributor

tuliren commented Dec 13, 2021

In my opinion, we will avoid a lot of trouble and we can skip adding the additional property to JSON Schema in case we will work only with the UTC timezone, what do you think?

I think this only works if the source supports outputting a timestamp in UTC. It is possible that the source has no idea about what the timezone is for a timestamp. In that case, differentiating between timestamp with or without time zone is still important.

My previous comment was wrong. I have added an update there.

@tuliren
Copy link
Contributor

tuliren commented Dec 20, 2021

Summary on this issue:

  • The current proposal is: when adding new time-related enums, add a timezone-neutral enum, and a timezone-aware one together.
  • The timezone-neutral value should not have any timezone information in it (e.g. no Z at the end).
  • In this way, no information is lost. This is the best we can do.
  • The destination needs to properly handle different time types. If no timezone information is available from the source side, and a destination cannot handle a timezone-neutral value, there is nothing we can do about it.

@anatolec
Copy link

Hi @tuliren,
That sounds like a good solution. When do you think it will be available ?
Also, what would happen with existing syncs ?
Thanks !

@alexandr-shegeda
Copy link
Contributor

@tuliren @sherifnada I checked a few things with MySQL DB and made sure that it doesn't care about timezone almost for all time-related data types except timestamp, and it will return a constant time value irrespective of the DB timezone.
For example, we have date-time with value "2011-10-04 12:58:36", so with session timezone +00:00 or -06:00 it will return the same value "2011-10-04 12:58:36".

Also, in case we will try to pass the timezone in the insert statement for timestamp it will be ignored by the DB engine in favor of the DB timezone. This means that:

  • every time we get a result-set with a timestamp - it's converted according to the DB timezone. For example, timestamp value in UTC is "2011-10-04 12:58:36", it will be converted to "2011-10-04 10:58:36" in case DB timezone is -02:00
  • every date that we get from the result-set must be in UTC

It looks like it is a common practice for DBs to work with UTC-based time-related data types, without having the information about the time zone (except for timestamp, which always relies on DB timezone).

In order to avoid double converting (DB -> source = convert to UTC from DB timezone, source -> destination = convert from UTC to destination timezone) we should declare in the connection for source/destination that we work in UTC, in this case, source DB will convert timestamp into UTC out of the box and at the same time destination will consume the timestamp in UTC and convert it into DB timezone under the hood.

With the described approach, we will need to add additional JDBC param for sources/destinations (e.g. serverTimezone=UTC) and use proper enum from the JsonSchemaPrimitives without huge code changes and we can avoid zoned time-related enums.

@sherifnada sherifnada moved this from Prioritized for scoping to Ready for implementation in GL Roadmap Jan 12, 2022
@alexandr-shegeda
Copy link
Contributor

@tuliren @edgao FYI:

Next steps:

  • Implement additional field airbyte_type_format to describe additional information whether date-related type timezone aware/unaware (will be implemented in the scope of this issue)
  • for TIMESTAMPTZ and TIMETZ implement different logic for JDBC-based DBs to pass additional fields to the destination
    cc @misteryeo @sherifnada

@alexandr-shegeda alexandr-shegeda moved this from Ready for implementation (prioritized) to Backlog (scoped) in GL Roadmap Mar 22, 2022
@alexandr-shegeda alexandr-shegeda moved this from Backlog (scoped) to Ready for implementation (prioritized) in GL Roadmap Apr 29, 2022
@alexandr-shegeda alexandr-shegeda changed the title Expand date-time schema declarations to describe timezone awareness [EPIC] Expand date-time schema declarations to describe timezone awareness May 17, 2022
@grishick grishick added the team/destinations Destinations team's backlog label Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/protocol Epic normalization team/destinations Destinations team's backlog type/enhancement New feature or request
Projects
No open projects
Status: Ready for implementation (prioritized)
Development

No branches or pull requests

6 participants