Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678

Closed

Conversation

tianhanhu
Copy link
Contributor

What changes were proposed in this pull request?

#36726 supports TimestampNTZ type in JDBC data source and #37013 applies a fix to pass more test cases with H2.

The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database.

This PR adds a dialect parameter to makeGetter for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. makeSetter already has a dialect field and we will use that for converting back to Java Timestamp.

Why are the changes needed?

Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for
converting between Java Timestamp and Spark TimestampNTZType.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test.
I added new test cases for PostgresIntegrationSuite to cover TimestampNTZ read and writes.

@github-actions github-actions bot added the SQL label Apr 5, 2023
DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)
}

override def convertTimestampNTZToJavaTimestamp(ldt: LocalDateTime): Timestamp = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space should be removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def resultSetToRows(
resultSet: ResultSet,
schema: StructType,
dialect: Option[JdbcDialect] = None): Iterator[Row] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is dialect optional? I think you can just pass dialect as JdbcDialect.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how and where this function is used.
As it is a public function, I am thinking maybe we want to keep backward compatibilities?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility, this solution would require a separate constructor, just having the default value would not work in Java.

@sadikovi
Copy link
Contributor

sadikovi commented Apr 6, 2023

* Convert java.sql.Timestamp to a Long value (internal representation of a TimestampNTZType)
* holding the microseconds since the epoch of 1970-01-01 00:00:00Z for this timestamp.
*/
def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add @param, @return and @Since("3.5.0). dialect is a developer API and is user-facing.

@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with SQLConfHelper {
case _ => None
}

override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a bit comments to explain why pgsql needs to override it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

@tianhanhu-db tianhanhu-db Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give a concrete Postgres read example where the general implementation would fail.

Say there is a Timestamp of "2023-04-05 08:00:00" stored in Postgres database and we want to read it as Spark TimestampNTZType from a TimeZone of America/Los_Angeles. The expected results would be "2023-04-05 08:00:00".

When we do PostgresDriver.getTimestamp, what happens under the hood is that Postgres would use the default JVM TimeZone and create a Timestamp representing an instant of the wall clock in that time zone. Thus, the Java Timestamp effectively represents "2023-04-05 08:00:00 America/Los_Angeles".

With our general conversion, we will just store the underlining microseconds from epoch to represent the TimestampNTZType. This is problematic as when displaying the TimestampNTZType, we convert to a LocalDateTime using UTC as the time zone. This will give as an erroneous result of "2023-04-05 15:00:00".

The Postgres specific conversion first convert the Java Timestamp to LocalDateTime before getting its underlining milliseconds from epoch. This basically restores the Timestamp to represent "2023-04-05 08:00:00 UTC". Thus when converting back we get the correct result.

For write it is the similar story. @cloud-fan @beliefer

Copy link

@tianhanhu-db tianhanhu-db Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the Postgres specific solution for the existing H2 test and it is not working.

I checked H2 driver as well and I think what happens is that H2 is creating the timestamp using the the milliseconds from epoch and THEN converting the wall clock time to the represent the instant in local Timezone. This change in order makes the difference. If we take the previous case as an example, the resultant Timestamp would be "2023-04-05 01:00:00 America/Los_Angeles". This represent the same instant as "2023-04-05 08:00:00 UTC" which is why storing its microseconds from epoch works. It also explain why converting to LocalDateTime (Postgres specific solution) would not work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To conclude, JDBC drivers have different expected behaviors in regard to implementing "getTimestamp" and "setTimestamp". A general conversion strategy would not work for all of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Postgres have TimestampNTZ ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost every database has timestamp ntz.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Postgres, timestamp is equivalent to timestamp without time zone.
It has timestamptz to represent timestamp with time zone.

@cloud-fan
Copy link
Contributor

also cc @yaooqinn

@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with SQLConfHelper {
case _ => None
}

override def convertJavaTimestampToTimestampNTZ(t: Timestamp): LocalDateTime = {
t.toLocalDateTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the change refer: https://github.com/apache/spark/pull/40678/files#r1162437868
We can update with DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime) here.

dialect: JdbcDialect,
schema: StructType): Array[JDBCValueGetter] =
schema.fields.map(sf => makeGetter(sf.dataType, dialect, sf.metadata))
=======
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix conflicts

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 0c4ac71 May 5, 2023
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
### What changes were proposed in this pull request?

apache#36726 supports TimestampNTZ type in JDBC data source and apache#37013 applies a fix to pass more test cases with H2.

The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database.

This PR adds a `dialect` parameter to `makeGetter` for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. `makeSetter` already has a `dialect` field and we will use that for converting back to Java Timestamp.

### Why are the changes needed?

Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for
converting between Java Timestamp and Spark TimestampNTZType.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test.
I added new test cases for `PostgresIntegrationSuite` to cover TimestampNTZ read and writes.

Closes apache#40678 from tianhanhu/SPARK-43040_jdbc_timestamp_ntz.

Authored-by: tianhanhu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants