[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678

tianhanhu · 2023-04-05T18:50:03Z

What changes were proposed in this pull request?

#36726 supports TimestampNTZ type in JDBC data source and #37013 applies a fix to pass more test cases with H2.

The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database.

This PR adds a dialect parameter to makeGetter for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. makeSetter already has a dialect field and we will use that for converting back to Java Timestamp.

Why are the changes needed?

Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for
converting between Java Timestamp and Spark TimestampNTZType.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test.
I added new test cases for PostgresIntegrationSuite to cover TimestampNTZ read and writes.

sadikovi · 2023-04-05T22:19:34Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

+    DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)
+  }
+
+  override  def convertTimestampNTZToJavaTimestamp(ldt: LocalDateTime): Timestamp = {


nit: extra space should be removed.

sadikovi · 2023-04-05T22:20:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+  def resultSetToRows(
+      resultSet: ResultSet,
+      schema: StructType,
+      dialect: Option[JdbcDialect] = None): Iterator[Row] = {


Why is dialect optional? I think you can just pass dialect as JdbcDialect.

I am not sure how and where this function is used.
As it is a public function, I am thinking maybe we want to keep backward compatibilities?

For backward compatibility, this solution would require a separate constructor, just having the default value would not work in Java.

...er-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala

sadikovi · 2023-04-06T02:21:43Z

cc @cloud-fan @dongjoon-hyun @beliefer

...er-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

cloud-fan · 2023-04-06T05:12:37Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

+   * Convert java.sql.Timestamp to a Long value (internal representation of a TimestampNTZType)
+   * holding the microseconds since the epoch of 1970-01-01 00:00:00Z for this timestamp.
+   */
+  def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {


please add @param, @return and @Since("3.5.0). dialect is a developer API and is user-facing.

cloud-fan · 2023-04-06T05:13:12Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with SQLConfHelper {
    case _ => None
  }

+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = {
+    DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)


can we add a bit comments to explain why pgsql needs to override it?

I will give a concrete Postgres read example where the general implementation would fail.

Say there is a Timestamp of "2023-04-05 08:00:00" stored in Postgres database and we want to read it as Spark TimestampNTZType from a TimeZone of America/Los_Angeles. The expected results would be "2023-04-05 08:00:00".

When we do PostgresDriver.getTimestamp, what happens under the hood is that Postgres would use the default JVM TimeZone and create a Timestamp representing an instant of the wall clock in that time zone. Thus, the Java Timestamp effectively represents "2023-04-05 08:00:00 America/Los_Angeles".

With our general conversion, we will just store the underlining microseconds from epoch to represent the TimestampNTZType. This is problematic as when displaying the TimestampNTZType, we convert to a LocalDateTime using UTC as the time zone. This will give as an erroneous result of "2023-04-05 15:00:00".

The Postgres specific conversion first convert the Java Timestamp to LocalDateTime before getting its underlining milliseconds from epoch. This basically restores the Timestamp to represent "2023-04-05 08:00:00 UTC". Thus when converting back we get the correct result.

For write it is the similar story. @cloud-fan @beliefer

I tried the Postgres specific solution for the existing H2 test and it is not working.

I checked H2 driver as well and I think what happens is that H2 is creating the timestamp using the the milliseconds from epoch and THEN converting the wall clock time to the represent the instant in local Timezone. This change in order makes the difference. If we take the previous case as an example, the resultant Timestamp would be "2023-04-05 01:00:00 America/Los_Angeles". This represent the same instant as "2023-04-05 08:00:00 UTC" which is why storing its microseconds from epoch works. It also explain why converting to LocalDateTime (Postgres specific solution) would not work.

To conclude, JDBC drivers have different expected behaviors in regard to implementing "getTimestamp" and "setTimestamp". A general conversion strategy would not work for all of them.

Does Postgres have TimestampNTZ ?

almost every database has timestamp ntz.

In Postgres, timestamp is equivalent to timestamp without time zone.
It has timestamptz to represent timestamp with time zone.

cloud-fan · 2023-04-06T12:26:26Z

also cc @yaooqinn

...er-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala

…c_timestamp_ntz

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

beliefer · 2023-04-11T08:00:43Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with SQLConfHelper {
    case _ => None
  }

+  override def convertJavaTimestampToTimestampNTZ(t: Timestamp): LocalDateTime = {
+    t.toLocalDateTime


After the change refer: https://github.com/apache/spark/pull/40678/files#r1162437868
We can update with DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime) here.

cloud-fan · 2023-04-14T08:50:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+      dialect: JdbcDialect,
+      schema: StructType): Array[JDBCValueGetter] =
+    schema.fields.map(sf => makeGetter(sf.dataType, dialect, sf.metadata))
+=======


please fix conflicts

…c_timestamp_ntz

cloud-fan · 2023-05-05T02:04:53Z

thanks, merging to master!

### What changes were proposed in this pull request? apache#36726 supports TimestampNTZ type in JDBC data source and apache#37013 applies a fix to pass more test cases with H2. The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database. This PR adds a `dialect` parameter to `makeGetter` for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. `makeSetter` already has a `dialect` field and we will use that for converting back to Java Timestamp. ### Why are the changes needed? Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for converting between Java Timestamp and Spark TimestampNTZType. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. I added new test cases for `PostgresIntegrationSuite` to cover TimestampNTZ read and writes. Closes apache#40678 from tianhanhu/SPARK-43040_jdbc_timestamp_ntz. Authored-by: tianhanhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Implementation and test

f69f77b

github-actions bot added the SQL label Apr 5, 2023

Style

30c2dbf

sadikovi reviewed Apr 5, 2023

View reviewed changes

Address comments

8176247

sadikovi approved these changes Apr 6, 2023

View reviewed changes

cloud-fan reviewed Apr 6, 2023

View reviewed changes

...er-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala Show resolved Hide resolved

cloud-fan reviewed Apr 6, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala Show resolved Hide resolved

cloud-fan reviewed Apr 6, 2023

View reviewed changes

Add comments

2e2c8ca

yaooqinn reviewed Apr 7, 2023

View reviewed changes

...er-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala Show resolved Hide resolved

tianhanhu added 2 commits April 10, 2023 23:29

Make functions symetric

01fa6fb

Merge branch 'master' of github.com:apache/spark into SPARK-43040_jdb…

99c5e2c

…c_timestamp_ntz

cloud-fan reviewed Apr 11, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala Show resolved Hide resolved

cloud-fan approved these changes Apr 11, 2023

View reviewed changes

beliefer reviewed Apr 11, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala Show resolved Hide resolved

beliefer reviewed Apr 11, 2023

View reviewed changes

Merge

4ac16fe

cloud-fan reviewed Apr 14, 2023

View reviewed changes

tianhanhu added 4 commits April 25, 2023 18:28

Fix conflict

41012f6

Merge branch 'master' of github.com:apache/spark into SPARK-43040_jdb…

7c96484

…c_timestamp_ntz

Merge branch 'master' of github.com:apache/spark into SPARK-43040_jdb…

9772c05

…c_timestamp_ntz

Fix header

92b02c2

cloud-fan closed this in 0c4ac71 May 5, 2023

beliefer mentioned this pull request May 5, 2023

[SPARK-43040][SQL][FOLLOWUP] Avoid duplicated conversion #41058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678

[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678

tianhanhu commented Apr 5, 2023

sadikovi Apr 5, 2023

tianhanhu-db Apr 5, 2023

sadikovi Apr 5, 2023

tianhanhu-db Apr 5, 2023

sadikovi Apr 5, 2023

sadikovi commented Apr 6, 2023

cloud-fan Apr 6, 2023

cloud-fan Apr 6, 2023

beliefer Apr 6, 2023

tianhanhu-db Apr 7, 2023 •

edited

Loading

tianhanhu-db Apr 7, 2023 •

edited

Loading

tianhanhu-db Apr 7, 2023

beliefer Apr 7, 2023

cloud-fan Apr 7, 2023

tianhanhu-db Apr 7, 2023

cloud-fan commented Apr 6, 2023

beliefer Apr 11, 2023

cloud-fan Apr 14, 2023

cloud-fan commented May 5, 2023

[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678

[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678

Conversation

tianhanhu commented Apr 5, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi commented Apr 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tianhanhu-db Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

tianhanhu-db Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented May 5, 2023

tianhanhu-db Apr 7, 2023 •

edited

Loading

tianhanhu-db Apr 7, 2023 •

edited

Loading