[SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source #36726

sadikovi · 2022-05-30T08:40:52Z

What changes were proposed in this pull request?

This PR adds support for TimestampNTZ (TIMESTAMP WITHOUT TIME ZONE) in JDBC data source. It also introduces a new configuration option inferTimestampNTZType which allows to read written timestamps as timestamp without time zone. By default this is set to false, i.e. all timestamps are read as legacy timestamp type.

Here is the state of timestamp without time zone support in the built-in dialects:

H2: timestamp without time zone, seems to map to timestamp type
Derby: only has timestamp type
MySQL: only has timestamp type
Postgres: has timestamp without time zone, which maps to timestamp
SQL Server: only datetime/datetime2, neither are time zone aware
Oracle: seems to only have timestamp and timestamp with time zone
Teradata: similar to Oracle but I could not verify
DB2: has TIMESTAMP WITHOUT TIME ZONE but I could not make this type work in my test, only TIMESTAMP seems to work

Why are the changes needed?

Adds support for the new TimestampNTZ type, see https://issues.apache.org/jira/browse/SPARK-35662.

Does this PR introduce any user-facing change?

JDBC data source is now capable of writing and reading TimestampNTZ types. When reading timestamp values, configuration option inferTimestampNTZType allows to infer those values as TIMESTAMP WITHOUT TIME ZONE. By default the option is set to false so the behaviour is unchanged and all timestamps are read TIMESTAMP WITH LOCAL TIME ZONE.

How was this patch tested?

I added a unit test to ensure the general functionality works. I also manually verified the write/read test for TimestampNTZ in the following databases (all I could get access to):

H2, jdbc:h2:mem:testdb0
Derby, jdbc:derby:<filepath>
MySQL, docker run --name mysql -e MYSQL_ROOT_PASSWORD=secret -e MYSQL_DATABASE=db -e MYSQL_USER=user -e MYSQL_PASSWORD=secret -p 3306:3306 -d mysql:5.7, jdbc:mysql://127.0.0.1:3306/db?user=user&password=secret
PostgreSQL, docker run -d --name postgres -e POSTGRES_PASSWORD=secret -e POSTGRES_USER=user -e POSTGRES_DB=db -p 5432:5432 postgres:12.11, jdbc:postgresql://127.0.0.1:5432/db?user=user&password=secret
SQL Server, docker run -e "ACCEPT_EULA=Y" -e SA_PASSWORD='yourStrong(!)Password' -p 1433:1433 -d mcr.microsoft.com/mssql/server:2019-CU15-ubuntu-20.04, jdbc:sqlserver://127.0.0.1:1433;user=sa;password=yourStrong(!)Password
DB2, docker run -itd --name mydb2 --privileged=true -p 50000:50000 -e LICENSE=accept -e DB2INST1_PASSWORD=secret -e DBNAME=db ibmcom/db2, jdbc:db2://127.0.0.1:50000/db:user=db2inst1;password=secret;.

sadikovi · 2022-05-30T08:42:13Z

@gengliangwang @beliefer Can you review this PR? Thanks.

AmplabJenkins · 2022-05-30T09:46:22Z

Can one of the admins verify this patch?

HyukjinKwon · 2022-05-30T10:17:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala

@@ -226,6 +226,9 @@ class JDBCOptions(
  // The prefix that is added to the query sent to the JDBC database.
  // This is required to support some complex queries with some JDBC databases.
  val prepareQuery = parameters.get(JDBC_PREPARE_QUERY).map(_ + " ").getOrElse("")
+
+  // Infers timestamp values as TimestampNTZ type when reading data.
+  val inferTimestampNTZType = parameters.getOrElse(JDBC_INFER_TIMESTAMP_NTZ, "false").toBoolean


Should we maybe check if spark.sql.timestampType is TIMESTAMP_NTZ if inferTimestampNTZType is not set? That's how CSV type inference and Python type inference do.

cc @gengliangwang FYI

Yes, I thought about it, let's ask @gengliangwang.

LuciferYang · 2022-05-31T04:09:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+          row.setLong(pos, DateTimeUtils.fromJavaTimestamp(t))
+        } else {
+          row.update(pos, null)
+        }


Same as TimestampType branch, should we merge them?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

Lines 466 to 473 in 8bbbdb5

case TimestampType =>

(rs: ResultSet, row: InternalRow, pos: Int) =>

val t = rs.getTimestamp(pos + 1)

if (t != null) {

row.setLong(pos, DateTimeUtils.fromJavaTimestamp(t))

} else {

row.update(pos, null)

}

Yes, we can merge. I will do that, thanks.

sadikovi · 2022-05-31T06:58:25Z

@beliefer Can you review this PR from JDBC perspective? I think you have contributed extensively to this part of the code. Also, cc @gengliangwang.

beliefer · 2022-06-02T06:46:25Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

+  }
+
+  test("SPARK-39339: TimestampNTZType support") {
+    val tableName = "timestamp_ntz_table"


Could you add tests write/read timestamp_ntz with different time zone ? I doubt the result is not correct.

Timestamp NTZ type is independent of the time zone with timestamp rebased to UTC. Sure, I can add a test to confirm.

beliefer · 2022-06-02T07:06:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

@@ -150,6 +150,9 @@ object JdbcUtils extends Logging with SQLConfHelper {
      case StringType => Option(JdbcType("TEXT", java.sql.Types.CLOB))
      case BinaryType => Option(JdbcType("BLOB", java.sql.Types.BLOB))
      case TimestampType => Option(JdbcType("TIMESTAMP", java.sql.Types.TIMESTAMP))
+      // Most of the databases either don't support TIMESTAMP WITHOUT TIME ZONE or map it to
+      // TIMESTAMP type. This will be overwritten in dialects.
+      case TimestampNTZType => Option(JdbcType("TIMESTAMP", java.sql.Types.TIMESTAMP))


We cannot do this. We should let JDBC dialect decide how to do the mapping.

This is a common use case of treating TIMESTAMP as timestamp without time zone. JDBC dialects can override this setting if need be. For example, SQL Server uses DATETIME instead. I have verified that most of the jdbc data sources work fine with TIMESTAMP.

I am going to update the comment to elaborate in more detail.

sadikovi · 2022-06-03T07:01:51Z

@gengliangwang Can you review?

Thanks to the comments from the reviewers, I noticed that there could be inconsistencies when writing timestamp with local time zone and reading as timestamp_ntz. For example, writing 2020-01-01 00:00:00 in local time zone (UTC+1), timestamp_ntz will be read as 2019-12-31 23:00:00. This is because we always store timestamp in UTC and we don't have information on what time zone was used when writing timestamp.

Could you advise on how to proceed? I am not sure we can do much about it because we don't store time zone information.

gengliangwang · 2022-06-03T07:32:42Z

@sadikovi yes will do. I just moved home. Sorry for the late reply.

beliefer · 2022-06-04T08:26:21Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

+            .option("inferTimestampNTZType", "true")
+            .option("url", urlWithUserAndPass)
+            .option("dbtable", tableName)
+            .load()


This test case always read/write with the same time zone. You can reference

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala

Line 808 in e410d98

test("SPARK-37463: read/write Timestamp ntz to Orc with different time zone") {

Yes, I will update, thanks 👍.

The test case still read and write to JDBC with the same time zone.

gengliangwang

LGTM, thanks for the work!

sadikovi · 2022-06-10T19:40:34Z

@gengliangwang I made a few small changes. Can you review again? Thanks.

gengliangwang · 2022-06-14T04:21:42Z

Thanks, merging to master

beliefer · 2022-06-14T05:35:56Z

Thanks, merging to master

I update this test case and it will fail !

  test("SPARK-39339: TimestampNTZType with different local time zones") {
    val tableName = "timestamp_ntz_diff_tz_support_table"

    DateTimeTestUtils.outstandingZoneIds.foreach { zoneId =>
      DateTimeTestUtils.withDefaultTimeZone(zoneId) {
        Seq(
          "1972-07-04 03:30:00",
          "2019-01-20 12:00:00.502",
          "2019-01-20T00:00:00.123456",
          "1500-01-20T00:00:00.123456"
        ).foreach { case datetime =>
          val df = spark.sql(s"select timestamp_ntz '$datetime'")
          df.write.format("jdbc")
            .mode("overwrite")
            .option("url", urlWithUserAndPass)
            .option("dbtable", tableName)
            .save()

          DateTimeTestUtils.outstandingZoneIds.foreach { zoneId =>
            DateTimeTestUtils.withDefaultTimeZone(zoneId) {
              val res = spark.read.format("jdbc")
                .option("inferTimestampNTZType", "true")
                .option("url", urlWithUserAndPass)
                .option("dbtable", tableName)
                .load()

              checkAnswer(res, df)
            }
          }
        }
      }
    }
  }

sadikovi · 2022-06-14T05:42:59Z

I updated the test case as you suggested and it passes on my machine. Can you share the error message? It also passed the build.

beliefer · 2022-06-14T05:45:57Z

I think we can't support timestamp ntz with the option.
We should let JDBC dialect to decide how to supports timestamp ntz.
If one table have ts1 is timestamp and ts2 is timestamp ntz, what is the output when we specify the inferTimestampNTZType option ?

beliefer · 2022-06-14T05:46:27Z

I updated the test case as you suggested and it passes on my machine. Can you share the error message? It also passed the build.

== Results ==
!== Correct Answer - 1 ==                                           == Spark Answer - 1 ==
 struct<TIMESTAMP_NTZ '1500-01-20 00:00:00.123456':timestamp_ntz>   struct<TIMESTAMP_NTZ '1500-01-20 00:00:00.123456':timestamp_ntz>
![1500-01-20T00:00:00.123456]                                       [1500-01-20T00:16:08.123456]

sadikovi · 2022-06-14T05:55:31Z

I think we can. JDBC dialects can configure how they map TimestampNTZ type.
In the case you mentioned, both timestamps will be read as timestamp_ntz in MySQL and Postgres. In fact, the current timestamp type is stored as timestamp_ntz in those database systems.

Even with dialects managing timestamp_ntz writes and reads, this would be the same problem unless you store them as different types.

Also, the test passes in master:

[ivan.sadikov@C02DV1TGMD6R spark-oss (master)]$ git log -n1
commit 2349175e1b81b0a61e1ed90c2d051c01cf78de9b (HEAD -> master, upstream/master)
Author: Ivan Sadikov <[email protected]>
Date:   Mon Jun 13 21:22:15 2022 -0700

    [SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source
    



[info] JDBCSuite:
05:53:04.233 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
05:53:07.936 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built-in provider.
[info] - SPARK-39339: Handle TimestampNTZType null values (1 second, 555 milliseconds)
[info] - SPARK-39339: TimestampNTZType with different local time zones (4 seconds, 48 milliseconds)
05:53:14.022 WARN org.apache.spark.sql.jdbc.JDBCSuite: 

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.jdbc.JDBCSuite, threads: Timer-2 (daemon=true), rpc-boss-3-1 (daemon=true), shuffle-boss-6-1 (daemon=true) =====
[info] Run completed in 11 seconds, 724 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 81 s (01:21), completed Jun 14, 2022 5:53:14 AM

sadikovi · 2022-06-14T06:03:09Z

@beliefer Maybe we can address your concerns in the follow-up work, what do you think? We can open a follow-up ticket and try to polish the implementation - it is not perfect by any means!

beliefer · 2022-06-14T06:11:46Z

@sadikovi You can run the test case I added above.

…ource is incorrect ### What changes were proposed in this pull request? #36726 supports TimestampNTZ type in JDBC data source. But the implement is incorrect. This PR just modify a test case and it will be failed ! The test case show below. ``` test("SPARK-39339: TimestampNTZType with different local time zones") { val tableName = "timestamp_ntz_diff_tz_support_table" DateTimeTestUtils.outstandingZoneIds.foreach { zoneId => DateTimeTestUtils.withDefaultTimeZone(zoneId) { Seq( "1972-07-04 03:30:00", "2019-01-20 12:00:00.502", "2019-01-20T00:00:00.123456", "1500-01-20T00:00:00.123456" ).foreach { case datetime => val df = spark.sql(s"select timestamp_ntz '$datetime'") df.write.format("jdbc") .mode("overwrite") .option("url", urlWithUserAndPass) .option("dbtable", tableName) .save() DateTimeTestUtils.outstandingZoneIds.foreach { zoneId => DateTimeTestUtils.withDefaultTimeZone(zoneId) { val res = spark.read.format("jdbc") .option("inferTimestampNTZType", "true") .option("url", urlWithUserAndPass) .option("dbtable", tableName) .load() checkAnswer(res, df) } } } } } } ``` The test case output failure show below. ``` Results do not match for query: Timezone: sun.util.calendar.ZoneInfo[id="Africa/Dakar",offset=0,dstSavings=0,useDaylight=false,transitions=3,lastRule=null] Timezone Env: == Parsed Logical Plan == Relation [TIMESTAMP_NTZ '1500-01-20 00:00:00.123456'#253] JDBCRelation(timestamp_ntz_diff_tz_support_table) [numPartitions=1] == Analyzed Logical Plan == TIMESTAMP_NTZ '1500-01-20 00:00:00.123456': timestamp_ntz Relation [TIMESTAMP_NTZ '1500-01-20 00:00:00.123456'#253] JDBCRelation(timestamp_ntz_diff_tz_support_table) [numPartitions=1] == Optimized Logical Plan == Relation [TIMESTAMP_NTZ '1500-01-20 00:00:00.123456'#253] JDBCRelation(timestamp_ntz_diff_tz_support_table) [numPartitions=1] == Physical Plan == *(1) Scan JDBCRelation(timestamp_ntz_diff_tz_support_table) [numPartitions=1] [TIMESTAMP_NTZ '1500-01-20 00:00:00.123456'#253] PushedFilters: [], ReadSchema: struct<TIMESTAMP_NTZ '1500-01-20 00:00:00.123456':timestamp_ntz> == Results == == Results == !== Correct Answer - 1 == == Spark Answer - 1 == struct<TIMESTAMP_NTZ '1500-01-20 00:00:00.123456':timestamp_ntz> struct<TIMESTAMP_NTZ '1500-01-20 00:00:00.123456':timestamp_ntz> ![1500-01-20T00:00:00.123456] [1500-01-20T00:16:08.123456] ScalaTestFailureLocation: org.apache.spark.sql.QueryTest$ at (QueryTest.scala:243) org.scalatest.exceptions.TestFailedException: ``` ### Why are the changes needed? Fix an implement bug. The reason of the bug is use `toJavaTimestamp` and `fromJavaTimestamp`. `toJavaTimestamp` and `fromJavaTimestamp` lead to the timestamp with JVM system time zone. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New test case. Closes #37013 from beliefer/SPARK-39339_followup. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

### What changes were proposed in this pull request? #36726 supports TimestampNTZ type in JDBC data source and #37013 applies a fix to pass more test cases with H2. The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database. This PR adds a `dialect` parameter to `makeGetter` for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. `makeSetter` already has a `dialect` field and we will use that for converting back to Java Timestamp. ### Why are the changes needed? Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for converting between Java Timestamp and Spark TimestampNTZType. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. I added new test cases for `PostgresIntegrationSuite` to cover TimestampNTZ read and writes. Closes #40678 from tianhanhu/SPARK-43040_jdbc_timestamp_ntz. Authored-by: tianhanhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? apache#36726 supports TimestampNTZ type in JDBC data source and apache#37013 applies a fix to pass more test cases with H2. The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database. This PR adds a `dialect` parameter to `makeGetter` for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. `makeSetter` already has a `dialect` field and we will use that for converting back to Java Timestamp. ### Why are the changes needed? Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for converting between Java Timestamp and Spark TimestampNTZType. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. I added new test cases for `PostgresIntegrationSuite` to cover TimestampNTZ read and writes. Closes apache#40678 from tianhanhu/SPARK-43040_jdbc_timestamp_ntz. Authored-by: tianhanhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

sadikovi added 2 commits May 30, 2022 18:26

support timestamp_ntz

f218359

update test name

0f38d1b

github-actions bot added DOCS SQL labels May 30, 2022

HyukjinKwon reviewed May 30, 2022

View reviewed changes

fix scalastyle

3cd7eb4

LuciferYang reviewed May 31, 2022

View reviewed changes

apache deleted a comment from codecov-commenter Jun 1, 2022

beliefer reviewed Jun 2, 2022

View reviewed changes

sadikovi added 2 commits June 3, 2022 16:51

update tests

4820228

merge types

016459e

beliefer reviewed Jun 4, 2022

View reviewed changes

gengliangwang approved these changes Jun 6, 2022

View reviewed changes

sadikovi added 2 commits June 10, 2022 17:45

Merge remote-tracking branch 'upstream/master' into timestamp_ntz_jdbc

e06b493

update test

f2a38f1

gengliangwang closed this in 2349175 Jun 14, 2022

beliefer mentioned this pull request Jun 28, 2022

[SPARK-39339][SQL][FOLLOWUP] Fix bug TimestampNTZ type in JDBC data source is incorrect #37013

Closed

tianhanhu mentioned this pull request Apr 5, 2023

[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source #36726

[SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source #36726

sadikovi commented May 30, 2022

sadikovi commented May 30, 2022

AmplabJenkins commented May 30, 2022

HyukjinKwon May 30, 2022

HyukjinKwon May 30, 2022

sadikovi May 31, 2022

LuciferYang May 31, 2022 •

edited

Loading

sadikovi May 31, 2022

sadikovi commented May 31, 2022

beliefer Jun 2, 2022

sadikovi Jun 2, 2022 •

edited

Loading

beliefer Jun 2, 2022

sadikovi Jun 2, 2022 •

edited

Loading

sadikovi commented Jun 3, 2022

gengliangwang commented Jun 3, 2022

beliefer Jun 4, 2022

sadikovi Jun 5, 2022

beliefer Jun 10, 2022

gengliangwang left a comment

sadikovi commented Jun 10, 2022

gengliangwang commented Jun 14, 2022

beliefer commented Jun 14, 2022 •

edited

Loading

sadikovi commented Jun 14, 2022

beliefer commented Jun 14, 2022

beliefer commented Jun 14, 2022

sadikovi commented Jun 14, 2022

sadikovi commented Jun 14, 2022

beliefer commented Jun 14, 2022

	case TimestampType =>
	(rs: ResultSet, row: InternalRow, pos: Int) =>
	val t = rs.getTimestamp(pos + 1)
	if (t != null) {
	row.setLong(pos, DateTimeUtils.fromJavaTimestamp(t))
	} else {
	row.update(pos, null)
	}

[SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source #36726

[SPARK-39339][SQL] Support TimestampNTZ type in JDBC data source #36726

Conversation

sadikovi commented May 30, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

sadikovi commented May 30, 2022

AmplabJenkins commented May 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang May 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi commented May 31, 2022

Choose a reason for hiding this comment

sadikovi Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

sadikovi commented Jun 3, 2022

gengliangwang commented Jun 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang left a comment

Choose a reason for hiding this comment

sadikovi commented Jun 10, 2022

gengliangwang commented Jun 14, 2022

beliefer commented Jun 14, 2022 • edited Loading

sadikovi commented Jun 14, 2022

beliefer commented Jun 14, 2022

beliefer commented Jun 14, 2022

sadikovi commented Jun 14, 2022

sadikovi commented Jun 14, 2022

beliefer commented Jun 14, 2022

LuciferYang May 31, 2022 •

edited

Loading

sadikovi Jun 2, 2022 •

edited

Loading

sadikovi Jun 2, 2022 •

edited

Loading

beliefer commented Jun 14, 2022 •

edited

Loading