-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678
[SPARK-43040][SQL] Improve TimestampNTZ type support in JDBC data source #40678
Conversation
DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime) | ||
} | ||
|
||
override def convertTimestampNTZToJavaTimestamp(ldt: LocalDateTime): Timestamp = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
def resultSetToRows( | ||
resultSet: ResultSet, | ||
schema: StructType, | ||
dialect: Option[JdbcDialect] = None): Iterator[Row] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is dialect optional? I think you can just pass dialect as JdbcDialect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how and where this function is used.
As it is a public function, I am thinking maybe we want to keep backward compatibilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backward compatibility, this solution would require a separate constructor, just having the default value would not work in Java.
...er-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
Show resolved
Hide resolved
...er-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
Show resolved
Hide resolved
* Convert java.sql.Timestamp to a Long value (internal representation of a TimestampNTZType) | ||
* holding the microseconds since the epoch of 1970-01-01 00:00:00Z for this timestamp. | ||
*/ | ||
def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add @param
, @return
and @Since("3.5.0)
. dialect is a developer API and is user-facing.
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with SQLConfHelper { | |||
case _ => None | |||
} | |||
|
|||
override def convertJavaTimestampToTimestampNTZ(t: Timestamp): Long = { | |||
DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a bit comments to explain why pgsql needs to override it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give a concrete Postgres read example where the general implementation would fail.
Say there is a Timestamp of "2023-04-05 08:00:00" stored in Postgres database and we want to read it as Spark TimestampNTZType from a TimeZone of America/Los_Angeles. The expected results would be "2023-04-05 08:00:00".
When we do PostgresDriver.getTimestamp
, what happens under the hood is that Postgres would use the default JVM TimeZone and create a Timestamp representing an instant of the wall clock in that time zone. Thus, the Java Timestamp effectively represents "2023-04-05 08:00:00 America/Los_Angeles".
With our general conversion, we will just store the underlining microseconds from epoch to represent the TimestampNTZType. This is problematic as when displaying the TimestampNTZType, we convert to a LocalDateTime using UTC as the time zone. This will give as an erroneous result of "2023-04-05 15:00:00".
The Postgres specific conversion first convert the Java Timestamp to LocalDateTime before getting its underlining milliseconds from epoch. This basically restores the Timestamp to represent "2023-04-05 08:00:00 UTC". Thus when converting back we get the correct result.
For write it is the similar story. @cloud-fan @beliefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the Postgres specific solution for the existing H2 test and it is not working.
I checked H2 driver as well and I think what happens is that H2 is creating the timestamp using the the milliseconds from epoch and THEN converting the wall clock time to the represent the instant in local Timezone. This change in order makes the difference. If we take the previous case as an example, the resultant Timestamp would be "2023-04-05 01:00:00 America/Los_Angeles". This represent the same instant as "2023-04-05 08:00:00 UTC" which is why storing its microseconds from epoch works. It also explain why converting to LocalDateTime (Postgres specific solution) would not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To conclude, JDBC drivers have different expected behaviors in regard to implementing "getTimestamp" and "setTimestamp". A general conversion strategy would not work for all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Postgres have TimestampNTZ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
almost every database has timestamp ntz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Postgres, timestamp
is equivalent to timestamp without time zone.
It has timestamptz
to represent timestamp with time zone.
also cc @yaooqinn |
...er-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/PostgresIntegrationSuite.scala
Show resolved
Hide resolved
@@ -98,6 +100,14 @@ private object PostgresDialect extends JdbcDialect with SQLConfHelper { | |||
case _ => None | |||
} | |||
|
|||
override def convertJavaTimestampToTimestampNTZ(t: Timestamp): LocalDateTime = { | |||
t.toLocalDateTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the change refer: https://github.com/apache/spark/pull/40678/files#r1162437868
We can update with DateTimeUtils.localDateTimeToMicros(t.toLocalDateTime)
here.
dialect: JdbcDialect, | ||
schema: StructType): Array[JDBCValueGetter] = | ||
schema.fields.map(sf => makeGetter(sf.dataType, dialect, sf.metadata)) | ||
======= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix conflicts
thanks, merging to master! |
### What changes were proposed in this pull request? apache#36726 supports TimestampNTZ type in JDBC data source and apache#37013 applies a fix to pass more test cases with H2. The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database. This PR adds a `dialect` parameter to `makeGetter` for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. `makeSetter` already has a `dialect` field and we will use that for converting back to Java Timestamp. ### Why are the changes needed? Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for converting between Java Timestamp and Spark TimestampNTZType. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. I added new test cases for `PostgresIntegrationSuite` to cover TimestampNTZ read and writes. Closes apache#40678 from tianhanhu/SPARK-43040_jdbc_timestamp_ntz. Authored-by: tianhanhu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
#36726 supports TimestampNTZ type in JDBC data source and #37013 applies a fix to pass more test cases with H2.
The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database.
This PR adds a
dialect
parameter tomakeGetter
for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType.makeSetter
already has adialect
field and we will use that for converting back to Java Timestamp.Why are the changes needed?
Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for
converting between Java Timestamp and Spark TimestampNTZType.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit test.
I added new test cases for
PostgresIntegrationSuite
to cover TimestampNTZ read and writes.