Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39339][SQL][FOLLOWUP] Fix bug TimestampNTZ type in JDBC data source is incorrect #37013

Closed
wants to merge 6 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Jun 28, 2022

What changes were proposed in this pull request?

#36726 supports TimestampNTZ type in JDBC data source.
But the implement is incorrect.
This PR just modify a test case and it will be failed !
The test case show below.

  test("SPARK-39339: TimestampNTZType with different local time zones") {
    val tableName = "timestamp_ntz_diff_tz_support_table"

    DateTimeTestUtils.outstandingZoneIds.foreach { zoneId =>
      DateTimeTestUtils.withDefaultTimeZone(zoneId) {
        Seq(
          "1972-07-04 03:30:00",
          "2019-01-20 12:00:00.502",
          "2019-01-20T00:00:00.123456",
          "1500-01-20T00:00:00.123456"
        ).foreach { case datetime =>
          val df = spark.sql(s"select timestamp_ntz '$datetime'")
          df.write.format("jdbc")
            .mode("overwrite")
            .option("url", urlWithUserAndPass)
            .option("dbtable", tableName)
            .save()

          DateTimeTestUtils.outstandingZoneIds.foreach { zoneId =>
            DateTimeTestUtils.withDefaultTimeZone(zoneId) {
              val res = spark.read.format("jdbc")
                .option("inferTimestampNTZType", "true")
                .option("url", urlWithUserAndPass)
                .option("dbtable", tableName)
                .load()

              checkAnswer(res, df)
            }
          }
        }
      }
    }
  }

The test case output failure show below.

Results do not match for query:
Timezone: sun.util.calendar.ZoneInfo[id="Africa/Dakar",offset=0,dstSavings=0,useDaylight=false,transitions=3,lastRule=null]
Timezone Env: 

== Parsed Logical Plan ==
Relation [TIMESTAMP_NTZ '1500-01-20 00:00:00.123456'#253] JDBCRelation(timestamp_ntz_diff_tz_support_table) [numPartitions=1]

== Analyzed Logical Plan ==
TIMESTAMP_NTZ '1500-01-20 00:00:00.123456': timestamp_ntz
Relation [TIMESTAMP_NTZ '1500-01-20 00:00:00.123456'#253] JDBCRelation(timestamp_ntz_diff_tz_support_table) [numPartitions=1]

== Optimized Logical Plan ==
Relation [TIMESTAMP_NTZ '1500-01-20 00:00:00.123456'#253] JDBCRelation(timestamp_ntz_diff_tz_support_table) [numPartitions=1]

== Physical Plan ==
*(1) Scan JDBCRelation(timestamp_ntz_diff_tz_support_table) [numPartitions=1] [TIMESTAMP_NTZ '1500-01-20 00:00:00.123456'#253] PushedFilters: [], ReadSchema: struct<TIMESTAMP_NTZ '1500-01-20 00:00:00.123456':timestamp_ntz>

== Results ==

== Results ==
!== Correct Answer - 1 ==                                           == Spark Answer - 1 ==
 struct<TIMESTAMP_NTZ '1500-01-20 00:00:00.123456':timestamp_ntz>   struct<TIMESTAMP_NTZ '1500-01-20 00:00:00.123456':timestamp_ntz>
![1500-01-20T00:00:00.123456]                                       [1500-01-20T00:16:08.123456]
    
       
ScalaTestFailureLocation: org.apache.spark.sql.QueryTest$ at (QueryTest.scala:243)
org.scalatest.exceptions.TestFailedException: 

Why are the changes needed?

Fix an implement bug.
The reason of the bug is use toJavaTimestamp and fromJavaTimestamp.
toJavaTimestamp and fromJavaTimestamp lead to the timestamp with JVM system time zone.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test case.

@github-actions github-actions bot added the SQL label Jun 28, 2022
@beliefer beliefer changed the title [SPARK-39339][SQL][FOLLOWUP] TimestampNTZ type in JDBC data source is incorrect [WIP][SPARK-39339][SQL][FOLLOWUP] TimestampNTZ type in JDBC data source is incorrect Jun 28, 2022
@beliefer beliefer changed the title [WIP][SPARK-39339][SQL][FOLLOWUP] TimestampNTZ type in JDBC data source is incorrect [SPARK-39339][SQL][FOLLOWUP] TimestampNTZ type in JDBC data source is incorrect Jun 28, 2022
@beliefer
Copy link
Contributor Author

beliefer commented Jun 28, 2022

ping @gengliangwang cc @cloud-fan @sadikovi

@beliefer beliefer changed the title [SPARK-39339][SQL][FOLLOWUP] TimestampNTZ type in JDBC data source is incorrect [SPARK-39339][SQL][FOLLOWUP] Fix bug TimestampNTZ type in JDBC data source is incorrect Jun 28, 2022
@@ -599,10 +610,13 @@ object JdbcUtils extends Logging with SQLConfHelper {
stmt.setTimestamp(pos + 1, toJavaTimestamp(instantToMicros(row.getAs[Instant](pos))))
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are touching the code here, let's make it right. I think for ntz type, its value is always java.time.LocalDateTime, no matter datetimeJava8ApiEnabled is true or not, right? @MaxGekk @gengliangwang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, TimestampNTZ is independent of the configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the PR. Can you elaborate on why these changes are necessary and provide a clear explanation of the problem and solution in the PR description and/or comments? This will help us to review this PR.

The current explanation is basically non-existent and just states that it is a bug; however, previously all of the tests passed in CI and on my local machine. It is likely that the existing tests did not cover this corner case so it would be good to outline what that case was.

@@ -1939,17 +1939,22 @@ class JDBCSuite extends QueryTest
val df = spark.sql(s"select timestamp_ntz '$datetime'")
df.write.format("jdbc")
.mode("overwrite")
.option("inferTimestampNTZType", "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferTimestampNTZType is only applied for reads. Could you remove this option here?

val seconds = Math.floorDiv(micros, DateTimeConstants.MICROS_PER_SECOND)
val nanos = (micros - seconds * DateTimeConstants.MICROS_PER_SECOND) *
DateTimeConstants.NANOS_PER_MICROS
val result = new java.sql.Timestamp(seconds * DateTimeConstants.MILLIS_PER_SECOND)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it what toJavaTimestamp does already?

Copy link
Contributor Author

@beliefer beliefer Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because toJavaTimestamp based on the JVM system time zone, we must keep it without time zone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that toJavaTimestamp rebases the original value, i.e. adds an offset, so java.sql.Timestamp ends up with that offset causing the date to be different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with @sadikovi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset based on JVM system time zone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer it's basically the same except for rebasing the old timestmaps

  def toJavaTimestamp(micros: Long): Timestamp = {
    val rebasedMicros = rebaseGregorianToJulianMicros(micros)
    val seconds = Math.floorDiv(rebasedMicros, MICROS_PER_SECOND)
    val ts = new Timestamp(seconds * MILLIS_PER_SECOND)
    val nanos = (rebasedMicros - seconds * MICROS_PER_SECOND) * NANOS_PER_MICROS
    ts.setNanos(nanos.toInt)
    ts
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@sadikovi
Copy link
Contributor

Please update the PR description with the clear explanation of the bug and how the solution fixes the problem.

@gengliangwang
Copy link
Member

gengliangwang commented Jun 30, 2022

This PR just modify a test case and it will be failed !
The test case output failure show below.

@beliefer could you provide the test case itself in the PR description?
It seems that only"1500-01-20T00:00:00.123456" failed? Why is that?

@sadikovi
Copy link
Contributor

I think the problem the unit test is trying to address is what happens when writes are in one time zone but reads are in another. This explains why there are 2 loops: one is for writes and the other one is for reads.

I am happy with the fix, thanks for working on it but I am also curious why it fails on 1500-01-20T00:00:00.123456. Is it because of the calendar that Spark uses?

@cloud-fan
Copy link
Contributor

I am also curious why it fails on 1500-01-20T00:00:00.123456. Is it because of the calendar that Spark uses?

I believe so, toJavaTimestamp takes care of the legacy calendar, which applies to old dates like the above one.

(rs: ResultSet, row: InternalRow, pos: Int) =>
val t = rs.getTimestamp(pos + 1)
if (t != null) {
val micros = DateTimeUtils.millisToMicros(t.getTime) +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it more readable, can we add DateTimeUtils.fromJavaTimestampNoRebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

(stmt: PreparedStatement, row: Row, pos: Int) =>
val micros = localDateTimeToMicros(row.getAs[java.time.LocalDateTime](pos))
val seconds = Math.floorDiv(micros, DateTimeConstants.MICROS_PER_SECOND)
val nanos = (micros - seconds * DateTimeConstants.MICROS_PER_SECOND) *
Copy link
Contributor

@cloud-fan cloud-fan Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, add toJavaTimestampNoRebase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@beliefer beliefer force-pushed the SPARK-39339_followup branch from b80959b to ebe69c5 Compare June 30, 2022 06:18
* @param micros The number of microseconds since 1970-01-01T00:00:00.000000Z.
* @return A `java.sql.Timestamp` from number of micros since epoch.
*/
def toJavaTimestampNoRebase(micros: Long): Timestamp = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to share code with toJavaTimestamp? e.g.

def toJavaTimestamp ... {
  toJavaTimestampNoRebase(rebaseGregorianToJulianMicros(micros))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@gengliangwang
Copy link
Member

Thanks, merging to master

@beliefer
Copy link
Contributor Author

beliefer commented Jul 1, 2022

@gengliangwang @cloud-fan @sadikovi Thank you !

cloud-fan pushed a commit that referenced this pull request May 5, 2023
### What changes were proposed in this pull request?

#36726 supports TimestampNTZ type in JDBC data source and #37013 applies a fix to pass more test cases with H2.

The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database.

This PR adds a `dialect` parameter to `makeGetter` for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. `makeSetter` already has a `dialect` field and we will use that for converting back to Java Timestamp.

### Why are the changes needed?

Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for
converting between Java Timestamp and Spark TimestampNTZType.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test.
I added new test cases for `PostgresIntegrationSuite` to cover TimestampNTZ read and writes.

Closes #40678 from tianhanhu/SPARK-43040_jdbc_timestamp_ntz.

Authored-by: tianhanhu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
### What changes were proposed in this pull request?

apache#36726 supports TimestampNTZ type in JDBC data source and apache#37013 applies a fix to pass more test cases with H2.

The problem is that Java Timestamp is a poorly defined class and different JDBC drivers implement "getTimestamp" and "setTimestamp" with different expected behaviors in mind. The general conversion implementation would work with some JDBC dialects and their drivers but not others. This issue is discovered when testing with PostgreSQL database.

This PR adds a `dialect` parameter to `makeGetter` for applying dialect specific conversions when reading a Java Timestamp into TimestampNTZType. `makeSetter` already has a `dialect` field and we will use that for converting back to Java Timestamp.

### Why are the changes needed?

Fix TimestampNTZ support for PostgreSQL. Allows other JDBC dialects to provide dialect specific implementation for
converting between Java Timestamp and Spark TimestampNTZType.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test.
I added new test cases for `PostgresIntegrationSuite` to cover TimestampNTZ read and writes.

Closes apache#40678 from tianhanhu/SPARK-43040_jdbc_timestamp_ntz.

Authored-by: tianhanhu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants