[BUG] test_csv_infer_schema_timestamp_ntz fails #9325

jlowe · 2023-09-28T16:24:00Z

The following tests failed in a recent integration test run for 23.10:

[2023-09-28T15:26:39.164Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v1[TIMESTAMP_LTZ--yyyy-MM-dd][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v1[TIMESTAMP_LTZ--yyyy-MM][ALLOW_NON_GPU(FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v1[TIMESTAMP_LTZ-'T'HH:mm:ss-yyyy-MM-dd][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v1[TIMESTAMP_LTZ-'T'HH:mm-yyyy-MM-dd][ALLOW_NON_GPU(FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v2[TIMESTAMP_LTZ--yyyy-MM-dd][INJECT_OOM, ALLOW_NON_GPU(BatchScanExec,FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v2[TIMESTAMP_LTZ--yyyy-MM][INJECT_OOM, ALLOW_NON_GPU(BatchScanExec,FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v2[TIMESTAMP_LTZ-'T'HH:mm:ss-yyyy-MM-dd][ALLOW_NON_GPU(BatchScanExec,FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v2[TIMESTAMP_LTZ-'T'HH:mm-yyyy-MM-dd]

Some of the tests failed like:

[2023-09-28T15:26:39.156Z] E                   Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
[2023-09-28T15:26:39.156Z] E                   Fail to parse '2884-06-24T02:45:51.138' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.

Others failed with:

[2023-09-28T15:26:39.148Z] E                   Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 16

The text was updated successfully, but these errors were encountered:

jlowe · 2023-09-28T16:26:03Z

Tests were introduced in #9159. cc: @andygrove

andygrove · 2023-09-29T16:48:45Z

There is a regression or change in behavior in Spark 3.5.0. Here is a repro case.

spark.conf.set("spark.sql.timestampType", "TIMESTAMP_LTZ")
spark.conf.set("spark.sql.sources.useV1SourceList", "")
spark.conf.set("spark.sql.legacy.timeParserPolicy", "EXCEPTION")
val timestampFormat = "yyyy-MM-dd'T'HH:mm:ss"
val ts = Seq("2884-06-24T02:45:51.138", "2884-06-24T02:45:51.138", "2884-06-24T02:45:51.138").toDF("ts").repartition(1)
ts.write.mode("Overwrite").option("timestampFormat", timestampFormat).csv("/tmp/ts.csv")
val df = spark.read.option("timestampFormat", timestampFormat).option("inferSchema", true).csv("/tmp/ts.csv")
df.show(truncate=false)

Both 3.4.0 and 3.5.0 produce the same CSV file:

2884-06-24T02:45:51.138
2884-06-24T02:45:51.138
2884-06-24T02:45:51.138

The read works fine with 3.4.0 and produces this result:

+-----------------------+
|_c0                    |
+-----------------------+
|2884-06-24T02:45:51.138|
|2884-06-24T02:45:51.138|
|2884-06-24T02:45:51.138|
+-----------------------+

The read in 3.5.0 fails with:

E                   Fail to parse '2884-06-24T02:45:51.138' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
E                       at org.apache.spark.sql.errors.ExecutionErrors.failToParseDateTimeInNewParserError(ExecutionErrors.scala:54)
E                       at org.apache.spark.sql.errors.ExecutionErrors.failToParseDateTimeInNewParserError$(ExecutionErrors.scala:48)
E                       at org.apache.spark.sql.errors.ExecutionErrors$.failToParseDateTimeInNewParserError(ExecutionErrors.scala:218)
E                       at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:142)
E                       at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:135)
E                       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
E                       at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:194)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:237)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:291)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:235)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:346)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:307)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:452)
E                       at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:456)
E                       at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
E                       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
E                       at org.apache.spark.sql.execution.datasources.v2.PartitionReaderFromIterator.next(PartitionReaderFromIterator.scala:26)

This only fails when inferSchema is enabled.

andygrove · 2023-09-29T17:15:56Z

Another observation is that this works fine with TIMESTAMP_NTZ but not for TIMESTAMP_LTZ.

andygrove · 2023-10-05T17:38:34Z

I filed a bug against Spark:

https://issues.apache.org/jira/browse/SPARK-45424

andygrove · 2023-10-05T20:40:33Z

There is also a bug in the plugin:

#9390

andygrove · 2023-10-25T18:31:12Z

The Spark bug is fixed in 3.5.1

apache/spark#43245

razajafri · 2024-02-20T20:05:09Z

The Spark bug is fixed in 3.5.1

apache/spark#43245

I can confirm that this test passes on Spark-3.5.1

razajafri · 2024-02-26T19:59:28Z

I can confirm that this is not reproducible on Spark 3.5.0 either.

marreddy · 2024-04-11T08:54:25Z

[jira] [Updated] (SPARK-44025) CSV Table Read Error with CharType(length) column

https://www.mail-archive.com/[email protected]/msg349889.html

I'm getting similar issue after upgrading from Spark 3.3.0 to 3.5.0. and not even working in 3.5.1. Please help me fix the issue.

Driver stacktrace:
INFO | ResultStage 1 (show at TestTimeZone.java:60) failed in 0.116 s due to Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (192.168.3.18 executor driver): java.lang.IllegalArgumentException: requirement failed: requiredSchema (structoperator:string) should be the subset of dataSchema (structoperator:string).
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)

My DB tables : (NOTE : I'm getting issue only for char column types. )

CREATE TABLE test_table (
operator char(10) DEFAULT NULL
) ENGINE=InnoDB

CREATE TABLE test_table_varchar (
operator varchar(10) DEFAULT NULL
) ENGINE=InnoDB

My Test Code

public class TestTimeZone {

public static void main(String[] args) {
      SparkSession spark = SparkSession
            .builder()
            .master("local[*]")
            .appName("TestTimeZone")
            .getOrCreate();
    Map<String, String> options = new HashMap<String, String>();
    options.put("zeroDateTimeBehavior", "ROUND");
    options.put("driver", "com.mysql.cj.jdbc.Driver");
    options.put("url", "jdbc:mysql://localhost:3306/test?user=root&password=root&tinyInt1isBit=false" );
    options.put("dbtable", "(SELECT operator FROM test_table LIMIT 1) as t1");
    
    StructType schema = new StructType(new StructField[]{
        new StructField("operator", DataTypes.StringType, true, Metadata.empty())
    });

    Dataset<Row> dbDS = spark.read().format("jdbc").options( options ).load();
    dbDS.printSchema();
    dbDS.show();
    
    Dataset<Row> tmpDS = spark.read()
                .format("csv")
                .option("header", "true")
                .option("escape", "\"")
                .option("wholeFile", true)
                .option("multiline",true)
                .option("ignoreLeadingWhiteSpace",true)
                .option("ignoreTrailingWhiteSpace",true)
                .option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
                .option("dateFormat", "yyyy-MM-dd")
                .schema(dbDS.schema())   // I'm reading schema from DB and applying as schema.
                //.schema(schema)
                .load( "D:\\opt\\data\\spark3.csv" );
    
    tmpDS.printSchema();
    tmpDS.show();
}

}

Data In CSV file

"operator"
"abc"
"def"

same is working with spark 3.3.0

+--------+
|operator|
+--------+
| abc|
| def|
+--------+

revans2 · 2024-04-11T14:56:56Z

@marreddy I don't see anything in the error that is related to the RAPIDS plugin for Apache Spark. You appear to be running on windows and we don't support windows at all, not to mention the stack trace does not include anything related to this project.

You might want to ask on the apache-spark users mailing list or one of the other resources for asking questions https://spark.apache.org/community.html

jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 28, 2023

andygrove self-assigned this Sep 28, 2023

andygrove mentioned this issue Sep 28, 2023

Temporarily skip failing tests test_csv_infer_schema_timestamp_ntz* #9335

Merged

andygrove added the Spark 3.5+ Spark 3.5+ issues label Sep 28, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Sep 29, 2023

andygrove mentioned this issue Sep 29, 2023

Refine rules for skipping test_csv_infer_schema_timestamp_ntz_* tests #9352

Merged

NvTimLiu mentioned this issue Oct 2, 2023

[BUG] json_test failed on "NameError: name 'TimestampNTZType' is not defined" #9357

Closed

andygrove mentioned this issue Oct 4, 2023

WIP: Update tests for csv timestamp inference [databricks] #9367

Closed

andygrove added Spark 3.5+ Spark 3.5+ issues and removed Spark 3.5+ Spark 3.5+ issues labels Oct 6, 2023

andygrove removed their assignment Oct 25, 2023

razajafri closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_csv_infer_schema_timestamp_ntz fails #9325

[BUG] test_csv_infer_schema_timestamp_ntz fails #9325

jlowe commented Sep 28, 2023

jlowe commented Sep 28, 2023

andygrove commented Sep 29, 2023

andygrove commented Sep 29, 2023

andygrove commented Oct 5, 2023

andygrove commented Oct 5, 2023

andygrove commented Oct 25, 2023

razajafri commented Feb 20, 2024

razajafri commented Feb 26, 2024

marreddy commented Apr 11, 2024 •

edited

Loading

revans2 commented Apr 11, 2024

[BUG] test_csv_infer_schema_timestamp_ntz fails #9325

[BUG] test_csv_infer_schema_timestamp_ntz fails #9325

Comments

jlowe commented Sep 28, 2023

jlowe commented Sep 28, 2023

andygrove commented Sep 29, 2023

andygrove commented Sep 29, 2023

andygrove commented Oct 5, 2023

andygrove commented Oct 5, 2023

andygrove commented Oct 25, 2023

razajafri commented Feb 20, 2024

razajafri commented Feb 26, 2024

marreddy commented Apr 11, 2024 • edited Loading

My DB tables : (NOTE : I'm getting issue only for char column types. )

My Test Code

Data In CSV file

revans2 commented Apr 11, 2024

marreddy commented Apr 11, 2024 •

edited

Loading