Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_csv_infer_schema_timestamp_ntz fails #9325

Closed
jlowe opened this issue Sep 28, 2023 · 10 comments
Closed

[BUG] test_csv_infer_schema_timestamp_ntz fails #9325

jlowe opened this issue Sep 28, 2023 · 10 comments
Labels
bug Something isn't working Spark 3.5+ Spark 3.5+ issues

Comments

@jlowe
Copy link
Member

jlowe commented Sep 28, 2023

The following tests failed in a recent integration test run for 23.10:

[2023-09-28T15:26:39.164Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v1[TIMESTAMP_LTZ--yyyy-MM-dd][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v1[TIMESTAMP_LTZ--yyyy-MM][ALLOW_NON_GPU(FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v1[TIMESTAMP_LTZ-'T'HH:mm:ss-yyyy-MM-dd][INJECT_OOM, ALLOW_NON_GPU(FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v1[TIMESTAMP_LTZ-'T'HH:mm-yyyy-MM-dd][ALLOW_NON_GPU(FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v2[TIMESTAMP_LTZ--yyyy-MM-dd][INJECT_OOM, ALLOW_NON_GPU(BatchScanExec,FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v2[TIMESTAMP_LTZ--yyyy-MM][INJECT_OOM, ALLOW_NON_GPU(BatchScanExec,FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v2[TIMESTAMP_LTZ-'T'HH:mm:ss-yyyy-MM-dd][ALLOW_NON_GPU(BatchScanExec,FileSourceScanExec,ProjectExec,CollectLimitExec,DeserializeToObjectExec)]
[2023-09-28T15:26:39.165Z] FAILED ../../src/main/python/csv_test.py::test_csv_infer_schema_timestamp_ntz_v2[TIMESTAMP_LTZ-'T'HH:mm-yyyy-MM-dd]

Some of the tests failed like:

[2023-09-28T15:26:39.156Z] E                   Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
[2023-09-28T15:26:39.156Z] E                   Fail to parse '2884-06-24T02:45:51.138' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.

Others failed with:

[2023-09-28T15:26:39.148Z] E                   Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 16
@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 28, 2023
@jlowe
Copy link
Member Author

jlowe commented Sep 28, 2023

Tests were introduced in #9159. cc: @andygrove

@andygrove
Copy link
Contributor

There is a regression or change in behavior in Spark 3.5.0. Here is a repro case.

spark.conf.set("spark.sql.timestampType", "TIMESTAMP_LTZ")
spark.conf.set("spark.sql.sources.useV1SourceList", "")
spark.conf.set("spark.sql.legacy.timeParserPolicy", "EXCEPTION")
val timestampFormat = "yyyy-MM-dd'T'HH:mm:ss"
val ts = Seq("2884-06-24T02:45:51.138", "2884-06-24T02:45:51.138", "2884-06-24T02:45:51.138").toDF("ts").repartition(1)
ts.write.mode("Overwrite").option("timestampFormat", timestampFormat).csv("/tmp/ts.csv")
val df = spark.read.option("timestampFormat", timestampFormat).option("inferSchema", true).csv("/tmp/ts.csv")
df.show(truncate=false)

Both 3.4.0 and 3.5.0 produce the same CSV file:

2884-06-24T02:45:51.138
2884-06-24T02:45:51.138
2884-06-24T02:45:51.138

The read works fine with 3.4.0 and produces this result:

+-----------------------+
|_c0                    |
+-----------------------+
|2884-06-24T02:45:51.138|
|2884-06-24T02:45:51.138|
|2884-06-24T02:45:51.138|
+-----------------------+

The read in 3.5.0 fails with:

E                   Fail to parse '2884-06-24T02:45:51.138' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
E                       at org.apache.spark.sql.errors.ExecutionErrors.failToParseDateTimeInNewParserError(ExecutionErrors.scala:54)
E                       at org.apache.spark.sql.errors.ExecutionErrors.failToParseDateTimeInNewParserError$(ExecutionErrors.scala:48)
E                       at org.apache.spark.sql.errors.ExecutionErrors$.failToParseDateTimeInNewParserError(ExecutionErrors.scala:218)
E                       at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:142)
E                       at org.apache.spark.sql.catalyst.util.DateTimeFormatterHelper$$anonfun$checkParsedDiff$1.applyOrElse(DateTimeFormatterHelper.scala:135)
E                       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
E                       at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:194)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$21(UnivocityParser.scala:237)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.nullSafeDatum(UnivocityParser.scala:291)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$makeConverter$20(UnivocityParser.scala:235)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.org$apache$spark$sql$catalyst$csv$UnivocityParser$$convert(UnivocityParser.scala:346)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$parse$2(UnivocityParser.scala:307)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$1(UnivocityParser.scala:452)
E                       at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:60)
E                       at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:456)
E                       at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
E                       at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
E                       at org.apache.spark.sql.execution.datasources.v2.PartitionReaderFromIterator.next(PartitionReaderFromIterator.scala:26)

This only fails when inferSchema is enabled.

@andygrove
Copy link
Contributor

Another observation is that this works fine with TIMESTAMP_NTZ but not for TIMESTAMP_LTZ.

@andygrove
Copy link
Contributor

I filed a bug against Spark:

https://issues.apache.org/jira/browse/SPARK-45424

@andygrove
Copy link
Contributor

There is also a bug in the plugin:

#9390

@andygrove andygrove added Spark 3.5+ Spark 3.5+ issues and removed Spark 3.5+ Spark 3.5+ issues labels Oct 6, 2023
@andygrove
Copy link
Contributor

The Spark bug is fixed in 3.5.1

apache/spark#43245

@andygrove andygrove removed their assignment Oct 25, 2023
@razajafri
Copy link
Collaborator

The Spark bug is fixed in 3.5.1

apache/spark#43245

I can confirm that this test passes on Spark-3.5.1

@razajafri
Copy link
Collaborator

I can confirm that this is not reproducible on Spark 3.5.0 either.

@marreddy
Copy link

marreddy commented Apr 11, 2024

[jira] [Updated] (SPARK-44025) CSV Table Read Error with CharType(length) column

https://www.mail-archive.com/[email protected]/msg349889.html

I'm getting similar issue after upgrading from Spark 3.3.0 to 3.5.0. and not even working in 3.5.1. Please help me fix the issue.

Driver stacktrace:
INFO | ResultStage 1 (show at TestTimeZone.java:60) failed in 0.116 s due to Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (192.168.3.18 executor driver): java.lang.IllegalArgumentException: requirement failed: requiredSchema (structoperator:string) should be the subset of dataSchema (structoperator:string).
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.(UnivocityParser.scala:56)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:141)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)


My DB tables : (NOTE : I'm getting issue only for char column types. )

CREATE TABLE test_table (
operator char(10) DEFAULT NULL
) ENGINE=InnoDB

CREATE TABLE test_table_varchar (
operator varchar(10) DEFAULT NULL
) ENGINE=InnoDB


My Test Code

public class TestTimeZone {

public static void main(String[] args) {
      SparkSession spark = SparkSession
            .builder()
            .master("local[*]")
            .appName("TestTimeZone")
            .getOrCreate();
    Map<String, String> options = new HashMap<String, String>();
    options.put("zeroDateTimeBehavior", "ROUND");
    options.put("driver", "com.mysql.cj.jdbc.Driver");
    options.put("url", "jdbc:mysql://localhost:3306/test?user=root&password=root&tinyInt1isBit=false" );
    options.put("dbtable", "(SELECT operator FROM test_table LIMIT 1) as t1");
    
    StructType schema = new StructType(new StructField[]{
        new StructField("operator", DataTypes.StringType, true, Metadata.empty())
    });

    Dataset<Row> dbDS = spark.read().format("jdbc").options( options ).load();
    dbDS.printSchema();
    dbDS.show();
    
    Dataset<Row> tmpDS = spark.read()
                .format("csv")
                .option("header", "true")
                .option("escape", "\"")
                .option("wholeFile", true)
                .option("multiline",true)
                .option("ignoreLeadingWhiteSpace",true)
                .option("ignoreTrailingWhiteSpace",true)
                .option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
                .option("dateFormat", "yyyy-MM-dd")
                .schema(dbDS.schema())   // I'm reading schema from DB and applying as schema.
                //.schema(schema)
                .load( "D:\\opt\\data\\spark3.csv" );
    
    tmpDS.printSchema();
    tmpDS.show();
}

}


Data In CSV file

"operator"
"abc"
"def"

same is working with spark 3.3.0

+--------+
|operator|
+--------+
| abc|
| def|
+--------+

@revans2
Copy link
Collaborator

revans2 commented Apr 11, 2024

@marreddy I don't see anything in the error that is related to the RAPIDS plugin for Apache Spark. You appear to be running on windows and we don't support windows at all, not to mention the stack trace does not include anything related to this project.

You might want to ask on the apache-spark users mailing list or one of the other resources for asking questions https://spark.apache.org/community.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark 3.5+ Spark 3.5+ issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants