Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reading invalid DATE strings yields exceptions instead of nulls #7089

Closed
mythrocks opened this issue Nov 16, 2022 · 0 comments · Fixed by #7221
Closed

[BUG] Reading invalid DATE strings yields exceptions instead of nulls #7089

mythrocks opened this issue Nov 16, 2022 · 0 comments · Fixed by #7221
Labels
bug Something isn't working

Comments

@mythrocks
Copy link
Collaborator

mythrocks commented Nov 16, 2022

Scope

This behaviour affects reads of both CSV and Hive delimited text input.

Description

When an invalid DATE string (e.g. "2020-50-16", or even "abcde") is read through the Spark RAPIDS plugin, one sees the following exception:

Caused by: java.time.DateTimeException: One or more values is not a valid date
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$castStringToDate$3(GpuTextBasedPartitionReader.scala:261)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$castStringToDate$3$adapted(GpuTextBasedPartitionReader.scala:259)
  at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
  at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.withResource(GpuTextBasedPartitionReader.scala:52)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.$anonfun$castStringToDate$2(GpuTextBasedPartitionReader.scala:259)
  at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
  at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
  at com.nvidia.spark.rapids.GpuTextBasedPartitionReader.withResource(GpuTextBasedPartitionReader.scala.52)

Repro

This may be reproduced by reading the following input either as CSV or as Hive delimited text:

2020-09-16
2020-10-16
2021-09-16
 2021-09-16
2021-09-16
2020-50-16
asdf

Expected behaviour

Invalid strings should produce null when interpreted as DATE, as it does with Apache Spark.

Caveats

This behaviour does not happen with date values serialized through the Hive LazySimpleSerDe, which are read correctly through GpuHiveTableScanExec.

@mythrocks mythrocks added bug Something isn't working ? - Needs Triage Need team to review and classify and removed ? - Needs Triage Need team to review and classify labels Nov 16, 2022
mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Dec 1, 2022
Fixes NVIDIA#7089. There were two problems:
  1. Strings between field delimiters should not be trimmed before casting to dates.
  2. Invalid date strings should not be causing exceptions. They should return null
     values, as is the convention in Hive's `LazySimpleSerDe`.
mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Dec 1, 2022
Fixes NVIDIA#7089. There were two problems:
  1. Strings between field delimiters should not be trimmed before casting to dates.
  2. Invalid date strings should not be causing exceptions. They should return null
     values, as is the convention in Hive's `LazySimpleSerDe`.

Signed-off-by: MithunR <[email protected]>
mythrocks added a commit that referenced this issue Dec 7, 2022
* Hive Text parsing of invalid date strings should not cause exceptions.

Fixes #7089. There were two problems:
  1. Strings between field delimiters should not be trimmed before casting to dates.
  2. Invalid date strings should not be causing exceptions. They should return null
     values, as is the convention in Hive's `LazySimpleSerDe`.

Signed-off-by: MithunR <[email protected]>

* Fixed verify errors.

* Fixed merge duplication.

* Review fixes:

1. Fixed indentation.
2. Hardcode for supported date format.
3. Added tests for timestamp strings read as dates.
4. Fixed behaviour for #3 above.

Signed-off-by: MithunR <[email protected]>
Co-authored-by: Robert (Bobby) Evans <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant