Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] date_format returns wrong results #5524

Open
clee704 opened this issue Apr 25, 2024 · 1 comment
Open

[VL] date_format returns wrong results #5524

clee704 opened this issue Apr 25, 2024 · 1 comment
Labels
bug Something isn't working triage

Comments

@clee704
Copy link
Contributor

clee704 commented Apr 25, 2024

Backend

VL (Velox)

Bug description

Velox evaluates date_format(timestamp'12345-01-01 01:01:01', 'yyyy-MM') to '12345-07', whereas vanilla Spark evaluates the same expression to '+12345-07'. This can be an issue because unix_timestamp in vanilla Spark only supports '+12345-07'. If date_format is executed in Velox and the result is used as an argument to unix_timestamp in vanilla Spark, there will be a failure.

// Somehow CREATE TABLE doesn't work with five-digit year timestamps
spark.sql("select timestamp'12345-01-01 01:01:01' c").write.mode("overwrite").save("x")
spark.read.load("x").createOrReplaceTempView("t")

// date_format is run in Velox
spark.sql("select date_format(c, 'yyyy-MM') from t").explain()
// == Physical Plan ==
// VeloxColumnarToRowExec
// +- ^(14) ProjectExecTransformer [date_format(c#83, yyyy-MM, Some(Etc/UTC)) AS date_format(c, yyyy-MM)#85]
//    +- ^(14) NativeFileScan parquet [c#83] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/ssd/chungmin/repos/spark34/x], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:timestamp>

// Use collect() instead of show(), as show() makes the function run in vanilla Spark in Spark 3.5 due to the inserted ToPrettyString function.
spark.sql("select date_format(c, 'yyyy-MM') from t").collect()
// Array([12345-01])

spark.sql("create table t2 as select date_format(c, 'yyyy-MM') c from t")
spark.sql("set spark.gluten.enabled = false")
spark.sql("select unix_timestamp(c, 'yyyy-MM') from t2").collect()
// 24/04/25 02:01:01 ERROR TaskResources: Task 8 failed by error:
// org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
// Fail to parse '12345-01' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
// ...

Spark uses java.time.format.DateTimeFormatter for date_format.

import java.time.{LocalDate, ZoneId}
import java.time.format.DateTimeFormatter

DateTimeFormatter.ofPattern("yyyy").withZone(ZoneId.of("Z")).format(LocalDate.of(12345, 1, 1))
// "+12345"

OpenJDK 1.8.0_402, 11.0.22, 21.0.2 all behave the same. It is not documented in the class in general, but for some constants it is mentioned that years outside of 0000-9999 will have a prefixed positive or negative symbol.

Five-digit years should be extremely rare in real world applications, but it's breaking Delta unit tests.

The issue occurs with Spark 3.4.2 and 3.5.1. Didn't check older versions.

Spark version

None

Spark configurations

spark.plugins=org.apache.gluten.GlutenPlugin
spark.gluten.enabled=true
spark.gluten.sql.columnar.backend.lib=velox
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=28g

System information

Velox System Info v0.0.2
Commit: 45dc46a
CMake Version: 3.28.3
System: Linux-6.5.0-1018-azure
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path: /usr/local;/usr;/;/ssd/linuxbrew/.linuxbrew/Cellar/cmake/3.28.3;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

@PHILO-HE
Copy link
Contributor

Will investigate this issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants