[VL] date_format returns wrong results #5524

clee704 · 2024-04-25T02:12:31Z

Backend

VL (Velox)

Bug description

Velox evaluates date_format(timestamp'12345-01-01 01:01:01', 'yyyy-MM') to '12345-07', whereas vanilla Spark evaluates the same expression to '+12345-07'. This can be an issue because unix_timestamp in vanilla Spark only supports '+12345-07'. If date_format is executed in Velox and the result is used as an argument to unix_timestamp in vanilla Spark, there will be a failure.

// Somehow CREATE TABLE doesn't work with five-digit year timestamps
spark.sql("select timestamp'12345-01-01 01:01:01' c").write.mode("overwrite").save("x")
spark.read.load("x").createOrReplaceTempView("t")

// date_format is run in Velox
spark.sql("select date_format(c, 'yyyy-MM') from t").explain()
// == Physical Plan ==
// VeloxColumnarToRowExec
// +- ^(14) ProjectExecTransformer [date_format(c#83, yyyy-MM, Some(Etc/UTC)) AS date_format(c, yyyy-MM)#85]
//    +- ^(14) NativeFileScan parquet [c#83] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/ssd/chungmin/repos/spark34/x], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:timestamp>

// Use collect() instead of show(), as show() makes the function run in vanilla Spark in Spark 3.5 due to the inserted ToPrettyString function.
spark.sql("select date_format(c, 'yyyy-MM') from t").collect()
// Array([12345-01])

spark.sql("create table t2 as select date_format(c, 'yyyy-MM') c from t")
spark.sql("set spark.gluten.enabled = false")
spark.sql("select unix_timestamp(c, 'yyyy-MM') from t2").collect()
// 24/04/25 02:01:01 ERROR TaskResources: Task 8 failed by error:
// org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
// Fail to parse '12345-01' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
// ...

Spark uses java.time.format.DateTimeFormatter for date_format.

import java.time.{LocalDate, ZoneId}
import java.time.format.DateTimeFormatter

DateTimeFormatter.ofPattern("yyyy").withZone(ZoneId.of("Z")).format(LocalDate.of(12345, 1, 1))
// "+12345"

OpenJDK 1.8.0_402, 11.0.22, 21.0.2 all behave the same. It is not documented in the class in general, but for some constants it is mentioned that years outside of 0000-9999 will have a prefixed positive or negative symbol.

Five-digit years should be extremely rare in real world applications, but it's breaking Delta unit tests.

The issue occurs with Spark 3.4.2 and 3.5.1. Didn't check older versions.

Spark version

None

Spark configurations

spark.plugins=org.apache.gluten.GlutenPlugin
spark.gluten.enabled=true
spark.gluten.sql.columnar.backend.lib=velox
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=28g

System information

Velox System Info v0.0.2
Commit: 45dc46a
CMake Version: 3.28.3
System: Linux-6.5.0-1018-azure
Arch: x86_64
C++ Compiler: /usr/bin/c++
C++ Compiler Version: 11.4.0
C Compiler: /usr/bin/cc
C Compiler Version: 11.4.0
CMake Prefix Path: /usr/local;/usr;/;/ssd/linuxbrew/.linuxbrew/Cellar/cmake/3.28.3;/usr/local;/usr/X11R6;/usr/pkg;/opt

Relevant logs

No response

The text was updated successfully, but these errors were encountered:

PHILO-HE · 2024-04-25T14:22:21Z

Will investigate this issue. Thanks!

clee704 added bug Something isn't working triage labels Apr 25, 2024

clee704 mentioned this issue Apr 25, 2024

date_format returns wrong results in Spark facebookincubator/velox#9616

Open

PHILO-HE mentioned this issue May 14, 2024

[VL] Result mismatch issues Tracker #4652

Open

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] date_format returns wrong results #5524

[VL] date_format returns wrong results #5524

clee704 commented Apr 25, 2024 •

edited

Loading

PHILO-HE commented Apr 25, 2024

[VL] date_format returns wrong results #5524

[VL] date_format returns wrong results #5524

Comments

clee704 commented Apr 25, 2024 • edited Loading

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

PHILO-HE commented Apr 25, 2024

clee704 commented Apr 25, 2024 •

edited

Loading