[VL] Results mismatch when scan low version orc file #6673

NEUpanning · 2024-08-01T07:21:39Z

Backend

VL (Velox)

Bug description

SparkSQL:

SELECT if(user_type <> -1 ,user_id ,null) as a
from table
where partition_date='2024-07-01' order by a desc limit 10;

Gluten Result:

gluten
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL

Vanilla Result

vanilla
dp_765265243
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_3779112707
dp_3778736486
dp_3778655687
dp_3778588244

Physical Plan:

== Physical Plan ==
VeloxColumnarToRowExec
+- TakeOrderedAndProjectExecTransformer (limit=10, orderBy=[a#0 DESC NULLS LAST], output=[a#0])
   +- ^(1) ProjectExecTransformer [if (NOT (user_type#6L = -1)) user_id#1 else null AS a#0]
      +- ^(1) NativeFileScan orc table[user_id#1,user_type#6L,partition_date#18] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex(1 paths)[viewfs://******, PartitionFilters: [isnotnull(partition_date#18), (partition_date#18 = 2024-07-01)], PushedFilters: [], ReadSchema: struct<user_id:string,user_type:bigint>

Unfortunately, I can't reproduce it with new hive table. I tried to create a new table that contains rows in original table and submit a same SQL to Spark and even the physical plan is same as before. But the result of gluten is same as vanilla spark.

Spark version

None

Spark configurations

No response

System information

v1.2.0 rc1

Relevant logs

No response

The text was updated successfully, but these errors were encountered:

NEUpanning · 2024-08-01T07:21:57Z

cc @kecookier

Yohahaha · 2024-08-01T07:35:02Z

Unfortunately, I can't reproduce it with new hive table. I tried to create a new table that contains rows in original table and submit a same SQL to Spark and even the physical plan is same as before. But the result of gluten is same as vanilla spark.

would you log RowVector in TableScan#getOutput to check if this issue caused by scan orc?

Z1Wu · 2024-08-01T07:48:29Z

Unfortunately, I can't reproduce it with new hive table.

would you check the orc file schema of the old hive table, maybe it is the problem of orc file. ( hive --orcfiledump <orc_file_path> )
Issue about orc scan : #5638.

NEUpanning · 2024-08-01T07:53:28Z

@Yohahaha This table contains 10365356 rows. It's tricky to log RowVector.

NEUpanning · 2024-08-01T08:07:04Z

@Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED <table_name>)
old table:

col_name	data_type	comment
user_id	string	
user_type	bigint	
partition_date	string	

# Detailed Table Information
Table	old_table
Type	MANAGED
Provider	hive
Serde Library	org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat	org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties	[serialization.format=1]
Partition Provider	Catalog

new table:

col_name	data_type	comment
user_type	bigint
user_id	string

# Detailed Table Information
Table	panning_if_expression
Owner	hive
Type	MANAGED
Provider	hive
Serde Library	org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat	org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties	[serialization.format=1]
Partition Provider	Catalog

Yohahaha · 2024-08-01T08:27:19Z

@Yohahaha This table contains 10365356 rows. It's tricky to log RowVector.

you can set spark.gluten.sql.debug=true and find json plan and json scan split of one task, then use GenericBenchmark.cc to run single velox task.

Z1Wu · 2024-08-01T08:53:19Z

@Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED <table_name>) old table:

Hive orc table have table schema and its orc data file should also contain schema too, but orc data file written by some old engine(like hive-1.x) contains incomplete schema ( lack of column name).

For a hive orc table create by :

CREATE TABLE `test_orc_table_hive_gluten`(
  `id` int,
  `name` string)
PARTITIONED BY (
  `dt` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

You can get orc data file schema using this command :

# hive --orcfiledump <your orc data file>
hive --orcfiledump hdfs:///data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0

Malformed orc schema output looks like below. Orc file with schema like Type: struct<_col0:int,_col1:string> can't be read by gluten. Result will always be NULL. Expected orc file schema should be Type: struct<id:int,name:string>

File Version: 0.12 with HIVE_8732
24/08/01 15:28:01 INFO orc.ReaderImpl: Reading ORC rows from hdfs://data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 2
Compression: SNAPPY
Compression size: 262144
Type: struct<_col0:int,_col1:string>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 2 hasNull: false
    Column 1: count: 2 hasNull: false min: 1 max: 2 sum: 3
    Column 2: count: 2 hasNull: false min: a max: b sum: 2

NEUpanning · 2024-08-01T12:39:01Z

@Z1Wu Thanks for your clarification. It looks like the old table lacks of column name.
The old table orc file schema is

Type: struct<_col0:bigint,_col1:string>

The new table orc file schema is

Type: struct<user_type:bigint,user_id:string>

Z1Wu · 2024-08-02T02:26:29Z

SELECT if(user_type <> -1 ,user_id ,null) as a
from table
where partition_date='2024-07-01' order by a desc limit 10;

Gluten Result:

gluten
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL

Vanilla Result

vanilla
dp_765265243
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_3779112707
dp_3778736486
dp_3778655687
dp_3778588244

If the the old table's orc files lack of column name, it may be the same problem as #5638

You can set spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true to fallback to vanilla orc scan operator as a workaround , but it may cause performance degradation.

NEUpanning · 2024-08-02T03:00:45Z

You can set spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true to fallback to vanilla orc scan operator as a workaround

@Z1Wu @Yohahaha Thanks for your help. I will try to fallback orc scan for determining if this issue caused by native orc scan.

NEUpanning added bug Something isn't working triage labels Aug 1, 2024

NEUpanning mentioned this issue Aug 1, 2024

[VL] Result mismatch issues Tracker #4652

Open

31 tasks

Yohahaha changed the title ~~[VL] Results are mismatch with vanilla Spark when using if expression~~ [VL] Results mismatch when scan low version orc file Aug 2, 2024

NEUpanning closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Results mismatch when scan low version orc file #6673

[VL] Results mismatch when scan low version orc file #6673

NEUpanning commented Aug 1, 2024

NEUpanning commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

Z1Wu commented Aug 1, 2024

NEUpanning commented Aug 1, 2024

NEUpanning commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

Z1Wu commented Aug 1, 2024

NEUpanning commented Aug 1, 2024

Z1Wu commented Aug 2, 2024

NEUpanning commented Aug 2, 2024

[VL] Results mismatch when scan low version orc file #6673

[VL] Results mismatch when scan low version orc file #6673

Comments

NEUpanning commented Aug 1, 2024

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

NEUpanning commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

Z1Wu commented Aug 1, 2024

NEUpanning commented Aug 1, 2024

NEUpanning commented Aug 1, 2024

Yohahaha commented Aug 1, 2024

Z1Wu commented Aug 1, 2024

NEUpanning commented Aug 1, 2024

Z1Wu commented Aug 2, 2024

NEUpanning commented Aug 2, 2024