Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Results mismatch when scan low version orc file #6673

Closed
NEUpanning opened this issue Aug 1, 2024 · 10 comments
Closed

[VL] Results mismatch when scan low version orc file #6673

NEUpanning opened this issue Aug 1, 2024 · 10 comments
Labels
bug Something isn't working triage

Comments

@NEUpanning
Copy link
Contributor

Backend

VL (Velox)

Bug description

SparkSQL:

SELECT if(user_type <> -1 ,user_id ,null) as a
from table
where partition_date='2024-07-01' order by a desc limit 10; 

Gluten Result:

gluten
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL

Vanilla Result

vanilla
dp_765265243
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_3779112707
dp_3778736486
dp_3778655687
dp_3778588244

Physical Plan:

== Physical Plan ==
VeloxColumnarToRowExec
+- TakeOrderedAndProjectExecTransformer (limit=10, orderBy=[a#0 DESC NULLS LAST], output=[a#0])
   +- ^(1) ProjectExecTransformer [if (NOT (user_type#6L = -1)) user_id#1 else null AS a#0]
      +- ^(1) NativeFileScan orc table[user_id#1,user_type#6L,partition_date#18] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex(1 paths)[viewfs://******, PartitionFilters: [isnotnull(partition_date#18), (partition_date#18 = 2024-07-01)], PushedFilters: [], ReadSchema: struct<user_id:string,user_type:bigint>

Unfortunately, I can't reproduce it with new hive table. I tried to create a new table that contains rows in original table and submit a same SQL to Spark and even the physical plan is same as before. But the result of gluten is same as vanilla spark.

Spark version

None

Spark configurations

No response

System information

v1.2.0 rc1

Relevant logs

No response

@NEUpanning NEUpanning added bug Something isn't working triage labels Aug 1, 2024
@NEUpanning
Copy link
Contributor Author

cc @kecookier

@Yohahaha
Copy link
Contributor

Yohahaha commented Aug 1, 2024

Unfortunately, I can't reproduce it with new hive table. I tried to create a new table that contains rows in original table and submit a same SQL to Spark and even the physical plan is same as before. But the result of gluten is same as vanilla spark.

would you log RowVector in TableScan#getOutput to check if this issue caused by scan orc?

@Z1Wu
Copy link
Contributor

Z1Wu commented Aug 1, 2024

Unfortunately, I can't reproduce it with new hive table.

would you check the orc file schema of the old hive table, maybe it is the problem of orc file. ( hive --orcfiledump <orc_file_path> )
Issue about orc scan : #5638.

@NEUpanning
Copy link
Contributor Author

@Yohahaha This table contains 10365356 rows. It's tricky to log RowVector.

@NEUpanning
Copy link
Contributor Author

@Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED <table_name>)
old table:

col_name	data_type	comment
user_id	string	
user_type	bigint	
partition_date	string	

# Detailed Table Information
Table	old_table
Type	MANAGED
Provider	hive
Serde Library	org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat	org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties	[serialization.format=1]
Partition Provider	Catalog

new table:

col_name	data_type	comment
user_type	bigint
user_id	string

# Detailed Table Information
Table	panning_if_expression
Owner	hive
Type	MANAGED
Provider	hive
Serde Library	org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat	org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Storage Properties	[serialization.format=1]
Partition Provider	Catalog

@Yohahaha
Copy link
Contributor

Yohahaha commented Aug 1, 2024

@Yohahaha This table contains 10365356 rows. It's tricky to log RowVector.

you can set spark.gluten.sql.debug=true and find json plan and json scan split of one task, then use GenericBenchmark.cc to run single velox task.

@Z1Wu
Copy link
Contributor

Z1Wu commented Aug 1, 2024

@Z1Wu It looks like table schema is same.(DESCRIBE FORMATTED <table_name>) old table:

Hive orc table have table schema and its orc data file should also contain schema too, but orc data file written by some old engine(like hive-1.x) contains incomplete schema ( lack of column name).

For a hive orc table create by :

CREATE TABLE `test_orc_table_hive_gluten`(
  `id` int,
  `name` string)
PARTITIONED BY (
  `dt` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

You can get orc data file schema using this command :

# hive --orcfiledump <your orc data file>
hive --orcfiledump hdfs:///data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0

Malformed orc schema output looks like below. Orc file with schema like Type: struct<_col0:int,_col1:string> can't be read by gluten. Result will always be NULL. Expected orc file schema should be Type: struct<id:int,name:string>

File Version: 0.12 with HIVE_8732
24/08/01 15:28:01 INFO orc.ReaderImpl: Reading ORC rows from hdfs://data/warehouse2/test_orc_table_hive_gluten/dt=20240728/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 2
Compression: SNAPPY
Compression size: 262144
Type: struct<_col0:int,_col1:string>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 2 hasNull: false
    Column 1: count: 2 hasNull: false min: 1 max: 2 sum: 3
    Column 2: count: 2 hasNull: false min: a max: b sum: 2

@NEUpanning
Copy link
Contributor Author

@Z1Wu Thanks for your clarification. It looks like the old table lacks of column name.
The old table orc file schema is

Type: struct<_col0:bigint,_col1:string>

The new table orc file schema is

Type: struct<user_type:bigint,user_id:string>

@Z1Wu
Copy link
Contributor

Z1Wu commented Aug 2, 2024

SELECT if(user_type <> -1 ,user_id ,null) as a
from table
where partition_date='2024-07-01' order by a desc limit 10; 

Gluten Result:

gluten
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL

Vanilla Result

vanilla
dp_765265243
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_71942892
dp_3779112707
dp_3778736486
dp_3778655687
dp_3778588244

If the the old table's orc files lack of column name, it may be the same problem as #5638

You can set spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true to fallback to vanilla orc scan operator as a workaround , but it may cause performance degradation.

@NEUpanning
Copy link
Contributor Author

You can set spark.gluten.sql.columnar.backend.velox.orc.scan.enabled=true to fallback to vanilla orc scan operator as a workaround

@Z1Wu @Yohahaha Thanks for your help. I will try to fallback orc scan for determining if this issue caused by native orc scan.

@Yohahaha Yohahaha changed the title [VL] Results are mismatch with vanilla Spark when using if expression [VL] Results mismatch when scan low version orc file Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

3 participants