[SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail #17791

umehrot2 · 2017-04-27T22:25:55Z

What changes were proposed in this pull request?

Reading from a Hive ORC table containing char/varchar columns fails in Spark SQL. This is caused by the fact that Spark SQL internally replaces the char/varchar columns with String data type. So, while reading from the table created in Hive which has varchar/char columns, it ends up using the wrong reader and causes a ClassCastException.

This patch allows Spark SQL to interpret varchar/char columns correctly, and store them as varchar/char type instead of internally converting to string columns.

How was this patch tested?

-> Added Unit tests
-> Manually tested on AWS EMR cluster

Step 1:
Created a table using hive (having varchar/char columns), and inserted some data:

CREATE EXTERNAL TABLE IF NOT EXISTS hive_orc_test (
a VARCHAR(10),
b CHAR(10),
c BIGINT)
STORED AS ORC
LOCATION 's3://xxxx';

INSERT INTO TABLE hive_orc_test VALUES ('abc', 'A', 101), ('abc1', 'B', 102), ('abc3', 'C', 103);

Step 2:
Created an external table in Spark SQL using the same source location, and run a select query on that.

CREATE EXTERNAL TABLE IF NOT EXISTS spark_orc_test (
a VARCHAR(10),
b CHAR(10),
c BIGINT)
STORED AS ORC
LOCATION 's3://xxxx';

SELECT * form spark_orc_test;

Result:
17/02/24 23:22:57 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 2.673360 s
abc A 101
abc1 B 102
abc3 C 103
Time taken: 4.327 seconds, Fetched 3 row(s)

…should not fail

AmplabJenkins · 2017-04-27T22:27:15Z

Can one of the admins verify this patch?

mridulm · 2017-04-27T23:15:21Z

+CC @dongjoon-hyun - since you were looking at ORC.

hvanhovell · 2017-04-27T23:21:04Z

This is very similar to #16804 however that approach is like this one is slightly broken (because it does not support nested char/varchar columns), can you just backport #17030 which is an improved version.

dongjoon-hyun · 2017-04-27T23:24:50Z

Thank you for pining me, @mridulm . :)

gatorsmile · 2017-04-27T23:30:09Z

BTW, please add [BACKPORT-2.0] in your PR title.

HyukjinKwon · 2017-06-02T12:48:54Z

ping @umehrot2

Fix reading of HIVE ORC table with varchar/char columns in Spark SQL …

522cd75

…should not fail

umehrot2 changed the title ~~Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail~~ [SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail Apr 27, 2017

HyukjinKwon mentioned this pull request Jun 7, 2017

[INFRA] Close stale PRs #18223

Closed

asfgit closed this in b771fed Jun 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail #17791

[SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail #17791

umehrot2 commented Apr 27, 2017

AmplabJenkins commented Apr 27, 2017

mridulm commented Apr 27, 2017

hvanhovell commented Apr 27, 2017 •

edited

Loading

dongjoon-hyun commented Apr 27, 2017

gatorsmile commented Apr 27, 2017

HyukjinKwon commented Jun 2, 2017

[SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail #17791

[SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail #17791

Conversation

umehrot2 commented Apr 27, 2017

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Apr 27, 2017

mridulm commented Apr 27, 2017

hvanhovell commented Apr 27, 2017 • edited Loading

dongjoon-hyun commented Apr 27, 2017

gatorsmile commented Apr 27, 2017

HyukjinKwon commented Jun 2, 2017

hvanhovell commented Apr 27, 2017 •

edited

Loading