Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail #17791

Closed
wants to merge 1 commit into from

Conversation

umehrot2
Copy link

What changes were proposed in this pull request?

Reading from a Hive ORC table containing char/varchar columns fails in Spark SQL. This is caused by the fact that Spark SQL internally replaces the char/varchar columns with String data type. So, while reading from the table created in Hive which has varchar/char columns, it ends up using the wrong reader and causes a ClassCastException.

This patch allows Spark SQL to interpret varchar/char columns correctly, and store them as varchar/char type instead of internally converting to string columns.

How was this patch tested?

-> Added Unit tests
-> Manually tested on AWS EMR cluster

Step 1:
Created a table using hive (having varchar/char columns), and inserted some data:

CREATE EXTERNAL TABLE IF NOT EXISTS hive_orc_test (
a VARCHAR(10),
b CHAR(10),
c BIGINT)
STORED AS ORC
LOCATION 's3://xxxx';

INSERT INTO TABLE hive_orc_test VALUES ('abc', 'A', 101), ('abc1', 'B', 102), ('abc3', 'C', 103);

Step 2:
Created an external table in Spark SQL using the same source location, and run a select query on that.

CREATE EXTERNAL TABLE IF NOT EXISTS spark_orc_test (
a VARCHAR(10),
b CHAR(10),
c BIGINT)
STORED AS ORC
LOCATION 's3://xxxx';

SELECT * form spark_orc_test;

Result:
17/02/24 23:22:57 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 2.673360 s
abc A 101
abc1 B 102
abc3 C 103
Time taken: 4.327 seconds, Fetched 3 row(s)

@umehrot2 umehrot2 changed the title Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail [SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail Apr 27, 2017
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mridulm
Copy link
Contributor

mridulm commented Apr 27, 2017

+CC @dongjoon-hyun - since you were looking at ORC.

@hvanhovell
Copy link
Contributor

hvanhovell commented Apr 27, 2017

This is very similar to #16804 however that approach is like this one is slightly broken (because it does not support nested char/varchar columns), can you just backport #17030 which is an improved version.

@dongjoon-hyun
Copy link
Member

Thank you for pining me, @mridulm . :)

@gatorsmile
Copy link
Member

BTW, please add [BACKPORT-2.0] in your PR title.

@HyukjinKwon
Copy link
Member

ping @umehrot2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants