[SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail #17791
+95
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Reading from a Hive ORC table containing char/varchar columns fails in Spark SQL. This is caused by the fact that Spark SQL internally replaces the char/varchar columns with String data type. So, while reading from the table created in Hive which has varchar/char columns, it ends up using the wrong reader and causes a ClassCastException.
This patch allows Spark SQL to interpret varchar/char columns correctly, and store them as varchar/char type instead of internally converting to string columns.
How was this patch tested?
-> Added Unit tests
-> Manually tested on AWS EMR cluster
Step 1:
Created a table using hive (having varchar/char columns), and inserted some data:
CREATE EXTERNAL TABLE IF NOT EXISTS hive_orc_test (
a VARCHAR(10),
b CHAR(10),
c BIGINT)
STORED AS ORC
LOCATION 's3://xxxx';
INSERT INTO TABLE hive_orc_test VALUES ('abc', 'A', 101), ('abc1', 'B', 102), ('abc3', 'C', 103);
Step 2:
Created an external table in Spark SQL using the same source location, and run a select query on that.
CREATE EXTERNAL TABLE IF NOT EXISTS spark_orc_test (
a VARCHAR(10),
b CHAR(10),
c BIGINT)
STORED AS ORC
LOCATION 's3://xxxx';
SELECT * form spark_orc_test;
Result:
17/02/24 23:22:57 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 2.673360 s
abc A 101
abc1 B 102
abc3 C 103
Time taken: 4.327 seconds, Fetched 3 row(s)