[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804

hvanhovell · 2017-02-04T11:06:28Z

What changes were proposed in this pull request?

Reading from an existing ORC table which contains char or varchar columns can fail with a ClassCastException if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces char and varchar columns with a string column.

This PR fixes this by adding the hive type to the StructField's metadata under the HIVE_TYPE_STRING key. This is picked up by the HiveClient and the ORC reader, see #16060 for more details on how the metadata is used.

How was this patch tested?

Added a regression test to OrcSourceSuite.

… issues with char/varchar columns in ORC.

SparkQA · 2017-02-04T13:38:24Z

Test build #72371 has finished for PR 16804 at commit c6a5bf6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-02-05T02:06:26Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

+
+  test("read varchar column from orc tables created by hive") {
+    try {
+      // This is an ORC file with a single VARCHAR(10) column that's created using Hive 1.2.1


Hi, @hvanhovell .
Nit. It's three columns.

Structure for orc/orc_text_types.orc File Version: 0.12 with HIVE_8732 Rows: 1 Compression: ZLIB Compression size: 262144 Type: struct<_col0:string,_col1:char(10),_col2:varchar(10)>

SparkQA · 2017-02-05T18:37:34Z

Test build #72412 has finished for PR 16804 at commit 277ed15.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-06T16:26:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+    dataType match {
+      case p: PrimitiveDataTypeContext =>
+        val dt = p.identifier.getText.toLowerCase
+        (dt, p.INTEGER_VALUE().asScala.toList) match {


nit:

p.identifier.getText.toLowerCase match { case "varchar" | "char" => builder.putString(HIVE_TYPE_STRING, dataType.getText.toLowerCase) }

cloud-fan · 2017-02-06T16:26:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala

+   * Metadata key used to store the Hive type name. This is relevant for datatypes that do not
+   * have a direct Spark SQL counterpart, such as CHAR and VARCHAR.
+   */
+  val HIVE_TYPE_STRING = "HIVE_TYPE_STRING"


shall we remove HiveUtils. HIVE_TYPE_STRING?

Yeah we should.

cloud-fan · 2017-02-06T16:33:49Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

@@ -162,6 +162,28 @@ abstract class OrcSuite extends QueryTest with TestHiveSingleton with BeforeAndA
      hiveClient.runSqlHive("DROP TABLE IF EXISTS orc_varchar")
    }
  }
+
+  test("read varchar column from orc tables created by hive") {
+    try {


how about

val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client try { hiveClient.runSqlHive("CREATE TABLE hive_orc(a VARCHAR(10)) STORED AS orc LOCATION xxx") hiveClient.runSqlHive("INSERT INTO TABLE hive_orc SELECT 'a' FROM (SELECT 1) t") sql("CREATE EXTERNAL TABLE spark_orc ...") checkAnswer... } finally { sql("DROP TABLE IF EXISTS ...") ... }

then we don't need to create the orc file manually.

…data.

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala

SparkQA · 2017-02-07T18:38:34Z

Test build #72518 has finished for PR 16804 at commit 64c37e0.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-07T18:44:53Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala

@@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.catalog._
 import org.apache.spark.sql.catalyst.expressions.{AttributeMap, AttributeReference, Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics}
 import org.apache.spark.sql.execution.FileRelation
-import org.apache.spark.sql.types.StructField
+import org.apache.spark.sql.types._


Unnecessary change?

That just makes it easier to use HIVE_TYPE_STRING.

gatorsmile · 2017-02-07T18:46:40Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala

@@ -51,6 +51,9 @@ private[hive] case class HiveSimpleUDF(
  @transient
  lazy val function = funcWrapper.createFunction[UDF]()

+  {
+    function
+  }


What is the reason for this?

That is my bad.

gatorsmile · 2017-02-07T18:57:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/package.scala

-package object types
+package object types {
+  /**
+   * Metadata key used to store the the raw hive type string in the metadata of StructField. This


Nit: the the -> the

SparkQA · 2017-02-07T19:47:45Z

Test build #72522 has finished for PR 16804 at commit f42348a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-02-07T20:54:07Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

+        s"ALTER TABLE hive_orc SET LOCATION '$location'")
+      hiveClient.runSqlHive(
+        "INSERT INTO TABLE hive_orc SELECT 'a', 'b', 'c' FROM (SELECT 1) t")
+


How about adding one more check?

checkAnswer(spark.table("hive_orc"), Row("a", "b ", "c"))

Then, we can remove the test case SPARK-18220: read Hive orc table with varchar column

yeah that makes sense

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

SparkQA · 2017-02-08T15:10:32Z

Test build #72587 has finished for PR 16804 at commit e7ca0ea.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-02-08T21:59:27Z

retest this please

gatorsmile · 2017-02-08T22:11:56Z

LGTM pending test

SparkQA · 2017-02-09T00:48:34Z

Test build #72604 has finished for PR 16804 at commit e7ca0ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-09T01:29:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

    val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+    val location = Utils.createTempDir().toURI


shall we remove this temp dir in the finally block?

SparkQA · 2017-02-09T18:10:19Z

Test build #72648 has finished for PR 16804 at commit 21be4ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-10T19:07:22Z

LGTM, merging to master!

…tadata ## What changes were proposed in this pull request? Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column. This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see apache#16060 for more details on how the metadata is used. ## How was this patch tested? Added a regression test to `OrcSourceSuite`. Author: Herman van Hovell <[email protected]> Closes apache#16804 from hvanhovell/SPARK-19459.

## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.

## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up. # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.

weiatwork · 2019-01-11T00:37:02Z

This doesn't solve the problem when reading a CHAR/VARCHAR column in Hive from a table created using Spark, does it? Hive will fail when trying to convert the String to its CHAR/VARCHAR type

Add Hive datatype (char/varchar) to struct field metadata. This fixes…

c6a5bf6

… issues with char/varchar columns in ORC.

dongjoon-hyun reviewed Feb 5, 2017

View reviewed changes

Update comment.

277ed15

cloud-fan reviewed Feb 6, 2017

View reviewed changes

hvanhovell added 3 commits February 7, 2017 16:59

Code Review: Make changes more consistent, and generate our own test …

64c37e0

…data.

Merge remote-tracking branch 'apache-github/master' into SPARK-19459

7d1b48e

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala

Fix after merge

f42348a

gatorsmile reviewed Feb 7, 2017

View reviewed changes

hvanhovell added 3 commits February 8, 2017 14:45

Merge remote-tracking branch 'apache-github/master' into SPARK-19459

e378f62

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

Code Review.

673168e

Revert change to MetastoreRelation

e7ca0ea

cloud-fan reviewed Feb 9, 2017

View reviewed changes

hvanhovell added 2 commits February 9, 2017 15:20

Merge remote-tracking branch 'apache-github/master' into SPARK-19459

2fa926d

Remove temp file

21be4ca

asfgit closed this in de8a03e Feb 10, 2017

hvanhovell mentioned this pull request Feb 22, 2017

[SPARK-19459] Support for nested char/varchar fields in ORC #17030

Closed

hvanhovell mentioned this pull request Apr 27, 2017

[SPARK-20515][SQL] Fix reading of HIVE ORC table with varchar/char columns in Spark SQL should not fail #17791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804

[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804

hvanhovell commented Feb 4, 2017

SparkQA commented Feb 4, 2017

dongjoon-hyun Feb 5, 2017

SparkQA commented Feb 5, 2017

cloud-fan Feb 6, 2017

cloud-fan Feb 6, 2017

hvanhovell Feb 6, 2017

cloud-fan Feb 6, 2017 •

edited

Loading

SparkQA commented Feb 7, 2017

gatorsmile Feb 7, 2017

hvanhovell Feb 7, 2017

gatorsmile Feb 7, 2017

hvanhovell Feb 7, 2017

gatorsmile Feb 7, 2017

hvanhovell Feb 7, 2017

SparkQA commented Feb 7, 2017

gatorsmile Feb 7, 2017

hvanhovell Feb 7, 2017

hvanhovell Feb 8, 2017

SparkQA commented Feb 8, 2017

hvanhovell commented Feb 8, 2017

gatorsmile commented Feb 8, 2017

SparkQA commented Feb 9, 2017

cloud-fan Feb 9, 2017

SparkQA commented Feb 9, 2017

cloud-fan commented Feb 10, 2017

weiatwork commented Jan 11, 2019

		val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
		val location = Utils.createTempDir().toURI

[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804

[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804

Conversation

hvanhovell commented Feb 4, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 4, 2017

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Feb 6, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 8, 2017

hvanhovell commented Feb 8, 2017

gatorsmile commented Feb 8, 2017

SparkQA commented Feb 9, 2017

Choose a reason for hiding this comment

SparkQA commented Feb 9, 2017

cloud-fan commented Feb 10, 2017

weiatwork commented Jan 11, 2019

cloud-fan Feb 6, 2017 •

edited

Loading