Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804

Closed
wants to merge 10 commits into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

Reading from an existing ORC table which contains char or varchar columns can fail with a ClassCastException if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces char and varchar columns with a string column.

This PR fixes this by adding the hive type to the StructField's metadata under the HIVE_TYPE_STRING key. This is picked up by the HiveClient and the ORC reader, see #16060 for more details on how the metadata is used.

How was this patch tested?

Added a regression test to OrcSourceSuite.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72371 has finished for PR 16804 at commit c6a5bf6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


test("read varchar column from orc tables created by hive") {
try {
// This is an ORC file with a single VARCHAR(10) column that's created using Hive 1.2.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @hvanhovell .
Nit. It's three columns.

Structure for orc/orc_text_types.orc
File Version: 0.12 with HIVE_8732
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:char(10),_col2:varchar(10)>

@SparkQA
Copy link

SparkQA commented Feb 5, 2017

Test build #72412 has finished for PR 16804 at commit 277ed15.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dataType match {
case p: PrimitiveDataTypeContext =>
val dt = p.identifier.getText.toLowerCase
(dt, p.INTEGER_VALUE().asScala.toList) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

p.identifier.getText.toLowerCase match {
  case "varchar" | "char" => builder.putString(HIVE_TYPE_STRING, dataType.getText.toLowerCase)
}

* Metadata key used to store the Hive type name. This is relevant for datatypes that do not
* have a direct Spark SQL counterpart, such as CHAR and VARCHAR.
*/
val HIVE_TYPE_STRING = "HIVE_TYPE_STRING"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we remove HiveUtils. HIVE_TYPE_STRING?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should.

@@ -162,6 +162,28 @@ abstract class OrcSuite extends QueryTest with TestHiveSingleton with BeforeAndA
hiveClient.runSqlHive("DROP TABLE IF EXISTS orc_varchar")
}
}

test("read varchar column from orc tables created by hive") {
try {
Copy link
Contributor

@cloud-fan cloud-fan Feb 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

    val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
     try {
       hiveClient.runSqlHive("CREATE TABLE hive_orc(a VARCHAR(10)) STORED AS orc LOCATION xxx")
       hiveClient.runSqlHive("INSERT INTO TABLE hive_orc SELECT 'a' FROM (SELECT 1) t")
       sql("CREATE EXTERNAL TABLE spark_orc ...")
       checkAnswer...
     } finally {
        sql("DROP TABLE IF EXISTS ...")
        ...
      }

then we don't need to create the orc file manually.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72518 has finished for PR 16804 at commit 64c37e0.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.catalog._
import org.apache.spark.sql.catalyst.expressions.{AttributeMap, AttributeReference, Expression}
import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics}
import org.apache.spark.sql.execution.FileRelation
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That just makes it easier to use HIVE_TYPE_STRING.

@@ -51,6 +51,9 @@ private[hive] case class HiveSimpleUDF(
@transient
lazy val function = funcWrapper.createFunction[UDF]()

{
function
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is my bad.

package object types
package object types {
/**
* Metadata key used to store the the raw hive type string in the metadata of StructField. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the the -> the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72522 has finished for PR 16804 at commit f42348a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"ALTER TABLE hive_orc SET LOCATION '$location'")
hiveClient.runSqlHive(
"INSERT INTO TABLE hive_orc SELECT 'a', 'b', 'c' FROM (SELECT 1) t")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding one more check?

      checkAnswer(spark.table("hive_orc"), Row("a", "b         ", "c"))

Then, we can remove the test case SPARK-18220: read Hive orc table with varchar column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72587 has finished for PR 16804 at commit e7ca0ea.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

retest this please

@gatorsmile
Copy link
Member

LGTM pending test

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72604 has finished for PR 16804 at commit e7ca0ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
val location = Utils.createTempDir().toURI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we remove this temp dir in the finally block?

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72648 has finished for PR 16804 at commit 21be4ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in de8a03e Feb 10, 2017
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…tadata

## What changes were proposed in this pull request?
Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column.

This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see apache#16060 for more details on how the metadata is used.

## How was this patch tested?
Added a regression test to `OrcSourceSuite`.

Author: Herman van Hovell <[email protected]>

Closes apache#16804 from hvanhovell/SPARK-19459.
ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 23, 2017
## What changes were proposed in this pull request?
This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc.

## How was this patch tested?
I have added a regression test to the OrcSourceSuite.

Author: Herman van Hovell <[email protected]>

Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.
hvanhovell added a commit to hvanhovell/spark that referenced this pull request Feb 23, 2017
## What changes were proposed in this pull request?
This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc.

## How was this patch tested?
I have added a regression test to the OrcSourceSuite.

Author: Herman van Hovell <[email protected]>

Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
#	sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala
hvanhovell added a commit to hvanhovell/spark that referenced this pull request Feb 23, 2017
## What changes were proposed in this pull request?
This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc.

## How was this patch tested?
I have added a regression test to the OrcSourceSuite.

Author: Herman van Hovell <[email protected]>

Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
#	sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
## What changes were proposed in this pull request?
This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc.

## How was this patch tested?
I have added a regression test to the OrcSourceSuite.

Author: Herman van Hovell <[email protected]>

Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.
@weiatwork
Copy link
Contributor

This doesn't solve the problem when reading a CHAR/VARCHAR column in Hive from a table created using Spark, does it? Hive will fail when trying to convert the String to its CHAR/VARCHAR type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants