-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804
Conversation
… issues with char/varchar columns in ORC.
Test build #72371 has finished for PR 16804 at commit
|
|
||
test("read varchar column from orc tables created by hive") { | ||
try { | ||
// This is an ORC file with a single VARCHAR(10) column that's created using Hive 1.2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @hvanhovell .
Nit. It's three columns.
Structure for orc/orc_text_types.orc
File Version: 0.12 with HIVE_8732
Rows: 1
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:char(10),_col2:varchar(10)>
Test build #72412 has finished for PR 16804 at commit
|
dataType match { | ||
case p: PrimitiveDataTypeContext => | ||
val dt = p.identifier.getText.toLowerCase | ||
(dt, p.INTEGER_VALUE().asScala.toList) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
p.identifier.getText.toLowerCase match {
case "varchar" | "char" => builder.putString(HIVE_TYPE_STRING, dataType.getText.toLowerCase)
}
* Metadata key used to store the Hive type name. This is relevant for datatypes that do not | ||
* have a direct Spark SQL counterpart, such as CHAR and VARCHAR. | ||
*/ | ||
val HIVE_TYPE_STRING = "HIVE_TYPE_STRING" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we remove HiveUtils. HIVE_TYPE_STRING
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we should.
@@ -162,6 +162,28 @@ abstract class OrcSuite extends QueryTest with TestHiveSingleton with BeforeAndA | |||
hiveClient.runSqlHive("DROP TABLE IF EXISTS orc_varchar") | |||
} | |||
} | |||
|
|||
test("read varchar column from orc tables created by hive") { | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
try {
hiveClient.runSqlHive("CREATE TABLE hive_orc(a VARCHAR(10)) STORED AS orc LOCATION xxx")
hiveClient.runSqlHive("INSERT INTO TABLE hive_orc SELECT 'a' FROM (SELECT 1) t")
sql("CREATE EXTERNAL TABLE spark_orc ...")
checkAnswer...
} finally {
sql("DROP TABLE IF EXISTS ...")
...
}
then we don't need to create the orc file manually.
# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala
Test build #72518 has finished for PR 16804 at commit
|
@@ -32,7 +32,7 @@ import org.apache.spark.sql.catalyst.catalog._ | |||
import org.apache.spark.sql.catalyst.expressions.{AttributeMap, AttributeReference, Expression} | |||
import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} | |||
import org.apache.spark.sql.execution.FileRelation | |||
import org.apache.spark.sql.types.StructField | |||
import org.apache.spark.sql.types._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That just makes it easier to use HIVE_TYPE_STRING
.
@@ -51,6 +51,9 @@ private[hive] case class HiveSimpleUDF( | |||
@transient | |||
lazy val function = funcWrapper.createFunction[UDF]() | |||
|
|||
{ | |||
function | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is my bad.
package object types | ||
package object types { | ||
/** | ||
* Metadata key used to store the the raw hive type string in the metadata of StructField. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the the
-> the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
Test build #72522 has finished for PR 16804 at commit
|
s"ALTER TABLE hive_orc SET LOCATION '$location'") | ||
hiveClient.runSqlHive( | ||
"INSERT INTO TABLE hive_orc SELECT 'a', 'b', 'c' FROM (SELECT 1) t") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding one more check?
checkAnswer(spark.table("hive_orc"), Row("a", "b ", "c"))
Then, we can remove the test case SPARK-18220: read Hive orc table with varchar column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
Test build #72587 has finished for PR 16804 at commit
|
retest this please |
LGTM pending test |
Test build #72604 has finished for PR 16804 at commit
|
val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client | ||
val location = Utils.createTempDir().toURI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we remove this temp dir in the finally block?
Test build #72648 has finished for PR 16804 at commit
|
LGTM, merging to master! |
…tadata ## What changes were proposed in this pull request? Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column. This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see apache#16060 for more details on how the metadata is used. ## How was this patch tested? Added a regression test to `OrcSourceSuite`. Author: Herman van Hovell <[email protected]> Closes apache#16804 from hvanhovell/SPARK-19459.
## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.
## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up. # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala
## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up. # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala # sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala
## What changes were proposed in this pull request? This PR is a small follow-up on apache#16804. This PR also adds support for nested char/varchar fields in orc. ## How was this patch tested? I have added a regression test to the OrcSourceSuite. Author: Herman van Hovell <[email protected]> Closes apache#17030 from hvanhovell/SPARK-19459-follow-up.
This doesn't solve the problem when reading a CHAR/VARCHAR column in Hive from a table created using Spark, does it? Hive will fail when trying to convert the String to its CHAR/VARCHAR type |
What changes were proposed in this pull request?
Reading from an existing ORC table which contains
char
orvarchar
columns can fail with aClassCastException
if the table metadata has been created using Spark. This is caused by the fact that spark internally replaceschar
andvarchar
columns with astring
column.This PR fixes this by adding the hive type to the
StructField's
metadata under theHIVE_TYPE_STRING
key. This is picked up by theHiveClient
and the ORC reader, see #16060 for more details on how the metadata is used.How was this patch tested?
Added a regression test to
OrcSourceSuite
.