Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23939][SQL] Add transform_keys function #22013

Closed
wants to merge 15 commits into from

Conversation

codeatri
Copy link
Contributor

@codeatri codeatri commented Aug 6, 2018

What changes were proposed in this pull request?

This pr adds transform_keys function which applies the function to each entry of the map and transforms the keys.

> SELECT transform_keys(map(array(1, 2, 3), array(1, 2, 3)), (k,v) -> k + 1);
       map(2->1, 3->2, 4->3)

> SELECT transform_keys(map(array(1, 2, 3), array(1, 2, 3)), (k,v) -> k + v);
       map(2->1, 4->2, 6->3)

How was this patch tested?

Added tests.

@hvanhovell
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94315 has finished for PR 22013 at commit 0a19cc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class TransformKeys(

while (i < arr.numElements) {
keyVar.value.set(arr.keyArray().get(i, keyVar.dataType))
valueVar.value.set(arr.valueArray().get(i, valueVar.dataType))
resultKeys.update(i, f.eval(input))
Copy link
Contributor

@hvanhovell hvanhovell Aug 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the transformation will return a unique key right? If it doesn't you'll break the map semantics. For example: transform_keys(some_map, (k, v) -> 0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fun of duplicated keys either, but other functions transforming maps have the same problem. See the discussions here and here.

Example:

scala> spark.range(1).selectExpr("map(0,1,0,2)").show()
+----------------+
| map(0, 1, 0, 2)|
+----------------+
|[0 -> 1, 0 -> 2]|
+----------------+


override def dataType: DataType = {
val valueType = input.dataType.asInstanceOf[MapType].valueType
MapType(function.dataType, valueType, input.nullable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here input.nullable is wrong. This should indicate whether the value contains null, not whether the returned object can be null or not.

usage = "_FUNC_(expr, func) - Transforms elements in a map using the function.",
examples = """
Examples:
> SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3), (k,v) -> k + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing space -> k, v

@@ -365,3 +365,69 @@ case class ArrayAggregate(

override def prettyName: String = "aggregate"
}

/**
* Transform Keys in a map using the transform_keys function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better comment?

while (i < arr.numElements) {
keyVar.value.set(arr.keyArray().get(i, keyVar.dataType))
valueVar.value.set(arr.valueArray().get(i, valueVar.dataType))
resultKeys.update(i, f.eval(input))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but couldn't f.eval(input) be evaluated to null? Keys are not allowed to benull. Other functions have usually a null check and throw RuntimeException for such cases.

dfExample5.cache()
dfExample6.cache()
// Test with cached relation, the Project will be evaluated with codegen
testMapOfPrimitiveTypesCombination()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have do that if the expression implements CodegenFallback?


test("TransformKeys") {
val ai0 = Literal.create(
Map(1 -> 1, 2 -> 2, 3 -> 3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's maybe irrelevant but WDYT about adding test cases with null values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!
Included test cases, both here and in DataFrameFunctionsSuite.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94404 has finished for PR 22013 at commit 5806ac4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94405 has finished for PR 22013 at commit 150a6a5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94451 has finished for PR 22013 at commit 9f6a8ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94455 has finished for PR 22013 at commit 6526630.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94457 has finished for PR 22013 at commit f7fd231.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@codeatri
Copy link
Contributor Author

codeatri commented Aug 9, 2018

@hvanhovell @mn-mikke @mgaido91 Thanks for the review! I have addressed all your comments and added appropriate test cases for the same.

usage = "_FUNC_(expr, func) - Transforms elements in a map using the function.",
examples = """
Examples:
> SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3), (k, v) -> k + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we need one more right parenthesis after the second array(1, 2, 3)?

Examples:
> SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3), (k, v) -> k + 1);
map(array(2, 3, 4), array(1, 2, 3))
> SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3), (k, v) -> k + v);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

def transformKeys(expr: Expression, f: (Expression, Expression) => Expression): Expression = {
val valueType = expr.dataType.asInstanceOf[MapType].valueType
val keyType = expr.dataType.asInstanceOf[MapType].keyType
TransformKeys(expr, createLambda(keyType, false, valueType, true, f))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use valueContainsNull instead of true?

test("TransformKeys") {
val ai0 = Literal.create(
Map(1 -> 1, 2 -> 2, 3 -> 3),
MapType(IntegerType, IntegerType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add valueContainsNull explicitly?

MapType(IntegerType, IntegerType))
val ai2 = Literal.create(
Map(1 -> 1, 2 -> null, 3 -> 3),
MapType(IntegerType, IntegerType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for Literal.create(null, MapType(IntegerType, IntegerType))?

@@ -2117,6 +2117,198 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext {
assert(ex4.getMessage.contains("data type mismatch: argument 3 requires int type"))
}

test("transform keys function - test various primitive data types combinations") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need so many cases here. We only need to verify the api works end to end.
Evaluation checks of the function should be in HigherOrderFunctionsSuite.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94518 has finished for PR 22013 at commit 1cbaf0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@codeatri
Copy link
Contributor Author

codeatri commented Aug 9, 2018

Thanks for the review @ueshin! I have addressed all your comments.

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94758 has finished for PR 22013 at commit bb52630.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 14, 2018

Test build #94765 has finished for PR 22013 at commit 621213d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94775 has finished for PR 22013 at commit 5db526b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

keyVar.value.set(map.keyArray().get(i, keyVar.dataType))
valueVar.value.set(map.valueArray().get(i, valueVar.dataType))
val result = f.eval(inputRow)
if (result == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space between == and null.

MapType(function.dataType, map.valueType, map.valueContainsNull)
}

@transient val MapType(keyType, valueType, valueContainsNull) = argument.dataType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy val?
Could you add a test when argument is not a map in invalid cases of DataFrameFunctionsSuite?


override def dataType: DataType = {
val map = argument.dataType.asInstanceOf[MapType]
MapType(function.dataType, map.valueType, map.valueContainsNull)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use valueType and valueContainsNull from the following val?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this?

}

override def prettyName: String = "transform_keys"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent


def testInvalidLambdaFunctions(): Unit = {
val ex1 = intercept[AnalysisException] {
dfExample1.selectExpr("transform_keys(i, k -> k )")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space after k -> k.

dfExample2.selectExpr("transform_keys(j, (k, v, x) -> k + 1)")
}
assert(ex2.getMessage.contains(
"The number of lambda function arguments '3' does not match"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

).toDF("i")

val dfExample2 = Seq(
Map[Int, Double](1 -> 1.0E0, 2 -> 1.4E0, 3 -> 1.7E0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need E0?


create or replace temporary view nested as values
(1, map(1,1,2,2,3,3)),
(2, map(4,4,5,5,6,6))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

  (1, map(1, 1, 2, 2, 3, 3)),
  (2, map(4, 4, 5, 5, 6, 6))

MapType(StringType, StringType, valueContainsNull = true))
val as2 = Literal.create(null,
MapType(StringType, StringType, valueContainsNull = false))
val asn = Literal.create(Map.empty[StringType, StringType],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as3?

testInvalidLambdaFunctions()
dfExample1.cache()
dfExample2.cache()
testInvalidLambdaFunctions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need dfExample3.cache() as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin I would like to ask you a generic question regarding higher-order functions. Is it necessary to perform checks with codegen paths if all the newly added functions extends from CodegenFallback? Eventually, is there a plan to add coden for these functions in future?

@ueshin
Copy link
Member

ueshin commented Aug 15, 2018

Btw, we need one more right parenthesis after the second array(1, 2, 3) and a space at (k,v) in the description?

val LambdaFunction(
_, (keyVar: NamedLambdaVariable) :: (valueVar: NamedLambdaVariable) :: Nil, _) = function
(keyVar, valueVar)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about:

@transient lazy val LambdaFunction(_,
  (keyVar: NamedLambdaVariable) :: (valueVar: NamedLambdaVariable) :: Nil, _) = function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant we don't need to surround by:

@transient lazy val (keyVar, valueVar) = {
  ...
  (keyVar, valueVar)
}

just

@transient lazy val LambdaFunction(_,
  (keyVar: NamedLambdaVariable) :: (valueVar: NamedLambdaVariable) :: Nil, _) = function

should work.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94788 has finished for PR 22013 at commit e5d9b05.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@transient lazy val MapType(keyType, valueType, valueContainsNull) = argument.dataType

override def dataType: DataType = {
MapType(function.dataType, valueType, valueContainsNull)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just in one line?


override def nullSafeEval(inputRow: InternalRow, argumentValue: Any): Any = {
val map = argumentValue.asInstanceOf[MapData]
val f = functionForEval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use functionForEval directly?

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94811 has finished for PR 22013 at commit fb885f4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94817 has finished for PR 22013 at commit 58b60b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 15, 2018

Test build #94819 has finished for PR 22013 at commit 2f4943f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Aug 16, 2018

LGTM.
@mn-mikke @mgaido91 Do you have any other comments on this?

@ueshin
Copy link
Member

ueshin commented Aug 16, 2018

I'd merge this now.
@mn-mikke @mgaido91 If you have any other comments, let's have a follow-up pr.

@ueshin
Copy link
Member

ueshin commented Aug 16, 2018

Thanks! merging to master.

@asfgit asfgit closed this in 5b4a38d Aug 16, 2018
Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one very minor comment, thanks

function: Expression)
extends MapBasedSimpleHigherOrderFunction with CodegenFallback {

override def nullable: Boolean = argument.nullable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved to SimpleHigherOrderFunction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.
Let's have wrap-up prs for higher-order functions after the remaining 2 prs are merged.

asfgit pushed a commit that referenced this pull request Oct 25, 2018
## What changes were proposed in this pull request?

- Revert [SPARK-23935][SQL] Adding map_entries function: #21236
- Revert [SPARK-23937][SQL] Add map_filter SQL function: #21986
- Revert [SPARK-23940][SQL] Add transform_values SQL function: #22045
- Revert [SPARK-23939][SQL] Add transform_keys function: #22013
- Revert [SPARK-23938][SQL] Add map_zip_with function: #22017
- Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: #21434

## How was this patch tested?
The existing tests.

Closes #22827 from gatorsmile/revertMap2.4.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants