[SPARK-23937][SQL] Add map_filter SQL function #21986

mgaido91 · 2018-08-03T15:21:12Z

What changes were proposed in this pull request?

The PR adds the high order function map_filter, which filters the entries of a map and returns a new map which contains only the entries which satisfied the filter function.

How was this patch tested?

added UTs

mgaido91 · 2018-08-03T15:21:50Z

cc @ueshin

SparkQA · 2018-08-03T18:58:07Z

Test build #94141 has finished for PR 21986 at commit 3f88e2a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
trait ArrayBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
trait MapBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
case class MapFilter(

mgaido91 · 2018-08-03T20:00:48Z

retest this please

SparkQA · 2018-08-03T23:21:27Z

Test build #94170 has finished for PR 21986 at commit 3f88e2a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
trait ArrayBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
trait MapBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
case class MapFilter(

HyukjinKwon · 2018-08-04T06:03:02Z

retest this please

SparkQA · 2018-08-04T07:05:01Z

Test build #94199 has finished for PR 21986 at commit 3f88e2a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
trait ArrayBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
trait MapBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
case class MapFilter(

ueshin · 2018-08-04T07:52:18Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+/**
+ * Trait for functions having as input one argument and one function.
+ */
+trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes {


I like this trait but I'm not sure whether we can say "Unary"HigherOrderFunction for this.

Btw, how about defining nullSafeEval for input in this trait like UnaryExpression? (nullInputSafeEval?)

I called it Unary as it gets one input and one function. Honestly I can't think of a better name without becoming very verbose. if you have a better suggestion I am happy to follow it. I will add the nullSafeEval, thanks!

cc @hvanhovell for the naming?

We use the term Unary a lot and this is different from the other uses. The name should convey a HigherOrderFunction that only uses a single (lambda) function right? The only thing I can come up with is SingleHigherOrderFunction. Simple would probably also be fine.

ueshin · 2018-08-04T08:01:39Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala

+    checkEvaluation(mapFilter(mii1, kGreaterThanV), Map())
+    checkEvaluation(mapFilter(miin, kGreaterThanV), null)
+
+    val valueNull: (Expression, Expression) => Expression = (_, v) => v.isNull


nit: valueIsNull?

ueshin · 2018-08-04T08:25:09Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+      null
+    } else {
+      val retKeys = new mutable.ListBuffer[Any]
+      val retValues = new mutable.ListBuffer[Any]


I'm just curious that ListBuffer is better than ArrayBuffer? If so, should we rewrite in ArrayFilter?

I think it is better as here we are always appending (and then creating an array from it). Appending a value is always O(1) for ListBuffer, while in ArrayBuffer it is: O(1) if the length of the underlying allocated array is bigger than the number of elements in the list plus one, O(n) otherwise (since it has to create a new array and copy the old one). As the initial value for the length of the underlying array in ArrayBuffer is 16, this means that for output values with more than 16 elements ListBuffer saves at least one copy.

But I just checked that in ArrayFilter you initialized it with the number of incoming elements. So i think there is no difference in terms of performance, as using an upper value for the number of output elements we are sure no copy is performed.

SparkQA · 2018-08-06T10:28:42Z

Test build #94273 has finished for PR 21986 at commit 37e221c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-06T10:38:59Z

retest this please

SparkQA · 2018-08-06T13:35:27Z

Test build #94272 has finished for PR 21986 at commit 9bbaa3b.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

mn-mikke · 2018-08-06T14:21:24Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+
+  override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): MapFilter = {
+    function match {
+      case LambdaFunction(_, _, _) =>


Is this pattern matching necessary? If so, shouldn't ArrayFilter use it as well?

right, I am removing it, thanks

ueshin · 2018-08-06T14:59:48Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+    case _ =>
+      val MapType(kType, vType, vContainsNull) = MapType.defaultConcreteType
+      (kType, vType, vContainsNull)
+  }


How about extracting this to object MapBasedUnaryHigherOrderFunction like array based one? We'll need this in other map based ones.

Sorry, I meant something like:

object MapBasedUnaryHigherOrderFunction { def keyValueArgumentType(dt: DataType): (DataType, DataType, Boolean) = { dt match { case MapType(kType, vType, vContainsNull) => (kType, vType, vContainsNull) case _ => val MapType(kType, vType, vContainsNull) = MapType.defaultConcreteType (kType, vType, vContainsNull) } } } ... case class MapFilter( ... ) { ... @transient val (keyType, valueType, valueContainsNull) = MapBasedUnaryHigherOrderFunction.keyValueArgumentType(input.dataType) ... }

Hmm, something wrong with introducing object to have util methods?

How about:

rename ArrayBasedHigherOrderFunction object to HigherOrderFunction

rename elementArgumentType method to arrayElementArgumentType

move keyValueArgumentType to HigherOrderFunction object and rename to mapKeyValueArgumentType

oh, sorry I haven read carefully your comment, now I see what you meant. Yes, I agree unifying them in a Helper object. I am updating accordingly. Thanks.

xuanyuanking · 2018-08-06T15:02:56Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

+    function: Expression)
+  extends MapBasedUnaryHigherOrderFunction with CodegenFallback {
+
+  @transient val (keyType, valueType, valueContainsNull) = input.dataType match {


Maybe this should be a function in object MapBasedUnaryHigherOrderFunction, we can use it in other map based higher order function just like using ArrayBasedHigherOrderFunction.elementArgumentType.

SparkQA · 2018-08-06T17:59:03Z

Test build #94280 has finished for PR 21986 at commit 37e221c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-06T18:01:30Z

Test build #94291 has finished for PR 21986 at commit 9c25ae6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-06T19:05:12Z

Test build #94289 has finished for PR 21986 at commit b58a1de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-08-07T11:33:30Z

LGTM.

SparkQA · 2018-08-07T12:30:33Z

Test build #94367 has finished for PR 21986 at commit af79644.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-08-07T12:32:20Z

retest this please

SparkQA · 2018-08-07T13:47:37Z

Test build #94363 has finished for PR 21986 at commit 1823fb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SimpleHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
trait ArrayBasedSimpleHigherOrderFunction extends SimpleHigherOrderFunction
trait MapBasedSimpleHigherOrderFunction extends SimpleHigherOrderFunction

SparkQA · 2018-08-07T16:32:15Z

Test build #94371 has finished for PR 21986 at commit af79644.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-08-07T17:11:41Z

Thanks! merging to master.

## What changes were proposed in this pull request? - Revert [SPARK-23935][SQL] Adding map_entries function: #21236 - Revert [SPARK-23937][SQL] Add map_filter SQL function: #21986 - Revert [SPARK-23940][SQL] Add transform_values SQL function: #22045 - Revert [SPARK-23939][SQL] Add transform_keys function: #22013 - Revert [SPARK-23938][SQL] Add map_zip_with function: #22017 - Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: #21434 ## How was this patch tested? The existing tests. Closes #22827 from gatorsmile/revertMap2.4. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-23937][SQL] Add map_filter SQL function

3f88e2a

ueshin reviewed Aug 4, 2018

View reviewed changes

mgaido91 added 2 commits August 6, 2018 10:39

address comments

9bbaa3b

Merge branch 'master' of github.com:apache/spark into SPARK-23937

37e221c

mn-mikke reviewed Aug 6, 2018

View reviewed changes

address comment

b58a1de

ueshin reviewed Aug 6, 2018

View reviewed changes

xuanyuanking reviewed Aug 6, 2018

View reviewed changes

address comment

9c25ae6

address comments

1823fb2

ueshin mentioned this pull request Aug 7, 2018

[SPARK-23938][SQL] Add map_zip_with function #22017

Closed

mgaido91 added 2 commits August 7, 2018 12:44

address comment

16d8b64

rename to HigherOrderFunction

af79644

asfgit closed this in cb6cb31 Aug 7, 2018

gatorsmile mentioned this pull request Oct 25, 2018

[SPARK-25832][SQL][BRANCH-2.4] Revert newly added map related functions #22827

Closed

[SPARK-23937][SQL] Add map_filter SQL function #21986

[SPARK-23937][SQL] Add map_filter SQL function #21986

Conversation

mgaido91 commented Aug 3, 2018

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 commented Aug 3, 2018

SparkQA commented Aug 3, 2018

mgaido91 commented Aug 3, 2018

SparkQA commented Aug 3, 2018

HyukjinKwon commented Aug 4, 2018

SparkQA commented Aug 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 6, 2018

mgaido91 commented Aug 6, 2018

SparkQA commented Aug 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 6, 2018

SparkQA commented Aug 6, 2018

SparkQA commented Aug 6, 2018

ueshin commented Aug 7, 2018

SparkQA commented Aug 7, 2018

mgaido91 commented Aug 7, 2018

SparkQA commented Aug 7, 2018

SparkQA commented Aug 7, 2018

ueshin commented Aug 7, 2018