Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23937][SQL] Add map_filter SQL function #21986

Closed
wants to merge 8 commits into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Aug 3, 2018

What changes were proposed in this pull request?

The PR adds the high order function map_filter, which filters the entries of a map and returns a new map which contains only the entries which satisfied the filter function.

How was this patch tested?

added UTs

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 3, 2018

cc @ueshin

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94141 has finished for PR 21986 at commit 3f88e2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
  • trait ArrayBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
  • trait MapBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
  • case class MapFilter(

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 3, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94170 has finished for PR 21986 at commit 3f88e2a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
  • trait ArrayBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
  • trait MapBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
  • case class MapFilter(

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 4, 2018

Test build #94199 has finished for PR 21986 at commit 3f88e2a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
  • trait ArrayBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
  • trait MapBasedUnaryHigherOrderFunction extends UnaryHigherOrderFunction
  • case class MapFilter(

/**
* Trait for functions having as input one argument and one function.
*/
trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this trait but I'm not sure whether we can say "Unary"HigherOrderFunction for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, how about defining nullSafeEval for input in this trait like UnaryExpression? (nullInputSafeEval?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called it Unary as it gets one input and one function. Honestly I can't think of a better name without becoming very verbose. if you have a better suggestion I am happy to follow it. I will add the nullSafeEval, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @hvanhovell for the naming?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the term Unary a lot and this is different from the other uses. The name should convey a HigherOrderFunction that only uses a single (lambda) function right? The only thing I can come up with is SingleHigherOrderFunction. Simple would probably also be fine.

checkEvaluation(mapFilter(mii1, kGreaterThanV), Map())
checkEvaluation(mapFilter(miin, kGreaterThanV), null)

val valueNull: (Expression, Expression) => Expression = (_, v) => v.isNull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: valueIsNull?

null
} else {
val retKeys = new mutable.ListBuffer[Any]
val retValues = new mutable.ListBuffer[Any]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious that ListBuffer is better than ArrayBuffer? If so, should we rewrite in ArrayFilter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better as here we are always appending (and then creating an array from it). Appending a value is always O(1) for ListBuffer, while in ArrayBuffer it is: O(1) if the length of the underlying allocated array is bigger than the number of elements in the list plus one, O(n) otherwise (since it has to create a new array and copy the old one). As the initial value for the length of the underlying array in ArrayBuffer is 16, this means that for output values with more than 16 elements ListBuffer saves at least one copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I just checked that in ArrayFilter you initialized it with the number of incoming elements. So i think there is no difference in terms of performance, as using an upper value for the number of output elements we are sure no copy is performed.

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94273 has finished for PR 21986 at commit 37e221c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 6, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94272 has finished for PR 21986 at commit 9bbaa3b.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.


override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): MapFilter = {
function match {
case LambdaFunction(_, _, _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pattern matching necessary? If so, shouldn't ArrayFilter use it as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I am removing it, thanks

case _ =>
val MapType(kType, vType, vContainsNull) = MapType.defaultConcreteType
(kType, vType, vContainsNull)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about extracting this to object MapBasedUnaryHigherOrderFunction like array based one? We'll need this in other map based ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant something like:

object MapBasedUnaryHigherOrderFunction {

  def keyValueArgumentType(dt: DataType): (DataType, DataType, Boolean) = {
    dt match {
      case MapType(kType, vType, vContainsNull) => (kType, vType, vContainsNull)
      case _ =>
        val MapType(kType, vType, vContainsNull) = MapType.defaultConcreteType
        (kType, vType, vContainsNull)
    }
  }
}

...

case class MapFilter( ... ) {
  ...
  @transient val (keyType, valueType, valueContainsNull) =
    MapBasedUnaryHigherOrderFunction.keyValueArgumentType(input.dataType)
  ...
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, something wrong with introducing object to have util methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

  1. rename ArrayBasedHigherOrderFunction object to HigherOrderFunction
  2. rename elementArgumentType method to arrayElementArgumentType
  3. move keyValueArgumentType to HigherOrderFunction object and rename to mapKeyValueArgumentType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry I haven read carefully your comment, now I see what you meant. Yes, I agree unifying them in a Helper object. I am updating accordingly. Thanks.

function: Expression)
extends MapBasedUnaryHigherOrderFunction with CodegenFallback {

@transient val (keyType, valueType, valueContainsNull) = input.dataType match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be a function in object MapBasedUnaryHigherOrderFunction, we can use it in other map based higher order function just like using ArrayBasedHigherOrderFunction.elementArgumentType.

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94280 has finished for PR 21986 at commit 37e221c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94291 has finished for PR 21986 at commit 9c25ae6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2018

Test build #94289 has finished for PR 21986 at commit b58a1de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Aug 7, 2018

LGTM.

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94367 has finished for PR 21986 at commit af79644.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Aug 7, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94363 has finished for PR 21986 at commit 1823fb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait SimpleHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes
  • trait ArrayBasedSimpleHigherOrderFunction extends SimpleHigherOrderFunction
  • trait MapBasedSimpleHigherOrderFunction extends SimpleHigherOrderFunction

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94371 has finished for PR 21986 at commit af79644.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Aug 7, 2018

Thanks! merging to master.

@asfgit asfgit closed this in cb6cb31 Aug 7, 2018
asfgit pushed a commit that referenced this pull request Oct 25, 2018
## What changes were proposed in this pull request?

- Revert [SPARK-23935][SQL] Adding map_entries function: #21236
- Revert [SPARK-23937][SQL] Add map_filter SQL function: #21986
- Revert [SPARK-23940][SQL] Add transform_values SQL function: #22045
- Revert [SPARK-23939][SQL] Add transform_keys function: #22013
- Revert [SPARK-23938][SQL] Add map_zip_with function: #22017
- Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: #21434

## How was this patch tested?
The existing tests.

Closes #22827 from gatorsmile/revertMap2.4.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants