-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23937][SQL] Add map_filter SQL function #21986
Conversation
cc @ueshin |
Test build #94141 has finished for PR 21986 at commit
|
retest this please |
Test build #94170 has finished for PR 21986 at commit
|
retest this please |
Test build #94199 has finished for PR 21986 at commit
|
/** | ||
* Trait for functions having as input one argument and one function. | ||
*/ | ||
trait UnaryHigherOrderFunction extends HigherOrderFunction with ExpectsInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this trait but I'm not sure whether we can say "Unary"HigherOrderFunction
for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, how about defining nullSafeEval
for input
in this trait like UnaryExpression
? (nullInputSafeEval
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I called it Unary
as it gets one input and one function. Honestly I can't think of a better name without becoming very verbose. if you have a better suggestion I am happy to follow it. I will add the nullSafeEval
, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @hvanhovell for the naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the term Unary
a lot and this is different from the other uses. The name should convey a HigherOrderFunction that only uses a single (lambda) function right? The only thing I can come up with is SingleHigherOrderFunction
. Simple
would probably also be fine.
checkEvaluation(mapFilter(mii1, kGreaterThanV), Map()) | ||
checkEvaluation(mapFilter(miin, kGreaterThanV), null) | ||
|
||
val valueNull: (Expression, Expression) => Expression = (_, v) => v.isNull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: valueIsNull
?
null | ||
} else { | ||
val retKeys = new mutable.ListBuffer[Any] | ||
val retValues = new mutable.ListBuffer[Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just curious that ListBuffer
is better than ArrayBuffer
? If so, should we rewrite in ArrayFilter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better as here we are always appending (and then creating an array from it). Appending a value is always O(1) for ListBuffer
, while in ArrayBuffer
it is: O(1) if the length of the underlying allocated array is bigger than the number of elements in the list plus one, O(n) otherwise (since it has to create a new array and copy the old one). As the initial value for the length of the underlying array in ArrayBuffer
is 16, this means that for output values with more than 16 elements ListBuffer
saves at least one copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I just checked that in ArrayFilter
you initialized it with the number of incoming elements. So i think there is no difference in terms of performance, as using an upper value for the number of output elements we are sure no copy is performed.
Test build #94273 has finished for PR 21986 at commit
|
retest this please |
Test build #94272 has finished for PR 21986 at commit
|
|
||
override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): MapFilter = { | ||
function match { | ||
case LambdaFunction(_, _, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pattern matching necessary? If so, shouldn't ArrayFilter
use it as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I am removing it, thanks
case _ => | ||
val MapType(kType, vType, vContainsNull) = MapType.defaultConcreteType | ||
(kType, vType, vContainsNull) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about extracting this to object MapBasedUnaryHigherOrderFunction
like array based one? We'll need this in other map based ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant something like:
object MapBasedUnaryHigherOrderFunction {
def keyValueArgumentType(dt: DataType): (DataType, DataType, Boolean) = {
dt match {
case MapType(kType, vType, vContainsNull) => (kType, vType, vContainsNull)
case _ =>
val MapType(kType, vType, vContainsNull) = MapType.defaultConcreteType
(kType, vType, vContainsNull)
}
}
}
...
case class MapFilter( ... ) {
...
@transient val (keyType, valueType, valueContainsNull) =
MapBasedUnaryHigherOrderFunction.keyValueArgumentType(input.dataType)
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, something wrong with introducing object to have util methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
- rename
ArrayBasedHigherOrderFunction
object toHigherOrderFunction
- rename
elementArgumentType
method toarrayElementArgumentType
- move
keyValueArgumentType
toHigherOrderFunction
object and rename tomapKeyValueArgumentType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry I haven read carefully your comment, now I see what you meant. Yes, I agree unifying them in a Helper object. I am updating accordingly. Thanks.
function: Expression) | ||
extends MapBasedUnaryHigherOrderFunction with CodegenFallback { | ||
|
||
@transient val (keyType, valueType, valueContainsNull) = input.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be a function in object MapBasedUnaryHigherOrderFunction, we can use it in other map based higher order function just like using ArrayBasedHigherOrderFunction.elementArgumentType.
Test build #94280 has finished for PR 21986 at commit
|
Test build #94291 has finished for PR 21986 at commit
|
Test build #94289 has finished for PR 21986 at commit
|
LGTM. |
Test build #94367 has finished for PR 21986 at commit
|
retest this please |
Test build #94363 has finished for PR 21986 at commit
|
Test build #94371 has finished for PR 21986 at commit
|
Thanks! merging to master. |
## What changes were proposed in this pull request? - Revert [SPARK-23935][SQL] Adding map_entries function: #21236 - Revert [SPARK-23937][SQL] Add map_filter SQL function: #21986 - Revert [SPARK-23940][SQL] Add transform_values SQL function: #22045 - Revert [SPARK-23939][SQL] Add transform_keys function: #22013 - Revert [SPARK-23938][SQL] Add map_zip_with function: #22017 - Revert the changes of map_entries in [SPARK-24331][SPARKR][SQL] Adding arrays_overlap, array_repeat, map_entries to SparkR: #21434 ## How was this patch tested? The existing tests. Closes #22827 from gatorsmile/revertMap2.4. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
The PR adds the high order function
map_filter
, which filters the entries of a map and returns a new map which contains only the entries which satisfied the filter function.How was this patch tested?
added UTs