-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Series.all and Series.any #359
Conversation
# Here we check if the count of `True`s is more than one in order to mimic `any`. | ||
return sdf.select( | ||
(F.count(F.when(col.cast('boolean'), 1).otherwise(None)) >= 1) | ||
).collect()[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell do you remember this discussion about rewriting for any
and every
at Apache Spark side? (IIRC you suggested max/min approach but the hole was None
handling). Can you quickly check if this rewriting makes sense to you?
Here I need to mimic both here to mimic Pandas' with rewriting since any
and every
only exist in the Spark's master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't entirely remember that comment. I did found an e-mail thread about this, and there I state the opposite: ANY(col)
can be safely rewritten into MAX(col)
you may need to add a coalesce if we need to return false
for an empty dataset.
Codecov Report
@@ Coverage Diff @@
## master #359 +/- ##
==========================================
- Coverage 94.4% 94.36% -0.04%
==========================================
Files 36 36
Lines 3646 3656 +10
==========================================
+ Hits 3442 3450 +8
- Misses 204 206 +2
Continue to review full report at Codecov.
|
All tests passed |
# Here we check if the count of `True`s is more than one in order to mimic `any`. | ||
return sdf.select( | ||
(F.count(F.when(col.cast('boolean'), 1).otherwise(None)) >= 1) | ||
).collect()[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, count(col)
ignores null
Let me get this in. The output is as expected anyway. |
# Note that we're ignoring `None`s here for now. | ||
# Here we check if the count of `True`s is more than one in order to mimic `any`. | ||
return sdf.select( | ||
(F.count(F.when(col.cast('boolean'), 1).otherwise(None)) >= 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nits: otherwise(None)
should be implied, and if you do a is not empty check why not check > 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, currently None
is being simply ignored (that's what Pandas does by default - skipna
argument). Yes .. > 0
looks better. Will fix it soon. thanks :D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also let me give a shot with min/max ones too
No description provided.