-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix (DataFrame|Series).isin to pass numpy array #2103
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2103 +/- ##
==========================================
- Coverage 95.21% 89.78% -5.44%
==========================================
Files 60 60
Lines 13460 13347 -113
==========================================
- Hits 12816 11983 -833
- Misses 644 1364 +720
Continue to review full report at Codecov.
|
@@ -1848,10 +1848,16 @@ def test_isin(self): | |||
kdf = ks.from_pandas(pdf) | |||
|
|||
self.assert_eq(kdf.isin([4, "six"]), pdf.isin([4, "six"])) | |||
# Seems like pandas has a bug when passing `np.array` as parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas
should be koalas
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh... thanks, good catch!
This is because Koalas internally uses PySpark's isin
directly,
but seems like PySpark's isin
doesn't distinguish the type between 4
and "4"
...
>>> sdf
DataFrame[__index_level_0__: double, a: bigint, b: bigint, c: string, __natural_order__: bigint]
>>> sdf.select(sdf.a).show()
+---+
| a|
+---+
| 4|
| 2|
| 3|
| 4|
| 8|
| 6|
+---+
>>> sdf.select(sdf.a.isin(4)).show()
+----------+
|(a IN (4))|
+----------+
| true|
| false|
| false|
| true|
| false|
| false|
+----------+
>>> sdf.select(sdf.a.isin('4')).show()
+----------+
|(a IN (4))|
+----------+
| true|
| false|
| false|
| true|
| false|
| false|
+----------+
I think the last line should return false
for all values as below, since the type of column a
is bigint
.
# expected result for `sdf.select(sdf.a.isin('4')).show()`, but actually not.
+-----------+
|(a IN (4))|
+-----------+
| false|
| false|
| false|
| false|
| false|
| false|
+-----------+
@ueshin , @HyukjinKwon , Is this expected behavior of PySpark, or we should fix this ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be caused by type-coercion rule. We can leave it as-is. cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @ueshin :)
I'm pretty sure for this fix. Please feel free to leave comment if any! |
(Series|DataFrame).isin
don't work properly when passing numpy array as a parameter.This should resolve #2098