Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix (DataFrame|Series).isin to pass numpy array #2103

Merged
merged 1 commit into from
Mar 24, 2021

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Mar 15, 2021

(Series|DataFrame).isin don't work properly when passing numpy array as a parameter.

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4]})
>>> numpy_arr = np.array([3, 4, 5])
>>> df['a'].isin(numpy_arr)
0    False
1    False
2     True
Name: a, dtype: bool

>>> kdf = ks.from_pandas(df)
>>> kdf[kdf['a'].isin(numpy_arr)]
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

This should resolve #2098

@itholic itholic changed the title Fix isin to pass numpy array Fix (DataFrame|Series).isin to pass numpy array Mar 15, 2021
@codecov-io
Copy link

Codecov Report

Merging #2103 (716e996) into master (2fe8796) will decrease coverage by 5.43%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2103      +/-   ##
==========================================
- Coverage   95.21%   89.78%   -5.44%     
==========================================
  Files          60       60              
  Lines       13460    13347     -113     
==========================================
- Hits        12816    11983     -833     
- Misses        644     1364     +720     
Impacted Files Coverage Δ
databricks/koalas/base.py 93.39% <100.00%> (-3.51%) ⬇️
databricks/koalas/frame.py 93.42% <100.00%> (-3.11%) ⬇️
databricks/koalas/plot/plotly.py 15.78% <0.00%> (-81.06%) ⬇️
...bricks/koalas/tests/plot/test_frame_plot_plotly.py 23.33% <0.00%> (-76.67%) ⬇️
...ricks/koalas/tests/plot/test_series_plot_plotly.py 26.92% <0.00%> (-71.26%) ⬇️
databricks/koalas/usage_logging/__init__.py 28.20% <0.00%> (-64.36%) ⬇️
databricks/koalas/usage_logging/usage_logger.py 47.82% <0.00%> (-52.18%) ⬇️
databricks/koalas/typedef/typehints.py 66.84% <0.00%> (-27.72%) ⬇️
databricks/koalas/__init__.py 77.63% <0.00%> (-14.48%) ⬇️
databricks/conftest.py 87.27% <0.00%> (-12.73%) ⬇️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fe8796...716e996. Read the comment docs.

@@ -1848,10 +1848,16 @@ def test_isin(self):
kdf = ks.from_pandas(pdf)

self.assert_eq(kdf.isin([4, "six"]), pdf.isin([4, "six"]))
# Seems like pandas has a bug when passing `np.array` as parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas should be koalas :)

Copy link
Contributor Author

@itholic itholic Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh... thanks, good catch!

This is because Koalas internally uses PySpark's isin directly,

but seems like PySpark's isin doesn't distinguish the type between 4 and "4"...

>>> sdf
DataFrame[__index_level_0__: double, a: bigint, b: bigint, c: string, __natural_order__: bigint]

>>> sdf.select(sdf.a).show()
+---+
|  a|
+---+
|  4|
|  2|
|  3|
|  4|
|  8|
|  6|
+---+

>>> sdf.select(sdf.a.isin(4)).show()
+----------+
|(a IN (4))|
+----------+
|      true|
|     false|
|     false|
|      true|
|     false|
|     false|
+----------+

>>> sdf.select(sdf.a.isin('4')).show()
+----------+
|(a IN (4))|
+----------+
|      true|
|     false|
|     false|
|      true|
|     false|
|     false|
+----------+

I think the last line should return false for all values as below, since the type of column a is bigint.

# expected result for `sdf.select(sdf.a.isin('4')).show()`, but actually not.
+-----------+
|(a IN (4))|
+-----------+
|      false|
|      false|
|      false|
|      false|
|      false|
|      false|
+-----------+

@ueshin , @HyukjinKwon , Is this expected behavior of PySpark, or we should fix this ??

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be caused by type-coercion rule. We can leave it as-is. cc @HyukjinKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ueshin :)

@xinrong-meng xinrong-meng self-requested a review March 16, 2021 21:13
@itholic
Copy link
Contributor Author

itholic commented Mar 24, 2021

I'm pretty sure for this fix. Please feel free to leave comment if any!

@itholic itholic merged commit 95022f3 into databricks:master Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series.isin() throws exception when fed numpy array
5 participants