Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Series.where #922

Merged
merged 8 commits into from
Oct 28, 2019
Merged

Implement Series.where #922

merged 8 commits into from
Oct 28, 2019

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Oct 13, 2019

Like pandas Series.where (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html)

implemented function where for series.

>>> s1 = ks.Series([0, 1, 2, 3, 4])
>>> s2 = ks.Series([100, 200, 300, 400, 500])
>>> s1.where(s1 > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
Name: 0, dtype: float64


>>> s1.where(s1 > 1, 10)
0    10
1    10
2     2
3     3
4     4
Name: 0, dtype: int64

>>> s1.where(s1 > 1, s1 + 50)
0    50
1    51
2     2
3     3
4     4
Name: 0, dtype: int64


>>> s1.where(s1 > 1, s2)
0    100
1    200
2      2
3      3
4      4
Name: 0, dtype: int64

@codecov-io
Copy link

codecov-io commented Oct 13, 2019

Codecov Report

Merging #922 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #922      +/-   ##
==========================================
+ Coverage   94.52%   94.53%   +<.01%     
==========================================
  Files          34       34              
  Lines        6465     6476      +11     
==========================================
+ Hits         6111     6122      +11     
  Misses        354      354
Impacted Files Coverage Δ
databricks/koalas/missing/series.py 100% <ø> (ø) ⬆️
databricks/koalas/series.py 96.15% <100%> (+0.05%) ⬆️
databricks/koalas/internal.py 96.38% <0%> (ø) ⬆️
databricks/koalas/namespace.py 86.83% <0%> (ø) ⬆️
databricks/koalas/frame.py 96.02% <0%> (ø) ⬆️
databricks/koalas/indexes.py 96.44% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8dcb64...b620849. Read the comment docs.

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add more tests in test_series to check various patterns? e.g.,

>>> s1 = pd.Series([0, 1, 2, 3, 4])
>>> s2 = pd.Series([100, 200, 300, 400, 500])

>>> s1.where(s2 > 100)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

and negative cases?

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
@@ -3409,6 +3409,80 @@ def replace(self, to_replace=None, value=None, regex=False) -> 'Series':

return self._with_new_scol(current)

def where(self, cond, other=np.nan):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic seems like pandas shares the same implementation internally. After this PR is merged, can you move this into _Frame class and implement DataFrame.where as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, i'm going to work right after this PR is merged

@HyukjinKwon
Copy link
Member

Seems fine to me otherwise.

@softagram-bot
Copy link

Softagram Impact Report for pull/922 (head commit: b620849)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon HyukjinKwon merged commit 709b928 into databricks:master Oct 28, 2019
@itholic itholic deleted the s_where branch November 6, 2019 05:32
# | 4| 4| true| 500|
# +-----------------+---+----------------+-----------------+
data_col_name = self._internal.column_name_for(self._internal.column_index[0])
index_column = self._internal.index_columns[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic, I think this doesn't support multi-level index cases. Can you fix this please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_columns can be multiple and we cannot just use the first one only.

set_option("compute.ops_on_diff_frames", True)

@classmethod
def tearDownClass(cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic disable this. compute.ops_on_diff_frames is disabled by default because it costs a lot. We should move the test cases into OpsOnDiffFramesEnabledTest

@@ -742,6 +753,23 @@ def test_duplicates(self):
self.assert_eq(pser.drop_duplicates().sort_values(),
kser.drop_duplicates().sort_values())

def test_where(self):
pser1 = pd.Series([0, 1, 2, 3, 4], name=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test when compute.ops_on_diff_frames is off? I think we can still use a scalar values for other such as int.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants