Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Series.pop #866

Merged
merged 33 commits into from
Nov 22, 2019
Merged

Implement Series.pop #866

merged 33 commits into from
Nov 22, 2019

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Oct 3, 2019

Resolves #789

>>> s = ks.Series(data=np.arange(3), index=['A', 'B', 'C'])
>>> s
A    0
B    1
C    2
Name: 0, dtype: int64

>>> s.pop('A')
A    0
Name: 0, dtype: int64

>>> s
B    1
C    2
Name: 0, dtype: int64

Also support for MultiIndex

>>> midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
...                       ['speed', 'weight', 'length']],
...                      [[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                       [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> s = ks.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3],
...               index=midx)
>>> s
lama    speed      45.0
        weight    200.0
        length      1.2
cow     speed      30.0
        weight    250.0
        length      1.5
falcon  speed     320.0
        weight      1.0
        length      0.3
Name: 0, dtype: float64

>>> s.pop('lama')
lama  speed      45.0
      weight    200.0
      length      1.2
Name: 0, dtype: float64

>>> s
cow     speed      30.0
        weight    250.0
        length      1.5
falcon  speed     320.0
        weight      1.0
        length      0.3
Name: 0, dtype: float64

And beyond pandas, we can support for MultiIndex with specifying level.
(But i'm not sure this is necessary.)

>>> midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
...                       ['speed', 'weight', 'length']],
...                      [[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                       [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> s = ks.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3],
...               index=midx)
>>> s
lama    speed      45.0
        weight    200.0
        length      1.2
cow     speed      30.0
        weight    250.0
        length      1.5
falcon  speed     320.0
        weight      1.0
        length      0.3
Name: 0, dtype: float64

>>> s.pop('speed', level=1)
lama    speed     45.0
cow     speed     30.0
falcon  speed    320.0
Name: 0, dtype: float64

>>> s
lama    weight    200.0
        length      1.2
cow     weight    250.0
        length      1.5
falcon  weight      1.0
        length      0.3
Name: 0, dtype: float64

In pandas, above example raises TypeError like below:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: pop() got an unexpected keyword argument 'level'

@codecov-io
Copy link

codecov-io commented Oct 3, 2019

Codecov Report

Merging #866 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #866      +/-   ##
==========================================
+ Coverage   95.19%   95.21%   +0.01%     
==========================================
  Files          34       34              
  Lines        6810     6835      +25     
==========================================
+ Hits         6483     6508      +25     
  Misses        327      327
Impacted Files Coverage Δ
databricks/koalas/missing/series.py 100% <ø> (ø) ⬆️
databricks/koalas/series.py 96.53% <100%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7070bc6...513bd8a. Read the comment docs.

@ueshin
Copy link
Collaborator

ueshin commented Oct 3, 2019

The returned value seems different from pandas'.

>>> s
lama    speed      45.0
        weight    200.0
        length      1.2
cow     speed      30.0
        weight    250.0
        length      1.5
falcon  speed     320.0
        weight      1.0
        length      0.3
dtype: float64
>>> s.pop('lama')
speed      45.0
weight    200.0
length      1.2
dtype: float64
>>> s
cow     speed      30.0
        weight    250.0
        length      1.5
falcon  speed     320.0
        weight      1.0
        length      0.3
dtype: float64

@ueshin
Copy link
Collaborator

ueshin commented Oct 3, 2019

Btw, shall we remove level for now to be consistent with DataFrame's API?

@itholic
Copy link
Contributor Author

itholic commented Oct 4, 2019

@ueshin Thanks for reviewing!! fixed them. :)

@itholic
Copy link
Contributor Author

itholic commented Oct 4, 2019

@ueshin Ah I got it. Then i think maybe our most important purpose is mimic pandas as same as possible, not more, not less, right ? 😃

@ueshin
Copy link
Collaborator

ueshin commented Oct 4, 2019

@itholic Yup, basically that's right. We rarely expand if there is a special reason, though. haha. it depends. 😸

@itholic
Copy link
Contributor Author

itholic commented Oct 4, 2019

@ueshin Got it. Thanks for replying :)

@ueshin
Copy link
Collaborator

ueshin commented Oct 4, 2019

Shall we support this?

>>> s
a  lama    speed      45.0
           weight    200.0
           length      1.2
   cow     speed      30.0
           weight    250.0
           length      1.5
b  falcon  speed     320.0
           weight      1.0
           length      0.3
dtype: float64
>>> s.pop(('a', 'lama'))
speed      45.0
weight    200.0
length      1.2
dtype: float64
>>> s
a  cow     speed      30.0
           weight    250.0
           length      1.5
b  falcon  speed     320.0
           weight      1.0
           length      0.3
dtype: float64

@itholic
Copy link
Contributor Author

itholic commented Oct 5, 2019

@ueshin sure, i'm going to add. Thanks for comment! :)

@itholic
Copy link
Contributor Author

itholic commented Oct 5, 2019

@ueshin , added it 😄

@ueshin
Copy link
Collaborator

ueshin commented Oct 7, 2019

So far, I found the following issues:

For the given s,

>>> s
lama    speed      45.0
        weight    200.0
        length      1.2
cow     speed      30.0
        weight    250.0
        length      1.5
falcon  speed     320.0
        weight      1.0
        length      0.3
Name: 0, dtype: float64
  1. (s + 1).pop('lama') raises AnalysisException:
>>> (s + 1).pop('lama')
Traceback (most recent call last):

...

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/series.py", line 3286, in pop
    index_map=self._internal.index_map[len(item):])
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/internal.py", line 693, in copy
    column_index=column_index, column_index_names=column_index_names)
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/internal.py", line 421, in __init__
    self._data_columns = sdf.select(scol).columns
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 1320, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Resolved attribute(s) 0#33 missing from __index_level_1__#32,(0 + 1)#51 in operator !Project [(0#33 + cast(1 as double)) AS (0 + 1)#54].;;\n!Project [(0#33 + cast(1 as double)) AS (0 + 1)#54]\n+- Project [__index_level_1__#32, (0 + 1)#51]\n   +- Filter (__index_level_0__#31 = lama)\n      +- Project [__index_level_1__#32, (0#33 + cast(1 as double)) AS (0 + 1)#51, __index_level_0__#31]\n         +- LogicalRDD [__index_level_0__#31, __index_level_1__#32, 0#33], false\n'
  1. s.pop(('lama', 'speed')) raises AssertionError:
>>> s.pop(('lama', 'speed'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/series.py", line 3470, in __repr__
    pser = self.head(max_display_count + 1)._to_internal_pandas()
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/series.py", line 1634, in head
    return _col(self.to_dataframe().head(n))
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/series.py", line 1072, in to_frame
    index_map=renamed._internal.index_map,
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/internal.py", line 574, in index_map
    assert len(self._index_map) > 0
AssertionError
  1. For the missing key, pandas raises KeyError:
>>> s.to_pandas().pop('horse')
Traceback (most recent call last):
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1319, in get_value
    return libindex.get_value_at(s, k)
  File "pandas/_libs/index.pyx", line 43, in pandas._libs.index.get_value_at
  File "pandas/_libs/index.pyx", line 48, in pandas._libs.index.get_value_at
  File "pandas/_libs/util.pxd", line 113, in pandas._libs.util.get_value_at
  File "pandas/_libs/util.pxd", line 98, in pandas._libs.util.validate_indexer
TypeError: 'str' object cannot be interpreted as an integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/generic.py", line 809, in pop
    result = self[item]
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/series.py", line 868, in __getitem__
    result = self.index.get_value(self, key)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1327, in get_value
    raise e1
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 1311, in get_value
    return self._engine.get_value(s, k)
  File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 667, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
KeyError: 'horse'

whereas Koalas returns empty Series:

>>> s.pop('horse')
Series([], Name: 0, dtype: float64)

As for 3), the current behavior is reasonable, otherwise we need to count the row number. cc @HyukjinKwon

@itholic
Copy link
Contributor Author

itholic commented Oct 7, 2019

@ueshin Thanks for reviewing!! i'm going to take a look at them.

@itholic
Copy link
Contributor Author

itholic commented Oct 7, 2019

@ueshin Fixed (1), (2) and not yet for (3).
(but fixed (3) & ready for commit in my local code)

databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
databricks/koalas/series.py Outdated Show resolved Hide resolved
@itholic
Copy link
Contributor Author

itholic commented Oct 14, 2019

hmm.. i don't understand why travis failed only in python3.6

@HyukjinKwon
Copy link
Member

@itholic can you rebase this please?

@itholic
Copy link
Contributor Author

itholic commented Nov 6, 2019

@HyukjinKwon Sure, i just rebased them :)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise

@softagram-bot
Copy link

Softagram Impact Report for pull/866 (head commit: 513bd8a)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@HyukjinKwon HyukjinKwon merged commit dad31db into databricks:master Nov 22, 2019
@itholic itholic deleted the s_pop branch December 2, 2019 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Series.pop
5 participants