Implemented GroupBy.median() #1957

itholic · 2020-12-09T05:56:50Z

This PR proposes GroupBy.median().

Note: the result can be slightly different from pandas since we use an approximated median based upon approximate percentile computation because computing median across a large dataset is extremely expensive.

>>> kdf = ks.DataFrame({'a': [1., 1., 1., 1., 2., 2., 2., 3., 3., 3.],
...                     'b': [2., 3., 1., 4., 6., 9., 8., 10., 7., 5.],
...                     'c': [3., 5., 2., 5., 1., 2., 6., 4., 3., 6.]},
...                    columns=['a', 'b', 'c'],
...                    index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6])
>>> kdf
      a     b    c
7   1.0   2.0  3.0
2   1.0   3.0  5.0
4   1.0   1.0  2.0
1   1.0   4.0  5.0
3   2.0   6.0  1.0
4   2.0   9.0  2.0
9   2.0   8.0  6.0
10  3.0  10.0  4.0
5   3.0   7.0  3.0
6   3.0   5.0  6.0

>>> kdf.groupby('a').median().sort_index()  # doctest: +NORMALIZE_WHITESPACE
       b    c
a
1.0  2.0  3.0
2.0  8.0  2.0
3.0  7.0  4.0

>>> kdf.groupby('a')['b'].median().sort_index()
a
1.0    2.0
2.0    8.0
3.0    7.0
Name: b, dtype: float64

ref #1929

itholic · 2020-12-09T05:59:09Z

FYI: we're doing same thing in median for DataFrame and Series.

koalas/databricks/koalas/generic.py

Lines 1898 to 1904 in 9e8d99b

    
               def median(self, axis=None, numeric_only=True, accuracy=10000) -> Union[Scalar, "Series"]: 
        
                   """ 
        
                   Return the median of the values for the requested axis. 
        
                   .. note:: Unlike pandas', the median in Koalas is an approximated median based upon 
        
                       approximate percentile computation because computing median across a large dataset 
        
                       is extremely expensive.

codecov-io · 2020-12-09T06:25:06Z

Codecov Report

Merging #1957 (e121472) into master (a68717d) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1957   +/-   ##
=======================================
  Coverage   94.59%   94.60%           
=======================================
  Files          49       49           
  Lines       10882    10890    +8     
=======================================
+ Hits        10294    10302    +8     
  Misses        588      588

Impacted Files	Coverage Δ
databricks/koalas/missing/groupby.py	`100.00% <ø> (ø)`
databricks/koalas/groupby.py	`91.58% <100.00%> (+0.07%)`	⬆️
databricks/koalas/series.py	`96.90% <0.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a68717d...e121472. Read the comment docs.

databricks/koalas/tests/test_groupby.py

xinrong-meng · 2020-12-10T18:09:01Z

Otherwise, LGTM!

ueshin

Could you fix the conflicts as well?
Otherwise, LGTM.

databricks/koalas/groupby.py

ueshin · 2020-12-11T00:37:05Z

databricks/koalas/groupby.py

@@ -2343,6 +2344,73 @@ def get_group(self, name) -> Union[DataFrame, Series]:

        return DataFrame(internal)

+    def median(self, numeric_only=True, accuracy=10000) -> Union[DataFrame, Series]:


Are numeric_only and accuracy unknown types?

We should add more type hints for function arguments in the separate PR.

Sure, I'll add in the separated PR.

Thanks for the review!

Oh, anyway numeric_only only supports for True since this is only for pandas compatibility.

Should we add type hints as bool though ?

Yep, it must be bool, and accuracy must be int?
There are other similar places in the file or the other files, we can do it in a batch.

Sounds good. I will take a look and add type hints throughout the whole file next week !

ueshin · 2020-12-11T01:20:44Z

Thanks! I'd merge this now.
Let's add more type hints for function arguments later.

Implemented GroupBy.median()

95caa56

xinrong-meng reviewed Dec 9, 2020

View reviewed changes

databricks/koalas/tests/test_groupby.py Show resolved Hide resolved

ueshin approved these changes Dec 10, 2020

View reviewed changes

databricks/koalas/groupby.py Outdated Show resolved Hide resolved

Resolved comments

e121472

ueshin reviewed Dec 11, 2020

View reviewed changes

ueshin merged commit 78b1004 into databricks:master Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented GroupBy.median() #1957

Implemented GroupBy.median() #1957

itholic commented Dec 9, 2020 •

edited

Loading

itholic commented Dec 9, 2020

codecov-io commented Dec 9, 2020 •

edited

Loading

xinrong-meng commented Dec 10, 2020

ueshin left a comment •

edited

Loading

ueshin Dec 11, 2020

ueshin Dec 11, 2020

itholic Dec 11, 2020

itholic Dec 11, 2020

ueshin Dec 11, 2020

itholic Dec 11, 2020

ueshin commented Dec 11, 2020

		@@ -2343,6 +2344,73 @@ def get_group(self, name) -> Union[DataFrame, Series]:

		return DataFrame(internal)

		def median(self, numeric_only=True, accuracy=10000) -> Union[DataFrame, Series]:

Implemented GroupBy.median() #1957

Implemented GroupBy.median() #1957

Conversation

itholic commented Dec 9, 2020 • edited Loading

itholic commented Dec 9, 2020

codecov-io commented Dec 9, 2020 • edited Loading

Codecov Report

xinrong-meng commented Dec 10, 2020

ueshin left a comment • edited Loading

Choose a reason for hiding this comment

ueshin Dec 11, 2020

Choose a reason for hiding this comment

ueshin Dec 11, 2020

Choose a reason for hiding this comment

itholic Dec 11, 2020

Choose a reason for hiding this comment

itholic Dec 11, 2020

Choose a reason for hiding this comment

ueshin Dec 11, 2020

Choose a reason for hiding this comment

itholic Dec 11, 2020

Choose a reason for hiding this comment

ueshin commented Dec 11, 2020

itholic commented Dec 9, 2020 •

edited

Loading

codecov-io commented Dec 9, 2020 •

edited

Loading

ueshin left a comment •

edited

Loading