Add percentiles to describe #378

patryk-oleniuk · 2019-05-23T18:39:08Z

This PR adds to the existing Dataframe.describe method the percentiles optional parameter.

garawalid

Thanks @patryk-oleniuk. I left a comment.

garawalid · 2019-05-23T19:20:58Z

databricks/koalas/frame.py

@@ -2626,7 +2657,11 @@ def describe(self) -> 'DataFrame':
        if len(exprs) == 0:
            raise ValueError("Cannot describe a DataFrame without columns")

-        sdf = self._sdf.select(*exprs).describe()
+        formatted_perc =  ["{:.0%}".format(p) for p in percentiles]


I'd check if percentiles are in the range [0 1] otherwise, I raise an exception

I am checking that in the new version

patryk-oleniuk · 2019-05-23T20:00:44Z

The unit tests are failing, I'm setting up my environment to properly reproduce these errors locally.

garawalid

Fix for doctest

garawalid · 2019-05-23T20:12:35Z

databricks/koalas/frame.py

+        >>> df = ks.DataFrame({'numeric1': [1, 2, 3],
+        ...                    'numeric2': [4.0, 5.0, 6.0]
+        ...                   },
+        ...                   columns=['numeric1', 'numeric2', 'object'])


It should be columns=['numeric1', 'numeric2']

codecov-io · 2019-05-23T21:38:47Z

Codecov Report

Merging #378 into master will increase coverage by 0.12%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #378      +/-   ##
==========================================
+ Coverage   94.61%   94.73%   +0.12%     
==========================================
  Files          41       42       +1     
  Lines        4383     4561     +178     
==========================================
+ Hits         4147     4321     +174     
- Misses        236      240       +4

Impacted Files	Coverage Δ
databricks/koalas/series.py	`93.59% <100%> (+0.62%)`	⬆️
databricks/koalas/frame.py	`95.43% <100%> (+0.08%)`	⬆️
databricks/koalas/missing/frame.py	`100% <0%> (ø)`	⬆️
databricks/koalas/tests/test_series.py	`100% <0%> (ø)`	⬆️
databricks/koalas/tests/test_dataframe.py	`100% <0%> (ø)`	⬆️
databricks/koalas/missing/series.py	`100% <0%> (ø)`	⬆️
databricks/koalas/mlflow.py	`95.12% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac175a3...93166db. Read the comment docs.

patryk-oleniuk · 2019-05-23T22:00:09Z

@garawalid let me know what do you think about this version.
Not sure why codecov is failing, should I include more unit tests?
This is my first time using that tool, could you help me understand?

garawalid · 2019-05-23T22:44:05Z

@patryk-oleniuk that's good. You can ignore codecov for instance.

You forget to update describe() in series.py

I suggest that you change the value of stddev to std after computing stats. Then you can add test_describe() in test_series.py.

test_describe() should compare the result of pandas.Series.describe() and Koalas.DataFrame.describe(). The same goes for test_dataframe.py

Feel free to ask me any questions.

ueshin · 2019-05-23T23:44:30Z

databricks/koalas/frame.py

@@ -2541,7 +2541,7 @@ def astype(self, dtype) -> 'DataFrame':
        return DataFrame(sdf, self._metadata.copy())

    # TODO: percentiles, include, and exclude should be implemented.
-    def describe(self) -> 'DataFrame':
+    def describe(self, percentiles=[0.25, 0.5, 0.75]) -> 'DataFrame':


Please use None for the default value and set it to [0.25, 0.5, 0.75] if it's None later. A problem mentioned here #21 (comment) might happen.

Yea, using a mutable value as a default is actually quite dangerous (see https://docs.quantifiedcode.com/python-anti-patterns/correctness/mutable_default_value_as_argument.html)

I think if the default value is immutable, we're safe tho.

ueshin · 2019-05-23T23:45:45Z

databricks/koalas/frame.py

+        if any((p <= 0.0) | (p >= 1.0) for p in percentiles):
+            raise ValueError("Percentiles not in range (0.0, 1.0)")
+
+        formatted_perc = ["{:.0%}".format(p) for p in percentiles]


Seems like pandas adds 50% if the percentiles doesn't include the value, i.e., the example df.describe(percentiles = [0.15, 0.85]) above includes 50% in the result.

>>> df = pd.DataFrame({'numeric1': [1, 2, 3], 'numeric2': [4.0, 5.0, 6.0]}) >>> df.describe(percentiles=[0.15, 0.85]) numeric1 numeric2 count 3.0 3.0 mean 2.0 5.0 std 1.0 1.0 min 1.0 4.0 15% 1.3 4.3 50% 2.0 5.0 85% 2.7 5.7 max 3.0 6.0

Could you follow the behavior?

Done in the new version + I'm sorting the percentiles (following the pandas behavior)

HyukjinKwon · 2019-05-24T02:37:05Z

databricks/koalas/frame.py

@@ -2626,7 +2657,14 @@ def describe(self) -> 'DataFrame':
        if len(exprs) == 0:
            raise ValueError("Cannot describe a DataFrame without columns")

-        sdf = self._sdf.select(*exprs).describe()
+        if any((p <= 0.0) | (p >= 1.0) for p in percentiles):


Can we use logical or instead of |?

patryk-oleniuk · 2019-05-24T20:18:19Z

I suggest that you change the value of stddev to std after computing stats.

@garawalid I'm having problems renaming stddev to std, the Metadata class does not have name_map or any similar parameter in constructor. Any idea on how to do that cleanly?

garawalid · 2019-05-25T11:22:02Z

@patryk-oleniuk you can ignore it for instance.

garawalid · 2019-05-26T12:14:22Z

databricks/koalas/frame.py

+        stats = ["count", "mean", "stddev", "min", *formatted_perc, "max"]
+
+        sdf = self._sdf.select(*exprs).summary(stats)
+
        return DataFrame(sdf, index=Metadata(data_columns=data_columns,


You can use replace to change stddev to std.

return DataFrame(sdf.replace("stddev","std"), index=Metadata(data_columns=data_columns, index_map=[('summary', None)])).astype('float64')

Let's add subset='summary' argument to ensure the column, just in case, if we want to replace the name.

@patryk-oleniuk Seems like you forgot to add subset='summary' to the replace?

ueshin

I left some comments.
Otherwise LGTM.

ueshin · 2019-05-27T04:40:07Z

databricks/koalas/frame.py

@@ -2541,7 +2541,7 @@ def astype(self, dtype) -> 'DataFrame':
        return DataFrame(sdf, self._metadata.copy())

    # TODO: percentiles, include, and exclude should be implemented.
-    def describe(self) -> 'DataFrame':
+    def describe(self, percentiles=None) -> 'DataFrame':


Could you add a type hint? percentiles: Optional[List[float]] = None

ueshin · 2019-05-27T04:40:18Z

databricks/koalas/series.py

@@ -1117,8 +1117,8 @@ def apply(self, func, args=(), **kwds):
        wrapped = ks.pandas_wraps(return_col=return_sig)(apply_each)
        return wrapped(self, *args, **kwds)

-    def describe(self) -> 'Series':
-        return _col(self.to_dataframe().describe())
+    def describe(self, percentiles=None) -> 'Series':


ueshin · 2019-05-27T04:43:11Z

databricks/koalas/frame.py

@@ -2554,9 +2554,8 @@ def describe(self, percentiles=None) -> 'DataFrame':

        Parameters
        ----------
-        percentiles : list of ``float`` in range (0.0, 1.0), default [0.25, 0.5, 0.75]
+        percentiles : list of ``float`` in range [0.0, 1.0], default None


I think we can say the default is [0.25, 0.5, 0.75] here.

softagram-bot · 2019-05-28T17:57:52Z

Softagram Impact Report for pull/378 (head commit: `93166db`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/378

Give feedback on this report to [email protected]

ueshin · 2019-05-29T01:55:59Z

databricks/koalas/frame.py

+
+        return DataFrame(sdf.replace("stddev", "std", subset='summary'),
+                         index=Metadata(data_columns=data_columns,
+                         index_map=[('summary', None)])).astype('float64')


I'm wondering why lint-python doesn't raise an error, but index_map should be the same indent as data_columns?

ueshin · 2019-05-29T04:24:20Z

I'd merge this now. I'll fix the indent later.
@patryk-oleniuk Thanks for working on this!

patryk-oleniuk added 3 commits May 23, 2019 10:50

Used SparkDF.summary to calculate percentiles in DataFrame.describe

addb532

Updated Dataframe.describe description and examples

46a3595

Corrected Dataframe.describe docs

7bc3a12

patryk-oleniuk changed the title ~~Add percentiles to describe #376~~ Add percentiles to describe May 23, 2019

patryk-oleniuk mentioned this pull request May 23, 2019

Add percentiles to describe #376

Closed

garawalid reviewed May 23, 2019

View reviewed changes

Corrected docstrings

da3058b

garawalid reviewed May 23, 2019

View reviewed changes

patryk-oleniuk added 2 commits May 23, 2019 14:20

Added percentile value check

851e46e

Corrected whitespaces

8caa2ef

ueshin reviewed May 23, 2019

View reviewed changes

HyukjinKwon reviewed May 24, 2019

View reviewed changes

patryk-oleniuk added 3 commits May 24, 2019 10:23

Using logical or instead of |

38f1f8c

Add 50% in perc list and sort

657e600

Added percentiles for Series.describe

5d48cbe

Corrected docstrings

5630bc3

garawalid reviewed May 26, 2019

View reviewed changes

ueshin reviewed May 27, 2019

View reviewed changes

patryk-oleniuk-epfl and others added 6 commits May 27, 2019 21:06

Replaced stddev with std, added type hints

87bf3dc

Corrected indentations

ea9905e

Fixed List error

881343d

Corrected doctests

09e5767

Corrected doctests

44769d9

Corrected spacing error in doctest

53b7001

patryk-oleniuk added 3 commits May 28, 2019 10:35

Limiting replace to summary column

9cd640f

Corrected line length

1c75af7

Corrected trailing whitespace

93166db

ueshin reviewed May 29, 2019

View reviewed changes

ueshin merged commit 1c5f1b8 into databricks:master May 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add percentiles to describe #378

Add percentiles to describe #378

patryk-oleniuk commented May 23, 2019

garawalid left a comment

garawalid May 23, 2019

patryk-oleniuk May 23, 2019

patryk-oleniuk commented May 23, 2019

garawalid left a comment

garawalid May 23, 2019 •

edited

Loading

patryk-oleniuk May 23, 2019

codecov-io commented May 23, 2019 •

edited

Loading

patryk-oleniuk commented May 23, 2019

garawalid commented May 23, 2019

ueshin May 23, 2019

HyukjinKwon May 24, 2019

HyukjinKwon May 24, 2019

patryk-oleniuk May 24, 2019

ueshin May 23, 2019

patryk-oleniuk May 24, 2019

HyukjinKwon May 24, 2019

patryk-oleniuk May 24, 2019

patryk-oleniuk commented May 24, 2019 •

edited

Loading

garawalid commented May 25, 2019

garawalid May 26, 2019

ueshin May 27, 2019

ueshin May 28, 2019

patryk-oleniuk May 28, 2019

ueshin left a comment

ueshin May 27, 2019

patryk-oleniuk May 28, 2019

ueshin May 27, 2019

patryk-oleniuk May 28, 2019

ueshin May 27, 2019

patryk-oleniuk May 28, 2019

softagram-bot commented May 28, 2019

ueshin May 29, 2019

ueshin commented May 29, 2019

Add percentiles to describe #378

Add percentiles to describe #378

Conversation

patryk-oleniuk commented May 23, 2019

garawalid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patryk-oleniuk commented May 23, 2019

garawalid left a comment

Choose a reason for hiding this comment

garawalid May 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented May 23, 2019 • edited Loading

Codecov Report

patryk-oleniuk commented May 23, 2019

garawalid commented May 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patryk-oleniuk commented May 24, 2019 • edited Loading

garawalid commented May 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softagram-bot commented May 28, 2019

Softagram Impact Report for pull/378 (head commit: 93166db)

⭐ Change Overview

📄 Full report

Choose a reason for hiding this comment

ueshin commented May 29, 2019

garawalid May 23, 2019 •

edited

Loading

codecov-io commented May 23, 2019 •

edited

Loading

patryk-oleniuk commented May 24, 2019 •

edited

Loading

Softagram Impact Report for pull/378 (head commit: `93166db`)