Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add percentiles to describe #378

Merged
merged 19 commits into from
May 29, 2019
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 38 additions & 3 deletions databricks/koalas/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -2541,7 +2541,7 @@ def astype(self, dtype) -> 'DataFrame':
return DataFrame(sdf, self._metadata.copy())

# TODO: percentiles, include, and exclude should be implemented.
def describe(self) -> 'DataFrame':
def describe(self, percentiles=[0.25, 0.5, 0.75]) -> 'DataFrame':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use None for the default value and set it to [0.25, 0.5, 0.75] if it's None later. A problem mentioned here #21 (comment) might happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, using a mutable value as a default is actually quite dangerous (see https://docs.quantifiedcode.com/python-anti-patterns/correctness/mutable_default_value_as_argument.html)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the default value is immutable, we're safe tho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""
Generate descriptive statistics that summarize the central tendency,
dispersion and shape of a dataset's distribution, excluding
Expand All @@ -2552,6 +2552,12 @@ def describe(self) -> 'DataFrame':
will vary depending on what is provided. Refer to the notes
below for more detail.

Parameters
----------
percentiles : list of ``float`` in range (0.0, 1.0), default [0.25, 0.5, 0.75]
A list of percentiles to be computed.
Use an empty list if no percentiles should be computed.

Returns
-------
Series or DataFrame
Expand All @@ -2568,7 +2574,7 @@ def describe(self) -> 'DataFrame':
Notes
-----
For numeric data, the result's index will include ``count``,
``mean``, ``stddev``, ``min``, ``max``.
``mean``, ``stddev``, ``min``, ``25%``, ``50%``, ``75%``, ``max``.

Currently only numeric data is supported.

Expand All @@ -2582,6 +2588,9 @@ def describe(self) -> 'DataFrame':
mean 2.0
stddev 1.0
min 1.0
25% 1.0
50% 2.0
75% 3.0
max 3.0
Name: 0, dtype: float64

Expand All @@ -2598,6 +2607,25 @@ def describe(self) -> 'DataFrame':
mean 2.0 5.0
stddev 1.0 1.0
min 1.0 4.0
25% 1.0 4.0
50% 2.0 5.0
75% 3.0 6.0
max 3.0 6.0

Describing a ``DataFrame`` and selecting custom percentiles.

>>> df = ks.DataFrame({'numeric1': [1, 2, 3],
... 'numeric2': [4.0, 5.0, 6.0]
... },
... columns=['numeric1', 'numeric2', 'object'])
Copy link
Contributor

@garawalid garawalid May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be columns=['numeric1', 'numeric2']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

>>> df.describe(percentiles = [0.15, 0.85])
numeric1 numeric2
count 3.0 3.0
mean 2.0 5.0
stddev 1.0 1.0
min 1.0 4.0
15% 1.0 4.0
85% 3.0 6.0
max 3.0 6.0

Describing a column from a ``DataFrame`` by accessing it as
Expand All @@ -2608,6 +2636,9 @@ def describe(self) -> 'DataFrame':
mean 2.0
stddev 1.0
min 1.0
25% 1.0
50% 2.0
75% 3.0
max 3.0
Name: numeric1, dtype: float64
"""
Expand All @@ -2626,7 +2657,11 @@ def describe(self) -> 'DataFrame':
if len(exprs) == 0:
raise ValueError("Cannot describe a DataFrame without columns")

sdf = self._sdf.select(*exprs).describe()
formatted_perc = ["{:.0%}".format(p) for p in percentiles]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check if percentiles are in the range [0 1] otherwise, I raise an exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am checking that in the new version

stats = ["count", "mean", "stddev", "min", *formatted_perc, "max"]

sdf = self._sdf.select(*exprs).summary(stats)

return DataFrame(sdf, index=Metadata(data_columns=data_columns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use replace to change stddev to std.

return DataFrame(sdf.replace("stddev","std"), index=Metadata(data_columns=data_columns,
                                             index_map=[('summary', None)])).astype('float64')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add subset='summary' argument to ensure the column, just in case, if we want to replace the name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patryk-oleniuk Seems like you forgot to add subset='summary' to the replace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

index_map=[('summary', None)])).astype('float64')

Expand Down