-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add percentiles to describe #378
Changes from 3 commits
addb532
46a3595
7bc3a12
da3058b
851e46e
8caa2ef
38f1f8c
657e600
5d48cbe
5630bc3
87bf3dc
ea9905e
881343d
09e5767
44769d9
53b7001
9cd640f
1c75af7
93166db
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2541,7 +2541,7 @@ def astype(self, dtype) -> 'DataFrame': | |
return DataFrame(sdf, self._metadata.copy()) | ||
|
||
# TODO: percentiles, include, and exclude should be implemented. | ||
def describe(self) -> 'DataFrame': | ||
def describe(self, percentiles=[0.25, 0.5, 0.75]) -> 'DataFrame': | ||
""" | ||
Generate descriptive statistics that summarize the central tendency, | ||
dispersion and shape of a dataset's distribution, excluding | ||
|
@@ -2552,6 +2552,12 @@ def describe(self) -> 'DataFrame': | |
will vary depending on what is provided. Refer to the notes | ||
below for more detail. | ||
|
||
Parameters | ||
---------- | ||
percentiles : list of ``float`` in range (0.0, 1.0), default [0.25, 0.5, 0.75] | ||
A list of percentiles to be computed. | ||
Use an empty list if no percentiles should be computed. | ||
|
||
Returns | ||
------- | ||
Series or DataFrame | ||
|
@@ -2568,7 +2574,7 @@ def describe(self) -> 'DataFrame': | |
Notes | ||
----- | ||
For numeric data, the result's index will include ``count``, | ||
``mean``, ``stddev``, ``min``, ``max``. | ||
``mean``, ``stddev``, ``min``, ``25%``, ``50%``, ``75%``, ``max``. | ||
|
||
Currently only numeric data is supported. | ||
|
||
|
@@ -2582,6 +2588,9 @@ def describe(self) -> 'DataFrame': | |
mean 2.0 | ||
stddev 1.0 | ||
min 1.0 | ||
25% 1.0 | ||
50% 2.0 | ||
75% 3.0 | ||
max 3.0 | ||
Name: 0, dtype: float64 | ||
|
||
|
@@ -2598,6 +2607,25 @@ def describe(self) -> 'DataFrame': | |
mean 2.0 5.0 | ||
stddev 1.0 1.0 | ||
min 1.0 4.0 | ||
25% 1.0 4.0 | ||
50% 2.0 5.0 | ||
75% 3.0 6.0 | ||
max 3.0 6.0 | ||
|
||
Describing a ``DataFrame`` and selecting custom percentiles. | ||
|
||
>>> df = ks.DataFrame({'numeric1': [1, 2, 3], | ||
... 'numeric2': [4.0, 5.0, 6.0] | ||
... }, | ||
... columns=['numeric1', 'numeric2', 'object']) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
>>> df.describe(percentiles = [0.15, 0.85]) | ||
numeric1 numeric2 | ||
count 3.0 3.0 | ||
mean 2.0 5.0 | ||
stddev 1.0 1.0 | ||
min 1.0 4.0 | ||
15% 1.0 4.0 | ||
85% 3.0 6.0 | ||
max 3.0 6.0 | ||
|
||
Describing a column from a ``DataFrame`` by accessing it as | ||
|
@@ -2608,6 +2636,9 @@ def describe(self) -> 'DataFrame': | |
mean 2.0 | ||
stddev 1.0 | ||
min 1.0 | ||
25% 1.0 | ||
50% 2.0 | ||
75% 3.0 | ||
max 3.0 | ||
Name: numeric1, dtype: float64 | ||
""" | ||
|
@@ -2626,7 +2657,11 @@ def describe(self) -> 'DataFrame': | |
if len(exprs) == 0: | ||
raise ValueError("Cannot describe a DataFrame without columns") | ||
|
||
sdf = self._sdf.select(*exprs).describe() | ||
formatted_perc = ["{:.0%}".format(p) for p in percentiles] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd check if percentiles are in the range [0 1] otherwise, I raise an exception There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am checking that in the new version |
||
stats = ["count", "mean", "stddev", "min", *formatted_perc, "max"] | ||
|
||
sdf = self._sdf.select(*exprs).summary(stats) | ||
|
||
return DataFrame(sdf, index=Metadata(data_columns=data_columns, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can use return DataFrame(sdf.replace("stddev","std"), index=Metadata(data_columns=data_columns,
index_map=[('summary', None)])).astype('float64') There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @patryk-oleniuk Seems like you forgot to add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
index_map=[('summary', None)])).astype('float64') | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use
None
for the default value and set it to[0.25, 0.5, 0.75]
if it'sNone
later. A problem mentioned here #21 (comment) might happen.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, using a mutable value as a default is actually quite dangerous (see https://docs.quantifiedcode.com/python-anti-patterns/correctness/mutable_default_value_as_argument.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the default value is immutable, we're safe tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done