-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add percentiles to describe #378
Add percentiles to describe #378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @patryk-oleniuk. I left a comment.
databricks/koalas/frame.py
Outdated
@@ -2626,7 +2657,11 @@ def describe(self) -> 'DataFrame': | |||
if len(exprs) == 0: | |||
raise ValueError("Cannot describe a DataFrame without columns") | |||
|
|||
sdf = self._sdf.select(*exprs).describe() | |||
formatted_perc = ["{:.0%}".format(p) for p in percentiles] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd check if percentiles are in the range [0 1] otherwise, I raise an exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am checking that in the new version
The unit tests are failing, I'm setting up my environment to properly reproduce these errors locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix for doctest
databricks/koalas/frame.py
Outdated
>>> df = ks.DataFrame({'numeric1': [1, 2, 3], | ||
... 'numeric2': [4.0, 5.0, 6.0] | ||
... }, | ||
... columns=['numeric1', 'numeric2', 'object']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be columns=['numeric1', 'numeric2']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Codecov Report
@@ Coverage Diff @@
## master #378 +/- ##
==========================================
+ Coverage 94.61% 94.73% +0.12%
==========================================
Files 41 42 +1
Lines 4383 4561 +178
==========================================
+ Hits 4147 4321 +174
- Misses 236 240 +4
Continue to review full report at Codecov.
|
@garawalid let me know what do you think about this version. |
@patryk-oleniuk that's good. You can ignore You forget to update I suggest that you change the value of
Feel free to ask me any questions. |
databricks/koalas/frame.py
Outdated
@@ -2541,7 +2541,7 @@ def astype(self, dtype) -> 'DataFrame': | |||
return DataFrame(sdf, self._metadata.copy()) | |||
|
|||
# TODO: percentiles, include, and exclude should be implemented. | |||
def describe(self) -> 'DataFrame': | |||
def describe(self, percentiles=[0.25, 0.5, 0.75]) -> 'DataFrame': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use None
for the default value and set it to [0.25, 0.5, 0.75]
if it's None
later. A problem mentioned here #21 (comment) might happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, using a mutable value as a default is actually quite dangerous (see https://docs.quantifiedcode.com/python-anti-patterns/correctness/mutable_default_value_as_argument.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the default value is immutable, we're safe tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
databricks/koalas/frame.py
Outdated
if any((p <= 0.0) | (p >= 1.0) for p in percentiles): | ||
raise ValueError("Percentiles not in range (0.0, 1.0)") | ||
|
||
formatted_perc = ["{:.0%}".format(p) for p in percentiles] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like pandas adds 50%
if the percentiles
doesn't include the value, i.e., the example df.describe(percentiles = [0.15, 0.85])
above includes 50%
in the result.
>>> df = pd.DataFrame({'numeric1': [1, 2, 3], 'numeric2': [4.0, 5.0, 6.0]})
>>> df.describe(percentiles=[0.15, 0.85])
numeric1 numeric2
count 3.0 3.0
mean 2.0 5.0
std 1.0 1.0
min 1.0 4.0
15% 1.3 4.3
50% 2.0 5.0
85% 2.7 5.7
max 3.0 6.0
Could you follow the behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the new version + I'm sorting the percentiles (following the pandas behavior)
databricks/koalas/frame.py
Outdated
@@ -2626,7 +2657,14 @@ def describe(self) -> 'DataFrame': | |||
if len(exprs) == 0: | |||
raise ValueError("Cannot describe a DataFrame without columns") | |||
|
|||
sdf = self._sdf.select(*exprs).describe() | |||
if any((p <= 0.0) | (p >= 1.0) for p in percentiles): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use logical or
instead of |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@garawalid I'm having problems renaming |
@patryk-oleniuk you can ignore it for instance. |
databricks/koalas/frame.py
Outdated
stats = ["count", "mean", "stddev", "min", *formatted_perc, "max"] | ||
|
||
sdf = self._sdf.select(*exprs).summary(stats) | ||
|
||
return DataFrame(sdf, index=Metadata(data_columns=data_columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use replace
to change stddev
to std
.
return DataFrame(sdf.replace("stddev","std"), index=Metadata(data_columns=data_columns,
index_map=[('summary', None)])).astype('float64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add subset='summary'
argument to ensure the column, just in case, if we want to replace the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patryk-oleniuk Seems like you forgot to add subset='summary'
to the replace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
Otherwise LGTM.
databricks/koalas/frame.py
Outdated
@@ -2541,7 +2541,7 @@ def astype(self, dtype) -> 'DataFrame': | |||
return DataFrame(sdf, self._metadata.copy()) | |||
|
|||
# TODO: percentiles, include, and exclude should be implemented. | |||
def describe(self) -> 'DataFrame': | |||
def describe(self, percentiles=None) -> 'DataFrame': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a type hint? percentiles: Optional[List[float]] = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
databricks/koalas/series.py
Outdated
@@ -1117,8 +1117,8 @@ def apply(self, func, args=(), **kwds): | |||
wrapped = ks.pandas_wraps(return_col=return_sig)(apply_each) | |||
return wrapped(self, *args, **kwds) | |||
|
|||
def describe(self) -> 'Series': | |||
return _col(self.to_dataframe().describe()) | |||
def describe(self, percentiles=None) -> 'Series': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
databricks/koalas/frame.py
Outdated
@@ -2554,9 +2554,8 @@ def describe(self, percentiles=None) -> 'DataFrame': | |||
|
|||
Parameters | |||
---------- | |||
percentiles : list of ``float`` in range (0.0, 1.0), default [0.25, 0.5, 0.75] | |||
percentiles : list of ``float`` in range [0.0, 1.0], default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can say the default is [0.25, 0.5, 0.75] here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Softagram Impact Report for pull/378 (head commit: 93166db)⭐ Change Overview
📄 Full report
Give feedback on this report to [email protected] |
|
||
return DataFrame(sdf.replace("stddev", "std", subset='summary'), | ||
index=Metadata(data_columns=data_columns, | ||
index_map=[('summary', None)])).astype('float64') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why lint-python doesn't raise an error, but index_map
should be the same indent as data_columns
?
I'd merge this now. I'll fix the indent later. |
This PR adds to the existing
Dataframe.describe
method thepercentiles
optional parameter.