Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add percentiles to describe #378

Merged
merged 19 commits into from
May 29, 2019

Conversation

patryk-oleniuk
Copy link
Contributor

This PR adds to the existing Dataframe.describe method the percentiles optional parameter.

@patryk-oleniuk patryk-oleniuk changed the title Add percentiles to describe #376 Add percentiles to describe May 23, 2019
Copy link
Contributor

@garawalid garawalid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @patryk-oleniuk. I left a comment.

@@ -2626,7 +2657,11 @@ def describe(self) -> 'DataFrame':
if len(exprs) == 0:
raise ValueError("Cannot describe a DataFrame without columns")

sdf = self._sdf.select(*exprs).describe()
formatted_perc = ["{:.0%}".format(p) for p in percentiles]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check if percentiles are in the range [0 1] otherwise, I raise an exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am checking that in the new version

@patryk-oleniuk
Copy link
Contributor Author

The unit tests are failing, I'm setting up my environment to properly reproduce these errors locally.

Copy link
Contributor

@garawalid garawalid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for doctest

>>> df = ks.DataFrame({'numeric1': [1, 2, 3],
... 'numeric2': [4.0, 5.0, 6.0]
... },
... columns=['numeric1', 'numeric2', 'object'])
Copy link
Contributor

@garawalid garawalid May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be columns=['numeric1', 'numeric2']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@codecov-io
Copy link

codecov-io commented May 23, 2019

Codecov Report

Merging #378 into master will increase coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #378      +/-   ##
==========================================
+ Coverage   94.61%   94.73%   +0.12%     
==========================================
  Files          41       42       +1     
  Lines        4383     4561     +178     
==========================================
+ Hits         4147     4321     +174     
- Misses        236      240       +4
Impacted Files Coverage Δ
databricks/koalas/series.py 93.59% <100%> (+0.62%) ⬆️
databricks/koalas/frame.py 95.43% <100%> (+0.08%) ⬆️
databricks/koalas/missing/frame.py 100% <0%> (ø) ⬆️
databricks/koalas/tests/test_series.py 100% <0%> (ø) ⬆️
databricks/koalas/tests/test_dataframe.py 100% <0%> (ø) ⬆️
databricks/koalas/missing/series.py 100% <0%> (ø) ⬆️
databricks/koalas/mlflow.py 95.12% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac175a3...93166db. Read the comment docs.

@patryk-oleniuk
Copy link
Contributor Author

@garawalid let me know what do you think about this version.
Not sure why codecov is failing, should I include more unit tests?
This is my first time using that tool, could you help me understand?

@garawalid
Copy link
Contributor

@patryk-oleniuk that's good. You can ignore codecov for instance.

You forget to update describe() in series.py

I suggest that you change the value of stddev to std after computing stats. Then you can add test_describe() in test_series.py.

test_describe() should compare the result of pandas.Series.describe() and Koalas.DataFrame.describe(). The same goes for test_dataframe.py

Feel free to ask me any questions.

@@ -2541,7 +2541,7 @@ def astype(self, dtype) -> 'DataFrame':
return DataFrame(sdf, self._metadata.copy())

# TODO: percentiles, include, and exclude should be implemented.
def describe(self) -> 'DataFrame':
def describe(self, percentiles=[0.25, 0.5, 0.75]) -> 'DataFrame':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use None for the default value and set it to [0.25, 0.5, 0.75] if it's None later. A problem mentioned here #21 (comment) might happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, using a mutable value as a default is actually quite dangerous (see https://docs.quantifiedcode.com/python-anti-patterns/correctness/mutable_default_value_as_argument.html)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the default value is immutable, we're safe tho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if any((p <= 0.0) | (p >= 1.0) for p in percentiles):
raise ValueError("Percentiles not in range (0.0, 1.0)")

formatted_perc = ["{:.0%}".format(p) for p in percentiles]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like pandas adds 50% if the percentiles doesn't include the value, i.e., the example df.describe(percentiles = [0.15, 0.85]) above includes 50% in the result.

>>> df = pd.DataFrame({'numeric1': [1, 2, 3], 'numeric2': [4.0, 5.0, 6.0]})
>>> df.describe(percentiles=[0.15, 0.85])
       numeric1  numeric2
count       3.0       3.0
mean        2.0       5.0
std         1.0       1.0
min         1.0       4.0
15%         1.3       4.3
50%         2.0       5.0
85%         2.7       5.7
max         3.0       6.0

Could you follow the behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the new version + I'm sorting the percentiles (following the pandas behavior)

@@ -2626,7 +2657,14 @@ def describe(self) -> 'DataFrame':
if len(exprs) == 0:
raise ValueError("Cannot describe a DataFrame without columns")

sdf = self._sdf.select(*exprs).describe()
if any((p <= 0.0) | (p >= 1.0) for p in percentiles):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use logical or instead of |?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@patryk-oleniuk
Copy link
Contributor Author

patryk-oleniuk commented May 24, 2019

I suggest that you change the value of stddev to std after computing stats.

@garawalid I'm having problems renaming stddev to std, the Metadata class does not have name_map or any similar parameter in constructor. Any idea on how to do that cleanly?

@garawalid
Copy link
Contributor

@patryk-oleniuk you can ignore it for instance.

stats = ["count", "mean", "stddev", "min", *formatted_perc, "max"]

sdf = self._sdf.select(*exprs).summary(stats)

return DataFrame(sdf, index=Metadata(data_columns=data_columns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use replace to change stddev to std.

return DataFrame(sdf.replace("stddev","std"), index=Metadata(data_columns=data_columns,
                                             index_map=[('summary', None)])).astype('float64')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add subset='summary' argument to ensure the column, just in case, if we want to replace the name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patryk-oleniuk Seems like you forgot to add subset='summary' to the replace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.
Otherwise LGTM.

@@ -2541,7 +2541,7 @@ def astype(self, dtype) -> 'DataFrame':
return DataFrame(sdf, self._metadata.copy())

# TODO: percentiles, include, and exclude should be implemented.
def describe(self) -> 'DataFrame':
def describe(self, percentiles=None) -> 'DataFrame':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a type hint? percentiles: Optional[List[float]] = None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1117,8 +1117,8 @@ def apply(self, func, args=(), **kwds):
wrapped = ks.pandas_wraps(return_col=return_sig)(apply_each)
return wrapped(self, *args, **kwds)

def describe(self) -> 'Series':
return _col(self.to_dataframe().describe())
def describe(self, percentiles=None) -> 'Series':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -2554,9 +2554,8 @@ def describe(self, percentiles=None) -> 'DataFrame':

Parameters
----------
percentiles : list of ``float`` in range (0.0, 1.0), default [0.25, 0.5, 0.75]
percentiles : list of ``float`` in range [0.0, 1.0], default None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can say the default is [0.25, 0.5, 0.75] here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@softagram-bot
Copy link

Softagram Impact Report for pull/378 (head commit: 93166db)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to [email protected]


return DataFrame(sdf.replace("stddev", "std", subset='summary'),
index=Metadata(data_columns=data_columns,
index_map=[('summary', None)])).astype('float64')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why lint-python doesn't raise an error, but index_map should be the same indent as data_columns?

@ueshin
Copy link
Collaborator

ueshin commented May 29, 2019

I'd merge this now. I'll fix the indent later.
@patryk-oleniuk Thanks for working on this!

@ueshin ueshin merged commit 1c5f1b8 into databricks:master May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants