Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[waiting for author] Add stubs for the statistics module #546

Closed
wants to merge 2 commits into from

Conversation

Michael0x2a
Copy link
Contributor

This pull request adds stubs for the statistics module, including the geometric_mean and harmonic_mean functions that will be added to the upcoming Python 3.6 release.

In order to reduce the number of interconnected pull requests I'm making, I've deliberately omitted using the fractions module in any way so that this pull request can be evaluated and merged independently of my other pending pull requests.

Once either #544 or #94 is merged, I'll submit a new pull request to add back support for fractions.Fraction to the statistics module. (The former requires #545 to first be accepted, and the latter appear to be indefinitely stuck until issue python/mypy#1264 is resolved).

Since typeshed currently does not have working stubs for the fractions
module, this commit removes all references to Fraction within the
statistics module so that this pull request can be evaluated and
accepted independently of the other pull requests I've made recently to
improve various numbers related modules.
_TNum = TypeVar('_TNum', int, float, Decimal)
_T = TypeVar('_T')

def mean(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that this makes mypy think that the return type for mean([1, 2, 4]) is an int. It actually returns a float. Similarly, it allows mixing ints with Decimal (but not float with Decimal). I think the only way to express this reasonably may be a complex set of overloads...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, hmm -- when I did mean([1, 2, 3]), I got back an int, but after some poking, that turned out to be only because 1 + 2 + 3 is actually evenly divisible by 3.

Perhaps I could get rid of int within the TypeVar and rely on mypy to understand that ints should be auto-promoted to float?

Regarding mixing ints with Decimals/mixing floats with Decimals: the documentation for statistics stated that mixing together different types is "undefined and implementation-dependent", and even recommends running map(float, input_data) first if you have mixed numeric types.

So, if mixing together things like Decimal and int is undefined, then I think it's ok and actually for the best if typeshed forbids that altogether?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the worry is mostly about float vs. Decimal, where the user has to
decide. int can be mixed with each of the others. But probably we can start
by making the type variable allow float and Decimal only.

On Wed, Sep 14, 2016 at 8:36 AM, Michael Lee [email protected]
wrote:

In stdlib/3.4/statistics.pyi
#546 (comment):

+from typing import Iterable, Iterator, Sequence, TypeVar, Union, overload
+from decimal import Decimal
+
+# Note: according to the docs, statistics only explicitly supports
+# int, float, Decimal, and Fraction. Other types within the numeric
+# tower are not supported. It also states mixing together different
+# types results in undefined behavior, so the type signatures below
+# deliberately enforce that numeric types may not be mixed.
+#
+# TODO: Once either #545 or
+# #94 is accepted, add support
+# for fractions.
+_TNum = TypeVar('_TNum', int, float, Decimal)
+_T = TypeVar('_T')
+
+def mean(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ...

Oh, hmm -- when I did mean([1, 2, 3]), I got back an int, but after some
poking, that turned out to be only because 1 + 2 + 3 is actually evenly
divisible by 3.

Perhaps I could get rid of int within the TypeVar and rely on mypy to
understand that ints should be auto-promoted to float?

Regarding mixing ints with Decimals/mixing floats with Decimals: the documentation
for statistics https://docs.python.org/3/library/statistics.html stated
that mixing together different types is "undefined and
implementation-dependent", and even recommends running map(float,
input_data) first if you have mixed numeric types.

So, if mixing together things like Decimal and int is undefined, then I
think it's ok and actually for the best if typeshed forbids that altogether?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/python/typeshed/pull/546/files/98161406c69d8f31f4b848dffb889c04a5e49098#r78774231,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACwrMiPTW4ogrthvoNIOaKJ3XE9k9XwIks5qqBSTgaJpZM4J8bk1
.

--Guido van Rossum (python.org/~guido)

Copy link
Member

@stevendaprano stevendaprano Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is that the statistics functions should accept any of the following:

  • just ints, just floats, just Decimal, or just Fraction
  • mixed int + float, mixed int + Decimal, or mixed int + Fraction
  • but not mixed float + Decimal

I'm on the fence about mixing float + Fraction -- that likely works due to the automatic coercion of Fraction to float, but I'm not confident enough to rely on that at the moment. Hence that's currently officially unsupported, even if it happens to work.

If you think the docs need improving to make that more clear, then let me know.

As far as mean() is concerned, mean(ints) may return either an int or a float depending on whether the sum divides evenly by the number of items. E.g mean([1, 2]) is 1.5 but mean([1, 3]) is 2 (an int). Does that complicate the type-checking too much?

@gvanrossum
Copy link
Member

I don't know this module very well, I've asked its author to help with the review.

@gvanrossum gvanrossum changed the title Add stubs for the statistics module [waiting for reviewer] Add stubs for the statistics module Sep 23, 2016
def pstdev(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> Union[float, Decimal]: ...
def stdev(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> Union[float, Decimal]: ...
def pvariance(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ...
def variance(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signatures for the four functions pvariance, variance, pstdev and stdev are incomplete. They take an optional argument representing the mean of the data. I think something like this should be right:

def pstdev(data: Union[Iterator[_TNum], Sequence[_TNum]],
           mu: Optional[_TNum]) -> Union[float, Decimal]: ...
def stdev(data: Union[Iterator[_TNum], Sequence[_TNum]],
          xbar: Optional[_TNum]) -> Union[float, Decimal]: ...
def pvariance(data: Union[Iterator[_TNum], Sequence[_TNum]]
          mu: Optional[_TNum]) -> _TNum: ...
def variance(data: Union[Iterator[_TNum], Sequence[_TNum]]
          xbar: Optional[_TNum]) -> _TNum: ...

The difference in parameter name is intentional: mu is the mean of the entire population, and xbar is the mean of the sample.

# float as an implementation detail. In the interests of not breaking code
# that relies on this implementation detail, the return type is set to the
# Union of float and the contained numeric type.
def median(data: Iterable[_TNum]) -> _TNum: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the return type of median is right. median() may return the same type as the data _TNum, but if it does an averaging step, it may return a float. For instance:

>>> statistics.median([2, 4])
3.0
>>> statistics.median([2, 4, 6])
4

I'm not sure how to write that as a type hint. Would that be Union[_TNum, float]?

median_low and median_high look correct, as they always return one of the data points and don't interpolate or average between them.

@gvanrossum gvanrossum changed the title [waiting for reviewer] Add stubs for the statistics module [waiting for author] Add stubs for the statistics module Oct 13, 2016
@gvanrossum
Copy link
Member

@Michael0x2a Do you still care about this?

@stevendaprano
Copy link
Member

stevendaprano commented Oct 18, 2016 via email

@gvanrossum
Copy link
Member

Now that this PR is about 8 weeks without a response I'm going to close it, sorry. @Michael0x2a if you want to work on this again please just open a new PR! Or if @stevendaprano wants to work on it feel free to copy form Michael's fork and we'll take it from there.

@gvanrossum gvanrossum closed this Dec 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants