-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[waiting for author] Add stubs for the statistics module #546
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Stubs for statistics | ||
# See: http://docs.python.org/3/library/statistics.html | ||
|
||
import sys | ||
from typing import Iterable, Iterator, Sequence, TypeVar, Union, overload | ||
from decimal import Decimal | ||
|
||
# Note: according to the docs, statistics only explicitly supports | ||
# int, float, Decimal, and Fraction. Other types within the numeric | ||
# tower are not supported. It also states mixing together different | ||
# types results in undefined behavior, so the type signatures below | ||
# deliberately enforce that numeric types may not be mixed. | ||
# | ||
# TODO: Once either https://github.com/python/typeshed/pull/545 or | ||
# https://github.com/python/typeshed/pull/94 is accepted, add support | ||
# for fractions. | ||
_TNum = TypeVar('_TNum', int, float, Decimal) | ||
_T = TypeVar('_T') | ||
|
||
def mean(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ... | ||
if sys.version_info >= (3, 6): | ||
def geometric_mean(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ... | ||
def harmonic_mean(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ... | ||
|
||
# Note: in CPython, the output of median_grouped may sometimes be coerced to | ||
# float as an implementation detail. In the interests of not breaking code | ||
# that relies on this implementation detail, the return type is set to the | ||
# Union of float and the contained numeric type. | ||
def median(data: Iterable[_TNum]) -> _TNum: ... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that the return type of median is right. median() may return the same type as the data _TNum, but if it does an averaging step, it may return a float. For instance:
I'm not sure how to write that as a type hint. Would that be median_low and median_high look correct, as they always return one of the data points and don't interpolate or average between them. |
||
def median_low(data: Iterable[_TNum]) -> _TNum: ... | ||
def median_high(data: Iterable[_TNum]) -> _TNum: ... | ||
# TODO: interval should also accept a Fraction. | ||
def median_grouped(data: Iterable[_TNum], | ||
interval: Union[int, float] = 1 | ||
) -> Union[_TNum, float]: ... | ||
|
||
def mode(data: Iterable[_T]) -> _T: ... | ||
|
||
def pstdev(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> Union[float, Decimal]: ... | ||
def stdev(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> Union[float, Decimal]: ... | ||
def pvariance(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ... | ||
def variance(data: Union[Iterator[_TNum], Sequence[_TNum]]) -> _TNum: ... | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The signatures for the four functions pvariance, variance, pstdev and stdev are incomplete. They take an optional argument representing the mean of the data. I think something like this should be right:
The difference in parameter name is intentional: mu is the mean of the entire population, and xbar is the mean of the sample. |
||
|
||
class StatisticsError(ValueError): pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that this makes mypy think that the return type for
mean([1, 2, 4])
is an int. It actually returns a float. Similarly, it allows mixing ints with Decimal (but not float with Decimal). I think the only way to express this reasonably may be a complex set of overloads...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, hmm -- when I did
mean([1, 2, 3])
, I got back an int, but after some poking, that turned out to be only because 1 + 2 + 3 is actually evenly divisible by 3.Perhaps I could get rid of int within the
TypeVar
and rely on mypy to understand that ints should be auto-promoted tofloat
?Regarding mixing ints with Decimals/mixing floats with Decimals: the documentation for statistics stated that mixing together different types is "undefined and implementation-dependent", and even recommends running
map(float, input_data)
first if you have mixed numeric types.So, if mixing together things like Decimal and int is undefined, then I think it's ok and actually for the best if typeshed forbids that altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the worry is mostly about float vs. Decimal, where the user has to
decide. int can be mixed with each of the others. But probably we can start
by making the type variable allow float and Decimal only.
On Wed, Sep 14, 2016 at 8:36 AM, Michael Lee [email protected]
wrote:
--Guido van Rossum (python.org/~guido)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is that the statistics functions should accept any of the following:
I'm on the fence about mixing float + Fraction -- that likely works due to the automatic coercion of Fraction to float, but I'm not confident enough to rely on that at the moment. Hence that's currently officially unsupported, even if it happens to work.
If you think the docs need improving to make that more clear, then let me know.
As far as mean() is concerned, mean(ints) may return either an int or a float depending on whether the sum divides evenly by the number of items. E.g mean([1, 2]) is 1.5 but mean([1, 3]) is 2 (an int). Does that complicate the type-checking too much?