-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement logcdf method for discrete distributions #4387
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4387 +/- ##
==========================================
+ Coverage 88.04% 88.09% +0.05%
==========================================
Files 88 88
Lines 14482 14538 +56
==========================================
+ Hits 12750 12807 +57
+ Misses 1732 1731 -1
|
9aab229
to
0583d8e
Compare
…Geometric` to avoid errors when evaluating negative logcdfs.
…domains, such as the DiscreteUniform.
dd54634
to
670a652
Compare
I am sorry for all the changes after opening the PR. I think it is now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks @ricardoV94 !
This is a pretty big PR, so a second review by another core dev would be beneficial.
I agree that the logcdf
methods that do not work with multiple values should raise a ValueError
to explain the limitation to the user. Related to this, I added some comments to add the mutiple values' types (numpy array or theano tensor) to the docstrings
at the specified value. | ||
Parameters | ||
---------- | ||
value: numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add the types for multiple values then. Something like the following?
value: numeric | |
value: numeric or np.ndarray or theano.tensor |
|
||
Parameters | ||
---------- | ||
value: numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value: numeric | |
value: numeric or np.ndarray or theano.tensor |
at the specified value. | ||
Parameters | ||
---------- | ||
value: numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value: numeric | |
value: numeric or np.ndarray or theano.tensor |
at the specified value. | ||
Parameters | ||
---------- | ||
value: numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value: numeric | |
value: numeric or np.ndarray or theano.tensor |
at the specified value. | ||
Parameters | ||
---------- | ||
value: numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value: numeric | |
value: numeric or np.ndarray or theano.tensor |
at the specified value. | ||
Parameters | ||
---------- | ||
value: numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value: numeric | |
value: numeric or np.ndarray or theano.tensor |
at the specified value. | ||
Parameters | ||
---------- | ||
value: numeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value: numeric | |
value: numeric or np.ndarray or theano.tensor |
Thanks for the review @AlexAndorra. I agree about the ValueError, although I have to check how to check that with Theano variables. I was hoping someone would suggest a fix for those that use About the type hints in the docstring. What if we do a separate PR just for that, so that all logcdfs (discrete and continuous) are homogeneous? Right now, none of the other docstrings have it and they should as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the doc-strings miss new-lines before Parameters and Returns, otherwise this looks great to me!
Thanks for reviewing it. I will fix those. Unrelated to that, does anyone know a good way to check that |
You can try whether |
Fix docstring formatting.
New commit fixes missing newlines in docstrings and raises informative I would address @AlexAndorra suggestion of including more informative docstring type hints, as well as expanding |
…` making use of `tt.log1p` and `logaddexp`. More informative comment on workaround for `Poisson.logcdf`.
This is a great contribution, thanks @ricardoV94! |
This PR adds
logcdf
methods to all univariate discrete distributions. Formulas were based in either scipy or wikipedia entries. For some distributions, I could not find specialized formulas and had to rely on summing up all the logps intt.arange(0, value+1)
.Closes #4331
Unittests:
check_logcdf
in all distributions with scipy counterpart.check_selfconsistency_discrete_logcdf
, which tests that the logcdf is equivalent to adding all the logps starting fromzerodomain.lower
. This provides coverage to the distributions that are not in scipy (not covered by 1.). I added it to all distributions, but maybe this is too redundant for those that are covered by 1.Some remaining issues / questions:
bound
calls in CDF methdos be replaced bytt.switch
? Mostlogcdf
methods for continous distributions seem to usett.switch
, except for theGamma
andInverseGamma
distributions. I usedbound
whenever there were at least two conditions that needed checking leading no-np.inf
, otherwise went withtt.switch
. I am also not completely sure if this interferes with the recent Add flag to disable bounds check for speed-up #4377 PR.logcdf
methods do not work with multiple values (or even with an array containing a single value such asnp.array([1])
). In these cases, I excluded information in the docstrings about the possibility of computing thelogcdf
for multiple values, but this seems like a suboptimal solution. We should either fix it or raise a ValueError to explain the limitation to the user, as this seems to go against the expected behavior in most distributions. This occurs for two reasons:logsumexp(tt.arange(0, value+1), keepdims=False)
for theBetaBinomial
andHyperGeometric
. Any alternatives?incomplete_beta
function in theBinomial
andNegativeBinomial
(andZeroInflated
counterparts). This is also an issue in the current implementation of theBeta
andStudentT
distributions (see Beta logcdf method fails with array of values #4342). In addition, the unit tests for these functions seem to be very slow compared to those of other distributions. Is theincomplete_beta
particularly slow?logcdf
method of thePoisson
distribution (and itsZeroInflated
counterpart) fails with a C-assertion (exiting the python process altogether) when asked to evaluate multiple invalid values. This is also a problem in theInverseGamma
distribution (see InverseGamma logcdf method fails with invalid parameters when array is used #4340). It will be solved once this Theano issue is fixed (see Return NaN in C implementations of SciPy Ops aesara-devs/aesara#224). Update: I have found a temporary hack to "hide" this problem (which can be removed once those other issues are solved).check_logcdf
tests only for values in the output domain, ignoring the edge values, but we might want to test also for values at the edges as well as below or beyond the domain (which can be evaluated to either -np.inf or 0). For examplepm.Bernoulli(p=.1).logcdf([-1, 2]) -> [-np.inf, 0]
. Should we use more comprehensive extra-domain checks? Update: This is now implemented in Increase unittestcheck_logcdf
coverage and fix issues with some distribution methods #4393.Other changes not directly related to this PR:
pymc3_matches_scipy
test to theBetaBinomial
. Was there a reason why this was missing?logp
andrandom
methods in theDiscreteWeibull
distribution to be in line with the rest of the library.DiscreteWeibull
andZeroInflatedPoisson
distributions.