Add HyperGeometric Distribution to pymc3.distributions.discrete #3504 #4108

Harivallabha · 2020-09-17T17:57:20Z

Support HyperGeometric Distribution as part of distributions/discrete
Added tests
Added docstrings

@twiecki @ricardoV94. Hope this helps. Please feel free to modify/add on top of this, if this is useful. If not, please ignore 😬

Harivallabha · 2020-09-17T18:34:06Z

I can see that scipy.stats.hypergeom and the above pymc3.distributions.discrete.HyperGeometric give the same result for usual values.

However, when I test for something like: M = 3, n = 3, N = 4, and k = 1
scipy.stats.hypergeom.logpmf(1, 3, 3, 4) gives nan

And,
with pm.Model() as model:
z = pm.HyperGeometric('z', 3, 3, 4)
model.logp({'z':1}) gives -inf

fonnesbeck · 2020-09-17T19:38:37Z

Thanks for the PR! Can you please run the Black formatter on this?

Harivallabha · 2020-09-17T19:48:09Z

@fonnesbeck Sure! I should tell you that running the Black formatter on discrete.py, test_distributions.py, and test_distributions_random.py formats some code that I haven't touched as well 😅

…stributions_random.py

fonnesbeck · 2020-09-17T19:49:37Z

No problem. It likely slipped through on a previous PR.

Harivallabha · 2020-09-17T19:50:07Z

Okay, done! The Black formatter has reached in and changed stuff in a lottt of places (Exa: change all ' ' to " "). So lemme know if I should revert this because this is probably not gonna be easy for you to review 😅

fonnesbeck · 2020-09-17T20:05:23Z

pymc3/distributions/discrete.py

+        k = self.k
+        n = self.n
+        return bound(binomln(k, value) + binomln(N - k, n - value) - binomln(N, n),
+                     0 <= k, k <= N, 0 <= n, 0 <= N, n - N + k <= value, 0 <= value,


You are already testing that 0 <= k and k <= N, so you should not have to include 0 <= N

Ah, point! Removing the unnecessary condition.

fonnesbeck · 2020-09-17T20:09:27Z

As for the difference between your implementation and SciPy's, theirs uses a different formulat for the logprob:

    def _logpmf(self, k, M, n, N):
        tot, good = M, n
        bad = tot - good
        result = (betaln(good+1, 1) + betaln(bad+1, 1) + betaln(tot-N+1, N+1) -
                  betaln(k+1, good-k+1) - betaln(N-k+1, bad-N+k+1) -
                  betaln(tot+1, 1))
        return result

So, you can either change yours to match this, or address the test another way.

MarcoGorelli · 2020-09-17T20:13:20Z

Okay, done! The Black formatter has reached in and changed stuff in a lottt of places (Exa: change all ' ' to " "). So lemme know if I should revert this because this is probably not gonna be easy for you to review sweat_smile

FWIW, I think this is a fair point...would PyMC3 take a large PR that does nothing except for applying black everywhere and also adds it to CI / pre-commit? Because otherwise reviews do become hard. Conversely, once black has been applied everywhere, diffs do tend to become smaller.

EDIT

taken forward in #4109

Harivallabha · 2020-09-17T20:36:09Z

@fonnesbeck Thanks! Modified the implementation to match the scipy one.

fonnesbeck · 2020-09-17T20:45:25Z

Looks like you need a rebase to address the conflicts.

tirthasheshpatel

Thanks, @Harivallabha for implementing this distribution. I have been working on adding NegativeHypergeometric and MultivariateHypergeometric distributions to pymc3 so I thought I would help with this too. Hope these comments are helpful :)

Please tell me if I have missed anything.

tirthasheshpatel · 2020-10-21T17:33:19Z

pymc3/distributions/discrete.py

+        plt.show()
+
+    ========  =============================
+    Support   :math:`x \in \mathbb{N}_{>0}`


Are you sure this is right? On the Wikipedia page, support is given as x in [max(0, n - N + k), min(k, n)].

tirthasheshpatel · 2020-10-21T17:34:54Z

pymc3/distributions/discrete.py

+
+class HyperGeometric(Discrete):
+    R"""
+    Hypergeometric log-likelihood.


Suggested change

Hypergeometric log-likelihood.

Discrete hypergeometric distribution.

Isn't this better?

tirthasheshpatel · 2020-10-21T17:41:53Z

pymc3/distributions/discrete.py

+    The probability of x successes in a sequence of n Bernoulli
+    trials (That is, sample size = n) - where the population
+    size is N, containing a total of k successful individuals.
+    The process is carried out without replacement.


Suggested change

The probability of x successes in a sequence of n Bernoulli

trials (That is, sample size = n) - where the population

size is N, containing a total of k successful individuals.

The process is carried out without replacement.

The probability of :math:`x` successes in a sequence of :math:`n` bernoulli

trials taken without replacement from a population of :math:`N` objects,

containing :math:`k` good (or successful or Type I) objects.

Nitpick. Not a blocking comment.

tirthasheshpatel · 2020-10-21T17:44:32Z

pymc3/distributions/discrete.py

+    Mean      :math:`\dfrac{n.k}{N}`
+    Variance  :math:`\dfrac{(N-n).n.k.(N-k)}{(N-1).N^2}`


Suggested change

Mean :math:`\dfrac{n.k}{N}`

Variance :math:`\dfrac{(N-n).n.k.(N-k)}{(N-1).N^2}`

Mean :math:`\dfrac{nk}{N}`

Variance :math:`\dfrac{(N-n)nk(N-k)}{(N-1)N^2}`

tirthasheshpatel · 2020-10-21T17:46:29Z

pymc3/distributions/discrete.py

+        self.N = N = tt.as_tensor_variable(intX(N))
+        self.k = k = tt.as_tensor_variable(intX(k))
+        self.n = n = tt.as_tensor_variable(intX(n))


Suggested change

self.N = N = tt.as_tensor_variable(intX(N))

self.k = k = tt.as_tensor_variable(intX(k))

self.n = n = tt.as_tensor_variable(intX(n))

self.N = intX(N)

self.k = intX(k)

self.n = intX(n)

Nitpick. Not a blocking comment.

tirthasheshpatel · 2020-10-21T17:59:52Z

pymc3/distributions/discrete.py

+    The pmf of this distribution is
+    .. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}}
+    .. plot::


Suggested change

The pmf of this distribution is

.. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}}

.. plot::

The pmf of this distribution is

.. math:: f(x \mid N, n, k) = \frac{\binom{k}{x}\binom{N-k}{n-x}}{\binom{N}{n}}

.. plot::

Nitpick. Not a blocking comment

tirthasheshpatel · 2020-10-21T18:01:10Z

pymc3/distributions/discrete.py

+            specified).
+        Returns
+        -------
+        array


Suggested change

specified).

Returns

-------

array

specified).

Returns

-------

array

Nitpick. Not a blocking comment

tirthasheshpatel · 2020-10-21T18:01:49Z

pymc3/distributions/discrete.py

+        Calculate log-probability of HyperGeometric distribution at specified value.
+        Parameters
+        ----------


Suggested change

Calculate log-probability of HyperGeometric distribution at specified value.

Parameters

----------

Calculate log-probability of HyperGeometric distribution at specified value.

Parameters

----------

Nitpick. Not a blocking comment

tirthasheshpatel · 2020-10-21T18:02:13Z

pymc3/distributions/discrete.py

+            values are desired the values must be provided in a numpy array or theano tensor
+        Returns
+        -------


Suggested change

values are desired the values must be provided in a numpy array or theano tensor

Returns

-------

values are desired the values must be provided in a numpy array or theano tensor

Returns

-------

Nitpick. Not a blocking comment

tirthasheshpatel · 2020-10-21T18:15:01Z

pymc3/distributions/discrete.py

+            - betaln(n - value + 1, bad - n + value + 1)
+            - betaln(tot + 1, 1)
+        )
+        return result


You should mask the invalid entries with -inf before returning the result. I have used the inequality of support from the Wikipedia page. If that comment is wrong, please change the lower and upper bound conditions according to the formula of support you use.

Suggested change

return result

# value in [max(0, n - N + k), min(k, n)]

lower = tt.switch(tt.gt(n - N + k, 0), n - N + k, 0)

upper = tt.switch(tt.lt(k, n), k, n)

nonint_value = (value != intX(tt.floor(value)))

return bound(result, lower <= value, value <= upper, nonint_value)

There is also an additional condition checking whether all the values are integers. scipy checks it but haven't seen pymc3 use it anywhere. It should ideally be there but has multiple numerical problems (like 1e-17 is considered non integer value). Please avoid if you think the condition is better avoided.

Spaak · 2020-11-23T16:22:17Z

@Harivallabha thanks for the PR, we're trying to get this in release 3.10. For that, could you please:

Go over @tirthasheshpatel's comments and mark resolved if so
Rebase onto master (hopefully not too painful...) so the tests can run and we can merge
Simplify (or probably simply remove) the _repr_latex_ implementations (probably where many of the conflicts are from) (see refactor _repr_latex functionality #4065 and adding meaningful str representations to PyMC3 objects #4076 for a change to how LaTeX and str representations are implemented)

Spaak · 2020-11-24T07:35:18Z

Thanks, let's continue in #4108 then.

…#4249) * Add HyperGeometric distribution to discrete.py; Add tests * Add HyperGeo to distirbutions/__init__.py * Fix minor linting issue * Add ref_rand helper function. Clip lower in logp * Fix bug. Now pymc3_matches_scipy runs without error but pymc3_random_discrete diverges from expected value * passes match with scipy test in test_distributions.py but fails in test_distributions_random.py * Modify HyperGeom.random; Random test still failing. match_with_scipy test passing * rm stray print * Fix failing random test by specifying domain * Update pymc3/distributions/discrete.py Remove stray newline Co-authored-by: Tirth Patel <[email protected]> * Add note in RELEASE-NOTES.md Co-authored-by: Tirth Patel <[email protected]>

Harivallabha added 5 commits September 17, 2020 22:59

Add hypergeometric dist. to discrete

0e865dc

Add tests for the hypergeometric distribution

7d12a7b

Clean up linting

a2bffe7

Fix linting - 2

fec3029

Add line in RELEASE-NOTES.md

a23960f

Run black formatter on discrete.py, test_distributions.py and test_di…

bd6e7fb

…stributions_random.py

fonnesbeck reviewed Sep 17, 2020

View reviewed changes

Harivallabha added 2 commits September 18, 2020 01:43

Remove unnecessary bound condition in logprob of hypergeom

d797ef7

Change logp of hypergeometric to mirror the scipy implementation

c708e2f

MarcoGorelli mentioned this pull request Sep 18, 2020

Apply black formatter #4109

Closed

tirthasheshpatel reviewed Oct 21, 2020

View reviewed changes

michaelosthege added the enhancements label Nov 14, 2020

michaelosthege added this to the 3.10 milestone Nov 14, 2020

Harivallabha mentioned this pull request Nov 23, 2020

Add HyperGeometric Distribution to pymc3.distributions.discrete #4108 #4249

Merged

Spaak closed this Nov 24, 2020

This was referenced Dec 21, 2020

HyperGeometric distribution gives wrong results #4366

Closed

Add bound to HyperGeometric logp (resolves #4366) #4367

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HyperGeometric Distribution to pymc3.distributions.discrete #3504 #4108

Add HyperGeometric Distribution to pymc3.distributions.discrete #3504 #4108

Harivallabha commented Sep 17, 2020 •

edited

Loading

Harivallabha commented Sep 17, 2020 •

edited

Loading

fonnesbeck commented Sep 17, 2020

Harivallabha commented Sep 17, 2020

fonnesbeck commented Sep 17, 2020

Harivallabha commented Sep 17, 2020 •

edited

Loading

fonnesbeck Sep 17, 2020

Harivallabha Sep 17, 2020

fonnesbeck commented Sep 17, 2020

MarcoGorelli commented Sep 17, 2020 •

edited

Loading

Harivallabha commented Sep 17, 2020

fonnesbeck commented Sep 17, 2020

tirthasheshpatel left a comment

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

tirthasheshpatel Oct 21, 2020

Spaak commented Nov 23, 2020 •

edited

Loading

Spaak commented Nov 24, 2020

	Hypergeometric log-likelihood.
	Discrete hypergeometric distribution.

		Mean :math:`\dfrac{n.k}{N}`
		Variance :math:`\dfrac{(N-n).n.k.(N-k)}{(N-1).N^2}`

-        return result
+        # value in [max(0, n - N + k), min(k, n)]
+        lower = tt.switch(tt.gt(n - N + k, 0), n - N + k, 0)
+        upper = tt.switch(tt.lt(k, n), k, n)
+        nonint_value = (value != intX(tt.floor(value)))
+        return bound(result, lower <= value, value <= upper, nonint_value)

Add HyperGeometric Distribution to pymc3.distributions.discrete #3504 #4108

Add HyperGeometric Distribution to pymc3.distributions.discrete #3504 #4108

Conversation

Harivallabha commented Sep 17, 2020 • edited Loading

Harivallabha commented Sep 17, 2020 • edited Loading

fonnesbeck commented Sep 17, 2020

Harivallabha commented Sep 17, 2020

fonnesbeck commented Sep 17, 2020

Harivallabha commented Sep 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fonnesbeck commented Sep 17, 2020

MarcoGorelli commented Sep 17, 2020 • edited Loading

EDIT

Harivallabha commented Sep 17, 2020

fonnesbeck commented Sep 17, 2020

tirthasheshpatel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Spaak commented Nov 23, 2020 • edited Loading

Spaak commented Nov 24, 2020

Harivallabha commented Sep 17, 2020 •

edited

Loading

Harivallabha commented Sep 17, 2020 •

edited

Loading

Harivallabha commented Sep 17, 2020 •

edited

Loading

MarcoGorelli commented Sep 17, 2020 •

edited

Loading

Spaak commented Nov 23, 2020 •

edited

Loading