-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding dirichlet distribution #318
base: develop
Are you sure you want to change the base?
adding dirichlet distribution #318
Conversation
Ok, got it. It is implemented in |
Looks a really promising start. It would be really nice if the Real defaults could be vector and all other parameters double. For testing, \boost\libs\math\test\test_beta_dist.cpp is a slightly useful template. However it is probably over-complicated, at least to get started on testing. I found it easier to start with testing just double values and then refactor later with other types and higher precision. This will at least get some sanity checks. (Although the usefulness of high values for this distribution is unclear, our policy is to aim for tests at least the precision of 128-bit (~36 decimal digits)). There is also the important question of test values. Python could be used to generate some test values for a start. https://cran.r-project.org/web/packages/DirichletReg/DirichletReg.pdf might also be slightly useful but is ominously marked as 'orphaned' - but is quite new. Our favorite Wolfram doesn't seem to be able to help. The beta distribution tests should be applicable when they should be identical, so this might be the first thing to try? We also like to have some example(s). Being able to show how to combine with Boost.Random to replicate the Wikipedia/Python example(s) would be good. There are lots of tests of the that check for 'dodgy' input like negative, NaN or inf. These check throwing (if the policy is default). (and don't forget any examples should use thow'n'catch blocks so that the error messages are displayed). You can probably leave me to tidy these up later. |
Looks like a great start. I'll go through your other items on your checklist later. |
Hi @pabristow @NAThompson Can we use something like |
I have a vague recollection that we had a complaint similar to this many years ago. In the end we decided to keep the apparently redundant check because there were circumstances when it would allow a 'bad' value to sneak in. Checking seems quite cheap, as is construction generally? |
…ontainer type for vector type.
Right. If some 'invalid' value comes in for reasons unknown to the user, it would be a disaster. |
@mrityunjay-tripathi Could you write a google benchmark and show how fast it is with and without the argument validation? This is a really hard question to answer without hard data. An example is here. |
@jzmaddock , @pabristow : For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions? |
@mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp". |
This is clear now. I will soon add the tests. |
I am not familiar with 'google benchmark', but ready to delve into it. As soon as I add the tests, I will try to write the benchmark as well. |
@jzmaddock devised this scheme - with good reasons - so he is best qualified to comment on this change to multi-thingys. |
From: Mrityunjay Tripathi <[email protected]>
Sent: 29 February 2020 16:23
To: boostorg/math <[email protected]>
Cc: Paul A. Bristow <[email protected]>; Mention <[email protected]>
Subject: Re: [boostorg/math] adding dirichlet distribution (#318)
@mrityunjay-tripathi<https://github.com/mrityunjay-tripathi> : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".
This is clear now. I will soon add the tests.
For distributions a distribution test is a much better starting point to see how the Boost.Test is used..
For example :\boost\libs\math\test\test_normal.cpp
I am looking for some ‘known-good’ values from SciPy, for example, but hopefully you have some already..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#318?email_source=notifications&email_token=AAIG4AMMMBSY55NF56XX2WDRFE27RA5CNFSM4K5A75T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENL6E4Q#issuecomment-592962162>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIG4AKXZZJNYHXHRGK4KALRFE27RANCNFSM4K5A75TQ>.
|
For better or worse I'd prefer to keep to the current pattern if possible. Implementation and use wise, I think it makes very little practical difference. I confess to a certain amount of hesitation in adding something that's not entirely mainstream as our first multivariate distribution. I had always assume that the multivariate normal would come first, but no matter I guess. With regard to the CDF, I note that multivariate normal has 2 different possible definitions, and I assume that is the case here also. I think we should be careful about treading on shaky ground here. Likewise, I assume there is no real notion of a quantile, not least because there is unlikely to be a unique vector of x-values corresponding to any given probability. |
You will definitely enjoy adding google benchmark to your toolkit. It's amazing. |
I think the problem is that the multivariate normal factors into products so that there is little demand for such a thing. Writing down a useful definition of the cdf that translates into an ergonomic API is going to be a problem though . . |
@jzmaddock I thought of using some random values so that they satisfy the condition. But it seems there is no practical use for this. I was putting it for the sense of completeness of the implementation. |
@NAThompson: I've used this deduction to calculate the CDF. Talking about efficiency, this distribution obviously is computationally expensive. |
Also:
|
I was trying this sample program from scipy but the values I am getting is not in sync with that of scipy.
The output:
|
I also wonder if we can dispense with passing a zero of the right type (required in the Dark Days of C++98) (Since you have already upped the C++ standard version to 14, if not 17!) Incidentally, is using c++14/17 really helpful - I'd much prefer to limit it to c++11 if possible? @jzmaddock Boost.Math policy decision? Your important views? |
It depends.... I think we need to mark C++03 as deprecated in the next release - that way we can start removing it from old code next year sometime (!). For new code, I'm reasonably relaxed about the compiler requirements, although it would be nice to not require stuff gratuitously. I guess recent GCC's are C++14 by default, as is MSVC so that seems like a reasonable target unless there's something in C++17 that essential? |
It's probably overkill unless you have extended precision values which you're certain are correct. |
Boost has to cope with systems without ‘throwability’ (eg embedded systems, real-time, IOT…).
Perhaps a vector containing a NaN? But still thinking about this.
From: Nick <[email protected]>
Sent: 6 March 2020 10:32
To: boostorg/math <[email protected]>
Cc: Paul A. Bristow <[email protected]>; Mention <[email protected]>
Subject: Re: [boostorg/math] adding dirichlet distribution (#318)
@NAThompson commented on this pull request.
________________________________
In include/boost/math/distributions/dirichlet.hpp<#318 (comment)>:
@@ -50,6 +50,12 @@ inline bool check_alpha(const char *function,
const Policy &pol)
{
using RealType = typename RandomAccessContainer::value_type;
+ using std::invalid_argument;
+ if (alpha.size() < 1)
+ {
+ throw invalid_argument("Size of 'concentration parameters' must be greater than 0.");
+ return false;
Do we need both a throw and a return false? I'd just do throw std::invalid_argument.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#318?email_source=notifications&email_token=AAIG4ALL6B4RANNUS2FYCOTRGDGLDA5CNFSM4K5A75T2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCYITBMA#pullrequestreview-370225328>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIG4AOCKIZ66FUWZALNSDTRGDGLDANCNFSM4K5A75TQ>.
|
Sadly Dirichlet distribution in Wolfram says "Contribute this entry." :-( |
Correction of my bad - max_digits10 should be 21 for 80-bit long double, not 17 as I said before (that's only MSVC who can't be bothered to implement 80-bit long double). But I have yet to discover who to make SciPy output all those digits. Still digits10 = 18 would be better than only 6 decimal digit output! |
@pabristow: I got it why this is the case. Actually the check for sum of x just returns false if the condition is not satisfied but doesn't throw the error.
Never thought about that! |
Why do these tests pass even if there are several errors? Am I missing something? |
I don't think that the jamfile.v2 includes dirichlet tests or examples yet . I will do this and check it out locally using the b2 build system with a variety of gcc clang and msvc versions and c++std variants too. Job for next week. (I think that it is premature to be using the CI suite yet. I'm only testing locally. When pushing, include the string [CI SKIP] in the commit message will avoid triggering the long-suffering and very tired and overworked CI systems.) |
I'm also trying to use beta distribution (which works with Boost.Multiprecision) for some really high precision tests. says and this seems to work for PDF, but so far I get a different value for CDF. I've been playing with
I'm not sure if this is feasible and I suspect my ignorance and inexpertise. Views from mathy experts? I also would hope that one can do the equivalent of this for didchlet
How to do this for a multi distribution (in hand-waving mode ;-) )? |
The deduction for CDF is definitely wrong and now it really seems complicated to deduce CDF of Dirichlet distribution though I will try to work on it. Scipy also doesn't have CDF implementation, now I see why. |
I got this answer on researchgate about CDF of Dirichlet distribution. Since the application for which I am using Dirichlet distribution is concerned about generating random distribution mostly. Can we omit CDF for now and wait if something (approximation or efficient deductions) comes up later. |
@mrityunjay-tripathi : Why don't you try Cross-Validated and see if you get any better answers? |
For the record here is the helpful if negative reply. "3rd Mar, 2020 Sounds tricky, as my rusty maths quickly concluded. |
typename RandomAccessContainer::value_type entropy, // entropy | ||
typename RandomAccessContainer::value_type pdf, // pdf | ||
typename RandomAccessContainer::value_type cdf, // cdf | ||
typename RandomAccessContainer::value_type tol) // Test tolerance. | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed mode
, skewness
, kurtosis
, cdf
because I don't have predefined values for them. I thought of adding tests first for those that we have predefined values.
alpha[0] = static_cast<RealType>(5.310948003052013); | ||
alpha[1] = static_cast<RealType>(8.003963132298916); | ||
x[0] = static_cast<RealType>(0.35042614416132284); | ||
x[1] = static_cast<RealType>(0.64957385583867716); | ||
mean[0] = static_cast<RealType>(0.398872207937724); | ||
mean[1] = static_cast<RealType>(0.601127792062276); | ||
var[0] = static_cast<RealType>(0.016749888798155716); | ||
var[1] = static_cast<RealType>(0.016749888798155716); | ||
pdf = static_cast<RealType>(2.870121181949622); | ||
entropy = static_cast<RealType>(-0.6347509574442718); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't get higher precision than this (numpy.float64
), because gamma function in SciPy doesn't support numpy.float128
/numpy.longdouble
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can’t you test kurtosis/variance with a scaling relation?
Var(kX) = k^2Var(X)?
I could generate some cpp_bin_float_50 values for this, but that is somewhat circular in that it assumes that your code is correct ;-) Cunning-er tests like Nick's are a Good Idea. (Past experience is that it is all to easy to code these equations wrong :-( so cross checks are Really Useful) I've asked again about the Dirichlet CDF |
We also like to have some worked examples of using functions and distributions, (often lifted from somewhere else so we can compare results). Have you any examples yet? Perhaps using with Boost.Random? I am starting on documentation. |
Sorry, I couldn't get done much today because of all-day celebration of 'Holi'.
@pabristow: The
I got the equation for CDF but it's toooo big. I am trying to implement it as soon as possible.
Yes, I have but I am yet to get familiar with Boost.Random
That's nice. |
It seems unlikely that we will get the same results because the equation for cdf of dirichlet is much complex and has to be dealt using approximation (as far as I could get it). May be we could hard-code it for k <= 2 and use the other equation for k > 2? |
looks intimidating to be an exercise for the student ;-) |
Sorry for the delay in the implementation of the CDF function, I've almost completed it but need to change the API accordingly. I got busy with the project proposal for GSoC. The approximation for Lauricella function mentioned in this paper is elaborated here. |
Ah - that sounds extremely interesting and promising. Approximations like these are really useful. |
I've removed std::invalid_argument almost everywhere, I will see if it is still used somewhere. Also, my Institute is closed and I am on my way to home due to this whole Corona thing. I too can't access them right now, though I downloaded some of the papers (if it's that what you want) and can send you if you say. |
I've searched better and found the article, but not even partially digested it. I'm still struggling to re-understand how the error_handling works - don't worry about it for now. Getting the cdf to work would be excellent. I've download lauricella.zip from http://faculty.smu.edu/rbutler contains a pdf for Laplace Approximation of Lauricella Functions FA, FD, and Φ2 lauricella_NM2.pdf contains a section on 3.1.1 explaining how the there is an error in the Exton paper (equation 10) We may be able to get help from Andrew Wood at Nottingham, UK. This is above my 'pay grade' ;-) http://www.smu.edu/statistics/faculty/butler.html is is 404, but the download above has some .m programs that might be useful? HTH Good luck avoiding the virus - at least you are not in the Death Age Zone! |
For test values, Selected Tables in Mathematical Statistics, Volume IV (1977) has 200 pages of numerical tables dedicated to the Dirichlet distribution's CDF, inverse, mean, and variance. The book also includes a lengthy exposition on how to compute them. |
Thanks for this. Covid-19 and other things have pushed the Dirichlet onto the back burner, but this looks a very useful reference. Library loans are not functionally at present, but I hope to return to this in due course. |
I am adding Dirichlet distribution in
include/boost/math/distributions/
. The Dirichlet distribution is a multivariate distribution function, so vector type inputs are needed, but the tests intest/compile_test/test_compile_result.hpp
are forint
,double
,float
,long double
,unsigned
. How can I add tests for Dirichlet distribution?TODO:
VectorType
and it's implementation.cdf
function. (I could not find it, so tried to derive it.)skewness
functionkurtosis
function.Also, none of the distribution functions generate random samples of the corresponding distribution. That means no sampling function implemented. In PyTorch, we can do something like,
Can we add such feature or its implemented as such?