Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove size from distribution API #5754

Closed
ricardoV94 opened this issue May 8, 2022 · 48 comments · Fixed by #5788
Closed

Remove size from distribution API #5754

ricardoV94 opened this issue May 8, 2022 · 48 comments · Fixed by #5788
Assignees
Milestone

Comments

@ricardoV94
Copy link
Member

This was brought up in the last meeting, the question is whether there's any benefit in introducing size as a new parameter to our distributions. Size is simply shape[:-ndim_supp] so it's pretty easy for us to convert a user provided shape to size which Aesara RandomVariables use and will keep using internally.

We should decide this before we release V4. For some context on the distinction see this WIP PR #5746

For a more involved example where we are converting from shape/dims/observed to size, take a look at these PRs #5743 and #5690

@twiecki
Copy link
Member

twiecki commented May 8, 2022

Big agreement, users know shape already so why add a subtly new thing.

@canyon289
Copy link
Member

Agree hiding size would help make things much more sane, especially because as examples proliferate if we have size and shape some people will use size, others will use shape, and with a split API itll cause recurring confusion. Very similar to how matplotlib as both ax.plot and plt.plot and every beginner gets confused every time

@fonnesbeck
Copy link
Member

This is a smart decision. Size can be left available for advanced users as an attribute, but no need to have it in the public API.

@ricardoV94
Copy link
Member Author

Anyone wants to open a PR to this effect?

@canyon289
Copy link
Member

canyon289 commented May 8, 2022

This is a smart decision. Size can be left available for advanced users as an attribute, but no need to have it in the public API.

Great idea, share with power users that its possible in a subtle way without confusing everyone else

@danhphan
Copy link
Member

danhphan commented May 9, 2022

Hi, I am happy to work on this task.

@canyon289
Copy link
Member

Hi, I am happy to work on this task.

Thanks! Its yours for the taking

@lucianopaz
Copy link
Contributor

I wouldn't like us to hide size. It's new, but it's also better. We should transition away from shape altogether. For univariate distributions, shape and size are equivalent, but for multivariate distributions they aren't, and the shape is far worse.

shape forces users to always insert the implied support shape of multivariate distributions (which was a bad UX in pymc3 and now that we can, we should stir away from it). Besides that, the shape value of the multivariate distribution's support is completely ignored in favor of the implied support shape:

>>> pm.draw(pm.MvNormal.dist(mu=np.zeros(5), cov=np.eye(5), shape=(3,))).shape
(5,)
>>> pm.draw(pm.MvNormal.dist(mu=np.zeros(1), cov=np.eye(1), shape=(3,))).shape
(1,)
>>> pm.draw(pm.Multinomial.dist(n=5, p=[0.1, 0.3, 0.6], shape=(5, 100))).shape
(5, 3)

Furthermore, if you want to mimic the behavior of size using an Ellipsis, you can't if there are implied batch dimensions:

>>> pm.draw(pm.MvNormal.dist(mu=np.zeros((5, 1)), cov=[[1.]], shape=(3, 2, ...))).shape
(3, 2, 5, 1)

size is better because it lets the users be explicit on what they can change of a random variable: they can create batches, but they can't alter the support which are implied.

The only weird scenario is with time series distributions where the support can't be inferred from the distribution parameters, and needs to be specified using shape.

@twiecki
Copy link
Member

twiecki commented May 9, 2022

@lucianopaz You raise good points. If size is indeed better/more consistent than shape, now would be the time to transition users over.

@ricardoV94
Copy link
Member Author

Also worth considering that we can add SpecifyShape to assert the support dimension size:

import aesara.tensor as at
import pymc as pm

x = pm.MvNormal.dist(mu=np.zeros(3), cov=np.eye(3), shape=(5,))
x = at.specify_shape(x, (5,))
pm.draw(x)  # AssertionError: SpecifyShape: dim 0 of input has shape 3, expected 5.

These asserts would most likely be constant folded during graph compilation so they won't pose any burden. The downside of this is that the outputs would no longer be pure RandomVariables, which complicates matters in factory distributions, like Mixture, RandomWalk or LKJCholeskyCov, which currently require pure RandomVariables as inputs.

Hopefully we will one day overcome this limitation (not only related to SpecifyShape, but also other cases like indexing, transposing, or doing transformations, the sort of thing that Aeppl focuses on), but it's an immediate downside if we were to add SpecifyShapes now.

@fonnesbeck
Copy link
Member

There are arguments to be made for either continuing with shape for now or converting to size. What I would not want to see is continuing with both. With a major release, we should either continue with shape if it is not causing a plethora of problems, or rip the BandAid off and start using size. It would be nice to avoid muddying the waters with shape/size/dim uncertainty.

@twiecki
Copy link
Member

twiecki commented May 9, 2022

I feel we can convert to size without too much friction:

  • Recommended is dims anyway
  • It behaves the same for univariate, which is what most users use
  • We have shape for backwards compatibility

Questions are if we should then actually add deprecation warnings for shape. We could probably add pretty smart ones that show exactly how the code needs to be changed.

@canyon289
Copy link
Member

canyon289 commented May 9, 2022

To be specific my vote is to remove one if possible. I trust the team to know which one i better to keep, and which one is better to "hide"

@ricardoV94 ricardoV94 changed the title Remove size from user-facing API? Remove one of size or shape from user-facing API? May 9, 2022
@michaelosthege
Copy link
Member

michaelosthege commented May 9, 2022

I vote to recommend one or the other with a smart DeprecationWarning but keep supporting both until 4.1.0.

So smooth backwards compatibile update, followed by clear migration path.


Which one is better? That's hard. (And the reason why IMO we should keep both until 4.1.0.)

@lucianopaz brought up good arguments in favor of size, but it's unclear how exactly the API for multivariate RVs should look like.

Note that dims currently behaves exactly like shape, so if we dislike shape enough to remove it, it would be very incoherent to keep dims as it is right now!!! (@twiecki @fonnesbeck)

We'd have to rethink dims too. For example passing dimnames of support dimensions, or timeseries length/steps separately.


Should this block 4.0.0?

If I didn't have to write my thesis and could take a 3 week vacation to refactor this stuff (for the fourth time) together with two of you I'd say yes.
But I hardly have time to review @ricardoV94's PRs, so there's that.
I only know from (painful) experience that this stuff is is the kind of refactoring that cooks your brain even without drinking coffee.

The implementation we have right now is at least coherent.
Yes, the support dim elements of shape/dims passed to multivariate RVs* are ignored, but that's in the docstring, or easy to print a warning about**.
*Except timeseries that infer steps/length from shape.
**And we could automatically evaluate & assert them.

Also, the size=(3, 2,...) example @lucianopaz gave is exactly as advertised/intended: The Ellipsis short-hand refers to all dimensions the RV had, if you'd leave the shape/dims kwarg away. Which in your example is (5, 1). It prepends dimensions you wouldn't get from the other parameters (mu, cov) without having to pass their shape or dimnames.

@twiecki
Copy link
Member

twiecki commented May 9, 2022

Right, the discussions we had back then @michaelosthege are coming back to me. One big motivator for keeping with shape/dims is that it's explicit, while size is implicit. I don't think it's a big deal with shape but more so with dims. E.g.

pm.MvNormal.dist(mu=np.zeros(5), cov=np.eye(5), dims=("predictors", "schools"))

vs (if we made dims match size):

pm.MvNormal.dist(mu=np.zeros(5), cov=np.eye(5), dims="predictors")

(example says there are 5 correlated schools and 3 predictors/regressors).

Although reading it now it doesn't seem so bad.

But then there seems to be an inconsistency with size for implied dimensions in univariate distributions:

pm.draw(pm.Normal.dist(mu=[0, 1, 2], size=(5, 3))).shape
>>> (5, 3)
pm.draw(pm.Normal.dist(mu=[0, 1, 2], shape=(5, 3))).shape
>>> (5, 3)

So implied dimensions are only ignored when support_dim > 0?

@danhphan
Copy link
Member

Hi,

My opinion is that we should keep shape, and remove size from parameters when constructing distributions.

Why? because I am an amateur, and my knowledge about shape and size is come from numpy (I think most users are also learn the concept of shape and size from numpy, before they use pymc. The use of size as parameters (function arguments) in numpy lead to many confusions between shape and size, so we should not repeat this mistake.

Besides, I am also not smart, and not good at thinking in a high-dimentional space or symbolic way, so I prefer explicit more than implicit (seem that most users are similar).

When I start to use pymc, I also get confused between shape and size in pymc, which lead me to read these useful blog posts and insightful discussions:

So these are ideas that I think will help normal users:

  • Keep shape and remove size from parameters when constructing distributions to reduce confusion
  • Explicitly specify different types of shape such as event shape, batch shape and sample shape.

Can we use dims to explicitly specify different types of shape? Also, is the ndim_support = (event shape + batch shape )?

Also, here we only mention about pm.MvNormal cases, what about other kind of distributions like timeseries pymc.GaussianRandomWalk and pymc.MvGaussianRandomWalk, where the steps parameter play in the output shape? Or what if the case of multiple variable outputs? Or what if we want to also use mini-batch in the observed data? How it can play well with observed distributions?

Maybe a good way is to explicitly list all the distributions and their possible usecases (how they are constructed or used). From that we can have a bigger picture to decide.

PS. I wrote that pm.draw() function before my acknowledgement of different types of shapes, so this function may be not suitable for our final goal to address the problem of shape vs. size in this discussion.

@Sayam753
Copy link
Member

I second the thought of keeping/maintaining shape and removing size from public facing API as I like being more explicit working with shape and dims.

Coming back to the example that @twiecki shared:
If we always specify the dimensions of complete distribution (i.e. batch_shape + event_shape) in multivariate distributions, I think this will be helpful for PyMC users to deal with shapes (not assume something is working implicitly under-the-hood)

pm.MvNormal.dist(mu=np.zeros(5), cov=np.eye(5), dims=("predictors", "schools"))

instead of

pm.MvNormal.dist(mu=np.zeros(5), cov=np.eye(5), dims="predictors")

I would like to extend the question from @danhphan

Can we use dims to explicitly specify different types of shape? If not, can we support working with dims to explicitly specify different types of shape?

And to answer:

Also, is the ndim_support = (event shape + batch shape )? ndim_supp is event_shape only as per my understanding.

@ricardoV94
Copy link
Member Author

ndim_supp is event dimensionality not shape. Tells you whether it's a scalar or vector but not how large

@lucianopaz
Copy link
Contributor

I agree with @michaelosthege: keep both, deprecate shape and have a migration guide. The fact that dims behaves similarly to shape isn’t too bad, because dims can be used to label dimensions and set coordinate labels for post processing, not just to resize a random variable.
As a final note, numpy also uses size in the same way that pymc does, so it’s a similar API.

@michaelosthege
Copy link
Member

Do we agree to keep both for 4.0.0 ?

Because then we could move this issue to the 4.1.0 milestone.


I feel like for experienced users size is the better choice, because it circumvents the problem of having ineffective entries in the tuple*.
*Only relevant for multivariate RVs though.

But I fully agree with @danhphan in that the shape/size distinction is confusing for beginners.
After all, both ndarray and TensorVariable have .shape attributes, but no .size!

However, IMO, adding batch_shape, event_shape, ... parameters could make it even more confusing.


Starting with a 0-centered prior, and having observed >2 years of discussion on this topic, our posterior is still not clearly on one side.

Maybe there's just a really tiny effect size?

What's our uncertainty about...?

  1. The regret of keeping both.
  2. The regret of recommending shape (keeping both)
  3. The regret of recommending size (keeping both)
  4. The regret of deprecating shape.
  5. The regret of deprecating size.

I'd say we have a pretty good estimate of the regret for 1-3:

  • no code changes needed, or just adding a GoodBayesianWarning in some dists
  • no friction for users
  • no risk of confusion by having to do a 180° turn
  • some cost of writing good doc explainer / recording a 10 minute video explainer about it

But for 4-5 there's known regret + high uncertainty

  • really hard code changes needed
  • user friction from breaking changes
  • high uncertainty around the regret of not having the other option

Let's minimize our expected regret.

@twiecki
Copy link
Member

twiecki commented May 10, 2022

Great points @michaelosthege. If indeed shape is easier for the user, which we now have some data on and I believe is right, we should err on the side of that. I don't think we can/should truly deprecate size because that's what aesara uses.

So I think we should teach shape as we have been, keep size internal or for expert users, and not let this block us further.

@ricardoV94
Copy link
Member Author

I don't think that keeping one for "naive users" one for "experts" is any good. We still need to keep both in mind in the codebase, and most examples of code that users will see outside of perhaps the official documentation will be written by "experts" anyway. We will still have to think about what combinations of things we want to maintain, e.g., do we allow for ellipsis with size, or keep only with shape?

@danhphan
Copy link
Member

Hi,

Instead of discussing on what users may prefer between shape and size, why dont we just ask them directly?

We have a strong community on discourse, so we can create a survey of 5 to 10 questions, and make a pin post on PyMC discourse, and ask for the community's opinions on this.

Some questions like (we can discuss more on the details of what we will ask later if you think it is a suitable approach):

  • What important features you want to have in PyMC v4?
  • When constructing distributions, do you prefer shape or size
  • Where do you think PyMC can be impoved, that can make your research or work easier?
  • ....

At the end of the day, the users are the ones who use these functions/APIs. So they are like the customers of PyMC. Maybe customers are not always right, but the survey will give us a sense of what is the opinions of the end users. Besides, this post will attract the attention of end users, and communicate what we gonna change in PyMC v4.

This will also give us more insights to make a more data-driven decision :)

@twiecki
Copy link
Member

twiecki commented May 11, 2022

Had a great chat with @ricardoV94 and here's where I'm at.

Pros of shape:

  • Users seem to prefer shape (more explicit etc)
  • Past users already know shape, so we don't have to explain a new thing (we have a new version so I weigh this point lightly)
  • Shape is compatible with dims

Pros of size:

  • size is more consistent, but the difference is small (only for multivariate distributions)
  • numpy and aesara use size

In the past, shape has rarely been a point of confusion (only internally it's been difficult). And we're proposing dims as the main-way to begin with. If we were to only use size, we would not change dims because the information about the labels along the event-dimension is still important to have in the idata object.

For me the pros of shape outweigh the pros of size. I also like what @michaelosthege said about the effect size probably being small. In that case I would err to the default of staying with what we have.

So my (updated) proposal is to remove the size kwarg.

@twiecki twiecki changed the title Remove one of size or shape from user-facing API? Remove size from user-facing API May 13, 2022
@danhphan
Copy link
Member

Hi, should we proceed this?

I add a couple of distributions that has size from @michaelosthege's list here:

A list of signatures that mention size:

  • pm.Distribution.dist has is, but that's not user-facing API.
  • pm.Flat.dist()
  • pm.HalfFlat.dist()
  • pm.Bound.dist()
  • pm.GaussianRandomWalk.dist()

@ricardoV94
Copy link
Member Author

ricardoV94 commented May 19, 2022

The biggest change is that size should no longer be accepted in the internal dist calls, up until the point where rv_op(*args, **kwargs) are finally called to initialize a RandomVariable:

size: Optional[Size] = None,

size = convert_size(size)

create_size, ndim_expected, ndim_batch, ndim_supp = find_size(
shape=shape, size=size, ndim_supp=cls.rv_op.ndim_supp
)

Then all the tests that make use of size and are not creating RandomVariables directly will fail and have to be changed to use shape instead.

Finally the existing document on creating distributions will have to be updated to explain that PyMC distributions only accept shape, but the underlying RandomVariables or other methods that are dispatched on created RVs such as moment only accept size: https://github.com/pymc-devs/pymc/blob/main/docs/source/contributing/developer_guide_implementing_distribution.md

@michaelosthege
Copy link
Member

michaelosthege commented May 19, 2022

The biggest change is that size should no longer be accepted in the internal dist calls

Removing it from the signature doesn't mean that it's no longer accepted - it can still be in the kwargs and that's fine.

The internal code, like the size = convert_size(size) line just needs to be changed to extract it from kwargs.

Then all the tests that make use of size and are not creating RandomVariables directly will fail and have to be changed to use shape instead.

They wouldn't fail, because they'd just be passing another kwarg, right?


IMO we shouldn't make our life harder than need be.
Removing size from the signature and beginner tutorials is one thing, but refactoring the entire test suite from size back to shape is a massive undertaking and would considerably delay v4.0.0..

Remember that we currently have an implementation that's working fine on the technical level!

@ricardoV94
Copy link
Member Author

ricardoV94 commented May 19, 2022

Sure it's harder, but otherwise we will end up maintaining the two ways indefinitely.

It will never be as easy as today, but also this issue does not need to be a blocker for V4. The decision was the important thing to get done quickly.

@ricardoV94
Copy link
Member Author

ricardoV94 commented May 19, 2022

We can put a FutureWarning just for 4.0

@ricardoV94 ricardoV94 changed the title Remove size from user-facing API Remove size from distribution API May 19, 2022
@ricardoV94
Copy link
Member Author

ricardoV94 commented May 19, 2022

@pymc-devs/dev-core Rereading the discussion here it is not clear whether the consensus was towards:

  1. Hide size from docstrings / function signature / official docs but keep allowing it for "super-users".
  2. Do not allow size at all when creating a RandomVariable via a PyMC distribution.

I don't think anything is gained from 1). It will put a higher burden on reviewing official docs and eventually discourse questions related to size/shape and code that works with one and not the other will start to emerge. In the codebase we will have to keep expecting both sources of information at all stages prior to rv_op.

Again, I don't think this issue needs to be a blocker, but a clear decision between 1) and 2) should probably be done now.

@twiecki
Copy link
Member

twiecki commented May 19, 2022

I vote for 2.

@michaelosthege
Copy link
Member

I voted for 1.


Whoopsie. We didn't have consensus after all.

Shall we have a separate videocall to lay out & discuss the pros/cons and options?

I have the feeling that we're all open to changing our opinion (and many of us did that multiple times in this thread) --- we just want to make the right decision.

@twiecki
Copy link
Member

twiecki commented May 19, 2022

@michaelosthege What are the pros of having an undocumented feature that we need to continue to test and support?

@michaelosthege
Copy link
Member

@michaelosthege What are the pros of having an undocumented feature that we need to continue to test and support?

In Distribution docstring:

**kwargs
Keyword arguments that will be forwarded to ``.dist()``.
Most prominently: ``shape`` and ``size``

In the Distribution.dist docstring we actually forgot to write about the **kwargs:

**kwargs,
) -> RandomVariable:
"""Creates a RandomVariable corresponding to the `cls` distribution.
Parameters
----------
dist_params : array-like
The inputs to the `RandomVariable` `Op`.
shape : int, tuple, Variable, optional
A tuple of sizes for each dimension of the new RV.
An Ellipsis (...) may be inserted in the last position to short-hand refer to
all the dimensions that the RV would get if no shape/size/dims were passed at all.
size : int, tuple, Variable, optional
For creating the RV like in Aesara/NumPy.

So we "support" forwarding any kwargs to Aesara and their behavior is, of course, documented by Aesara.
So I understand "2. Do not allow size at all when creating a RandomVariable via a PyMC distribution." as actively discriminating against one of the Aesara kwargs (size).

@ricardoV94
Copy link
Member Author

ricardoV94 commented May 19, 2022

What's the problem with explicitly not allowing size to pass through? Since we have an alternative parameter that is slightly different than the one used by Aesara it seems fine to intercept size and raise an informative Error for the user telling they must use shape instead.

@michaelosthege
Copy link
Member

michaelosthege commented May 19, 2022

I am not that opposed to size.
For example, I can picture a scenario where we recommend/warn to use shape for univariate dists, but warn to use size for ndim_supp > 0 distributions where it actually makes more sense.
That would be a compromise that maintains backwards compat while nudging users to the "safer" multivariate API that should lead to fewer shape problems.

That's one reason why I think we shouldn't at high development cost break something that works fine.

The recommendation/warning mentioned above can always become a raise in v4.1.0 if we eventually converge on a common understanding of what should be the best practice.

@twiecki
Copy link
Member

twiecki commented May 19, 2022

@michaelosthege If you really want to go back to that discussion (I was hoping we were past that), can you refute the logic in #5754 (comment)?

@michaelosthege
Copy link
Member

@michaelosthege If you really want to go back to that discussion (I was hoping we were past that), can you refute the logic in #5754 (comment)?

From those points I just don't conclude that we should forbid the use of size on the technical level. Warning against its use, okay. Excluding it from tutorials, yes.
But in my calculation the cost of refactoring half the codebase to technically prevent passing size to Distribution and Distribution.dist is just too high.

That's why in my comment right below yours I wrote

So "remove the size kwarg" is largely about excluding it from the beginner tutorials.

I should have been more explicit, because apparently we had a different understanding of what it means to "remove the size kwarg".

I was and am in favor of removing size from signatures, but I'm opposed to banning it from kwargs.

@ricardoV94
Copy link
Member Author

ricardoV94 commented May 19, 2022

The codebase can be changed. This issue is about what is best in the long term, mostly regarding user experience.

I don't think having two formats is good for users. One should be superior by whatever tiny margin and that's the one we should offer users. There's plenty as is to confuse users about PyMC API. Soft-rules like hiding things and "making sure" not to mention size in official docs will only increase the burden on devs, and still it will get out there and confuse users.

For example, just the other day someone was asking on the Discourse if they should be using NormalMixture vs Mixture. The question is obvious if you know the internals: it doesn't matter they are the same. But for this user it was enough doubt to motivate them to open a Discourse question and perhaps stall other work they were doing. I can easily see this happening with shape vs size. Options can hurt.

@twiecki
Copy link
Member

twiecki commented May 19, 2022

I caught up with @michaelosthege offline. We both agree that size should not be part of the user API in 4.0. We could, for example, have a warning when someone tries to use size (i.e. it's only internal).

The question of whether we want to completely remove size from the code-base then becomes a technical question.

@ricardoV94: How much work would it be to remove size and would the resulting code-base be much simpler than now?

@canyon289
Copy link
Member

canyon289 commented May 19, 2022

Actually can we pause here? This issue is titled "remove size from distribution API". It seems we have reached a conclusion which is yes.

If there needs to be another discussion on whether size needs to be removed on from the codebase may we open another issue for that scope just so we can separate out differing decisions (and means we reach to get there?

If my first sentence is wrong then by all means keep discussing Size and API in this issue.

@michaelosthege
Copy link
Member

We have not reached consensus on whether "removing" means to actively prevent passing it at all, or just removing one line from the signature.

The difference is massive!

@canyon289
Copy link
Member

canyon289 commented May 20, 2022

We have not reached consensus on whether "removing" means to actively prevent passing it at all, or just removing one line from the signature.

100% agree the scope is massive discussion which is why I'm glad this is being split up between user API and internal API/functionality. Given the PR here https://github.com/pymc-devs/pymc/pull/5788/files it seems that we've codified the user API portion which is great! Now we can debate internally but users won't be so affected, which also helps get to the 4.0.0 release.

Ill now get out of the way now so the very informed folks can discuss. Thanks @michaelosthege and @twiecki for discussing the nuances and the experts here being so engaged.

@twiecki
Copy link
Member

twiecki commented May 20, 2022

Talked to @ricardoV94 and I think we are all aligned. The plan is to do two things:

  1. open an issue to raise a warning if someone were to pass size and include it in 4.0
  2. open an issue that proposes removal of size in the code base. We don't have to decide that here and it would happen at 4.1 at the soonest.

@twiecki
Copy link
Member

twiecki commented May 20, 2022

Closing this to move discussion over to the other ones.

#5790

@twiecki twiecki closed this as completed May 20, 2022
twiecki pushed a commit that referenced this issue May 23, 2022
* Remove size from .dist() signature

Closes #5754

* Align (Half)Flat signatures with superclass

* Don't mention size in the docstring

Co-authored-by: Ricardo Vieira <[email protected]>

* Revert changes in `rng_fn` signatures

Co-authored-by: Ricardo Vieira <[email protected]>

Co-authored-by: Ricardo Vieira <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants