Implement flexible shape/dims/size API #4625

michaelosthege · 2021-04-08T13:49:29Z

Terminology

implied dimensions are those that a RV takes without specifying any of shape,dims,size,observed.
The size of an RV are all dimensions in addition to (preceeding) the implied dimensions.

Key Takeaways

shape can be numeric/symbolic and supports Ellipsis in the last position. Without Ellipsis a SpecifyShape Op is added automatically.
dims can be None and are defined at first mention. Ellipsis is also supported. In combination with pm.Data the dimensions specified through dims are re-sizeable.
size does care about implied dimensions in any way and just adds new dimensions to the RV. Internally size is inferred from shape and dims.
No more resizing in Model.make_obs_var because it leaves code in Distribution.__new__ not knowing the output shape.

ToDo

Distribution.dist() and Distribution.__new__ are similar in use (dims and observed only through __new__)
Inputs are checked and informative errors are raised where possible
Release notes were updated
Some tests were edited to use size/shape appropriately -- some tests illegally used size to specify size and implied dimensions.

ToDo (optional)

Ideally, it should be possible to pass dims in combination with size. We should just require a single source of truth (dim names must be undefined for dimensions that size integers apply to).

pymc3/distributions/distribution.py

brandonwillard

This introduces a considerable amount of complexity into Distribution.__new__; the shape logic really needs its own function/method, at the very least.

Also, this PR unnecessarily reinstates the use of shape internally (e.g. numerous change from size to shape in the tests), which is the opposite of what we set out to do.

pymc3/distributions/distribution.py

pymc3/tests/test_distributions_random.py

michaelosthege · 2021-04-10T15:02:10Z

Caution: The API from aesara.tensor.random.basic Ops such as normal resembles the NumPy API:

x = aesara.shared(np.array([10,20,30,40]))
n = atr.normal(loc=x, scale=0.1, size=(2,4))     # size=(2,) does not work!

In PyMC3 this will equivalent to "inflexible" use of shape:

with pm.Model():
    x = aesara.shared(np.array([10,20,30,40]))
    pm.Normal("n1", mu=x, sd=0.1, shape=(2,4))
    pm.Normal("n2", mu=x, sd=0.1, size=(2,))

michaelosthege · 2021-04-11T00:23:47Z

With the latest push the Distribution() and Distribution.dist() APIs are now aligned and validation/warning/casting logic moved to a helper function.

dims are not available through the .dist() API until dimension name broadcasting is implemented at the Aesara Variable level.

Passing a non-Ellipsis shape like shape=(1, 2, 3, 4) is the backwards-compatible use of the API that, at the same time results in the automatic addition of specify_shape.
The SpecifyShape Op couldn't deal with scalar shapes (), so the corresponding tests will fail until the fixes from pymc-devs/pytensor#367 are available.

A BestPracticeWarning is emitted in the following scenarios:

pm.Normal(..., shape=3) → "The shape parameter should be a list or tuple."
pm.Normal(..., dims="country") → "The dims parameter should be a list or tuple."

No BestPracticeWarnings are emitted in the following scenarios:

pm.Normal(..., mu=[1,2], shape=(3, 2)) → Automatic specify_shape will alert the user when the implied shape changes.
pm.Normal(..., mu=[1,2], shape=(3, ...)) → Notation indicates that there are more than 1 dims (debatable if we should nudge towards size).

Before we invest too much time into shape/size arguments, I would like to point out that dims are 10x more useful, because they make working with the resulting InferenceData much easier.

OriolAbril · 2021-04-11T00:41:10Z

Before we invest too much time into shape/size arguments, I would like to point out that dims are 10x more useful, because they make working with the resulting InferenceData much easier.

💯

pymc3/distributions/distribution.py

brandonwillard · 2021-04-11T01:06:07Z

dims are not available through the .dist() API until dimension name broadcasting is implemented at the Aesara Variable level.

To be clear, this is not a change that we should make in Aesara. Dimension names are completely unrelated to the concerns of Aesara and we absolutely must keep the objects in that library as simple and efficient as possible.

That does not mean that we cannot consider creating a custom TensorVariable or TensorType class for this purpose in PyMC3; however, I would still hesitate to do that unless it's absolutely necessary.

For instance, why can't we just use Variable.tag to hold these dimension labels and reason with/about them in PyMC3?

brandonwillard · 2021-04-11T01:10:44Z

A BestPracticeWarning is emitted in the following scenarios:

pm.Normal(..., shape=3) → "The shape parameter should be a list or tuple."

pm.Normal(..., dims="country") → "The dims parameter should be a list or tuple."

These are superfluous and oddly strict. NumPy doesn't even complain about this; why should PyMC?

twiecki · 2021-04-11T06:20:52Z

A BestPracticeWarning is emitted in the following scenarios:

pm.Normal(..., shape=3) → "The shape parameter should be a list or tuple."

pm.Normal(..., dims="country") → "The dims parameter should be a list or tuple."

These are superfluous and oddly strict. NumPy doesn't even complain about this; why should PyMC?

I agree, I don't think a warning is required here.

michaelosthege · 2021-04-11T17:03:14Z

I agree, I don't think a warning is required here.

Okay, this leaves us with no BestPracticeWarning at all.

dims=("town", ...) now works too. The ...-implied dims are set to he default pattern of f"{name}_dim_{d}" where d counting includes the size dimensions, like it would if there weren't any dim names at all.
For those implied dim names, None is set in the Model.RV_dims[rv_name] tuple of coordinate values. @OriolAbril is that something we can work with downstream, or would you prefer a different appraoch?

pymc3/distributions/distribution.py

RELEASE-NOTES.md

OriolAbril

These improvements on dims are amazing and I personally think most user code will soon use mostly dims instead of either of shape or size.

RELEASE-NOTES.md

pymc3/distributions/distribution.py

pymc3/model.py

Also refactor into properties to add docstrings and type annotations. And no longer allow InferenceData conversion without a Model on stack. Co-authored-by: Oriol Abril Pla <[email protected]>

brandonwillard

There are a lot of unnecessary shape keyword changes in unrelated tests that need to be reverted. As well, there seem to be some similarly unrelated non-shape test changes that need to be reverted.

Otherwise, the logic in Distribution.[__new__|dist] is still more complicated than it needs to be and should be better organized and distributed between the Model and Distribution classes.

Under these changes, Distribution is taking on steps that are—and were—performed in Model, and this is likely the cause of the current failures. Without an unusually compelling reason for this previously Model-specific logic to be relocated to Distribution, these changes are not sound.

For instance, the steps in Distribution that are performed based on the value of observed are actually Model-exclusive, because observations are a Model-only concept and feature. By moving—or adding—parts of that logic into Distribution, we fragment our logic and violate the abstractions that underlie these classes. The end result is code that's harder to follow—because the underlying abstractions are much less relevant to the actual implementation—and exceedingly more difficult to maintain and extend, due to a less manageable back-and-forth between the two classes.

In the end, these changes should only require a single call in Distribution.__new__ to a function that—at maximum—takes the size, shape, dims, rv_op, and distribution parameters as arguments, determines the appropriate size to be passed to Distribution.dist, and that's all.

This function could be a method provided by Model, which would make it reasonable to also perform any relevant Model updates that involve dims within it.

The only step that seems like it could/should be in Distribution is the application of specify_shape when shape is used, but I'm not seeing that in this diff.

RELEASE-NOTES.md

pymc3/distributions/distribution.py

brandonwillard · 2021-04-19T23:16:10Z

pymc3/distributions/distribution.py

+        if dims is not None:
+            if Ellipsis in dims:
+                # Auto-complete the dims tuple to the full length
+                dims = (*dims[:-1], *[None] * rv_out.ndim)


Is the RandomVariable created first, and then resized, only because rv_out.ndim might be needed in this step?

Regardless, rv_out.ndim is obtainable without first creating the underlying RandomVariable.

brandonwillard · 2021-04-19T23:30:26Z

pymc3/distributions/distribution.py

+            n_resize = len(dims) - n_implied

-        if "shape" in kwargs:
-            raise DeprecationWarning("The `shape` keyword is deprecated; use `size`.")
+            # All resize dims must be known already (numerically or symbolically).
+            unknown_resize_dims = set(dims[:n_resize]) - set(model.dim_lengths)
+            if unknown_resize_dims:
+                raise KeyError(
+                    f"Dimensions {unknown_resize_dims} are unknown to the model and cannot be used to specify a `size`."
+                )

-        transform = kwargs.pop("transform", UNSET)
+            # The numeric/symbolic resize tuple can be created using model.RV_dim_lengths
+            resize = tuple(model.dim_lengths[dname] for dname in dims[:n_resize])
+        elif observed is not None:
+            if not hasattr(observed, "shape"):
+                observed = pandas_to_array(observed)
+            n_resize = observed.ndim - n_implied
+            resize = tuple(observed.shape[d] for d in range(n_resize))
+
+        if resize:
+            # A batch size was specified through `dims`, or implied by `observed`.
+            rv_out = change_rv_size(rv_var=rv_out, new_size=resize, expand=True)
+
+        if dims is not None:
+            # Now that we have a handle on the output RV, we can register named implied dimensions that
+            # were not yet known to the model, such that they can be used for size further downstream.
+            for di, dname in enumerate(dims[n_resize:]):
+                if not dname in model.dim_lengths:
+                    model.add_coord(dname, values=None, length=rv_out.shape[n_resize + di])


Again, this logic needs to be factored out into a simple and clear function that—for instance—computes an appropriate value of size (to be passed to Distribution.dist) using values from dims or shape.

This is especially important given that this code also updates the model's coordinates, making it easily more relevant as a method in Model.

In other words, it's best to keep the Model-specific logic in Model, and not spread out across Distribution and Model.

Also, the steps related to observed are apparently duplicating some of the steps taken later in Model.make_obs_var. This might be yet another good reason for moving this logic into Model.

brandonwillard · 2021-04-19T23:43:06Z

pymc3/distributions/distribution.py

+        # Create the RV without specifying size or testval.
+        # The size will be expanded later (if necessary) and only then the testval fits.
+        rv_native = cls.rv_op(*dist_params, size=None, **kwargs)


This demonstrates a way to obtain the symbolic shape for a RandomVariable without actually creating one.

pymc3/tests/test_shape_handling.py

brandonwillard · 2021-04-20T00:17:52Z

pymc3/tests/test_step.py

@@ -982,7 +982,7 @@ def test_linalg(self, caplog):
            a = Normal("a", size=2, testval=floatX(np.zeros(2)))
            a = at.switch(a > 0, np.inf, a)
            b = at.slinalg.solve(floatX(np.eye(2)), a)
-            Normal("c", mu=b, size=2, testval=floatX(np.r_[0.0, 0.0]))
+            Normal("c", mu=b, shape=(2,), testval=floatX(np.r_[0.0, 0.0]))


Suggested change

Normal("c", mu=b, shape=(2,), testval=floatX(np.r_[0.0, 0.0]))

Normal("c", mu=b, size=2, testval=floatX(np.r_[0.0, 0.0]))

Same as with other examples: The way size was used before the testval shape does not match. Like with other examples, this was likely an unintended change of behavior from the original test when it was switched to size.

brandonwillard · 2021-04-20T00:19:10Z

pymc3/tests/test_transforms.py

+    def build_model(self, distfam, params, *, size=None, shape=None, transform=None, testval=None):
        if testval is not None:
            testval = pm.floatX(testval)
        with pm.Model() as m:
-            distfam("x", size=size, transform=transform, testval=testval, **params)
+            distfam("x", size=size, shape=shape, transform=transform, testval=testval, **params)


All of the changes in this module need to be reverted; they have nothing to do with the shape or dims feature, nor should they.

All of the changes in this module are because many of these tests are actually parametrized by shape.
Consider this test case from before the change:

@pytest.mark.parametrize( "a,b,size", [ (1.0, 1.0, (2,)), # scalar is implied; 2 replicate dimensions (np.ones(3), np.ones(3), (4, 3)), # (3,) is implied; (4, 3) replicate dimensions ], ) def test_beta_ordered(self, a, b, size): testval = np.sort(np.abs(np.random.rand(*size))) # but the testval is at most 2-dimensional model = self.build_model( pm.Beta, {"alpha": a, "beta": b}, size=size, # in the second parametrize case the RV is expanded to (4, 3, 3) testval=testval, # but the testval is only (4, 3) transform=tr.Chain([tr.logodds, tr.ordered]), ) self.check_vectortransform_elementwise_logp(model, vect_opt=0)

Comparing with master we can see that the test was up to 2-dimensional to begin with. The change to size had changed the behaviour of the test - I'm just reverting it to what was originally intended & tested.
https://github.com/pymc-devs/pymc3/blob/2dee9dafb3541043ae23b66fe9262e70fbd2769a/pymc3/tests/test_transforms.py#L405-L421

Yes, the test could be refactored to use size, by either replicating the shape←→size logic, or parametrizing the test with both shape and size manually. That would be a lot of unnecessary work and make the tests harder to work with!
Since this PR is about restoring shape as a backwards-compatible and fully supported way to parametrize RVs I do not see the point in making our lives harder by notoriously avoiding it in tests like this one.

I checked the testvals for this test in v3 and v4 and they're both the same. The symbolic types for the resulting variables are also the same between v3 and v4, so I don't understand what's changed here—aside from the switch to size.

More importantly, this is a scalar variable, so size and shape should be the same. If they're not, then that's an altogether different issue that needs to be addressed separately.

Regardless, the change here is still an unnecessary reversion that forces these tests to use an irrelevant and newly added feature.

Always treats `size` as being in addition to dimensions implied by RV parameters. All resizing beyond parameter-implied dimensionality is done from: - `shape` or `size` in `Distribution.dist()` - `dims` or `observed` in `Distribution.__new__` and only in those two places. Closes pymc-devs#4552.

michaelosthege · 2021-04-20T21:18:34Z

The test about #4652 XPASSed: https://github.com/pymc-devs/pymc3/pull/4625/checks?check_run_id=2394141812#step:7:3572

It shouldn't have done that - looks like https://github.com/pymc-devs/pymc3/blob/v4/pyproject.toml#L2 isn't working.

With respect to the other changes...

Reasons for why some tests were changed are thoroughly commented above (see threads).

Regarding other options to obtain implied dimensionality information, I'd like to quote @brandonwillard:

it's generally cheap to make these graphs, so you can always construct one just for information
also, graph construction isn't a "run-time" thing, so it doesn't really affect performance

So that's how I implemented it. There might be other ways, but given that it's cheap, that it works and that it restores the important backwards compatibility while bringing more flexibility I don't think we should delay this PR any further.

I think the shape/dims/size/size_from_observed logic from the two Distribution APIs should not be split across distribution.py/model.py. Not only do they reference the same helper functions, but the Distribution class is also the point of entry. After moving the determination of size from observed also into Distribution.__new__ the model.py is always given RVs with the correct dimensionality already and doesn't even import change_rv_size.
Factoring out the logic within Distribution.__new__ can possibly be addressed in a follow-up PR that makes dims work alongside shape/size (see PR description).

So from my point of view this PR is all good cc @brandonwillard @OriolAbril @twiecki

OriolAbril

I mostly agree with @michaelosthege last comment, but my opinion on model/distribution topics probably should not matter much as I only have a loose understanding their internals.

I do want to add an extra point on that however, as a relative sideliner. I feel that without a clear/detailed roadmap (that I know of) and with things like pymc-devs/pytensor#352 on the air, I am not sure it's possible at this point to know what is the correct level of abstraction for each thing. There seem to be arguments to move dims in either direction, towards dist and a better integration with Aesara or towards Model.

And last but definitely not least, I want to say that the more I see about the capabilities that v4 is improving/adding/extending the more amazed I am with it. Thanks to everyone that is pushing hard on this and leading the effort.

twiecki · 2021-04-20T21:45:49Z

I agree, I think this looks great. Agree with @OriolAbril that at this point it's not clear what the right level of abstraction is and I don't see a clear winner. We can always revisit when we find out later that we can improve things.

michaelosthege added enhancements shape problem v4 labels Apr 8, 2021

michaelosthege added this to the vNext (4.0.0) milestone Apr 8, 2021

michaelosthege force-pushed the v4-4552 branch 4 times, most recently from fa43a1a to bd348fb Compare April 8, 2021 20:30

OriolAbril reviewed Apr 8, 2021

View reviewed changes

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved

brandonwillard assigned michaelosthege Apr 8, 2021

brandonwillard suggested changes Apr 9, 2021

View reviewed changes

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved

pymc3/tests/test_distributions_random.py Show resolved Hide resolved

michaelosthege force-pushed the v4-4552 branch 3 times, most recently from f6c4602 to d3fde48 Compare April 11, 2021 00:07

OriolAbril requested changes Apr 11, 2021

View reviewed changes

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved

michaelosthege force-pushed the v4-4552 branch 2 times, most recently from da76513 to 4c60766 Compare April 11, 2021 16:56

OriolAbril reviewed Apr 11, 2021

View reviewed changes

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved

pymc3/distributions/distribution.py Outdated Show resolved Hide resolved

matteo-pallini mentioned this pull request Apr 15, 2021

V4 update test framework for distributions random method 2nd attempt #4608

Merged

3 tasks

brandonwillard suggested changes Apr 15, 2021

View reviewed changes

RELEASE-NOTES.md Outdated Show resolved Hide resolved

michaelosthege force-pushed the v4-4552 branch 2 times, most recently from 66d5fa7 to 5da6006 Compare April 18, 2021 14:55

OriolAbril previously approved these changes Apr 18, 2021

View reviewed changes

michaelosthege force-pushed the v4-4552 branch 2 times, most recently from e478bce to 9afc5ef Compare April 18, 2021 20:06

Allow unnamed (None) dims and undefined (None) coord values

fd3f730

Also refactor into properties to add docstrings and type annotations. And no longer allow InferenceData conversion without a Model on stack. Co-authored-by: Oriol Abril Pla <[email protected]>

michaelosthege force-pushed the v4-4552 branch from 9afc5ef to fd3f730 Compare April 18, 2021 20:18

michaelosthege modified the milestones: vNext (4.0.0), Merge v4 into master Apr 18, 2021

michaelosthege force-pushed the v4-4552 branch 4 times, most recently from 5b3acb5 to f4d8246 Compare April 19, 2021 23:07

brandonwillard suggested changes Apr 20, 2021

View reviewed changes

michaelosthege force-pushed the v4-4552 branch from f4d8246 to 7837696 Compare April 20, 2021 14:26

michaelosthege force-pushed the v4-4552 branch from 7837696 to 3ce51ab Compare April 20, 2021 14:33

michaelosthege mentioned this pull request Apr 20, 2021

Column vector breaks observed #4652

Closed

Automatically add SpecifyShape Op when full-length shape is given

5007d97

michaelosthege force-pushed the v4-4552 branch from 48722d1 to 5007d97 Compare April 20, 2021 19:40

OriolAbril approved these changes Apr 20, 2021

View reviewed changes

twiecki approved these changes Apr 20, 2021

View reviewed changes

twiecki marked this pull request as ready for review April 20, 2021 21:44

twiecki merged commit c99f15c into pymc-devs:v4 Apr 20, 2021

This was referenced Apr 24, 2021

Restore pre-4625 way how the size kwarg works #4662

Closed

Revert size kwarg behaviour and make it work with Ellipsis #4667

Closed

Add a test for observed data with a broadcast dimension #4654

Closed

brandonwillard mentioned this pull request May 13, 2021

Revert breaking size changes #4693

Merged

OriolAbril mentioned this pull request May 17, 2021

Implement backwards-compatible shape and dims with Ellipsis support #4696

Merged

michaelosthege deleted the v4-4552 branch August 7, 2021 11:21

falkmielke mentioned this pull request Sep 23, 2021

Refactored LKJCorr and _LKJCholeskyCov #4784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement flexible shape/dims/size API #4625

Implement flexible shape/dims/size API #4625

michaelosthege commented Apr 8, 2021 •

edited

Loading

brandonwillard left a comment

michaelosthege commented Apr 10, 2021

michaelosthege commented Apr 11, 2021 •

edited

Loading

OriolAbril commented Apr 11, 2021

brandonwillard commented Apr 11, 2021 •

edited

Loading

brandonwillard commented Apr 11, 2021

twiecki commented Apr 11, 2021

michaelosthege commented Apr 11, 2021

OriolAbril left a comment

brandonwillard left a comment

brandonwillard Apr 19, 2021

brandonwillard Apr 19, 2021 •

edited

Loading

brandonwillard Apr 19, 2021

brandonwillard Apr 20, 2021

michaelosthege Apr 20, 2021 •

edited

Loading

brandonwillard Apr 20, 2021

michaelosthege Apr 20, 2021

brandonwillard Apr 20, 2021

michaelosthege commented Apr 20, 2021

OriolAbril left a comment

twiecki commented Apr 20, 2021

	Normal("c", mu=b, shape=(2,), testval=floatX(np.r_[0.0, 0.0]))
	Normal("c", mu=b, size=2, testval=floatX(np.r_[0.0, 0.0]))

Implement flexible shape/dims/size API #4625

Implement flexible shape/dims/size API #4625

Conversation

michaelosthege commented Apr 8, 2021 • edited Loading

Terminology

Key Takeaways

ToDo

ToDo (optional)

brandonwillard left a comment

Choose a reason for hiding this comment

michaelosthege commented Apr 10, 2021

michaelosthege commented Apr 11, 2021 • edited Loading

OriolAbril commented Apr 11, 2021

brandonwillard commented Apr 11, 2021 • edited Loading

brandonwillard commented Apr 11, 2021

twiecki commented Apr 11, 2021

michaelosthege commented Apr 11, 2021

OriolAbril left a comment

Choose a reason for hiding this comment

brandonwillard left a comment

Choose a reason for hiding this comment

brandonwillard Apr 19, 2021

Choose a reason for hiding this comment

brandonwillard Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

brandonwillard Apr 19, 2021

Choose a reason for hiding this comment

brandonwillard Apr 20, 2021

Choose a reason for hiding this comment

michaelosthege Apr 20, 2021 • edited Loading

Choose a reason for hiding this comment

brandonwillard Apr 20, 2021

Choose a reason for hiding this comment

michaelosthege Apr 20, 2021

Choose a reason for hiding this comment

brandonwillard Apr 20, 2021

Choose a reason for hiding this comment

michaelosthege commented Apr 20, 2021

OriolAbril left a comment

Choose a reason for hiding this comment

twiecki commented Apr 20, 2021

michaelosthege commented Apr 8, 2021 •

edited

Loading

michaelosthege commented Apr 11, 2021 •

edited

Loading

brandonwillard commented Apr 11, 2021 •

edited

Loading

brandonwillard Apr 19, 2021 •

edited

Loading

michaelosthege Apr 20, 2021 •

edited

Loading