Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coords for pairs of variables. #1627

Closed
pscicluna opened this issue Mar 22, 2021 · 6 comments
Closed

coords for pairs of variables. #1627

pscicluna opened this issue Mar 22, 2021 · 6 comments

Comments

@pscicluna
Copy link

pscicluna commented Mar 22, 2021

Short Description

HI! I'm trying to understand how I can make the coords keyword of many arviz plotting routines to behave in a certain way. I hope explain the problem well enough, I'll gladly try to clarify if not.

To outline my problem, I'm inferring the contents of a correlation matrix with PyMC3. Now, because the matrix is symmetrical, I only want to use the upper (or lower, if it makes a difference) off-diagonal elements, that is to use elements [0,1], [1,2], [0,2] if it's a 3-dimensional problem.

However, plot_pair seems to interpret coords as a "slice" along a dimension, such that if I set coords to [0,1] on dimension one of the correlation matrix and [1,2] on dimension two, I will also get the plots for [1,1]. Is there any way that I can leave this one out automatically, or do I have to manually iterate over a set of axes to get this?

Code Example or link

Here's a short example, you can find a more complete version of the code for context in this repo.

import numpy as np
import pymc3 as pm
import arviz as az


mu = np.array([10., 30.])
sigma = np.array([20., 40.])
rho = -0.7
cov = np.zeros((len(mu), len(mu)))
cov[0,:] = [sigma[0]**2, rho*sigma[0]*sigma[1]]
cov[1,:] = [rho*sigma[1]*sigma[0], sigma[1]**2]

data = np.random.multivariate_normal(mu, cov, size=100)
ndim = data.shape[1]

with pm.Model() as model:
            #we put weakly informative priors on the means and standard deviations of the multivariate normal distribution
            mu = pm.Normal("mu", mu=mu_prior[0], sigma=mu_prior[1], shape=ndim)
            sigma = pm.HalfCauchy.dist(sigma_prior)
            #and a prior on the covariance matrix which weakly penalises strong correlations
            chol, corr, stds = pm.LKJCholeskyCov("chol", n=ndim, eta=2.0, sd_dist=sigma, compute_corr=True)
            #the prior gives us the Cholesky Decomposition of the covariance matrix, so for completeness we can calculate that determinisitically
            cov = pm.Deterministic("cov", chol.dot(chol.T))

            #and now we can put our observed values into a multivariate normal to complete the model
            vals = pm.MvNormal('vals', mu=mu, chol=chol, observed=data)

trace = pm.sample(
                steps, tune=tune, target_accept=0.9, compute_convergence_checks=False,return_inferencedata=True
            )

coords = {"chol_corr_dim_0":[0], "chol_corr_dim_1":[1]} #For 2-D data I can do this
                        #But it's not clear how I can just get the upper triangle in 3 or more dimensions

plot_vars = ['mu', 'chol_corr']
az.plot_pair(trace,
                     var_names = plot_vars,
                     coords = coords,
                     kind="kde",
                     marginals=True,
                     point_estimate="mean",
                     show=True,
            )


Thanks in advance for any input you can give me!

@OriolAbril
Copy link
Member

Pointwise or non-slice selection with xarray is something that always feels very complicated to me, with very obscure docs. You should be able to perform pointwise selection using DataArrays whose dimension is not present in the xarray object you want to index. Here is an example.

Our dataset has 4 dimensions, and we want to perform pointwise selection with values over 2 of them (I think it should work with as many dims as desired though).

import xarray as xr
import arviz as az

ds = az.dict_to_dataset(
    {"s": np.random.normal(size=(4, 100, 3, 2))}, 
    dims={"s": ["time", "sensor"]}, 
    coords={"sensor": ["A", "B"]}
)

Output:

<xarray.Dataset>
Dimensions:  (chain: 4, draw: 100, pointwise_sel: 3)
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
    time     (pointwise_sel) int64 0 2 2
    sensor   (pointwise_sel) <U1 'A' 'A' 'B'
Dimensions without coordinates: pointwise_sel
Data variables:
    s        (chain, draw, pointwise_sel) float64 2.155 -0.3476 ... 0.305

We can now select a subset with:

coords = {
     'sensor': xr.DataArray(['A', 'A', 'B'], dims=['pointwise_sel']),
     'time': xr.DataArray([0, 2, 2], dims=['pointwise_sel'])
 }
ds.sel(coords)

whose output is:

<xarray.Dataset>
Dimensions:  (chain: 4, draw: 100, pointwise_sel: 3)
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
    time     (pointwise_sel) int64 0 2 2
    sensor   (pointwise_sel) <U1 'A' 'A' 'B'
Dimensions without coordinates: pointwise_sel
Data variables:
    s        (chain, draw, pointwise_sel) float64 2.155 -0.3476 ... 0.305

ArviZ uses .sel under the hood, so using this code below will work

az.plot_posterior(ds, coords=coords)

image

There is one important caveat though, which is that now the original coordinate values that we used for indexing have been converted to coordinated without dimension, and they are not shown in the plot. As a comparison:

az.plot_posterior(ds)

image

With the latest (yet to be released at the time of writing) updates to ArviZ, you could define custom labellers to take care of that. See the label guide for info on labellers and installation guide for info on installing the development version.

And thanks for reporting! I think I'll use this as the example for the last section of the label guide 😄

The emerging idea/pattern could be:

subset_ds = ds.sel(coords)
# some extra work to get the "proper" labels
labeller = # define labeller with the processed ds
az.plot_xyz(subset_ds, labeller=labeller)

@pscicluna
Copy link
Author

Thanks, that's awesome! I'll give that a try

Any idea if it's possible to hack some sort of custom labelling in v0.11.0? I'm stuck with that to make arviz play nice with PyMC3...

@OriolAbril
Copy link
Member

I don't think it's possible :/.

What may be easier is to get latest ArviZ to play nice with old PyMC3 if you are interested in submitting a PR for that. I think that the only breaking incompatibility is that we removed geweke from arviz/stats/diagnostics.py, it had been basically broken for a while, it only worked on 1d arrays which little to no arviz data are, nobody complained about this which for us meant that nobody was using it and we broke no tests removing it (we only test on latest release and dev pymc3 versions).

If you are willing to take this on and locally test on your older pymc3 versions, submitting a PR to add geweke back with no code in the function, only raising a not implemented error should fix the compatibility issue and allow to work with latest arviz as long as geweke is not used.

@pscicluna
Copy link
Author

Ah, okay, thanks for explaining that. If I get the chance, I will attempt the PR, but seeing as this is quite a new package I might as well also force an update to the latest versions of pymc and arviz once the custom labellers are available and I have figured out how to use them :)

Thanks for all the help!

@OriolAbril OriolAbril changed the title plot_pair: coords for pairs of variables. coords for pairs of variables. Mar 26, 2021
@OriolAbril
Copy link
Member

I have added a PR to extend the label guide and show how to keep the right labels #1635

@OriolAbril
Copy link
Member

I'm trying to do some clean up, so I'll close this as it seems resolved, feel free to reopen if it is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants