Skip to content
This repository has been archived by the owner on Sep 10, 2023. It is now read-only.

Varying Slopes #1

Closed
joshualeond opened this issue Oct 1, 2019 · 28 comments
Closed

Varying Slopes #1

joshualeond opened this issue Oct 1, 2019 · 28 comments

Comments

@joshualeond
Copy link

Hi,

Thanks for putting this work out in the open. I just recently finished McElreath's excellent book and was attempting to translate some of the models over to Julia. I am having a ton of trouble figuring out how to specify a multilevel model in Turing with the correct Multivariate Normal prior for the varying intercepts and slopes like in the cafes model m13.1 in the book.

Do you happen to have any examples of a varying intercept/slope model that uses the MvNormal prior?

Thanks again!

@goedman
Copy link
Member

goedman commented Oct 2, 2019

Hi Josh, unfortunately I can't help you with this. I believe last year when Richard worked on the Turing versions of the models the LKJ Correlation distribution was not available. I'm not sure if that has changed since then.

My plan is to revisit TuringModels.jl and update the models to AdvancedHMC after I have completed updating StatisticalRethinking.jl to the 2nd edition of the book.

The SR update will definitely take a chunk of time (I expect until early next year and this delays the revision of TuringModels.jl).

For this specific model/example, you could check with the Turing team, e.g. on Slack, as they are always very helpful.

@joshualeond
Copy link
Author

Thanks for the response. Yes, I noticed there was no LKJ distribution available in Turing and ended up asking about it in the Slack channel. They recommended opening an issue on the Turing repo which I plan on doing soon. I had assumed it may be possible to get a similar solution by using a different prior on the correlations. I've seen the Inverse Wishart being used here and I believe the Beta and Truncated Uniform being used here. I wasn't sure how to apply it to the examples in Rethinking though. Good luck with the 2ed updates!

@goedman
Copy link
Member

goedman commented Oct 2, 2019

Hi Chris ( @itsdfish ), have you come across this and maybe used a different distribution for this in Turing?

@itsdfish
Copy link

itsdfish commented Oct 2, 2019

I have not used a multilevel model in Turing that specifies the relationship between coefficients. However, I believe Tamas has a version of the LKJ distribution here. In principle, it should interface with Turing without any modification.

@joshualeond
Copy link
Author

Thanks Chris, I tested it out with a simple example and Turing didn't like that the LKJL type wasn't a subtype of the Distribution type:

 Error: Right-hand side of a ~ must be subtype of Distribution or a vector of Distributions on line 96.

I'll spend some more time with it though. I plan on opening an issue on the Turing repo and I'll definitely reference the AltDistributions package.

@itsdfish
Copy link

itsdfish commented Oct 2, 2019

Ah I see. It looks like the problem is that LKJL is not a subtype of ContinuousMultivariateDistribution. It might be worth filing an issue with Tamas to see if he can modify the package. I'm not sure if you also need a corresponding rand function for LKJL.

If you don't need a rand function, a quick fix might be something like this:

using ArgCheck, Distributions 
import Distributions: logpdf
struct LKJL{T <: Real} <: ContinuousMultivariateDistribution
    η::T
    function LKJL(η::T) where T <: Real
        @argcheck η > 0
        new{T}(η)
    end
end

function logpdf(d::LKJL, L::Union{AbstractTriangular, Diagonal})
    @unpack η = d
    z = diag(L)
    n = size(L, 1)
    sum(log.(z) .* ((n:-1:1) .+ 2*(η-1))) + log(2) * n
end

@joshualeond
Copy link
Author

joshualeond commented Oct 2, 2019

I had to make a couple modifications to your code. Just bringing in a few more packages:

using ArgCheck, Parameters, Distributions, LinearAlgebra
import Distributions.logpdf
import LinearAlgebra: AbstractTriangular

struct LKJL{T <: Real} <: ContinuousMultivariateDistribution
    η::T
    function LKJL::T) where T <: Real
        @argcheck η > 0
        new{T}(η)
    end
end

function logpdf(d::LKJL, L::Union{AbstractTriangular, Diagonal})
    @unpack η = d
    z = diag(L)
    n = size(L, 1)
    sum(log.(z) .* ((n:-1:1) .+ 2*-1))) + log(2) * n
end

But still ended up getting hit with another error when I went to sample:

ERROR: MethodError: no method matching length(::LKJL{Float64})

So maybe another method needs to be added for this to work.

@itsdfish
Copy link

itsdfish commented Oct 2, 2019

Would you be able to post a MWE? If so, I'll take a look.

@joshualeond
Copy link
Author

Sure, I've just modified this example I saw in another Turing Github issue. I'll post what I have so far using your suggestions:

using Turing, Random

Random.seed!(1);

# define the LKJ distribution
using ArgCheck, Parameters, Distributions, LinearAlgebra
import Distributions.logpdf
import LinearAlgebra: AbstractTriangular

struct LKJL{T <: Real} <: ContinuousMultivariateDistribution
    η::T
    function LKJL::T) where T <: Real
        @argcheck η > 0
        new{T}(η)
    end
end

function logpdf(d::LKJL, L::Union{AbstractTriangular, Diagonal})
    @unpack η = d
    z = diag(L)
    n = size(L, 1)
    sum(log.(z) .* ((n:-1:1) .+ 2*-1))) + log(2) * n
end


# Specify Turing model.
@model correlation(x, N) = begin
    # Create mu variables.
    mu = TArray{Any}(2)
    mu[1] ~ TruncatedNormal(0, 100, -10, 10)
    mu[2] ~ TruncatedNormal(0, 100, -10, 10)

    # Create sigma variables.
    sigma = TArray{Any}(2)
    sigma[1] ~ TruncatedNormal(0, 100, 0, 10)
    sigma[2] ~ TruncatedNormal(0, 100, 0, 10)

    # Create rho.
    rho ~ LKJL(2.0)

    # Generate covariance matrix.
    p = sigma[1]*sigma[2]*rho
    cov = [sigma[1]^2 p;
            p sigma[2]^2]

    # Iterate through each datapoint, and observe its likelihood.
    for i in 1:N
        x[i,:] ~ MvNormal([mu[1], mu[2]], cov)
    end
end

# Helper function to generate a covariance matrix.
function tcovar(sigma, rho)
    p = sigma[1]*sigma[2]*rho
    return [sigma[1]^2 p;
            p sigma[2]^2]
end

# Number of datapoints to generate.
N = 1000

# Target covariance matrix.
target_covariance = tcovar([1, 10], 0.5)

# Generate random data.
target_dist = MvNormal([0.0, 0.0], target_covariance)
df = rand(target_dist, N)'

# Sample using HMC.
chain = sample(correlation(df, N), HMC(0.01, 5), 1000)

Note that I'm using the master branch of Turing. So the sample function arguments are very slightly different than the current release. The n_samples argument has been moved from the sampler to the sample function. Also, I'm using Julia Version 1.3.0-rc2.0.

@itsdfish
Copy link

itsdfish commented Oct 2, 2019

Hmmm. I'm not very familiar with KJL distributions, but it seems like there might be a type conflict. rho appears to be a scalar, but KJL is expecting an AbstractTriangular or Diagonal, both of which are types of matrices (I think). I'm not sure how you would set something up like that in Turing. This might be more complex than I thought. It might be worth reporting the example above along with the issue you opened. Sorry I couldn't be of more help!

@joshualeond
Copy link
Author

No problem, I appreciate the help though. Thanks!

@trappmartin
Copy link
Collaborator

I'll have a look at it. @goedman thanks for pointing me to this.

@joshualeond
Copy link
Author

Hey @trappmartin, I just noticed that the LKJ distribution is now in Distributions.jl: JuliaStats/Distributions.jl#1066. I attempted adjusting my example above with the new LKJ distribution but am having some issues with cholesky factorization/dimension issues. However, that's likely a user error.

@trappmartin
Copy link
Collaborator

Cool, could you post an example here or open an issue on Turing.jl please. I haven't had time to look into it. Too many things going on atm.

@joshualeond
Copy link
Author

Sure thing, here's my adjusted version with the new LKJ distribution from Distributions.jl:

using Turing, Distributions, LinearAlgebra

@model correlation(x, N) = begin
    # Create mu variables.
    mu = TArray{Any}(2)
    mu[1] ~ truncated(Normal(0, 100), -10, 10)
    mu[2] ~ truncated(Normal(0, 100), -10, 10)

    # Create sigma variables.
    sigma = TArray{Any}(2)
    sigma[1] ~ truncated(Normal(0, 100), 0, 10)
    sigma[2] ~ truncated(Normal(0, 100), 0, 10)

    # prior on the correlation matrix
    rho ~ LKJ(2, 2)

    # Generate covariance matrix.
    cov = [sigma[1] 0; 0 sigma[2]] * rho * [sigma[1] 0; 0 sigma[2]]

    # Iterate through each datapoint, and observe its likelihood.
    for i in 1:N
        x[i,:] ~ MvNormal([mu[1], mu[2]], cov)
    end
end

# Helper function to generate a covariance matrix.
function tcovar(sigma, rho)
    p = sigma[1]*sigma[2]*rho
    return [sigma[1]^2 p;
            p sigma[2]^2]
end

# Number of datapoints to generate.
N = 1000

# Target covariance matrix.
target_covariance = tcovar([1, 10], 0.5)

# Generate random data.
target_dist = MvNormal([0.0, 0.0], target_covariance)
df = rand(target_dist, N)'

# Sample
chain = sample(correlation(df, N), NUTS(), 1000)

With this example I'm currently seeing the following error message:

ERROR: MethodError: no method matching bijector(::LKJ{Float64,Int64})
Closest candidates are:
  bijector(::KSOneSided) at /Users/joshualeond/.julia/packages/Bijectors/bHaf6/src/transformed_distribution.jl:58
  bijector(::Product{Discrete,T,V} where V<:AbstractArray{T,1} where T<:Distribution{Univariate,Discrete}) at /Users/joshualeond/.julia/packages/Bijectors/bHaf6/src/transformed_distribution.jl:39
  bijector(::Product{Continuous,T,Tdists} where Tdists<:(FillArrays.Fill{T,1,Axes} where Axes) where T<:Distribution{Univariate,Continuous}) at /Users/joshualeond/.julia/packages/Bijectors/bHaf6/src/compat/distributionsad.jl:16

Does the LKJ distribution also have to be added to Bijectors.jl in order to be used by Turing? Also, this isn't the only error message I've seen. Sometimes I see the following message as well:

ERROR: PosDefException: matrix is not Hermitian; Cholesky factorization failed.

Any help would be appreciated.

@joshualeond
Copy link
Author

Just wanted to follow up here. I've opened an issue on Bijectors.jl:
TuringLang/Bijectors.jl#108.

As soon as the LKJ is usable in Turing I'll take a stab at a Turing version of this Rethinking model.

@goedman
Copy link
Member

goedman commented May 28, 2020

That would be great, particularly now we're seriously looking into a Turing version of StatisticalRethinking.

@trappmartin
Copy link
Collaborator

This:

using Turing
using LinearAlgebra

@model function correlation(y)
    N,D = size(y)
    
    mu ~ filldist(truncated(Normal(0, 100), -10, 10), D)
    sigma ~ filldist(truncated(Normal(0, 100), 0, 10), D)

    # prior on the correlation matrix
    rho ~ LKJ(D,D)
    L = Diagonal(sigma) * rho
    
    # Iterate through each datapoint, and observe its likelihood.
    for i in 1:N
        y[i,:] ~ MvNormal(mu , L*L')
    end
    return L*L'
end

cov = [ 4.94072998 -4.93536067; -4.93536067 5.99552455 ]
x = rand(MvNormal([0.0, 1.0], cov), 10)

model = correlation(x')
sample(model, HMC(0.01,5), 1000)

seems to work out of the box.

@trappmartin
Copy link
Collaborator

trappmartin commented May 28, 2020

I guess it will work a bit better (higher ESS) if you use the bijection described by Mohamed in TuringLang/Bijectors.jl#108.

https://gist.github.com/trappmartin/a8ceb7e7b1ce20ac737ab599da432469

@joshualeond
Copy link
Author

Thanks for the example @trappmartin. I was able to run through your example but noticed the following error if I switched the sampler to NUTS():

ERROR: PosDefException: matrix is not Hermitian; Cholesky factorization failed.

@trappmartin
Copy link
Collaborator

Did you observe the same when using the bijection in the gist?

@joshualeond
Copy link
Author

Yes, unfortunately, it's the same with the bijection and NUTS. Also, the HMC estimation might be having issues as well: low ESS and inconsistent estimates. If I increase the size (to 100) of the randomly generated data and the chain length (to 2000) then I end up seeing some pretty odd diagnostic info:

Summary Statistics
  parameters     mean     std  naive_se    mcse     ess   r_hat
  ──────────  ───────  ──────  ────────  ──────  ──────  ──────
       mu[1]  -5.4722  0.0000    0.0000  0.0000  8.0321  0.9995
       mu[2]  -4.8599  0.0000    0.0000  0.0000  8.0321  0.9995
   rho[1, 1]   1.0000  0.0000    0.0000  0.0000     NaN     NaN
   rho[1, 2]   0.6642  0.0000    0.0000  0.0000     NaN     NaN
   rho[2, 1]   0.6642  0.0000    0.0000  0.0000     NaN     NaN
   rho[2, 2]   1.0000  0.0000    0.0000  0.0000     NaN     NaN
    sigma[1]   1.1418  0.0000    0.0000  0.0000     NaN     NaN
    sigma[2]   4.3882  0.0000    0.0000  0.0000  8.0321  0.9995

@joshualeond
Copy link
Author

I found these docs for the PyMC3 implementation:
https://pymc3.readthedocs.io/en/latest/notebooks/LKJ.html

The LKJ distribution provides a prior on the correlation matrix, C=Corr(xi,xj), which, combined with priors on the standard deviations of each component, induces a prior on the covariance matrix, Σ. Since inverting Σ is numerically unstable and inefficient, it is computationally advantageous to use the Cholesky decompositon of Σ, Σ=LL⊤, where L is a lower-triangular matrix. This decompositon allows computation of the term (x−μ)⊤Σ−1(x−μ) using back-substitution, which is more numerically stable and efficient than direct matrix inversion.

Does L in the example above need to be lower-triangular?

@trappmartin
Copy link
Collaborator

trappmartin commented May 28, 2020

The Cholesky decomposition of a covariance matrix can be written as a product of a lower triangular matrix with its transpose or an upper triangular and its transpose.

In short, it can be rewritten using an upper triangular matrix.

I think my code example might actually be wrong. I should have double checked the LKJ distribution before.

@joshualeond
Copy link
Author

Thanks for all the help in investigating this btw. So I have to admit that I'm not too well versed in matrix algebra and I definitely need to educate myself some more on that topic. In seeing how you decided to construct the covariance matrix as L*L' it appears that we're doing something similar to this example in Stan using the multi_normal_cholesky:

data {
  int<lower=1> N; // number of observations
  int<lower=1> J; // dimension of observations
  vector[J] y[N]; // observations
  vector[J] Zero; // a vector of Zeros (fixed means of observations)
}
parameters {
  cholesky_factor_corr[J] Lcorr;  
  vector<lower=0>[J] sigma; 
}
model {
  y ~ multi_normal_cholesky(Zero, diag_pre_multiply(sigma, Lcorr));
  sigma ~ cauchy(0, 5);
  Lcorr ~ lkj_corr_cholesky(1);
}
generated quantities {
  matrix[J,J] Omega;
  matrix[J,J] Sigma;
  Omega <- multiply_lower_tri_self_transpose(Lcorr);
  Sigma <- quad_form_diag(Omega, sigma); 
}

Where your diag_pre_multiply(sigma, Lcorr) is the Diagonal(sigma) * rho. If that's the case then the type used for Lcorr in Stan is cholesky_factor_corr. Would that mean that we should also take the cholesky factor of our correlation matrix? Something like this for your example:

using Turing
using Bijectors
using LinearAlgebra

Bijectors.bijector(d::LKJ) = Bijectors.PDBijector()

@model function correlation(y)
    N,D = size(y)
    
    mu ~ filldist(truncated(Normal(0, 100), -10, 10), D)
    sigma ~ filldist(truncated(Normal(0, 100), 0, 10), D)

    # prior on the correlation matrix
    rho ~ LKJ(D,D)
    L = Diagonal(sigma) * cholesky(rho).L # <- only change is here
    
    # Iterate through each datapoint, and observe its likelihood.
    for i in 1:N
        y[i,:] ~ MvNormal(mu , L*L')
    end
    return L*L'
end

cov = [ 4.94072998 -4.93536067; -4.93536067 5.99552455 ]
x = rand(MvNormal([0.0, 1.0], cov), 100)

model = correlation(x')
sample(model, HMC(0.01,5), 1000)

This samples but unfortunately has not so great results like before:

Summary Statistics
  parameters     mean     std  naive_se    mcse      ess   r_hat
  ──────────  ───────  ──────  ────────  ──────  ───────  ──────
       mu[1]   0.9929  2.5374    0.0802  0.8218   6.0357  1.1313
       mu[2]   0.6373  0.5914    0.0187  0.0773  29.5624  1.0307
   rho[1, 1]   0.2090  0.3689    0.0117  0.1164   6.1382  1.0986
   rho[1, 2]  -0.7764  0.2752    0.0087  0.0670   5.2186  1.1796
   rho[2, 1]  -0.7764  0.2752    0.0087  0.0670   5.2186  1.1796
   rho[2, 2]  10.2008  5.1925    0.1642  1.6592   5.3453  1.1012
    sigma[1]   7.8477  0.9223    0.0292  0.2864   6.1592  1.0324
    sigma[2]   0.9934  0.6178    0.0195  0.1948   5.1487  1.1389

@joshualeond
Copy link
Author

Another odd thing I'm seeing here is that it appears that the HMC and NUTS samplers are not respecting the prior and are having inconsistent results. I'm even seeing explosive values when using NUTS that go as large as 1e155. Using your gist example and sampling from the prior gives:

sample(model, Prior(), 2000)
Summary Statistics
  parameters     mean     std  naive_se    mcse        ess   r_hat
  ──────────  ───────  ──────  ────────  ──────  ─────────  ──────
       mu[1]   0.1073  5.6915    0.1273  0.1153  1871.4727  0.9995
       mu[2]  -0.0617  5.6869    0.1272  0.1258  2127.8451  1.0001
   rho[1, 1]   1.0000  0.0000    0.0000  0.0000        NaN     NaN
   rho[1, 2]  -0.0009  0.4407    0.0099  0.0084  1699.2161  1.0008
   rho[2, 1]  -0.0009  0.4407    0.0099  0.0084  1699.2161  1.0008
   rho[2, 2]   1.0000  0.0000    0.0000  0.0000        NaN     NaN
    sigma[1]   5.0054  2.9461    0.0659  0.0426  2095.5229  0.9997
    sigma[2]   4.9586  2.8459    0.0636  0.0601  1796.0804  0.9999

Quantiles
  parameters     2.5%    25.0%    50.0%   75.0%   97.5%
  ──────────  ───────  ───────  ───────  ──────  ──────
       mu[1]  -9.3986  -4.7123   0.0259  5.1058  9.4620
       mu[2]  -9.6075  -4.8458  -0.0527  4.7862  9.3912
   rho[1, 1]   1.0000   1.0000   1.0000  1.0000  1.0000
   rho[1, 2]  -0.8358  -0.3345   0.0020  0.3452  0.7942
   rho[2, 1]  -0.8358  -0.3345   0.0020  0.3452  0.7942
   rho[2, 2]   1.0000   1.0000   1.0000  1.0000  1.0000
    sigma[1]   0.2664   2.4259   4.9984  7.6162  9.7842
    sigma[2]   0.2585   2.4988   4.9064  7.3393  9.7127

The diagonal on the correlation matrix rho is showing exactly 1 which should be correct. However, after sampling with HMC we're seeing different values on the diagonal as seen on my previous comment above. It appears that sampling with MH does a better job:

sample(model, MH(), 2000)
Summary Statistics
  parameters     mean     std  naive_se    mcse      ess   r_hat
  ──────────  ───────  ──────  ────────  ──────  ───────  ──────
       mu[1]   0.2502  1.7969    0.0402  0.3593  11.8399  1.0244
       mu[2]   1.5230  1.8676    0.0418  0.3667   8.0367  1.2582
   rho[1, 1]   1.0000  0.0000    0.0000  0.0000      NaN     NaN
   rho[1, 2]  -0.3615  0.1765    0.0039  0.0354  10.9368  1.0598
   rho[2, 1]  -0.3615  0.1765    0.0039  0.0354  10.9368  1.0598
   rho[2, 2]   1.0000  0.0000    0.0000  0.0000      NaN     NaN
    sigma[1]   3.4656  0.8232    0.0184  0.1620   8.0321  1.6431
    sigma[2]   3.6382  1.2235    0.0274  0.2637   8.0321  1.9880

Quantiles
  parameters     2.5%    25.0%    50.0%    75.0%   97.5%
  ──────────  ───────  ───────  ───────  ───────  ──────
       mu[1]  -3.4473   0.4832   0.4832   1.4426  2.8594
       mu[2]   0.6261   0.6261   0.8010   0.8010  5.4178
   rho[1, 1]   1.0000   1.0000   1.0000   1.0000  1.0000
   rho[1, 2]  -0.5688  -0.4236  -0.3132  -0.3132  0.4354
   rho[2, 1]  -0.5688  -0.4236  -0.3132  -0.3132  0.4354
   rho[2, 2]   1.0000   1.0000   1.0000   1.0000  1.0000
    sigma[1]   2.8745   2.8745   3.5219   3.5219  4.9908
    sigma[2]   2.6342   2.6342   3.7979   3.7979  5.4432

At least it's getting the diagonal right, however, still very low ess. I had some help on the issue I posted on Bijectors that you may find interesting. The comments on that issue seem to point to numerical issues regarding positive definiteness. Wrapping the matrix calculation in Symmetric helps the sampling but doesn't quite recover the parameters used to simulate the data.

I'd like to consolidate this issue to one place and not have it spread across repos. However, at this point I'm uncertain of what's the culprit: Turing, Bijectors, AdvancedHMC, DynamicPPL, ForwardDiff, etc. What would you recommend?

@trappmartin
Copy link
Collaborator

I think it’s mostly an issue related to the constraints of LKJ. And an issue on Turing or Bijectors is the best place for it.

@joshualeond
Copy link
Author

Decided to close and continue discussion on Bijectors:
TuringLang/Bijectors.jl#108

Once the LKJ issues are resolved I plan on coming back and working on the rethinking model examples of Chapter 13.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants