Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add numerically safe ordered probit distribution. #4232

Merged
merged 9 commits into from
Dec 16, 2020

Conversation

tohtsky
Copy link
Contributor

@tohtsky tohtsky commented Nov 19, 2020

This feature adds a ordered probit distribution,
which uses probit link function (i.e., std_cdf) instead of logistic one (sigmoid) used by OrdredLogistic.

Since std_cdf converges to 1 or 0 much faster than sigmoid, additional care must be taken
in order to deal with outlier values without encountering numerical instability,

For example, if we simply use std_cdf instead of sigmoid in the implementation of OrderedLogistic,
we will be getting bad initial energy error in the following example.

class OrderedProbitNaive(pm.Categorical):
    def __init__(self, eta, cutpoints, *args, **kwargs):
        self.eta = tt.as_tensor_variable(pm.floatX(eta))
        self.cutpoints = tt.as_tensor_variable(cutpoints)
        pa = pm.distributions.dist_math.std_cdf(self.cutpoints - tt.shape_padright(self.eta))
        p_cum = tt.concatenate(
            [
                tt.zeros_like(tt.shape_padright(pa[..., 0])),
                pa,
                tt.ones_like(tt.shape_padright(pa[..., 0])),
            ],
            axis=-1,
        )
        p = p_cum[..., 1:] - p_cum[..., :-1]
        super().__init__(p=p, *args, **kwargs)

n1_c = 300; n2_c = 300; n3_c = 300
cluster1 = np.random.randn(n1_c) + -1
cluster2 = np.random.randn(n2_c) + 0
cluster3 = np.random.randn(n3_c) + 2

x = np.concatenate((cluster1, cluster2, cluster3))
y = np.concatenate((1*np.ones(n1_c),
                    2*np.ones(n2_c),
                    3*np.ones(n3_c))) - 1

# Add an outlier point
x[0] = -10
y[0] = 2

# Ordered logistic regression
with pm.Model() as model:
    cutpoints = pm.Normal("cutpoints", mu=[-1,1], sigma=10, shape=2,
                          transform=pm.distributions.transforms.ordered)
    y_ = OrderedProbitNaive("y", cutpoints=cutpoints, eta=x, observed=y)
    tr = pm.sample(1000, chains=1)

The implemented version of OrderedProbit avoids this issue by carefully using scaled error function erfcx .

@Spaak
Copy link
Member

Spaak commented Dec 2, 2020

Hi @tohtsky thanks for the PR! This looks very useful indeed. As you can see some code checks are failing. Specifically the pre-commit one fails for some style reason (see the details there). If you fix that issue and push, the automatic tests will run again.

We're currently in the process of lining up PRs for upcoming release 3.10, this one might still make it I think.

@codecov
Copy link

codecov bot commented Dec 3, 2020

Codecov Report

Merging #4232 (7b27830) into master (dbcc49e) will decrease coverage by 0.01%.
The diff coverage is 78.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4232      +/-   ##
==========================================
- Coverage   87.87%   87.85%   -0.02%     
==========================================
  Files          88       88              
  Lines       14473    14501      +28     
==========================================
+ Hits        12718    12740      +22     
- Misses       1755     1761       +6     
Impacted Files Coverage Δ
pymc3/distributions/__init__.py 100.00% <ø> (ø)
pymc3/distributions/discrete.py 95.22% <75.00%> (-1.57%) ⬇️
pymc3/distributions/dist_math.py 91.97% <100.00%> (+0.11%) ⬆️

@tohtsky
Copy link
Contributor Author

tohtsky commented Dec 3, 2020

Hi @Spaak , thank you for the comment and sorry I'm not carefully reading the contribution guide. I have fixed the styling issues and made the test more accurate (the first test failed with float32 due to a numerical problem on the scipy side).

I would be very happy to receive further comments and suggestions!

Copy link
Member

@Spaak Spaak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Apologies for the slow follow-up, we got a bit busy with the 3.10 release and I thought it better to leave this one until after. Thanks @tohtsky!

Copy link
Member

@Spaak Spaak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, before merging, it would be great if you could add a line to the release notes, in the top, the "On deck" section.

@tohtsky
Copy link
Contributor Author

tohtsky commented Dec 16, 2020

Thanks for the review, @Spaak !
Added a New feature subsection and a line to the RELEASE_NOTE.md.

@Spaak Spaak merged commit 9dbf9ea into pymc-devs:master Dec 16, 2020
@Spaak
Copy link
Member

Spaak commented Dec 16, 2020

Thanks again @tohtsky! 🎉

@tohtsky tohtsky deleted the feature-oprobit branch December 16, 2020 08:30
Spaak added a commit that referenced this pull request Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants