Create a benchmarks module #470

neerajprad · 2019-11-28T03:55:18Z

This creates a benchmarks module in the main repo. Currently this has changes from #469, which will be merged in once that PR lands. This branch should only contain changes to the benchmarks module.

We should ensure the following:

Use float64 for CPU if possible and float32 for CUDA, specially when comparing with Stan.
Report both time per leapfrog and time per effective sample.
Exclude compilation + warmup from the times.
Add requirements.txt with pinned dependencies so that the benchmarks are reproducible.

neerajprad · 2019-11-29T02:36:24Z

@fehiepsi - we have made a number of changes to the interface, and jax's caching has changed too since the last version, so if you notice any benchmarks look off, let me know.

fehiepsi · 2019-11-29T03:29:35Z

Yup, so far there is a bit regression in hmm benchmark: 0.09 -> 0.12 ms/leapfrog because we use progress_bar=True here. But that is not important IMO. The benefit of enabling probbar=True is we don't need to set num_samples=100,000 (to reduce the contribution of compiling time).

FYI, Pyro is 1.5x faster than before (with PyTorch 1.3.1), but it is still very slow... In 32bit, NumPyro has a bit smaller n_eff than Stan (566 vs 603), a much higher number of divergences than Stan (282 vs 42). But in 64 bits, n_eff of NumPyro is higher than Stan (752 vs 603), no divergences, 1.5x slower than 32 bits (0.17 vs 0.12 ms/leapfrog).

neerajprad · 2019-11-29T07:34:30Z

Yup, so far there is a bit regression in hmm benchmark: 0.09 -> 0.12 ms/leapfrog because we use progress_bar=True here.

I think its fine to use progress_bar=False for benchmarks. One issue that I notice is that progress_bar=False does not give identical results currently (it is likely wrong). I'll try to debug and push a fix for that.

neerajprad · 2019-11-29T07:36:07Z

benchmarks/sparsereg_run.sh

+
+device=cpu
+N=100
+benchmark_dir=$( cd $(dirname "$0") ; pwd -P )


@fehiepsi - here's a driver script to run the benchmarks..could you run this with the same system config as the remaining ones? Let me fix the issue with the progress bar first.

benchmarks/sparsereg_run.sh

martinjankowiak · 2019-12-04T17:01:30Z

i'm afraid i haven't much intuition about hyperparameters....

btw in pyro why don't you merge half-steps when you run multiple verlet steps?

(i.e. combine line 46 with line 54)

fehiepsi · 2019-12-04T17:17:22Z

merge half-steps when you run multiple verlet steps

I recall that we merged them previously (for HMC) but separating it out during refactoring. Updating those r sites given a know z_grad is cheap IMO. But we can reinstate if it is necessary for you.

martinjankowiak · 2019-12-04T17:23:28Z

oh i'm not making a suggestion. i just would have imagined that merging could give a noticeable perf bump but maybe not

junpenglao · 2019-12-10T06:20:34Z

Instead of copying the old experimental Edward nuts implementation to your repository, why not importing the TFP NUTS implementation (https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/mcmc/nuts.py)? It is also a iterative implementation and you can compile it to XLA+GPU.

neerajprad · 2019-12-10T06:40:02Z

@junpenglao - Thanks for the pointer. We will definitely add the tfp NUTS implementation to the benchmarks. This branch is basically resurrecting some of our old code from a few months back (I think the tfp one was experimental back then, but I could be wrong).

fehiepsi · 2019-12-10T08:51:18Z

@junpenglao Could you give me an example of how to use TFP nuts so I can give it a try (I am just curious, so please take your time)? As @neerajprad pointed out, these are codes from a few months ago and we put it here for reproducible purposes. You know, it is not easy to keep up with new release packages for benchmarking. FYI, this benchmarks branch does not serve for state-of-the-art purposes.

junpenglao · 2019-12-10T14:22:13Z

Thank you both! I think the best place to start is the test case for tfp nuts and my introductory colab.

I am more than happy to review and help in subsequent PRs if you are interested in substituting the Edward_nuts with the TFP_nuts.

fehiepsi · 2019-12-10T19:14:28Z

Awesome! Thanks, I’ll give it a try tonight!

…

On Tue, Dec 10, 2019 at 6:22 AM Junpeng Lao ***@***.***> wrote: Thank you both! I think the best place to start is the test case for tfp nuts <https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/mcmc/nuts_test.py> and my introductory colab <https://colab.research.google.com/gist/junpenglao/51cd25c6372f8d2ab3490d4af8f97401/tfp_nuts_demo.ipynb> . I am more than happy to review and help in subsequent PRs if you are interested in substituting the Edward_nuts with the TFP_nuts. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#470?email_source=notifications&email_token=ABEEKVTDNSNAFCGE3ZE3EBLQX6QZNA5CNFSM4JSO6QZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGPMVAY#issuecomment-564054659>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABEEKVX34HBVYWAYT6QIPATQX6QZNANCNFSM4JSO6QZA> .

fehiepsi · 2019-12-19T09:12:25Z

@neerajprad @junpenglao It seems that tfp does a bit better job (~~1.3x~~ 1.05x faster) in this GPU example (see the gist) but slow in CPU (even slower than ed2 - I'm not sure why - probably I missed some configs to make it work - or tf xla is just in an experimental stage).

It would be nice to see if we can improve the speed in GPU. I don't have an answer for this difference. Probably the small operators add up? @junpenglao given your expertise of tfp, do you have any insight for this difference? I think tfp also uses the iterative algorithm, runs in XLA, and most computation time (probably I am wrong about this assumption) should lie at leapfrog step... so the performance on this example should be similar.

@neerajprad I don't want to catch up with the "speed" thing in GPU unless it provides many benefits. It is good to keep the benchmarks as-is because we only want to benchmark against the recursive algorithms. However, we should create an issue to track down this problem (I suspect lax.cond plays a role here) in the future.

Edit: Sorry, I compared tfp with the old numpyro benchmark result. The updated difference (tfp is 1.05x faster) might come from how bernoulli logprob is computed or some other small tensor ops, probably not worth to investigate. :)

neerajprad · 2019-12-19T21:01:10Z

@fehiepsi - For large datasets, I suppose that most of the compute will be dominated by tensor ops on the backward pass. I don't think this is worth investigating unless we have a few more examples. In any case, I think the unrolled NUTS implementation using XLA should be pretty much the same in terms of run time even for CPU. It will be surprising if that's not the case.

junpenglao · 2019-12-20T01:24:45Z

Thanks a lot for the early feedback!
I wouldn't expect a large differences as well, one of the main differences (please correct me if I am wrong on the numpyro end) is that TFP version batched at each leapfrog step, and numpyro version batched at the end using vectorized_map. Since I need to implement in a ways to make sure batching works, there are quite a lot of shape handling that might under-perform/out-perform autobatching in terms of memory (my largest concern) and using TF operations that are slow.

Otherwise, once we compile to XLA, everything should give similar speed as TF and JAX are basically different interface to XLA. However, we might see some differences when we do compare large batch size (e.g., num_chain=1, 10, 100, 500)

@fehiepsi the slowness in CPU is pretty strange, maybe it is something to do with compiling to XLA - I usually run once and then run again to do the timing.

fehiepsi · 2019-12-21T03:38:18Z

@junpenglao Yes, numpyro chain_method='vectorized' batches at the end, so it wastes computational time if some chains finish their trajectories earlier than the others :( (but to draw a lot of chains in GPU, this might not be a big problem). For cpu benchmark, I follow your suggestion - you can take a look at this gist (edward2 gives 60ms/leapfrog but tfp+XLA gives 80ms/leapfrog - but as I said, this probably comes from the experimental stage of tf.xla.experimental.compile; your colab with tf_nightly and tf.function(..., autograph=False, experimental_compile=True) might also do better). Anyway, thanks for sharing your suggestions! Using tfp seems to be not as complicated as I have thought previously. :D

junpenglao · 2019-12-21T07:02:12Z

TF got a lot simpler with TF2😉
Also, the edward nuts implemented its own leapfrog with a for loop that built a larger graph if I remember correctly, that might be one reason as well.

fehiepsi · 2019-12-21T19:29:57Z

benchmarks/covtype.py

+from numpyro.examples.datasets import COVTYPE, load_dataset
+
+# pyro
+import torch


Just a note in case we face this issue in the future: with jaxlib 0.1.37, import torch might cause GPU memory error when running numpyro mcmc on this example (I don't have a reason for this and also can't come with an isolated code to raise upstream).

neerajprad · 2019-12-24T06:36:58Z

Closing this since all tasks are resolved, and all changes will remain in the benchmarks branch. We will periodically update this branch, and add git tags to pin status at specific time points - the last one is pinned at benchmarks-20191222.

@junpenglao - Thanks for sharing your thoughts on the possible differences in the tfp implementation which could result in a different perf profile, it makes sense. Please feel free to use models from this branch for profiling if you find it useful. Please share any useful findings with us!

I wouldn't expect a large differences as well, one of the main differences (please correct me if I am wrong on the numpyro end) is that TFP version batched at each leapfrog step, and numpyro version batched at the end using vectorized_map.

In NumPyro, parallel chains work by either using device parallelism (pmap) or vectorization. In the latter case, as @fehiepsi mentioned, we need to wait for all the K chains to meet the terminating condition before we can collect the K samples and then repeat. Is this the same with tfp, or is Alexey's auto-batching solution incorporated in the tfp implementation? 🙂

junpenglao · 2019-12-24T09:07:41Z

Thanks @neerajprad!

Both the current TFP NUTS and Alexey's auto-batched NUTS is doing the same thing as numpyro then: wait for all chains to meet termination (u-turn or divergence), finalized one sample, start the next sample (resample momentum etc).
Potentially we can add a flag to have chains not wait for anyone - terminate when any chains terminated. This will increase the speed but reduce the effective sample size as the tree building would terminated too early for most chain. Not sure how it would effect num_effective_samples per second but certainly an interesting idea to explore.
We dont have pmap version (which IIUC would be the chains wont wait for each other) but maybe it is possible to use the TF pmap to do that.

neerajprad · 2019-12-26T19:23:30Z

Thanks for explaining, @junpenglao.

Potentially we can add a flag to have chains not wait for anyone - terminate when any chains terminated. This will increase the speed but reduce the effective sample size as the tree building would terminated too early for most chain. Not sure how it would effect num_effective_samples per second but certainly an interesting idea to explore.

That's something that we can also easily explore with our current setup - since we are wasting less computation, it is possible that the higher number of drawn samples results in a higher effective sample size, despite early termination. It is worth investigating!

neerajprad and others added 14 commits November 27, 2019 14:28

Ability to start MCMC sampling from the same warmup state

fc1b06d

add to fori_collect doc

2f61661

address comment

a756dae

address comment in fori_loop

9a5e7cf

address comments

3ec315b

reset state.i to 0 to avoid confusion

6be4130

make summary diagnostic (#468)

f1afab7

Create benchmarks module

8331c50

rebase changes from reuse-warmup

c469c8b

use correct invocation

be42ae5

add hmm numpyro and x64 flag

b421274

Merge branch 'master' into benchmarks

4a7c36b

add seed and Pyro/Stan hmm

3c18988

Merge remote-tracking branch 'upstream/benchmarks' into benchmarks

20a8381

neerajprad mentioned this pull request Nov 28, 2019

Enable x64 tests in travis #473

Merged

3 tasks

fehiepsi added 2 commits November 28, 2019 20:08

Merge remote-tracking branch 'upstream/master' into benchmarks

f6706d4

add patch for pyro to capture n_eff

a253b99

neerajprad added 2 commits November 28, 2019 21:12

Merge branch 'master' into benchmarks

03f2613

add bash script

5f804c3

neerajprad commented Nov 29, 2019

View reviewed changes

neerajprad added 2 commits November 29, 2019 12:33

Merge branch 'master' into benchmarks

4c92bd6

comment out progress bar

bf66ecf

neerajprad commented Nov 29, 2019

View reviewed changes

benchmarks/sparsereg_run.sh Outdated Show resolved Hide resolved

commit test

22fe145

fehiepsi mentioned this pull request Nov 30, 2019

Avoid double compiling issue #477

Merged

Merge branch 'master' into benchmarks

6f03cad

neerajprad mentioned this pull request Dec 3, 2019

Compare performance of brmp (pyro/numpyro) and brms pyro-ppl/brmp#12

Open

fehiepsi added 8 commits December 3, 2019 11:56

add hmm script

5bb7a59

add edward2 nuts implementation

6cf372d

add requirement file

616146d

Merge remote-tracking branch 'upstream/master' into benchmarks

6858c24

add covtype example with pyro, numpyro, stan

4dd52ad

add covtype run file

85c352f

add edward2 code

22f3bed

all examples are running now

06a21f3

fehiepsi added 3 commits December 6, 2019 08:40

use numpyro 0.2.3

da50381

final scripts

0c484b5

Merge remote-tracking branch 'upstream/master' into benchmarks

632b53d

fehiepsi mentioned this pull request Dec 10, 2019

Regression in fori_collect when progress bar disabled #475

Closed

3 tasks

fehiepsi reviewed Dec 21, 2019

View reviewed changes

neerajprad closed this Dec 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a benchmarks module #470

Create a benchmarks module #470

neerajprad commented Nov 28, 2019 •

edited

Loading

neerajprad commented Nov 29, 2019

fehiepsi commented Nov 29, 2019

neerajprad commented Nov 29, 2019

neerajprad Nov 29, 2019

martinjankowiak commented Dec 4, 2019

fehiepsi commented Dec 4, 2019

martinjankowiak commented Dec 4, 2019

junpenglao commented Dec 10, 2019

neerajprad commented Dec 10, 2019

fehiepsi commented Dec 10, 2019

junpenglao commented Dec 10, 2019

fehiepsi commented Dec 10, 2019 via email

fehiepsi commented Dec 19, 2019 •

edited

Loading

neerajprad commented Dec 19, 2019

junpenglao commented Dec 20, 2019

fehiepsi commented Dec 21, 2019

junpenglao commented Dec 21, 2019

fehiepsi Dec 21, 2019

neerajprad commented Dec 24, 2019

junpenglao commented Dec 24, 2019

neerajprad commented Dec 26, 2019

Create a benchmarks module #470

Create a benchmarks module #470

Conversation

neerajprad commented Nov 28, 2019 • edited Loading

neerajprad commented Nov 29, 2019

fehiepsi commented Nov 29, 2019

neerajprad commented Nov 29, 2019

neerajprad Nov 29, 2019

Choose a reason for hiding this comment

martinjankowiak commented Dec 4, 2019

fehiepsi commented Dec 4, 2019

martinjankowiak commented Dec 4, 2019

junpenglao commented Dec 10, 2019

neerajprad commented Dec 10, 2019

fehiepsi commented Dec 10, 2019

junpenglao commented Dec 10, 2019

fehiepsi commented Dec 10, 2019 via email

fehiepsi commented Dec 19, 2019 • edited Loading

neerajprad commented Dec 19, 2019

junpenglao commented Dec 20, 2019

fehiepsi commented Dec 21, 2019

junpenglao commented Dec 21, 2019

fehiepsi Dec 21, 2019

Choose a reason for hiding this comment

neerajprad commented Dec 24, 2019

junpenglao commented Dec 24, 2019

neerajprad commented Dec 26, 2019

neerajprad commented Nov 28, 2019 •

edited

Loading

fehiepsi commented Dec 19, 2019 •

edited

Loading