Add benchmarking infrastructure to run verification #142

tchajed · 2023-07-28T22:00:24Z

The user-visible part of this is a library that allows running temporal-verifier with a timeout and gathering some runtime resource usage statistics as well as exit status, and a function to produce a nice table of all the results. I've used this in the simplest possible way to generate the results of running temporal-verifier verify over all the benchmarks, with the results shown below.

This is the core infrastructure needed for #118. The remaining work (hopefully small) is to add configurations to do something similar for running qalpha and the bounded model checkers. These will be a bit more work since those have more parameters to configure.

The infrastructure to monitor a benchmark is highly Unix-specific so it works on macOS and Linux and will not work on Windows (but it should work on WSL).

To produce this table run cargo run -r --bin benchmark -- verify --time-limit 30s.

╭─────────┬──────────────────────────────────────────┬─────────┬────────┬──────────┬──────────┬────────┬─────────────╮
│ command │ file                                     │ outcome │ time s │ cpu util │ solver % │    mem │ params      │
├─────────┼──────────────────────────────────────────┼─────────┼────────┼──────────┼──────────┼────────┼─────────────┤
│ verify  │ consensus.fly                            │ fail    │    0.1 │     2.5× │      99% │   24MB │ --solver=z3 │
│ verify  │ consensus_epr.fly                        │         │    0.1 │     2.4× │      99% │   24MB │ --solver=z3 │
│ verify  │ consensus_forall.fly                     │         │    0.1 │     2.6× │      99% │   23MB │ --solver=z3 │
│ verify  │ fast_paxos_epr.fly                       │         │   24.1 │     1.4× │     100% │  172MB │ --solver=z3 │
│ verify  │ fol/block_cache_system_frozen_async.fly  │ timeout │   30.0 │     1.3× │     100% │ 1817MB │ --solver=z3 │
│ verify  │ fol/bosco_3t_safety.fly                  │ timeout │   30.0 │     0.7× │     100% │  313MB │ --solver=z3 │
│ verify  │ fol/cache.fly                            │         │    0.4 │     6.5× │     100% │   25MB │ --solver=z3 │
│ verify  │ fol/client_server_ae.fly                 │ fail    │    0.0 │     1.3× │      98% │   23MB │ --solver=z3 │
│ verify  │ fol/client_server_db_ae.fly              │         │    0.0 │     2.0× │      99% │   21MB │ --solver=z3 │
│ verify  │ fol/consensus_epr.fly                    │         │    0.1 │     2.6× │      99% │   24MB │ --solver=z3 │
│ verify  │ fol/consensus_forall.fly                 │         │    0.1 │     2.8× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/consensus_wo_decide.fly              │         │    0.1 │     2.3× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/fast_paxos_epr.fly                   │         │   24.2 │     1.5× │     100% │  175MB │ --solver=z3 │
│ verify  │ fol/fast_paxos_forall.fly                │ timeout │   30.0 │     0.9× │     100% │  195MB │ --solver=z3 │
│ verify  │ fol/fast_paxos_forall_choosable.fly      │ timeout │   30.0 │     0.0× │     100% │   35MB │ --solver=z3 │
│ verify  │ fol/firewall.fly                         │         │    0.0 │     1.0× │      97% │   21MB │ --solver=z3 │
│ verify  │ fol/flexible_paxos_epr.fly               │         │    0.2 │     2.5× │     100% │   25MB │ --solver=z3 │
│ verify  │ fol/flexible_paxos_forall.fly            │         │    0.4 │     1.9× │     100% │   32MB │ --solver=z3 │
│ verify  │ fol/flexible_paxos_forall_choosable.fly  │         │    2.9 │     1.1× │     100% │   56MB │ --solver=z3 │
│ verify  │ fol/hybrid_reliable_broadcast_cisa.fly   │         │    0.4 │     4.2× │     100% │   31MB │ --solver=z3 │
│ verify  │ fol/learning_switch.fly                  │         │    0.1 │     2.1× │      99% │   21MB │ --solver=z3 │
│ verify  │ fol/lockserv.fly                         │         │    0.1 │     3.1× │      99% │   22MB │ --solver=z3 │
│ verify  │ fol/multi_paxos_epr.fly                  │ timeout │   30.0 │     0.0× │     100% │   26MB │ --solver=z3 │
│ verify  │ fol/paxos_epr.fly                        │         │    1.0 │     1.4× │     100% │   39MB │ --solver=z3 │
│ verify  │ fol/paxos_forall.fly                     │         │    0.2 │     3.3× │     100% │   28MB │ --solver=z3 │
│ verify  │ fol/paxos_forall_choosable.fly           │ timeout │   30.0 │     0.0× │     100% │   28MB │ --solver=z3 │
│ verify  │ fol/raft.fly                             │ fail    │    0.0 │     0.4× │      88% │    1MB │ --solver=z3 │
│ verify  │ fol/ring_id.fly                          │         │    0.1 │     1.9× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/ring_id_not_dead.fly                 │         │    0.5 │     1.5× │     100% │   33MB │ --solver=z3 │
│ verify  │ fol/sharded_kv.fly                       │         │    0.1 │     1.9× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/sharded_kv_no_lost_keys.fly          │         │    0.1 │     1.8× │      99% │   24MB │ --solver=z3 │
│ verify  │ fol/stoppable_paxos_epr.fly              │ timeout │   30.0 │     0.2× │     100% │   94MB │ --solver=z3 │
│ verify  │ fol/stoppable_paxos_forall.fly           │ timeout │   30.0 │     0.3× │     100% │   90MB │ --solver=z3 │
│ verify  │ fol/stoppable_paxos_forall_choosable.fly │ timeout │   30.0 │     0.1× │     100% │   46MB │ --solver=z3 │
│ verify  │ fol/ticket.fly                           │         │    0.1 │     4.4× │     100% │   23MB │ --solver=z3 │
│ verify  │ fol/toy_consensus_epr.fly                │         │    0.1 │     1.9× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/toy_consensus_forall.fly             │         │    0.0 │     1.8× │      98% │   22MB │ --solver=z3 │
│ verify  │ fol/vertical_paxos_epr.fly               │         │    3.1 │     1.2× │     100% │   68MB │ --solver=z3 │
│ verify  │ fol/vertical_paxos_forall.fly            │         │    0.3 │     3.7× │     100% │   27MB │ --solver=z3 │
│ verify  │ fol/vertical_paxos_forall_choosable.fly  │         │    1.2 │     1.7× │     100% │   49MB │ --solver=z3 │
│ verify  │ lockserver.fly                           │         │    0.0 │     3.2× │      99% │   22MB │ --solver=z3 │
│ verify  │ paxos_epr.fly                            │         │    0.9 │     1.2× │     100% │   37MB │ --solver=z3 │
│ verify  │ pingpong.fly                             │         │    0.0 │     0.5× │      92% │   19MB │ --solver=z3 │
╰─────────┴──────────────────────────────────────────┴─────────┴────────┴──────────┴──────────┴────────┴─────────────╯
total:    43
ok:       31
timeout:  9
fail:     3

To produce this table run cargo run --bin benchmark -- verify --time-limit 30s --solver cvc5.

╭─────────┬──────────────────────────────────────────┬─────────┬────────┬──────────┬──────────┬──────┬───────────────╮
│ command │ file                                     │ outcome │ time s │ cpu util │ solver % │  mem │ params        │
├─────────┼──────────────────────────────────────────┼─────────┼────────┼──────────┼──────────┼──────┼───────────────┤
│ verify  │ consensus.fly                            │ fail    │    0.2 │     2.2× │     100% │ 20MB │ --solver=cvc5 │
│ verify  │ consensus_epr.fly                        │         │    1.1 │     1.5× │     100% │ 33MB │ --solver=cvc5 │
│ verify  │ consensus_forall.fly                     │         │    0.3 │     1.9× │     100% │ 25MB │ --solver=cvc5 │
│ verify  │ fast_paxos_epr.fly                       │ timeout │   30.0 │     0.1× │     100% │ 39MB │ --solver=cvc5 │
│ verify  │ fol/block_cache_system_frozen_async.fly  │         │    7.3 │     6.7× │     100% │ 59MB │ --solver=cvc5 │
│ verify  │ fol/bosco_3t_safety.fly                  │ timeout │   30.0 │     0.5× │     100% │ 78MB │ --solver=cvc5 │
│ verify  │ fol/cache.fly                            │         │    1.6 │     8.2× │     100% │ 36MB │ --solver=cvc5 │
│ verify  │ fol/client_server_ae.fly                 │ fail    │    0.0 │     1.6× │      98% │ 13MB │ --solver=cvc5 │
│ verify  │ fol/client_server_db_ae.fly              │         │    0.1 │     2.8× │     100% │ 17MB │ --solver=cvc5 │
│ verify  │ fol/consensus_epr.fly                    │         │    1.0 │     1.6× │     100% │ 32MB │ --solver=cvc5 │
│ verify  │ fol/consensus_forall.fly                 │         │    0.3 │     2.1× │     100% │ 25MB │ --solver=cvc5 │
│ verify  │ fol/consensus_wo_decide.fly              │         │    0.2 │     1.8× │     100% │ 25MB │ --solver=cvc5 │
│ verify  │ fol/fast_paxos_epr.fly                   │ timeout │   30.0 │     0.1× │     100% │ 43MB │ --solver=cvc5 │
│ verify  │ fol/fast_paxos_forall.fly                │ timeout │   30.0 │     0.1× │     100% │ 45MB │ --solver=cvc5 │
│ verify  │ fol/fast_paxos_forall_choosable.fly      │ timeout │   30.0 │     0.5× │     100% │ 63MB │ --solver=cvc5 │
│ verify  │ fol/firewall.fly                         │         │    0.2 │     1.9× │     100% │ 23MB │ --solver=cvc5 │
│ verify  │ fol/flexible_paxos_epr.fly               │         │   22.8 │     2.2× │     100% │ 90MB │ --solver=cvc5 │
│ verify  │ fol/flexible_paxos_forall.fly            │         │    2.4 │     2.4× │     100% │ 42MB │ --solver=cvc5 │
│ verify  │ fol/flexible_paxos_forall_choosable.fly  │         │    3.8 │     2.6× │     100% │ 56MB │ --solver=cvc5 │
│ verify  │ fol/hybrid_reliable_broadcast_cisa.fly   │         │   22.8 │     3.3× │     100% │ 74MB │ --solver=cvc5 │
│ verify  │ fol/learning_switch.fly                  │         │    1.0 │     1.6× │     100% │ 29MB │ --solver=cvc5 │
│ verify  │ fol/lockserv.fly                         │         │    0.1 │     4.5× │     100% │ 15MB │ --solver=cvc5 │
│ verify  │ fol/multi_paxos_epr.fly                  │         │   18.1 │     2.9× │     100% │ 91MB │ --solver=cvc5 │
│ verify  │ fol/paxos_epr.fly                        │ timeout │   30.0 │     0.2× │     100% │ 61MB │ --solver=cvc5 │
│ verify  │ fol/paxos_forall.fly                     │         │    2.4 │     3.8× │     100% │ 44MB │ --solver=cvc5 │
│ verify  │ fol/paxos_forall_choosable.fly           │         │    3.6 │     2.8× │     100% │ 56MB │ --solver=cvc5 │
│ verify  │ fol/raft.fly                             │ fail    │    0.0 │     0.4× │      89% │  1MB │ --solver=cvc5 │
│ verify  │ fol/ring_id.fly                          │         │    0.4 │     1.5× │     100% │ 30MB │ --solver=cvc5 │
│ verify  │ fol/ring_id_not_dead.fly                 │         │    4.4 │     2.5× │     100% │ 45MB │ --solver=cvc5 │
│ verify  │ fol/sharded_kv.fly                       │         │    0.2 │     2.8× │     100% │ 18MB │ --solver=cvc5 │
│ verify  │ fol/sharded_kv_no_lost_keys.fly          │         │    1.0 │     1.4× │     100% │ 35MB │ --solver=cvc5 │
│ verify  │ fol/stoppable_paxos_epr.fly              │ timeout │   30.0 │     0.1× │     100% │ 36MB │ --solver=cvc5 │
│ verify  │ fol/stoppable_paxos_forall.fly           │ timeout │   30.0 │     2.1× │     100% │ 81MB │ --solver=cvc5 │
│ verify  │ fol/stoppable_paxos_forall_choosable.fly │ timeout │   30.0 │     0.6× │     100% │ 85MB │ --solver=cvc5 │
│ verify  │ fol/ticket.fly                           │         │    0.3 │     3.9× │     100% │ 27MB │ --solver=cvc5 │
│ verify  │ fol/toy_consensus_epr.fly                │         │    0.4 │     2.1× │     100% │ 34MB │ --solver=cvc5 │
│ verify  │ fol/toy_consensus_forall.fly             │         │    0.1 │     2.6× │      99% │ 19MB │ --solver=cvc5 │
│ verify  │ fol/vertical_paxos_epr.fly               │ timeout │   30.0 │     0.1× │     100% │ 50MB │ --solver=cvc5 │
│ verify  │ fol/vertical_paxos_forall.fly            │ timeout │   30.0 │     0.3× │     100% │ 56MB │ --solver=cvc5 │
│ verify  │ fol/vertical_paxos_forall_choosable.fly  │ timeout │   30.0 │     1.0× │     100% │ 72MB │ --solver=cvc5 │
│ verify  │ lockserver.fly                           │         │    0.1 │     4.7× │     100% │ 15MB │ --solver=cvc5 │
│ verify  │ paxos_epr.fly                            │ timeout │   30.0 │     0.2× │     100% │ 59MB │ --solver=cvc5 │
│ verify  │ pingpong.fly                             │         │    0.0 │     0.6× │      89% │  7MB │ --solver=cvc5 │
╰─────────┴──────────────────────────────────────────┴─────────┴────────┴──────────┴──────────┴──────┴───────────────╯
total:    43
ok:       27
timeout:  13
fail:     3

benchmarking/src/run.rs

benchmarking/src/bin/benchmark.rs

vmwclabot · 2023-07-29T21:38:18Z

@tchajed, you must sign every commit in this pull request acknowledging our Developer Certificate of Origin before your changes are merged. This can be done by adding Signed-off-by: John Doe <[email protected]> to the last line of each Git commit message. The e-mail address used to sign must match the e-mail address of the Git author. Click here to view the Developer Certificate of Origin agreement.

Commit e5a0066 must be signed by [email protected]

wilcoxjay

Looks like a great start, thanks! I only had minor/nonblocking feedback. Feel free to fix what you agree with and then merge.

benchmarking/src/time_bench.rs

benchmarking/src/result.rs

benchmarking/src/time_bench.rs

benchmarking/src/run.rs

benchmarking/src/result.rs

tchajed · 2023-08-01T02:52:16Z

The fact that we track the signal that killed the process is just over-engineering, we should only record its exit status.

I think the racy behavior is now fixed (see the most recent commits). I don't think this is particularly important (if something is close to the timeout it can count as timing out even if it wasn't killed) but I fixed it since you pointed it out, and this fix wasn't too hard.

The whole thing could be somewhat simplified by just reporting the exit status of our time binary. But when we report results it'll be a lot more obscure, since timeout will just be a magic 124 exit status (a new feature I copied from timeout(1)).

wilcoxjay · 2023-08-01T03:25:55Z

I like it!

odedp · 2023-08-01T07:41:52Z

This is really great!! I haven't had a careful look at the code, but let me know if you want me to review specific files more carefully.

I have a question about concurrency/parallelism. What does benchmarking measure if we use parallelism? It would be nice to measure both "total CPU time" and "wall clock time", to be able to tell how much we actually exploit parallelism. (I remember we had some discussions about this.)

tchajed · 2023-08-01T15:50:08Z

That's a good reminder about tracking concurrency Oded. We already had the data: just compare the "user time" (which counts each core separately) with the wall clock time to get the average core utilization. I tweaked the output to show real time, a utilization ratio which tells us on average how many cores we're using, and the percentage of user time spent in subprocesses (this is a ratio of user times which both take into account concurrency).

The results have one effect I don't understand: a couple benchmarks timeout with a utilization of 0 or 0.1, indicating they're spending all of their time outsider of userspace (like blocking on I/O). That doesn't really make sense to me. It's possible getrusage isn't correctly reporting subprocess time for these benchmarks but I didn't really look into it. The first thing I'd do is get the numbers from Linux and make sure it's not a macOS issue.

edenfrenkel

This is great! I have no technical comments, but a few ideas regarding functionality, mainly things I have in mind for running qalpha benchmarks. I'm not sure whether you'd want to implement any of this in this PR or to leave it as future TODO's, so feel free to implement whatever you'd like. This is what I would add:

Saving outputs: Currently the stdout of each run is discarded, but it might be interesting to look at, and can be used to extract relevant statistics. Even for verify, the output associated withfail might be of interest. For this we could have
- Some dedicated folder (like we have .flyvy-log for .smt2 files) where the output of benchmark runs is saved to. We can also have a flag that determines whether the output should be saved or discarded.
- Once the above is done, we might want some control over the level of logging in the output, e.g. using a --log level which would set RUST_LOG=level (or not set any logging if not provided).
- Finally, there should be a way to access this output from the benchmark code, and to define per-benchmark statistics to present in the printed table. Maybe this should be done by providing a path to the saved file. For example, when I write the benchmarking for qalpha I might want to print the size of the fixpoint found, which I would extract from the output and then add to the table.
Note that we probably want the saved/processed output to contain both stdout and stderr, because I think the logging is sent to stderr.
More flexibility in configuring benchmarks: Currently this uses all examples in the examples folder and runs the benchmark with the same arguments for all of them, but we might want to determine a subset of these files to run a certain benchmark on, with file-specific arguments, or even multiple argument configurations per file. I think this can be done in several ways, for example using some config file that lists the benchmark configurations to run on each file, or by specifying inside the file which benchmark configurations should be run on it, similar to the way snapshot tests work. (Not all benchmarks must use these, just those where it's necessary.)

I'm approving, but let me know if you implement any of these changes.

benchmarking/src/time_bench.rs

vmwclabot · 2023-08-24T18:37:24Z

@tchajed, you must sign every commit in this pull request acknowledging our Developer Certificate of Origin before your changes are merged. This can be done by adding Signed-off-by: John Doe <[email protected]> to the last line of each Git commit message. The e-mail address used to sign must match the e-mail address of the Git author. Click here to view the Developer Certificate of Origin agreement.

Commit 7b3ea19 must be signed by [email protected]
Commit ceecd1e must be signed by [email protected]

Signed-off-by: Tej Chajed <[email protected]>

addressed comments

tchajed force-pushed the tchajed/benchmark-timing branch from 54ef7e1 to 7940004 Compare July 29, 2023 03:03

Alex-Fischman reviewed Jul 29, 2023

View reviewed changes

benchmarking/src/run.rs Outdated Show resolved Hide resolved

benchmarking/src/bin/benchmark.rs Outdated Show resolved Hide resolved

tchajed mentioned this pull request Jul 29, 2023

Use multiple solvers in qalpha #140

Merged

tchajed force-pushed the tchajed/benchmark-timing branch from 7ac1852 to 5a87f62 Compare July 29, 2023 21:38

vmwclabot added the dco-required label Jul 29, 2023

tchajed force-pushed the tchajed/benchmark-timing branch from 5a87f62 to 8337d99 Compare July 29, 2023 21:38

vmwclabot removed the dco-required label Jul 29, 2023

tchajed force-pushed the tchajed/benchmark-timing branch from 8337d99 to d5a74ff Compare July 30, 2023 20:08

wilcoxjay approved these changes Jul 31, 2023

View reviewed changes

wilcoxjay reviewed Aug 1, 2023

View reviewed changes

benchmarking/src/result.rs Outdated Show resolved Hide resolved

tchajed force-pushed the tchajed/benchmark-timing branch from 3f280b9 to c03b4ae Compare August 1, 2023 15:19

tchajed requested a review from edenfrenkel August 1, 2023 15:50

edenfrenkel approved these changes Aug 1, 2023

View reviewed changes

wilcoxjay previously requested changes Aug 2, 2023

View reviewed changes

benchmarking/src/time_bench.rs Outdated Show resolved Hide resolved

tchajed force-pushed the tchajed/benchmark-timing branch from 1f2b257 to ceecd1e Compare August 24, 2023 18:37

vmwclabot added the dco-required label Aug 24, 2023

tchajed added 3 commits August 24, 2023 13:38

Add benchmark infrastructure

92a6ad1

Signed-off-by: Tej Chajed <[email protected]>

Fix clippy warning

456ff07

Signed-off-by: Tej Chajed <[email protected]>

Fix mem usage units on Linux

18d347c

Signed-off-by: Tej Chajed <[email protected]>

tchajed force-pushed the tchajed/benchmark-timing branch from ceecd1e to 18d347c Compare August 24, 2023 18:39

vmwclabot removed the dco-required label Aug 24, 2023

tchajed enabled auto-merge (squash) August 24, 2023 18:41

tchajed merged commit 1541b89 into main Aug 24, 2023
2 checks passed

tchajed deleted the tchajed/benchmark-timing branch August 24, 2023 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarking infrastructure to run verification #142

Add benchmarking infrastructure to run verification #142

tchajed commented Jul 28, 2023 •

edited

Loading

vmwclabot commented Jul 29, 2023

wilcoxjay left a comment

tchajed commented Aug 1, 2023

wilcoxjay commented Aug 1, 2023

odedp commented Aug 1, 2023

tchajed commented Aug 1, 2023

edenfrenkel left a comment •

edited

Loading

vmwclabot commented Aug 24, 2023

Add benchmarking infrastructure to run verification #142

Add benchmarking infrastructure to run verification #142

Conversation

tchajed commented Jul 28, 2023 • edited Loading

vmwclabot commented Jul 29, 2023

wilcoxjay left a comment

Choose a reason for hiding this comment

tchajed commented Aug 1, 2023

wilcoxjay commented Aug 1, 2023

odedp commented Aug 1, 2023

tchajed commented Aug 1, 2023

edenfrenkel left a comment • edited Loading

Choose a reason for hiding this comment

vmwclabot commented Aug 24, 2023

tchajed commented Jul 28, 2023 •

edited

Loading

edenfrenkel left a comment •

edited

Loading