Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarking infrastructure to run verification #142

Merged
merged 3 commits into from
Aug 24, 2023

Conversation

tchajed
Copy link
Collaborator

@tchajed tchajed commented Jul 28, 2023

The user-visible part of this is a library that allows running temporal-verifier with a timeout and gathering some runtime resource usage statistics as well as exit status, and a function to produce a nice table of all the results. I've used this in the simplest possible way to generate the results of running temporal-verifier verify over all the benchmarks, with the results shown below.

This is the core infrastructure needed for #118. The remaining work (hopefully small) is to add configurations to do something similar for running qalpha and the bounded model checkers. These will be a bit more work since those have more parameters to configure.

The infrastructure to monitor a benchmark is highly Unix-specific so it works on macOS and Linux and will not work on Windows (but it should work on WSL).

To produce this table run cargo run -r --bin benchmark -- verify --time-limit 30s.

╭─────────┬──────────────────────────────────────────┬─────────┬────────┬──────────┬──────────┬────────┬─────────────╮
│ command │ file                                     │ outcome │ time s │ cpu util │ solver % │    mem │ params      │
├─────────┼──────────────────────────────────────────┼─────────┼────────┼──────────┼──────────┼────────┼─────────────┤
│ verify  │ consensus.fly                            │ fail    │    0.1 │     2.5× │      99% │   24MB │ --solver=z3 │
│ verify  │ consensus_epr.fly                        │         │    0.1 │     2.4× │      99% │   24MB │ --solver=z3 │
│ verify  │ consensus_forall.fly                     │         │    0.1 │     2.6× │      99% │   23MB │ --solver=z3 │
│ verify  │ fast_paxos_epr.fly                       │         │   24.1 │     1.4× │     100% │  172MB │ --solver=z3 │
│ verify  │ fol/block_cache_system_frozen_async.fly  │ timeout │   30.0 │     1.3× │     100% │ 1817MB │ --solver=z3 │
│ verify  │ fol/bosco_3t_safety.fly                  │ timeout │   30.0 │     0.7× │     100% │  313MB │ --solver=z3 │
│ verify  │ fol/cache.fly                            │         │    0.4 │     6.5× │     100% │   25MB │ --solver=z3 │
│ verify  │ fol/client_server_ae.fly                 │ fail    │    0.0 │     1.3× │      98% │   23MB │ --solver=z3 │
│ verify  │ fol/client_server_db_ae.fly              │         │    0.0 │     2.0× │      99% │   21MB │ --solver=z3 │
│ verify  │ fol/consensus_epr.fly                    │         │    0.1 │     2.6× │      99% │   24MB │ --solver=z3 │
│ verify  │ fol/consensus_forall.fly                 │         │    0.1 │     2.8× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/consensus_wo_decide.fly              │         │    0.1 │     2.3× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/fast_paxos_epr.fly                   │         │   24.2 │     1.5× │     100% │  175MB │ --solver=z3 │
│ verify  │ fol/fast_paxos_forall.fly                │ timeout │   30.0 │     0.9× │     100% │  195MB │ --solver=z3 │
│ verify  │ fol/fast_paxos_forall_choosable.fly      │ timeout │   30.0 │     0.0× │     100% │   35MB │ --solver=z3 │
│ verify  │ fol/firewall.fly                         │         │    0.0 │     1.0× │      97% │   21MB │ --solver=z3 │
│ verify  │ fol/flexible_paxos_epr.fly               │         │    0.2 │     2.5× │     100% │   25MB │ --solver=z3 │
│ verify  │ fol/flexible_paxos_forall.fly            │         │    0.4 │     1.9× │     100% │   32MB │ --solver=z3 │
│ verify  │ fol/flexible_paxos_forall_choosable.fly  │         │    2.9 │     1.1× │     100% │   56MB │ --solver=z3 │
│ verify  │ fol/hybrid_reliable_broadcast_cisa.fly   │         │    0.4 │     4.2× │     100% │   31MB │ --solver=z3 │
│ verify  │ fol/learning_switch.fly                  │         │    0.1 │     2.1× │      99% │   21MB │ --solver=z3 │
│ verify  │ fol/lockserv.fly                         │         │    0.1 │     3.1× │      99% │   22MB │ --solver=z3 │
│ verify  │ fol/multi_paxos_epr.fly                  │ timeout │   30.0 │     0.0× │     100% │   26MB │ --solver=z3 │
│ verify  │ fol/paxos_epr.fly                        │         │    1.0 │     1.4× │     100% │   39MB │ --solver=z3 │
│ verify  │ fol/paxos_forall.fly                     │         │    0.2 │     3.3× │     100% │   28MB │ --solver=z3 │
│ verify  │ fol/paxos_forall_choosable.fly           │ timeout │   30.0 │     0.0× │     100% │   28MB │ --solver=z3 │
│ verify  │ fol/raft.fly                             │ fail    │    0.0 │     0.4× │      88% │    1MB │ --solver=z3 │
│ verify  │ fol/ring_id.fly                          │         │    0.1 │     1.9× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/ring_id_not_dead.fly                 │         │    0.5 │     1.5× │     100% │   33MB │ --solver=z3 │
│ verify  │ fol/sharded_kv.fly                       │         │    0.1 │     1.9× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/sharded_kv_no_lost_keys.fly          │         │    0.1 │     1.8× │      99% │   24MB │ --solver=z3 │
│ verify  │ fol/stoppable_paxos_epr.fly              │ timeout │   30.0 │     0.2× │     100% │   94MB │ --solver=z3 │
│ verify  │ fol/stoppable_paxos_forall.fly           │ timeout │   30.0 │     0.3× │     100% │   90MB │ --solver=z3 │
│ verify  │ fol/stoppable_paxos_forall_choosable.fly │ timeout │   30.0 │     0.1× │     100% │   46MB │ --solver=z3 │
│ verify  │ fol/ticket.fly                           │         │    0.1 │     4.4× │     100% │   23MB │ --solver=z3 │
│ verify  │ fol/toy_consensus_epr.fly                │         │    0.1 │     1.9× │      99% │   23MB │ --solver=z3 │
│ verify  │ fol/toy_consensus_forall.fly             │         │    0.0 │     1.8× │      98% │   22MB │ --solver=z3 │
│ verify  │ fol/vertical_paxos_epr.fly               │         │    3.1 │     1.2× │     100% │   68MB │ --solver=z3 │
│ verify  │ fol/vertical_paxos_forall.fly            │         │    0.3 │     3.7× │     100% │   27MB │ --solver=z3 │
│ verify  │ fol/vertical_paxos_forall_choosable.fly  │         │    1.2 │     1.7× │     100% │   49MB │ --solver=z3 │
│ verify  │ lockserver.fly                           │         │    0.0 │     3.2× │      99% │   22MB │ --solver=z3 │
│ verify  │ paxos_epr.fly                            │         │    0.9 │     1.2× │     100% │   37MB │ --solver=z3 │
│ verify  │ pingpong.fly                             │         │    0.0 │     0.5× │      92% │   19MB │ --solver=z3 │
╰─────────┴──────────────────────────────────────────┴─────────┴────────┴──────────┴──────────┴────────┴─────────────╯
total:    43
ok:       31
timeout:  9
fail:     3

To produce this table run cargo run --bin benchmark -- verify --time-limit 30s --solver cvc5.

╭─────────┬──────────────────────────────────────────┬─────────┬────────┬──────────┬──────────┬──────┬───────────────╮
│ command │ file                                     │ outcome │ time s │ cpu util │ solver % │  mem │ params        │
├─────────┼──────────────────────────────────────────┼─────────┼────────┼──────────┼──────────┼──────┼───────────────┤
│ verify  │ consensus.fly                            │ fail    │    0.2 │     2.2× │     100% │ 20MB │ --solver=cvc5 │
│ verify  │ consensus_epr.fly                        │         │    1.1 │     1.5× │     100% │ 33MB │ --solver=cvc5 │
│ verify  │ consensus_forall.fly                     │         │    0.3 │     1.9× │     100% │ 25MB │ --solver=cvc5 │
│ verify  │ fast_paxos_epr.fly                       │ timeout │   30.0 │     0.1× │     100% │ 39MB │ --solver=cvc5 │
│ verify  │ fol/block_cache_system_frozen_async.fly  │         │    7.3 │     6.7× │     100% │ 59MB │ --solver=cvc5 │
│ verify  │ fol/bosco_3t_safety.fly                  │ timeout │   30.0 │     0.5× │     100% │ 78MB │ --solver=cvc5 │
│ verify  │ fol/cache.fly                            │         │    1.6 │     8.2× │     100% │ 36MB │ --solver=cvc5 │
│ verify  │ fol/client_server_ae.fly                 │ fail    │    0.0 │     1.6× │      98% │ 13MB │ --solver=cvc5 │
│ verify  │ fol/client_server_db_ae.fly              │         │    0.1 │     2.8× │     100% │ 17MB │ --solver=cvc5 │
│ verify  │ fol/consensus_epr.fly                    │         │    1.0 │     1.6× │     100% │ 32MB │ --solver=cvc5 │
│ verify  │ fol/consensus_forall.fly                 │         │    0.3 │     2.1× │     100% │ 25MB │ --solver=cvc5 │
│ verify  │ fol/consensus_wo_decide.fly              │         │    0.2 │     1.8× │     100% │ 25MB │ --solver=cvc5 │
│ verify  │ fol/fast_paxos_epr.fly                   │ timeout │   30.0 │     0.1× │     100% │ 43MB │ --solver=cvc5 │
│ verify  │ fol/fast_paxos_forall.fly                │ timeout │   30.0 │     0.1× │     100% │ 45MB │ --solver=cvc5 │
│ verify  │ fol/fast_paxos_forall_choosable.fly      │ timeout │   30.0 │     0.5× │     100% │ 63MB │ --solver=cvc5 │
│ verify  │ fol/firewall.fly                         │         │    0.2 │     1.9× │     100% │ 23MB │ --solver=cvc5 │
│ verify  │ fol/flexible_paxos_epr.fly               │         │   22.8 │     2.2× │     100% │ 90MB │ --solver=cvc5 │
│ verify  │ fol/flexible_paxos_forall.fly            │         │    2.4 │     2.4× │     100% │ 42MB │ --solver=cvc5 │
│ verify  │ fol/flexible_paxos_forall_choosable.fly  │         │    3.8 │     2.6× │     100% │ 56MB │ --solver=cvc5 │
│ verify  │ fol/hybrid_reliable_broadcast_cisa.fly   │         │   22.8 │     3.3× │     100% │ 74MB │ --solver=cvc5 │
│ verify  │ fol/learning_switch.fly                  │         │    1.0 │     1.6× │     100% │ 29MB │ --solver=cvc5 │
│ verify  │ fol/lockserv.fly                         │         │    0.1 │     4.5× │     100% │ 15MB │ --solver=cvc5 │
│ verify  │ fol/multi_paxos_epr.fly                  │         │   18.1 │     2.9× │     100% │ 91MB │ --solver=cvc5 │
│ verify  │ fol/paxos_epr.fly                        │ timeout │   30.0 │     0.2× │     100% │ 61MB │ --solver=cvc5 │
│ verify  │ fol/paxos_forall.fly                     │         │    2.4 │     3.8× │     100% │ 44MB │ --solver=cvc5 │
│ verify  │ fol/paxos_forall_choosable.fly           │         │    3.6 │     2.8× │     100% │ 56MB │ --solver=cvc5 │
│ verify  │ fol/raft.fly                             │ fail    │    0.0 │     0.4× │      89% │  1MB │ --solver=cvc5 │
│ verify  │ fol/ring_id.fly                          │         │    0.4 │     1.5× │     100% │ 30MB │ --solver=cvc5 │
│ verify  │ fol/ring_id_not_dead.fly                 │         │    4.4 │     2.5× │     100% │ 45MB │ --solver=cvc5 │
│ verify  │ fol/sharded_kv.fly                       │         │    0.2 │     2.8× │     100% │ 18MB │ --solver=cvc5 │
│ verify  │ fol/sharded_kv_no_lost_keys.fly          │         │    1.0 │     1.4× │     100% │ 35MB │ --solver=cvc5 │
│ verify  │ fol/stoppable_paxos_epr.fly              │ timeout │   30.0 │     0.1× │     100% │ 36MB │ --solver=cvc5 │
│ verify  │ fol/stoppable_paxos_forall.fly           │ timeout │   30.0 │     2.1× │     100% │ 81MB │ --solver=cvc5 │
│ verify  │ fol/stoppable_paxos_forall_choosable.fly │ timeout │   30.0 │     0.6× │     100% │ 85MB │ --solver=cvc5 │
│ verify  │ fol/ticket.fly                           │         │    0.3 │     3.9× │     100% │ 27MB │ --solver=cvc5 │
│ verify  │ fol/toy_consensus_epr.fly                │         │    0.4 │     2.1× │     100% │ 34MB │ --solver=cvc5 │
│ verify  │ fol/toy_consensus_forall.fly             │         │    0.1 │     2.6× │      99% │ 19MB │ --solver=cvc5 │
│ verify  │ fol/vertical_paxos_epr.fly               │ timeout │   30.0 │     0.1× │     100% │ 50MB │ --solver=cvc5 │
│ verify  │ fol/vertical_paxos_forall.fly            │ timeout │   30.0 │     0.3× │     100% │ 56MB │ --solver=cvc5 │
│ verify  │ fol/vertical_paxos_forall_choosable.fly  │ timeout │   30.0 │     1.0× │     100% │ 72MB │ --solver=cvc5 │
│ verify  │ lockserver.fly                           │         │    0.1 │     4.7× │     100% │ 15MB │ --solver=cvc5 │
│ verify  │ paxos_epr.fly                            │ timeout │   30.0 │     0.2× │     100% │ 59MB │ --solver=cvc5 │
│ verify  │ pingpong.fly                             │         │    0.0 │     0.6× │      89% │  7MB │ --solver=cvc5 │
╰─────────┴──────────────────────────────────────────┴─────────┴────────┴──────────┴──────────┴──────┴───────────────╯
total:    43
ok:       27
timeout:  13
fail:     3

benchmarking/src/run.rs Outdated Show resolved Hide resolved
benchmarking/src/bin/benchmark.rs Outdated Show resolved Hide resolved
@vmwclabot
Copy link

@tchajed, you must sign every commit in this pull request acknowledging our Developer Certificate of Origin before your changes are merged. This can be done by adding Signed-off-by: John Doe <[email protected]> to the last line of each Git commit message. The e-mail address used to sign must match the e-mail address of the Git author. Click here to view the Developer Certificate of Origin agreement.

Copy link
Collaborator

@wilcoxjay wilcoxjay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great start, thanks! I only had minor/nonblocking feedback. Feel free to fix what you agree with and then merge.

benchmarking/src/time_bench.rs Outdated Show resolved Hide resolved
benchmarking/src/time_bench.rs Outdated Show resolved Hide resolved
benchmarking/src/time_bench.rs Outdated Show resolved Hide resolved
benchmarking/src/result.rs Outdated Show resolved Hide resolved
benchmarking/src/result.rs Outdated Show resolved Hide resolved
benchmarking/src/time_bench.rs Outdated Show resolved Hide resolved
benchmarking/src/time_bench.rs Outdated Show resolved Hide resolved
benchmarking/src/run.rs Outdated Show resolved Hide resolved
benchmarking/src/run.rs Outdated Show resolved Hide resolved
benchmarking/src/run.rs Outdated Show resolved Hide resolved
@tchajed
Copy link
Collaborator Author

tchajed commented Aug 1, 2023

The fact that we track the signal that killed the process is just over-engineering, we should only record its exit status.

I think the racy behavior is now fixed (see the most recent commits). I don't think this is particularly important (if something is close to the timeout it can count as timing out even if it wasn't killed) but I fixed it since you pointed it out, and this fix wasn't too hard.

The whole thing could be somewhat simplified by just reporting the exit status of our time binary. But when we report results it'll be a lot more obscure, since timeout will just be a magic 124 exit status (a new feature I copied from timeout(1)).

@wilcoxjay
Copy link
Collaborator

I like it!

@odedp
Copy link
Contributor

odedp commented Aug 1, 2023

This is really great!! I haven't had a careful look at the code, but let me know if you want me to review specific files more carefully.

I have a question about concurrency/parallelism. What does benchmarking measure if we use parallelism? It would be nice to measure both "total CPU time" and "wall clock time", to be able to tell how much we actually exploit parallelism. (I remember we had some discussions about this.)

@tchajed
Copy link
Collaborator Author

tchajed commented Aug 1, 2023

That's a good reminder about tracking concurrency Oded. We already had the data: just compare the "user time" (which counts each core separately) with the wall clock time to get the average core utilization. I tweaked the output to show real time, a utilization ratio which tells us on average how many cores we're using, and the percentage of user time spent in subprocesses (this is a ratio of user times which both take into account concurrency).

The results have one effect I don't understand: a couple benchmarks timeout with a utilization of 0 or 0.1, indicating they're spending all of their time outsider of userspace (like blocking on I/O). That doesn't really make sense to me. It's possible getrusage isn't correctly reporting subprocess time for these benchmarks but I didn't really look into it. The first thing I'd do is get the numbers from Linux and make sure it's not a macOS issue.

Copy link
Collaborator

@edenfrenkel edenfrenkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I have no technical comments, but a few ideas regarding functionality, mainly things I have in mind for running qalpha benchmarks. I'm not sure whether you'd want to implement any of this in this PR or to leave it as future TODO's, so feel free to implement whatever you'd like. This is what I would add:

  • Saving outputs: Currently the stdout of each run is discarded, but it might be interesting to look at, and can be used to extract relevant statistics. Even for verify, the output associated withfail might be of interest. For this we could have

    • Some dedicated folder (like we have .flyvy-log for .smt2 files) where the output of benchmark runs is saved to. We can also have a flag that determines whether the output should be saved or discarded.
    • Once the above is done, we might want some control over the level of logging in the output, e.g. using a --log level which would set RUST_LOG=level (or not set any logging if not provided).
    • Finally, there should be a way to access this output from the benchmark code, and to define per-benchmark statistics to present in the printed table. Maybe this should be done by providing a path to the saved file. For example, when I write the benchmarking for qalpha I might want to print the size of the fixpoint found, which I would extract from the output and then add to the table.

    Note that we probably want the saved/processed output to contain both stdout and stderr, because I think the logging is sent to stderr.

  • More flexibility in configuring benchmarks: Currently this uses all examples in the examples folder and runs the benchmark with the same arguments for all of them, but we might want to determine a subset of these files to run a certain benchmark on, with file-specific arguments, or even multiple argument configurations per file. I think this can be done in several ways, for example using some config file that lists the benchmark configurations to run on each file, or by specifying inside the file which benchmark configurations should be run on it, similar to the way snapshot tests work. (Not all benchmarks must use these, just those where it's necessary.)

I'm approving, but let me know if you implement any of these changes.

benchmarking/src/time_bench.rs Outdated Show resolved Hide resolved
@vmwclabot
Copy link

@tchajed, you must sign every commit in this pull request acknowledging our Developer Certificate of Origin before your changes are merged. This can be done by adding Signed-off-by: John Doe <[email protected]> to the last line of each Git commit message. The e-mail address used to sign must match the e-mail address of the Git author. Click here to view the Developer Certificate of Origin agreement.

@tchajed tchajed dismissed wilcoxjay’s stale review August 24, 2023 18:40

addressed comments

@tchajed tchajed enabled auto-merge (squash) August 24, 2023 18:41
@tchajed tchajed merged commit 1541b89 into main Aug 24, 2023
2 checks passed
@tchajed tchajed deleted the tchajed/benchmark-timing branch August 24, 2023 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants