-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarking infrastructure to run verification #142
Conversation
54ef7e1
to
7940004
Compare
7ac1852
to
5a87f62
Compare
@tchajed, you must sign every commit in this pull request acknowledging our Developer Certificate of Origin before your changes are merged. This can be done by adding
|
5a87f62
to
8337d99
Compare
8337d99
to
d5a74ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a great start, thanks! I only had minor/nonblocking feedback. Feel free to fix what you agree with and then merge.
The fact that we track the signal that killed the process is just over-engineering, we should only record its exit status. I think the racy behavior is now fixed (see the most recent commits). I don't think this is particularly important (if something is close to the timeout it can count as timing out even if it wasn't killed) but I fixed it since you pointed it out, and this fix wasn't too hard. The whole thing could be somewhat simplified by just reporting the exit status of our |
I like it! |
This is really great!! I haven't had a careful look at the code, but let me know if you want me to review specific files more carefully. I have a question about concurrency/parallelism. What does benchmarking measure if we use parallelism? It would be nice to measure both "total CPU time" and "wall clock time", to be able to tell how much we actually exploit parallelism. (I remember we had some discussions about this.) |
3f280b9
to
c03b4ae
Compare
That's a good reminder about tracking concurrency Oded. We already had the data: just compare the "user time" (which counts each core separately) with the wall clock time to get the average core utilization. I tweaked the output to show real time, a utilization ratio which tells us on average how many cores we're using, and the percentage of user time spent in subprocesses (this is a ratio of user times which both take into account concurrency). The results have one effect I don't understand: a couple benchmarks timeout with a utilization of 0 or 0.1, indicating they're spending all of their time outsider of userspace (like blocking on I/O). That doesn't really make sense to me. It's possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I have no technical comments, but a few ideas regarding functionality, mainly things I have in mind for running qalpha benchmarks. I'm not sure whether you'd want to implement any of this in this PR or to leave it as future TODO's, so feel free to implement whatever you'd like. This is what I would add:
-
Saving outputs: Currently the stdout of each run is discarded, but it might be interesting to look at, and can be used to extract relevant statistics. Even for
verify
, the output associated withfail
might be of interest. For this we could have- Some dedicated folder (like we have
.flyvy-log
for.smt2
files) where the output of benchmark runs is saved to. We can also have a flag that determines whether the output should be saved or discarded. - Once the above is done, we might want some control over the level of logging in the output, e.g. using a
--log level
which would setRUST_LOG=level
(or not set any logging if not provided). - Finally, there should be a way to access this output from the benchmark code, and to define per-benchmark statistics to present in the printed table. Maybe this should be done by providing a path to the saved file. For example, when I write the benchmarking for
qalpha
I might want to print the size of the fixpoint found, which I would extract from the output and then add to the table.
Note that we probably want the saved/processed output to contain both
stdout
andstderr
, because I think the logging is sent tostderr
. - Some dedicated folder (like we have
-
More flexibility in configuring benchmarks: Currently this uses all examples in the
examples
folder and runs the benchmark with the same arguments for all of them, but we might want to determine a subset of these files to run a certain benchmark on, with file-specific arguments, or even multiple argument configurations per file. I think this can be done in several ways, for example using some config file that lists the benchmark configurations to run on each file, or by specifying inside the file which benchmark configurations should be run on it, similar to the way snapshot tests work. (Not all benchmarks must use these, just those where it's necessary.)
I'm approving, but let me know if you implement any of these changes.
1f2b257
to
ceecd1e
Compare
@tchajed, you must sign every commit in this pull request acknowledging our Developer Certificate of Origin before your changes are merged. This can be done by adding
|
Signed-off-by: Tej Chajed <[email protected]>
Signed-off-by: Tej Chajed <[email protected]>
Signed-off-by: Tej Chajed <[email protected]>
ceecd1e
to
18d347c
Compare
The user-visible part of this is a library that allows running
temporal-verifier
with a timeout and gathering some runtime resource usage statistics as well as exit status, and a function to produce a nice table of all the results. I've used this in the simplest possible way to generate the results of runningtemporal-verifier verify
over all the benchmarks, with the results shown below.This is the core infrastructure needed for #118. The remaining work (hopefully small) is to add configurations to do something similar for running qalpha and the bounded model checkers. These will be a bit more work since those have more parameters to configure.
The infrastructure to monitor a benchmark is highly Unix-specific so it works on macOS and Linux and will not work on Windows (but it should work on WSL).
To produce this table run
cargo run -r --bin benchmark -- verify --time-limit 30s
.To produce this table run
cargo run --bin benchmark -- verify --time-limit 30s --solver cvc5
.