Speculative logging support in Aptos VM #6708

gelash · 2023-02-20T20:28:50Z

Speculative state helper is a separate (and well-tested) crate now, this integrates into aptos-vm, initializes per every block and flushes at the end of block execution.

speculative logs are cleared on parallel execution abort, and also all logs are cleared during module r/w fallback to sequential execution.

the drawback of this approach currently, as brought up by @wrwg is that if there are crashes, we may not log some entries stuck in the speculative buffer. I will create a github issue for tracking and fixing this (i.e. by rolling commit granularity, or something flushing on crash/panic handlers, etc), but the current PR if all works as intended, should still provide an incremental improvement to what we have now. Once landed, it will also allow @danielxiangzl to proceed w. early termination PR for parallel execution -> and then supporting per block limits.

Another limitation is that this is currently for aptos-vm only, logs coming from move-vm are unaffected.

zekun000

looks great, it'd be good to try it out on some workload that previously output speculative logs

zekun000 · 2023-02-23T23:27:10Z

aptos-move/aptos-vm-logging/src/lib.rs

+                // TODO: Consider using SpeculativeCounter to increase CRITICAL_ERRORS
+                // on the critical path instead of async dispatching.
+                error!(self.context, "{}", self.message);
+                CRITICAL_ERRORS.inc();


hmm, is every single error message in vm considered critical error?

It was like this so I kept the behavior, but yeah, weird. Why not just alert on "error"

aptos-move/aptos-vm-logging/src/lib.rs

aptos-move/aptos-vm-logging/src/log_schema.rs

gelash · 2023-02-24T16:43:35Z

looks great, it'd be good to try it out on some workload that previously output speculative logs

@danielxiangzl - could you try your new PR with Dario's txn load generator - it has module publishing and reading conflicts. So without this PR it should print the storage errors, and with this it shouldn't. It would check the clear for fallback, but it's better than nothing. @zekun000 any ideas on how to check actual logs from speculative, later aborted txn executions?

@igor-aptos may know how to run (executor?) benchmark with Dario's transactions? (publishing etc)

davidiw

this is really cool, now we'll stop getting errant messages and alerts firing due to behavior that won't actually be committed / violate the blockstm semantics?

gelash · 2023-02-27T13:43:47Z

this is really cool, now we'll stop getting errant messages and alerts firing due to behavior that won't actually be committed / violate the blockstm semantics?

thx, yep! some outstanding issues to follow-up on but should improve the state of things atm.

gelash · 2023-03-08T14:03:58Z

I checked hacked speculative logging into one of the proptests and verified it works - i.e. it ensures only one log appears per transaction when speculative logging is enabled.
The test finishing would stop flush from actually logging, so I had to add a wait. Should not be an issue during normal operation, but might be a consideration for when we support flushes in crash scenarios (#6794) - added a comment there as well.

github-actions · 2023-03-08T14:49:33Z

✅ Forge suite `land_blocking` success on `2d5beda29dac32783a33c993775a91b13b3d09a5`

performance benchmark with full nodes : 6080 TPS, 6540 ms latency, 8700 ms p99 latency,no expired txns
Test Ok

github-actions · 2023-03-08T14:50:10Z

✅ Forge suite `compat` success on `testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b` ==> `2d5beda29dac32783a33c993775a91b13b3d09a5`

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 2d5beda29dac32783a33c993775a91b13b3d09a5 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7985 TPS, 4788 ms latency, 7900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 2d5beda29dac32783a33c993775a91b13b3d09a5
compatibility::simple-validator-upgrade::single-validator-upgrade : 5096 TPS, 7750 ms latency, 10800 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 2d5beda29dac32783a33c993775a91b13b3d09a5
compatibility::simple-validator-upgrade::half-validator-upgrade : 4760 TPS, 8070 ms latency, 10400 ms p99 latency,no expired txns
4. upgrading second batch to new version: 2d5beda29dac32783a33c993775a91b13b3d09a5
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6956 TPS, 5454 ms latency, 8800 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 2d5beda29dac32783a33c993775a91b13b3d09a5 passed
Test Ok

github-actions · 2023-03-08T14:50:13Z

✅ Forge suite `framework_upgrade` success on `cb4ba0a57c998c60cbab65af31a64875d2588ca5` ==> `2d5beda29dac32783a33c993775a91b13b3d09a5`

Compatibility test results for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 2d5beda29dac32783a33c993775a91b13b3d09a5 (PR)
Upgrade the nodes to version: 2d5beda29dac32783a33c993775a91b13b3d09a5
framework_upgrade::framework-upgrade::full-framework-upgrade : 6984 TPS, 5480 ms latency, 7600 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 2d5beda29dac32783a33c993775a91b13b3d09a5 passed
Test Ok

gelash requested review from ibalajiarun, JoshLind, runtian-zhou and vgao1996 February 20, 2023 20:28

gelash requested review from zekun000, sasha8, danielxiangzl, davidiw and wrwg as code owners February 20, 2023 20:28

gelash added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Feb 20, 2023