Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative logging support in Aptos VM #6708

Merged
merged 1 commit into from
Mar 8, 2023
Merged

Speculative logging support in Aptos VM #6708

merged 1 commit into from
Mar 8, 2023

Conversation

gelash
Copy link
Contributor

@gelash gelash commented Feb 20, 2023

Speculative state helper is a separate (and well-tested) crate now, this integrates into aptos-vm, initializes per every block and flushes at the end of block execution.

speculative logs are cleared on parallel execution abort, and also all logs are cleared during module r/w fallback to sequential execution.

the drawback of this approach currently, as brought up by @wrwg is that if there are crashes, we may not log some entries stuck in the speculative buffer. I will create a github issue for tracking and fixing this (i.e. by rolling commit granularity, or something flushing on crash/panic handlers, etc), but the current PR if all works as intended, should still provide an incremental improvement to what we have now. Once landed, it will also allow @danielxiangzl to proceed w. early termination PR for parallel execution -> and then supporting per block limits.

Another limitation is that this is currently for aptos-vm only, logs coming from move-vm are unaffected.

@gelash gelash added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Feb 20, 2023
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Copy link
Contributor

@zekun000 zekun000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, it'd be good to try it out on some workload that previously output speculative logs

// TODO: Consider using SpeculativeCounter to increase CRITICAL_ERRORS
// on the critical path instead of async dispatching.
error!(self.context, "{}", self.message);
CRITICAL_ERRORS.inc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is every single error message in vm considered critical error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was like this so I kept the behavior, but yeah, weird. Why not just alert on "error"

aptos-move/aptos-vm-logging/src/lib.rs Outdated Show resolved Hide resolved
aptos-move/aptos-vm-logging/src/lib.rs Outdated Show resolved Hide resolved
aptos-move/aptos-vm-logging/src/log_schema.rs Outdated Show resolved Hide resolved
@gelash
Copy link
Contributor Author

gelash commented Feb 24, 2023

looks great, it'd be good to try it out on some workload that previously output speculative logs

@danielxiangzl - could you try your new PR with Dario's txn load generator - it has module publishing and reading conflicts. So without this PR it should print the storage errors, and with this it shouldn't. It would check the clear for fallback, but it's better than nothing. @zekun000 any ideas on how to check actual logs from speculative, later aborted txn executions?

@igor-aptos may know how to run (executor?) benchmark with Dario's transactions? (publishing etc)

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

Copy link
Contributor

@davidiw davidiw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really cool, now we'll stop getting errant messages and alerts firing due to behavior that won't actually be committed / violate the blockstm semantics?

@gelash
Copy link
Contributor Author

gelash commented Feb 27, 2023

this is really cool, now we'll stop getting errant messages and alerts firing due to behavior that won't actually be committed / violate the blockstm semantics?

thx, yep! some outstanding issues to follow-up on but should improve the state of things atm.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@gelash gelash force-pushed the gelash/vmlogging branch from 571f8b2 to 2d5beda Compare March 8, 2023 14:01
@gelash
Copy link
Contributor Author

gelash commented Mar 8, 2023

I checked hacked speculative logging into one of the proptests and verified it works - i.e. it ensures only one log appears per transaction when speculative logging is enabled.
The test finishing would stop flush from actually logging, so I had to add a wait. Should not be an issue during normal operation, but might be a consideration for when we support flushes in crash scenarios (#6794) - added a comment there as well.

@gelash gelash enabled auto-merge (squash) March 8, 2023 14:04
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2023

✅ Forge suite land_blocking success on 2d5beda29dac32783a33c993775a91b13b3d09a5

performance benchmark with full nodes : 6080 TPS, 6540 ms latency, 8700 ms p99 latency,no expired txns
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2023

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 2d5beda29dac32783a33c993775a91b13b3d09a5

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 2d5beda29dac32783a33c993775a91b13b3d09a5 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7985 TPS, 4788 ms latency, 7900 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 2d5beda29dac32783a33c993775a91b13b3d09a5
compatibility::simple-validator-upgrade::single-validator-upgrade : 5096 TPS, 7750 ms latency, 10800 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 2d5beda29dac32783a33c993775a91b13b3d09a5
compatibility::simple-validator-upgrade::half-validator-upgrade : 4760 TPS, 8070 ms latency, 10400 ms p99 latency,no expired txns
4. upgrading second batch to new version: 2d5beda29dac32783a33c993775a91b13b3d09a5
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6956 TPS, 5454 ms latency, 8800 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 2d5beda29dac32783a33c993775a91b13b3d09a5 passed
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2023

✅ Forge suite framework_upgrade success on cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 2d5beda29dac32783a33c993775a91b13b3d09a5

Compatibility test results for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 2d5beda29dac32783a33c993775a91b13b3d09a5 (PR)
Upgrade the nodes to version: 2d5beda29dac32783a33c993775a91b13b3d09a5
framework_upgrade::framework-upgrade::full-framework-upgrade : 6984 TPS, 5480 ms latency, 7600 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for cb4ba0a57c998c60cbab65af31a64875d2588ca5 ==> 2d5beda29dac32783a33c993775a91b13b3d09a5 passed
Test Ok

@gelash gelash merged commit e1a0237 into main Mar 8, 2023
@gelash gelash deleted the gelash/vmlogging branch March 8, 2023 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants