e2e counters split out by top usecases #14031

igor-aptos · 2024-07-17T22:44:23Z

Description

Track frequency of usecases in recent blocks (or during state sync - recent batches), and then report latency counters separately for top usecases, as well as framework and others (i.e. scripts)

We can then use this for monitoring, and gas estimation and more

Type of Change

Which Components or Systems Does This Change Impact?

How Has This Been Tested?

Key Areas to Review

Checklist

I have read and followed the CONTRIBUTING doc
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I identified and added all stakeholders and component owners affected by this change as reviewers
I tested both happy and unhappy path of the functionality
I have made corresponding changes to the documentation

trunk-io · 2024-07-17T22:44:27Z

⏱️ 6h 52m total CI duration on this PR

Job	Cumulative Duration	Recent Runs
test-fuzzers	3h 31m	🟩 🟥 ⬜ ⬜
forge-compat-test / forge	35m	🟩
execution-performance / single-node-performance	21m	🟩
forge-framework-upgrade-test / forge	17m	🟩
forge-e2e-test / forge	17m	🟩
rust-move-tests	15m	🟩
rust-cargo-deny	10m	⬜ 🟩 🟩 🟩 🟩 (+2 more)
rust-move-tests	10m	🟩
general-lints	10m	⬜ 🟩 🟩 🟩 🟩 (+2 more)
rust-move-tests	10m	🟩
rust-move-tests	10m	🟩
rust-move-tests	10m	🟩
check-dynamic-deps	7m	🟩 🟩 🟩 🟩 🟩 (+2 more)
rust-doc-tests	6m	🟩
test-target-determinator	5m	🟩
execution-performance / test-target-determinator	4m	🟩
check	4m	🟩
semgrep/ci	3m	🟩 🟩 🟩 🟩 🟩 (+2 more)
adhoc-forge-test / forge	3m	🟥 🟥
file_change_determinator	1m	🟩 🟩 🟩 🟩 🟩 (+2 more)
file_change_determinator	1m	🟩 🟩 🟩 🟩 🟩 (+2 more)
rust-move-tests	58s	⬜
permission-check	23s	🟩 🟩 🟩 🟩 🟩 (+2 more)
permission-check	22s	🟩 🟩 🟩 🟩 🟩 (+2 more)
permission-check	21s	🟩 🟩 🟩 🟩 🟩 (+2 more)
permission-check	21s	🟩 🟩 🟩 🟩 🟩 (+2 more)
file_change_determinator	16s	🟩
determine-forge-run-metadata	5s	🟩 🟩
Backport PR	3s	🟥
permission-check	2s	🟩
determine-docker-build-metadata	2s	🟩
permission-check	2s	🟩
rust-move-tests	1s	⬜

🚨 3 jobs on the last run were significantly faster/slower than expected

Job	Duration	vs 7d avg
forge-compat-test / forge	35m	14m
forge-framework-upgrade-test / forge	17m	11m
test-fuzzers	56m	38m

_{settings ⋅ feedback ⋅ docs ⋅ learn more about trunk.io}

msmouse

beautiful

msmouse · 2024-07-17T22:56:39Z

types/src/transaction/use_case.rs

+                if module_id.address().is_special() {
+                    Platform
+                } else {
+                    // n.b. Generics ignored.


hmm.. fix my bug by deleting this comment, thank you..

msmouse · 2024-07-17T23:04:48Z

mempool/src/shared_mempool/use_case_history.rs

+        let mut total = HashMap::new();
+        for group in &self.recent {
+            for (use_case, count) in group {
+                if use_case != &UseCaseKey::Platform && use_case != &UseCaseKey::Others {
+                    *total.entry(use_case.clone()).or_insert(0) += count;
+                }
+            }
+        }


Tracking a global total: HashMap<> will avoid traversing the whole queue every time (at the cost of the space)?

oh you mean total on top of individual ones for window, and then instead of removing one - subtracting it?

we are doing this once per block, so not critical I think

it's potentially a lit bit more efficient (like 40x - the size of the window), but if anything goes wrong with doing -= count, we might have wrong values forever.

still probably worth doing :)

it's potentially a lit bit more efficient (like 40x

like a cryptographer talking 😀

msmouse · 2024-07-17T23:08:20Z

mempool/src/shared_mempool/use_case_history.rs

+        let mut sorted = total.into_iter().collect::<Vec<_>>();
+        sorted.sort_by_key(|(_, count)| *count);
+
+        for _ in 0..self.num_top_to_track {
+            if let Some((use_case, _)) = sorted.pop() {
+                result.insert(use_case);
+            }
+        }
+
+        result


probably doesn't matter, but perfect use case for a BinaryHeap<(usize, UseCaseKey)> 😂

you mean to use heap instead of total, or just instead of sorted?

if instead of total, not sure how to update point values for usecase.

if instead of sorting - does rust have a way to convert vector to heap in O(n)? If inserting from iterator - complexity will be the same.

I mean sorting.
true it's the same O, but on average it must be faster than completely sorting them, right?
https://doc.rust-lang.org/std/collections/struct.BinaryHeap.html#time-complexity-3

but again, this probably matters too little even for me

heap can be created in O(n), worst case, but I don't see rust providing such API? that would be nice to use :)

I think even pushing one by one is probably meaningfully faster -- as it grows most likely a new item doesn't need to climb many levels of the heap -- that's why they estimate ~O(1) for pushing as well

that's average. Reverse sorted would be O(n log n). but yeah, I can switch to heap, unfortunate I cannot get O(n)

actually they do have that optimization..
https://github.com/rust-lang/rust/blob/fcc325f1bc477975e2ce5ba534fe4c77ff8a8536/library/alloc/src/collections/binary_heap/mod.rs#L1809

sitalkedia · 2024-07-18T22:27:14Z

mempool/src/core_mempool/mempool.rs

+
+                counters::TXN_E2E_USE_CASE_COMMIT_LATENCY
+                    .with_label_values(&[
+                        &use_case_label,


This is going to blow up the number of counters no?

argh.

@sitalkedia - at any given block, we are exporting at most usecase_stats_num_top_to_track = 5 distinct values. (top is counted over 40 blocks, to be more stable)

so at any individual time it should be fine.

over time, these might change. from prometheus side, I don't think that is a problem - each forge test are new sets of counters.

But histogram will be exporting unchanged values too. (and using memory for them) Do you know of a way that will only keep track of new values, and export only those? Otherwise, I'll need to change this to contract_top1 / contract_top2 / ...

Yeah, my concern is the number of dimensions being tracked over time. Not a prometheus export to say for sure if its a problem but need to understand the implication of this before landing this change.

as is, it's also a problem that metrics that were created, but stopped being updated, continue being exported - as count is count from the start of the process.

I'll change to contract_top1 / ... , and log contract_top1 <-> address map, as I don't have a better suggestion

I doubt our current prometheus db can support this churn. We are talking about 5*4blocks = 20 unique labels per second per peer. And, if we end up getting unique labels for minutes, that will definitely overwhelm the db.

Forge is short-lived, when the node dies, it's labels go away with it. For mainnet, the validator will keep exporting the metric as long as it's alive, even if you record it once.

cc: @geekflyer

btw, have we thought about tracking this via logs or traces? If we for example have this behind a feature flag we could have one of the internal validators emit a log line for each observed TX. In terms of volume I think this should be fine as long we don't emit this for every single node. The main thing is with logs/traces we don't have any limits on cardinality or churn.

On 2. here's a related issue tikv/rust-prometheus#336 (this is the lib we're using a the moment afaik).

btw, just to add: I think our TSDB can possibly support this much cardinality/churn (since this is just a single metric), but the issue is the unboundness of cardinality due to non-expiring metrics in the process registry, which amplifies the problem by a lot as time goes by.

I think we've been doing similar things for the module+function already?

here we kinda have to have it for all - because only PFN that transaction is submitted to can compute latency correctly

But yeah, we have 2 fullnodes logging detailed countrs - but those are on module name, so republishing a contract under different address shows in the same counter. not sure if cardinalities there are causing any issues, or we need to revisit those counters - https://aptoslabs.grafana.net/goto/fxPdCAXSR?orgId=1

for here, I've basically changed cardinality to fixed set by using entry_user_top_1 ... entry_user_top_5, which should be enough

igor-aptos · 2024-07-22T21:07:49Z

@sitalkedia updated

sitalkedia

minor nit around possible frequent logging.

sitalkedia · 2024-07-22T22:10:35Z

mempool/src/shared_mempool/use_case_history.rs

+        for i in 0..self.num_top_to_track {
+            if let Some(UseCaseByCount { use_case, count }) = max_heap.pop() {
+                info!(
+                    "Use cases: top {} is {:?} with {} occurences",


Is this going to be logging too frequently to flood the logs?

5 times per block, so 20 per second.

is that too frequent? I can just merge into single line per block, that's probably fine?

That's probably fine but single line per block is better may be?

github-actions · 2024-07-23T18:54:04Z

✅ Forge suite `realistic_env_max_load` success on `799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b`

two traffics test: inner traffic : committed: 9184.83212735385 txn/s, submitted: 9447.574514668268 txn/s, failed submission: 0.4734097068728253 txn/s, expired: 262.74238731441807 txn/s, latency: 2698.9860689066677 ms, (p50: 2500 ms, p90: 3300 ms, p99: 4500 ms), latency samples: 3492260
two traffics test : committed: 99.99309776640919 txn/s, latency: 2008.7725 ms, (p50: 2000 ms, p90: 2200 ms, p99: 3600 ms), latency samples: 2000
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.242, avg: 0.214", "QsPosToProposal: max: 1.319, avg: 0.483", "ConsensusProposalToOrdered: max: 0.321, avg: 0.294", "ConsensusOrderedToCommit: max: 0.398, avg: 0.381", "ConsensusProposalToCommit: max: 0.687, avg: 0.675"]
Max round gap was 1 [limit 4] at version 1878985. Max no progress secs was 5.716865 [limit 15] at version 1878985.
Test Ok

github-actions · 2024-07-23T18:54:55Z

✅ Forge suite `framework_upgrade` success on `1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5` ==> `799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b`

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b (PR)
Upgrade the nodes to version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1088.833485545872 txn/s, submitted: 1091.094821964451 txn/s, failed submission: 2.2613364185791736 txn/s, expired: 2.2613364185791736 txn/s, latency: 2794.5085254413293 ms, (p50: 2100 ms, p90: 4800 ms, p99: 11700 ms), latency samples: 96300
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1069.9185066790722 txn/s, submitted: 1072.9942477660345 txn/s, failed submission: 3.075741086962425 txn/s, expired: 3.075741086962425 txn/s, latency: 2793.4022997946613 ms, (p50: 1800 ms, p90: 5100 ms, p99: 12300 ms), latency samples: 97400
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b passed
Upgrade the remaining nodes to version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1027.1186052779076 txn/s, submitted: 1029.420013759373 txn/s, failed submission: 2.301408481465175 txn/s, expired: 2.301408481465175 txn/s, latency: 3430.7060049294196 ms, (p50: 2300 ms, p90: 7100 ms, p99: 13900 ms), latency samples: 89260
Test Ok

github-actions · 2024-07-23T19:12:23Z

✅ Forge suite `compat` success on `1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5` ==> `799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b`

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 7824.11180576748 txn/s, latency: 3917.3402840662407 ms, (p50: 3000 ms, p90: 6300 ms, p99: 19800 ms), latency samples: 322460
2. Upgrading first Validator to new version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 5970.024054329734 txn/s, latency: 4489.207680305131 ms, (p50: 4200 ms, p90: 7000 ms, p99: 7200 ms), latency samples: 115360
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6358.553182920637 txn/s, latency: 4857.860828435713 ms, (p50: 4600 ms, p90: 5500 ms, p99: 8600 ms), latency samples: 248420
3. Upgrading rest of first batch to new version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6446.631209023102 txn/s, latency: 4026.001443990665 ms, (p50: 4400 ms, p90: 5300 ms, p99: 5400 ms), latency samples: 137120
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6668.795325900628 txn/s, latency: 4584.8529891074895 ms, (p50: 4400 ms, p90: 7000 ms, p99: 7500 ms), latency samples: 236860
4. upgrading second batch to new version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10489.085220205263 txn/s, latency: 2664.0522729647823 ms, (p50: 2700 ms, p90: 3700 ms, p99: 4200 ms), latency samples: 191380
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10493.59393535521 txn/s, latency: 3416.588983394087 ms, (p50: 3400 ms, p90: 4300 ms, p99: 5300 ms), latency samples: 345660
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b passed
Test Ok

igor-aptos requested review from msmouse, vusirikala and sitalkedia July 17, 2024 22:44

igor-aptos requested review from JoshLind, bchocho, zekun000, sasha8, ibalajiarun and gregnazario as code owners July 17, 2024 22:44

igor-aptos force-pushed the igor/use_case_visibility branch from bfae762 to 91533a3 Compare July 17, 2024 22:45

msmouse approved these changes Jul 17, 2024

View reviewed changes

igor-aptos force-pushed the igor/use_case_visibility branch 2 times, most recently from 254da3c to 54ee359 Compare July 18, 2024 22:02

sitalkedia reviewed Jul 18, 2024

View reviewed changes

e2e counters split out by top usecases

f89102b

igor-aptos force-pushed the igor/use_case_visibility branch from 54ee359 to e8fcf23 Compare July 22, 2024 21:07

igor-aptos requested a review from sitalkedia July 22, 2024 21:07

sitalkedia approved these changes Jul 22, 2024

View reviewed changes

addressing comments

799c3f1

igor-aptos force-pushed the igor/use_case_visibility branch from e8fcf23 to 799c3f1 Compare July 23, 2024 00:19

igor-aptos enabled auto-merge (squash) July 23, 2024 18:22

This comment has been minimized.

Sign in to view

igor-aptos merged commit 46be663 into main Jul 23, 2024
90 checks passed

igor-aptos deleted the igor/use_case_visibility branch July 23, 2024 19:12

zi0Black mentioned this pull request Jul 24, 2024

chore: Disable unwrap in non-test components for consensus #13992

Merged

21 tasks

e2e counters split out by top usecases #14031

e2e counters split out by top usecases #14031

Conversation

igor-aptos commented Jul 17, 2024

Description

Type of Change

Which Components or Systems Does This Change Impact?

How Has This Been Tested?

Key Areas to Review

Checklist

trunk-io bot commented Jul 17, 2024 • edited Loading

msmouse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igor-aptos Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geekflyer Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geekflyer Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

igor-aptos commented Jul 22, 2024

sitalkedia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Jul 23, 2024

✅ Forge suite realistic_env_max_load success on 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b

github-actions bot commented Jul 23, 2024

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b

github-actions bot commented Jul 23, 2024

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b

trunk-io bot commented Jul 17, 2024 •

edited

Loading

igor-aptos Jul 17, 2024 •

edited

Loading

geekflyer Jul 19, 2024 •

edited

Loading

geekflyer Jul 19, 2024 •

edited

Loading

✅ Forge suite `realistic_env_max_load` success on `799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b`

✅ Forge suite `framework_upgrade` success on `1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5` ==> `799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b`

✅ Forge suite `compat` success on `1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5` ==> `799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b`