Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e counters split out by top usecases #14031

Merged
merged 2 commits into from
Jul 23, 2024
Merged

Conversation

igor-aptos
Copy link
Contributor

Description

Track frequency of usecases in recent blocks (or during state sync - recent batches), and then report latency counters separately for top usecases, as well as framework and others (i.e. scripts)

We can then use this for monitoring, and gas estimation and more

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Other (specify)

How Has This Been Tested?

Key Areas to Review

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Jul 17, 2024

⏱️ 6h 52m total CI duration on this PR
Job Cumulative Duration Recent Runs
test-fuzzers 3h 31m 🟩🟥
forge-compat-test / forge 35m 🟩
execution-performance / single-node-performance 21m 🟩
forge-framework-upgrade-test / forge 17m 🟩
forge-e2e-test / forge 17m 🟩
rust-move-tests 15m 🟩
rust-cargo-deny 10m 🟩🟩🟩🟩 (+2 more)
rust-move-tests 10m 🟩
general-lints 10m 🟩🟩🟩🟩 (+2 more)
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
rust-move-tests 10m 🟩
check-dynamic-deps 7m 🟩🟩🟩🟩🟩 (+2 more)
rust-doc-tests 6m 🟩
test-target-determinator 5m 🟩
execution-performance / test-target-determinator 4m 🟩
check 4m 🟩
semgrep/ci 3m 🟩🟩🟩🟩🟩 (+2 more)
adhoc-forge-test / forge 3m 🟥🟥
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+2 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+2 more)
rust-move-tests 58s
permission-check 23s 🟩🟩🟩🟩🟩 (+2 more)
permission-check 22s 🟩🟩🟩🟩🟩 (+2 more)
permission-check 21s 🟩🟩🟩🟩🟩 (+2 more)
permission-check 21s 🟩🟩🟩🟩🟩 (+2 more)
file_change_determinator 16s 🟩
determine-forge-run-metadata 5s 🟩🟩
Backport PR 3s 🟥
permission-check 2s 🟩
determine-docker-build-metadata 2s 🟩
permission-check 2s 🟩
rust-move-tests 1s

🚨 3 jobs on the last run were significantly faster/slower than expected

Job Duration vs 7d avg Delta
forge-compat-test / forge 35m 14m +144%
forge-framework-upgrade-test / forge 17m 11m +58%
test-fuzzers 56m 38m +48%

settingsfeedbackdocs ⋅ learn more about trunk.io

@igor-aptos igor-aptos force-pushed the igor/use_case_visibility branch from bfae762 to 91533a3 Compare July 17, 2024 22:45
Copy link
Contributor

@msmouse msmouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful

if module_id.address().is_special() {
Platform
} else {
// n.b. Generics ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. fix my bug by deleting this comment, thank you.. :trollface:

Comment on lines 39 to 55
let mut total = HashMap::new();
for group in &self.recent {
for (use_case, count) in group {
if use_case != &UseCaseKey::Platform && use_case != &UseCaseKey::Others {
*total.entry(use_case.clone()).or_insert(0) += count;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking a global total: HashMap<> will avoid traversing the whole queue every time (at the cost of the space)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you mean total on top of individual ones for window, and then instead of removing one - subtracting it?

we are doing this once per block, so not critical I think

it's potentially a lit bit more efficient (like 40x - the size of the window), but if anything goes wrong with doing -= count, we might have wrong values forever.

still probably worth doing :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's potentially a lit bit more efficient (like 40x

like a cryptographer talking 😀

Comment on lines 52 to 87
let mut sorted = total.into_iter().collect::<Vec<_>>();
sorted.sort_by_key(|(_, count)| *count);

for _ in 0..self.num_top_to_track {
if let Some((use_case, _)) = sorted.pop() {
result.insert(use_case);
}
}

result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably doesn't matter, but perfect use case for a BinaryHeap<(usize, UseCaseKey)> 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to use heap instead of total, or just instead of sorted?

if instead of total, not sure how to update point values for usecase.

if instead of sorting - does rust have a way to convert vector to heap in O(n)? If inserting from iterator - complexity will be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean sorting.
true it's the same O, but on average it must be faster than completely sorting them, right?
https://doc.rust-lang.org/std/collections/struct.BinaryHeap.html#time-complexity-3

but again, this probably matters too little even for me

Copy link
Contributor Author

@igor-aptos igor-aptos Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heap can be created in O(n), worst case, but I don't see rust providing such API? that would be nice to use :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think even pushing one by one is probably meaningfully faster -- as it grows most likely a new item doesn't need to climb many levels of the heap -- that's why they estimate ~O(1) for pushing as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's average. Reverse sorted would be O(n log n). but yeah, I can switch to heap, unfortunate I cannot get O(n)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igor-aptos igor-aptos force-pushed the igor/use_case_visibility branch 2 times, most recently from 254da3c to 54ee359 Compare July 18, 2024 22:02

counters::TXN_E2E_USE_CASE_COMMIT_LATENCY
.with_label_values(&[
&use_case_label,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to blow up the number of counters no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh.

@sitalkedia - at any given block, we are exporting at most usecase_stats_num_top_to_track = 5 distinct values. (top is counted over 40 blocks, to be more stable)

so at any individual time it should be fine.

over time, these might change. from prometheus side, I don't think that is a problem - each forge test are new sets of counters.

But histogram will be exporting unchanged values too. (and using memory for them) Do you know of a way that will only keep track of new values, and export only those? Otherwise, I'll need to change this to contract_top1 / contract_top2 / ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my concern is the number of dimensions being tracked over time. Not a prometheus export to say for sure if its a problem but need to understand the implication of this before landing this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as is, it's also a problem that metrics that were created, but stopped being updated, continue being exported - as count is count from the start of the process.

I'll change to contract_top1 / ... , and log contract_top1 <-> address map, as I don't have a better suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt our current prometheus db can support this churn. We are talking about 5*4blocks = 20 unique labels per second per peer. And, if we end up getting unique labels for minutes, that will definitely overwhelm the db.

Forge is short-lived, when the node dies, it's labels go away with it. For mainnet, the validator will keep exporting the metric as long as it's alive, even if you record it once.

cc: @geekflyer

Copy link
Contributor

@geekflyer geekflyer Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, have we thought about tracking this via logs or traces? If we for example have this behind a feature flag we could have one of the internal validators emit a log line for each observed TX. In terms of volume I think this should be fine as long we don't emit this for every single node. The main thing is with logs/traces we don't have any limits on cardinality or churn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 2. here's a related issue tikv/rust-prometheus#336 (this is the lib we're using a the moment afaik).

Copy link
Contributor

@geekflyer geekflyer Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, just to add: I think our TSDB can possibly support this much cardinality/churn (since this is just a single metric), but the issue is the unboundness of cardinality due to non-expiring metrics in the process registry, which amplifies the problem by a lot as time goes by.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've been doing similar things for the module+function already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we kinda have to have it for all - because only PFN that transaction is submitted to can compute latency correctly

But yeah, we have 2 fullnodes logging detailed countrs - but those are on module name, so republishing a contract under different address shows in the same counter. not sure if cardinalities there are causing any issues, or we need to revisit those counters - https://aptoslabs.grafana.net/goto/fxPdCAXSR?orgId=1

for here, I've basically changed cardinality to fixed set by using entry_user_top_1 ... entry_user_top_5, which should be enough

@igor-aptos igor-aptos force-pushed the igor/use_case_visibility branch from 54ee359 to e8fcf23 Compare July 22, 2024 21:07
@igor-aptos igor-aptos requested a review from sitalkedia July 22, 2024 21:07
@igor-aptos
Copy link
Contributor Author

@sitalkedia updated

Copy link
Contributor

@sitalkedia sitalkedia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit around possible frequent logging.

for i in 0..self.num_top_to_track {
if let Some(UseCaseByCount { use_case, count }) = max_heap.pop() {
info!(
"Use cases: top {} is {:?} with {} occurences",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be logging too frequently to flood the logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 times per block, so 20 per second.

is that too frequent? I can just merge into single line per block, that's probably fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably fine but single line per block is better may be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@igor-aptos igor-aptos force-pushed the igor/use_case_visibility branch from e8fcf23 to 799c3f1 Compare July 23, 2024 00:19
@igor-aptos igor-aptos enabled auto-merge (squash) July 23, 2024 18:22

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b

two traffics test: inner traffic : committed: 9184.83212735385 txn/s, submitted: 9447.574514668268 txn/s, failed submission: 0.4734097068728253 txn/s, expired: 262.74238731441807 txn/s, latency: 2698.9860689066677 ms, (p50: 2500 ms, p90: 3300 ms, p99: 4500 ms), latency samples: 3492260
two traffics test : committed: 99.99309776640919 txn/s, latency: 2008.7725 ms, (p50: 2000 ms, p90: 2200 ms, p99: 3600 ms), latency samples: 2000
Latency breakdown for phase 0: ["QsBatchToPos: max: 0.242, avg: 0.214", "QsPosToProposal: max: 1.319, avg: 0.483", "ConsensusProposalToOrdered: max: 0.321, avg: 0.294", "ConsensusOrderedToCommit: max: 0.398, avg: 0.381", "ConsensusProposalToCommit: max: 0.687, avg: 0.675"]
Max round gap was 1 [limit 4] at version 1878985. Max no progress secs was 5.716865 [limit 15] at version 1878985.
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b (PR)
Upgrade the nodes to version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1088.833485545872 txn/s, submitted: 1091.094821964451 txn/s, failed submission: 2.2613364185791736 txn/s, expired: 2.2613364185791736 txn/s, latency: 2794.5085254413293 ms, (p50: 2100 ms, p90: 4800 ms, p99: 11700 ms), latency samples: 96300
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1069.9185066790722 txn/s, submitted: 1072.9942477660345 txn/s, failed submission: 3.075741086962425 txn/s, expired: 3.075741086962425 txn/s, latency: 2793.4022997946613 ms, (p50: 1800 ms, p90: 5100 ms, p99: 12300 ms), latency samples: 97400
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b passed
Upgrade the remaining nodes to version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1027.1186052779076 txn/s, submitted: 1029.420013759373 txn/s, failed submission: 2.301408481465175 txn/s, expired: 2.301408481465175 txn/s, latency: 3430.7060049294196 ms, (p50: 2300 ms, p90: 7100 ms, p99: 13900 ms), latency samples: 89260
Test Ok

Copy link
Contributor

✅ Forge suite compat success on 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b

Compatibility test results for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b (PR)
1. Check liveness of validators at old version: 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5
compatibility::simple-validator-upgrade::liveness-check : committed: 7824.11180576748 txn/s, latency: 3917.3402840662407 ms, (p50: 3000 ms, p90: 6300 ms, p99: 19800 ms), latency samples: 322460
2. Upgrading first Validator to new version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 5970.024054329734 txn/s, latency: 4489.207680305131 ms, (p50: 4200 ms, p90: 7000 ms, p99: 7200 ms), latency samples: 115360
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 6358.553182920637 txn/s, latency: 4857.860828435713 ms, (p50: 4600 ms, p90: 5500 ms, p99: 8600 ms), latency samples: 248420
3. Upgrading rest of first batch to new version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 6446.631209023102 txn/s, latency: 4026.001443990665 ms, (p50: 4400 ms, p90: 5300 ms, p99: 5400 ms), latency samples: 137120
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6668.795325900628 txn/s, latency: 4584.8529891074895 ms, (p50: 4400 ms, p90: 7000 ms, p99: 7500 ms), latency samples: 236860
4. upgrading second batch to new version: 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 10489.085220205263 txn/s, latency: 2664.0522729647823 ms, (p50: 2700 ms, p90: 3700 ms, p99: 4200 ms), latency samples: 191380
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10493.59393535521 txn/s, latency: 3416.588983394087 ms, (p50: 3400 ms, p90: 4300 ms, p99: 5300 ms), latency samples: 345660
5. check swarm health
Compatibility test for 1c2ee7082d6eff8c811ee25d6f5a7d00860a75d5 ==> 799c3f1c2f7c5292dd7f954ab3b1aa2d063df55b passed
Test Ok

@igor-aptos igor-aptos merged commit 46be663 into main Jul 23, 2024
90 checks passed
@igor-aptos igor-aptos deleted the igor/use_case_visibility branch July 23, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants