-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e counters split out by top usecases #14031
Conversation
bfae762
to
91533a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beautiful
types/src/transaction/use_case.rs
Outdated
if module_id.address().is_special() { | ||
Platform | ||
} else { | ||
// n.b. Generics ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. fix my bug by deleting this comment, thank you..
let mut total = HashMap::new(); | ||
for group in &self.recent { | ||
for (use_case, count) in group { | ||
if use_case != &UseCaseKey::Platform && use_case != &UseCaseKey::Others { | ||
*total.entry(use_case.clone()).or_insert(0) += count; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking a global total: HashMap<>
will avoid traversing the whole queue every time (at the cost of the space)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you mean total on top of individual ones for window, and then instead of removing one - subtracting it?
we are doing this once per block, so not critical I think
it's potentially a lit bit more efficient (like 40x - the size of the window), but if anything goes wrong with doing -= count, we might have wrong values forever.
still probably worth doing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's potentially a lit bit more efficient (like 40x
like a cryptographer talking 😀
let mut sorted = total.into_iter().collect::<Vec<_>>(); | ||
sorted.sort_by_key(|(_, count)| *count); | ||
|
||
for _ in 0..self.num_top_to_track { | ||
if let Some((use_case, _)) = sorted.pop() { | ||
result.insert(use_case); | ||
} | ||
} | ||
|
||
result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably doesn't matter, but perfect use case for a BinaryHeap<(usize, UseCaseKey)>
😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean to use heap instead of total, or just instead of sorted?
if instead of total, not sure how to update point values for usecase.
if instead of sorting - does rust have a way to convert vector to heap in O(n)? If inserting from iterator - complexity will be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean sorting.
true it's the same O, but on average it must be faster than completely sorting them, right?
https://doc.rust-lang.org/std/collections/struct.BinaryHeap.html#time-complexity-3
but again, this probably matters too little even for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heap can be created in O(n), worst case, but I don't see rust providing such API? that would be nice to use :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think even pushing one by one is probably meaningfully faster -- as it grows most likely a new item doesn't need to climb many levels of the heap -- that's why they estimate ~O(1) for pushing as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's average. Reverse sorted would be O(n log n). but yeah, I can switch to heap, unfortunate I cannot get O(n)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually they do have that optimization..
https://github.com/rust-lang/rust/blob/fcc325f1bc477975e2ce5ba534fe4c77ff8a8536/library/alloc/src/collections/binary_heap/mod.rs#L1809
254da3c
to
54ee359
Compare
mempool/src/core_mempool/mempool.rs
Outdated
|
||
counters::TXN_E2E_USE_CASE_COMMIT_LATENCY | ||
.with_label_values(&[ | ||
&use_case_label, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to blow up the number of counters no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argh.
@sitalkedia - at any given block, we are exporting at most usecase_stats_num_top_to_track = 5 distinct values. (top is counted over 40 blocks, to be more stable)
so at any individual time it should be fine.
over time, these might change. from prometheus side, I don't think that is a problem - each forge test are new sets of counters.
But histogram will be exporting unchanged values too. (and using memory for them) Do you know of a way that will only keep track of new values, and export only those? Otherwise, I'll need to change this to contract_top1 / contract_top2 / ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my concern is the number of dimensions being tracked over time. Not a prometheus export to say for sure if its a problem but need to understand the implication of this before landing this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as is, it's also a problem that metrics that were created, but stopped being updated, continue being exported - as count is count from the start of the process.
I'll change to contract_top1 / ... , and log contract_top1 <-> address map, as I don't have a better suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt our current prometheus db can support this churn. We are talking about 5*4blocks = 20 unique labels per second per peer. And, if we end up getting unique labels for minutes, that will definitely overwhelm the db.
Forge is short-lived, when the node dies, it's labels go away with it. For mainnet, the validator will keep exporting the metric as long as it's alive, even if you record it once.
cc: @geekflyer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, have we thought about tracking this via logs or traces? If we for example have this behind a feature flag we could have one of the internal validators emit a log line for each observed TX. In terms of volume I think this should be fine as long we don't emit this for every single node. The main thing is with logs/traces we don't have any limits on cardinality or churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 2. here's a related issue tikv/rust-prometheus#336 (this is the lib we're using a the moment afaik).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, just to add: I think our TSDB can possibly support this much cardinality/churn (since this is just a single metric), but the issue is the unboundness of cardinality due to non-expiring metrics in the process registry, which amplifies the problem by a lot as time goes by.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've been doing similar things for the module+function already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we kinda have to have it for all - because only PFN that transaction is submitted to can compute latency correctly
But yeah, we have 2 fullnodes logging detailed countrs - but those are on module name, so republishing a contract under different address shows in the same counter. not sure if cardinalities there are causing any issues, or we need to revisit those counters - https://aptoslabs.grafana.net/goto/fxPdCAXSR?orgId=1
for here, I've basically changed cardinality to fixed set by using entry_user_top_1 ... entry_user_top_5, which should be enough
54ee359
to
e8fcf23
Compare
@sitalkedia updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit around possible frequent logging.
for i in 0..self.num_top_to_track { | ||
if let Some(UseCaseByCount { use_case, count }) = max_heap.pop() { | ||
info!( | ||
"Use cases: top {} is {:?} with {} occurences", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be logging too frequently to flood the logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 times per block, so 20 per second.
is that too frequent? I can just merge into single line per block, that's probably fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably fine but single line per block is better may be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
e8fcf23
to
799c3f1
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
✅ Forge suite
|
Description
Track frequency of usecases in recent blocks (or during state sync - recent batches), and then report latency counters separately for top usecases, as well as framework and others (i.e. scripts)
We can then use this for monitoring, and gas estimation and more
Type of Change
Which Components or Systems Does This Change Impact?
How Has This Been Tested?
Key Areas to Review
Checklist