-
Notifications
You must be signed in to change notification settings - Fork 68
Conversation
1c702a7
to
b46e29c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @huitseeker !! Can't wait to see this in prod. Left a few comments, let me know your thoughts
primary_endpoint_metrics: Some(primary_endpoint_metrics), | ||
network_metrics: Some(network_metrics), | ||
} | ||
} | ||
|
||
#[derive(Clone)] | ||
pub struct PrimaryChannelMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing wrong here, but I would probably look into the approach of introducing a single gauge metric (ex the channel_capacity
) that would pass the channel name as a label (ex channel
). Cardinality shouldn't be a problem as we don't have that many different values.
That would give us a couple of benefits:
- easily plot all the channel capacities by just doing in grafana something like
sum by(channel) channel_capacity
whereas now we have to explicitly go and all of those one by one. - having less dedicated metrics to maintain in code
again nothing wrong with this approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the many metrics to maintain in code: they actually add friction to the use of channels, which we have too many of already.
Moreover, note two important things:
- we are not metering capacity, but occupancy,
- at the moment, our metrics all have the same maximum capacity. I expect that like in any Tokio application, we will shortly make these capacities diverge. As a consequence, the numbers between two metrics would no longer be comparable.
node/src/lib.rs
Outdated
let new_certificates_counter = IntGauge::new( | ||
"tx_new_certificates", | ||
"occupancy of the channel from the `primary::Core` to the `Consensus`", | ||
) | ||
.unwrap(); | ||
let (tx_new_certificates, rx_new_certificates) = | ||
types::metered_channel::channel(Self::CHANNEL_CAPACITY, &new_certificates_counter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the trouble that this is giving us. Taking a step back I'll probably raise the question "who owns that channel"? . Is it really the primary component it self? It sounds like it's kind of between the interface of the primary
& consensus
.
To be honest I would probably avoid doing the hack here (kudos though for it!) and introduce this channel metric under the ConsensusMetrics
.
Since you made the (right) decision to:
- Pass the guage directly to the metered channels constructor
- Use independent guages for each channel
then that allow us to still define independent metrics for channels in other places. I know that you probably want to have everything under the same struct (PrimaryChannelMetrics) but the cost/complexity of the hack tells me we shouldn't probably go that far.
After all as I said it seems to me that the primary
component doesn't really own that channel - it just publishes to it - so we might want to register a metric from somewhere else.
That being said, if you decide to go forward with this suggestion, then two options:
- Just register directly
new_certificates_counter
to theregistry
that you already have available here - Define the metric under the
ConsensusMetrics
. All we need to do then is just extract the initialisation part out of the spawn_consensus and do it as part of thespawn_primary
method. After all consensus will never run on its own without primary. It will be even more convenient as we already do that when we use external consensus with our DAG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the issue: even if I introduce it as consensus metrics, we cannot make this part of the initialization of consensus, because we would need to:
- create the channel in the node, with its associated metric
- pass the sending end to the Consensus, and probably re-register the associated metric there,
- and then pass the receiving end of the channel to the primary.
This is the exact symmetric of what we're doing now:
- create the channel in the node initialization, with its associated metric
- pass the sending end to the consensus,
- and then pass the receiving end of the channel to the primary, re-register the associated metric there,
Now I'm not denying that we will absolutely, 💯 need ConsensusMetrics
. It's clear we do. In particular we want to meter the channel between Consensus
and Executor
. No doubt about that. But the goal of this PR is to show a direction, and checkpoint at a reasonable point of progress.
I'm happy to let you have a go at it in a follow-up PR, if you have an idea of how to clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much @huitseeker for doing this, very valuable! 💯
d51db59
to
87f5f6a
Compare
…er} with an IntGauge
… bootstrap Adjusts the consensus registry on the back end of that.
* upgrade chrono to 0.4.20 * fix: fix underflow in debug message * fix: rename worker_network_concourrent_tasks to worker_network_available_tasks * feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge * feat: use metered_channel everywhere in the Primary * fix: move the tx_committed_certificates to be initialized in the node bootstrap Adjusts the consensus registry on the back end of that. * fix: add tests for reserve, try_reserve, empty & closed channel behavior
* upgrade chrono to 0.4.20 * fix: fix underflow in debug message * fix: rename worker_network_concourrent_tasks to worker_network_available_tasks * feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge * feat: use metered_channel everywhere in the Primary * fix: move the tx_committed_certificates to be initialized in the node bootstrap Adjusts the consensus registry on the back end of that. * fix: add tests for reserve, try_reserve, empty & closed channel behavior
* upgrade chrono to 0.4.20 * fix: fix underflow in debug message * fix: rename worker_network_concourrent_tasks to worker_network_available_tasks * feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge * feat: use metered_channel everywhere in the Primary * fix: move the tx_committed_certificates to be initialized in the node bootstrap Adjusts the consensus registry on the back end of that. * fix: add tests for reserve, try_reserve, empty & closed channel behavior
This is an extension of MystenLabs#682 to the worker
This is an extension of MystenLabs#682 to the worker
This is an extension of MystenLabs#682 to the worker
This is an extension of #682 to the worker
This is an extension of MystenLabs#682 to the worker
This is an extension of #682 to the worker
* upgrade chrono to 0.4.20 * fix: fix underflow in debug message * fix: rename worker_network_concourrent_tasks to worker_network_available_tasks * feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge * feat: use metered_channel everywhere in the Primary * fix: move the tx_committed_certificates to be initialized in the node bootstrap Adjusts the consensus registry on the back end of that. * fix: add tests for reserve, try_reserve, empty & closed channel behavior
This is an extension of #682 to the worker
* upgrade chrono to 0.4.20 * fix: fix underflow in debug message * fix: rename worker_network_concourrent_tasks to worker_network_available_tasks * feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge * feat: use metered_channel everywhere in the Primary * fix: move the tx_committed_certificates to be initialized in the node bootstrap Adjusts the consensus registry on the back end of that. * fix: add tests for reserve, try_reserve, empty & closed channel behavior
This is an extension of MystenLabs/narwhal#682 to the worker
This is a continuation of #637. Previously, we were logging the capacity of most (but not all) channels in the primary on a periodic basis.
This PR introduces a new primitive,
types::metered_channel
, which wraps atokio::mpsc::{Sender, Receiver}
and aprometheus::IntGauge
in a way distantly inspired bydiem::common::channel
.Then it uses it in all the top-level channels involved in the primary, and exposes those metrics through the registry of the primary.
Closes #655