Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

[metrics] Makes all channels metered #682

Merged
merged 7 commits into from
Aug 5, 2022

Conversation

huitseeker
Copy link
Contributor

This is a continuation of #637. Previously, we were logging the capacity of most (but not all) channels in the primary on a periodic basis.

This PR introduces a new primitive, types::metered_channel, which wraps a tokio::mpsc::{Sender, Receiver} and a prometheus::IntGauge in a way distantly inspired by diem::common::channel.

Then it uses it in all the top-level channels involved in the primary, and exposes those metrics through the registry of the primary.

Closes #655

@huitseeker huitseeker requested a review from bmwill as a code owner August 4, 2022 04:23
@huitseeker huitseeker force-pushed the fix_metrics branch 3 times, most recently from 1c702a7 to b46e29c Compare August 4, 2022 05:01
Copy link
Contributor

@akichidis akichidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @huitseeker !! Can't wait to see this in prod. Left a few comments, let me know your thoughts

primary_endpoint_metrics: Some(primary_endpoint_metrics),
network_metrics: Some(network_metrics),
}
}

#[derive(Clone)]
pub struct PrimaryChannelMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong here, but I would probably look into the approach of introducing a single gauge metric (ex the channel_capacity) that would pass the channel name as a label (ex channel). Cardinality shouldn't be a problem as we don't have that many different values.

That would give us a couple of benefits:

  • easily plot all the channel capacities by just doing in grafana something like sum by(channel) channel_capacity whereas now we have to explicitly go and all of those one by one.
  • having less dedicated metrics to maintain in code

again nothing wrong with this approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the many metrics to maintain in code: they actually add friction to the use of channels, which we have too many of already.
Moreover, note two important things:

  1. we are not metering capacity, but occupancy,
  2. at the moment, our metrics all have the same maximum capacity. I expect that like in any Tokio application, we will shortly make these capacities diverge. As a consequence, the numbers between two metrics would no longer be comparable.

primary/src/metrics.rs Outdated Show resolved Hide resolved
node/src/lib.rs Outdated
Comment on lines 137 to 143
let new_certificates_counter = IntGauge::new(
"tx_new_certificates",
"occupancy of the channel from the `primary::Core` to the `Consensus`",
)
.unwrap();
let (tx_new_certificates, rx_new_certificates) =
types::metered_channel::channel(Self::CHANNEL_CAPACITY, &new_certificates_counter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the trouble that this is giving us. Taking a step back I'll probably raise the question "who owns that channel"? . Is it really the primary component it self? It sounds like it's kind of between the interface of the primary & consensus.

To be honest I would probably avoid doing the hack here (kudos though for it!) and introduce this channel metric under the ConsensusMetrics.

Since you made the (right) decision to:

  1. Pass the guage directly to the metered channels constructor
  2. Use independent guages for each channel

then that allow us to still define independent metrics for channels in other places. I know that you probably want to have everything under the same struct (PrimaryChannelMetrics) but the cost/complexity of the hack tells me we shouldn't probably go that far.

After all as I said it seems to me that the primary component doesn't really own that channel - it just publishes to it - so we might want to register a metric from somewhere else.

That being said, if you decide to go forward with this suggestion, then two options:

  1. Just register directly new_certificates_counter to the registry that you already have available here
  2. Define the metric under the ConsensusMetrics. All we need to do then is just extract the initialisation part out of the spawn_consensus and do it as part of the spawn_primary method. After all consensus will never run on its own without primary. It will be even more convenient as we already do that when we use external consensus with our DAG

Copy link
Contributor Author

@huitseeker huitseeker Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the issue: even if I introduce it as consensus metrics, we cannot make this part of the initialization of consensus, because we would need to:

  • create the channel in the node, with its associated metric
  • pass the sending end to the Consensus, and probably re-register the associated metric there,
  • and then pass the receiving end of the channel to the primary.

This is the exact symmetric of what we're doing now:

  • create the channel in the node initialization, with its associated metric
  • pass the sending end to the consensus,
  • and then pass the receiving end of the channel to the primary, re-register the associated metric there,

Now I'm not denying that we will absolutely, 💯 need ConsensusMetrics. It's clear we do. In particular we want to meter the channel between Consensus and Executor. No doubt about that. But the goal of this PR is to show a direction, and checkpoint at a reasonable point of progress.

I'm happy to let you have a go at it in a follow-up PR, if you have an idea of how to clean this up.

primary/src/primary.rs Outdated Show resolved Hide resolved
test_utils/src/lib.rs Show resolved Hide resolved
types/src/metered_channel.rs Show resolved Hide resolved
types/src/metered_channel.rs Show resolved Hide resolved
types/src/metered_channel.rs Show resolved Hide resolved
types/src/metered_channel.rs Outdated Show resolved Hide resolved
types/src/metered_channel.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@akichidis akichidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @huitseeker for doing this, very valuable! 💯

@huitseeker huitseeker force-pushed the fix_metrics branch 2 times, most recently from d51db59 to 87f5f6a Compare August 5, 2022 17:05
@huitseeker huitseeker enabled auto-merge (rebase) August 5, 2022 17:11
@huitseeker huitseeker disabled auto-merge August 5, 2022 17:17
@huitseeker huitseeker enabled auto-merge (squash) August 5, 2022 17:17
@huitseeker huitseeker disabled auto-merge August 5, 2022 17:18
@huitseeker huitseeker enabled auto-merge (squash) August 5, 2022 17:18
@huitseeker huitseeker merged commit 0a92aa9 into MystenLabs:main Aug 5, 2022
huitseeker added a commit that referenced this pull request Aug 5, 2022
* upgrade chrono to 0.4.20

* fix: fix underflow in debug message

* fix: rename worker_network_concourrent_tasks to worker_network_available_tasks

* feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge

* feat: use metered_channel everywhere in the Primary

* fix: move the tx_committed_certificates to be initialized in the node bootstrap

Adjusts the consensus registry on the back end of that.

* fix: add tests for reserve, try_reserve, empty & closed channel behavior
huitseeker added a commit that referenced this pull request Aug 5, 2022
* upgrade chrono to 0.4.20

* fix: fix underflow in debug message

* fix: rename worker_network_concourrent_tasks to worker_network_available_tasks

* feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge

* feat: use metered_channel everywhere in the Primary

* fix: move the tx_committed_certificates to be initialized in the node bootstrap

Adjusts the consensus registry on the back end of that.

* fix: add tests for reserve, try_reserve, empty & closed channel behavior
huitseeker added a commit that referenced this pull request Aug 5, 2022
* upgrade chrono to 0.4.20

* fix: fix underflow in debug message

* fix: rename worker_network_concourrent_tasks to worker_network_available_tasks

* feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge

* feat: use metered_channel everywhere in the Primary

* fix: move the tx_committed_certificates to be initialized in the node bootstrap

Adjusts the consensus registry on the back end of that.

* fix: add tests for reserve, try_reserve, empty & closed channel behavior
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 7, 2022
This is an extension of MystenLabs#682 to the worker
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 8, 2022
This is an extension of MystenLabs#682 to the worker
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 8, 2022
This is an extension of MystenLabs#682 to the worker
huitseeker added a commit that referenced this pull request Aug 8, 2022
This is an extension of #682 to the worker
huitseeker added a commit to huitseeker/narwhal that referenced this pull request Aug 8, 2022
huitseeker added a commit that referenced this pull request Aug 8, 2022
This is an extension of #682 to the worker
huitseeker added a commit that referenced this pull request Aug 12, 2022
* upgrade chrono to 0.4.20

* fix: fix underflow in debug message

* fix: rename worker_network_concourrent_tasks to worker_network_available_tasks

* feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge

* feat: use metered_channel everywhere in the Primary

* fix: move the tx_committed_certificates to be initialized in the node bootstrap

Adjusts the consensus registry on the back end of that.

* fix: add tests for reserve, try_reserve, empty & closed channel behavior
huitseeker added a commit that referenced this pull request Aug 12, 2022
This is an extension of #682 to the worker
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022
* upgrade chrono to 0.4.20

* fix: fix underflow in debug message

* fix: rename worker_network_concourrent_tasks to worker_network_available_tasks

* feat: add a metered_channel that wraps a tokio::mpsc::{Sender, Receiver} with an IntGauge

* feat: use metered_channel everywhere in the Primary

* fix: move the tx_committed_certificates to be initialized in the node bootstrap

Adjusts the consensus registry on the back end of that.

* fix: add tests for reserve, try_reserve, empty & closed channel behavior
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022
@huitseeker huitseeker deleted the fix_metrics branch January 21, 2023 13:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Track channel utilization using metrics
4 participants