Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Telemetry per node #7463

Merged
merged 277 commits into from
Jan 20, 2021
Merged

Telemetry per node #7463

merged 277 commits into from
Jan 20, 2021

Conversation

cecton
Copy link
Contributor

@cecton cecton commented Oct 29, 2020

polkadot companion: paritytech/polkadot#1948

@github-actions github-actions bot added the A3-in_progress Pull request is in progress. No review needed at this stage. label Oct 29, 2020
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far.

And it is really good that we got rid off slog :)

client/service/src/config.rs Outdated Show resolved Hide resolved
client/telemetry/src/layer.rs Outdated Show resolved Hide resolved
client/telemetry/src/layer.rs Outdated Show resolved Hide resolved
client/telemetry/src/layer.rs Outdated Show resolved Hide resolved
client/telemetry/src/lib.rs Outdated Show resolved Hide resolved
@cecton cecton force-pushed the cecton-telemetry-without-global branch from b801f26 to ab089f9 Compare October 30, 2020 17:04
Comment on lines 129 to 132
// NOTE: this is a hack to inject the span id into the json
let mut message = format!(r#"{{"id":{},"#, (*self.0).id);
message.push_str(&value[1..]);
(*self.0).json = Some(message)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkchr 😱 😱 😱

I think I should open an issue on tracing to see if they plan to do something to allow passing anything through the key-values of the log. If we had that, I could serialize the json later when I actually have that id.

@cecton cecton mentioned this pull request Nov 5, 2020
5 tasks
@cecton
Copy link
Contributor Author

cecton commented Nov 5, 2020

Linking this here: old-storyai/tracing-wasm#13

TL;DR

Telemetry in the browser needs some update. It works already! But the logs look completely different because it's not using the same library than before.

client/service/src/logging.rs Outdated Show resolved Hide resolved
client/service/src/builder.rs Outdated Show resolved Hide resolved
client/telemetry/src/lib.rs Outdated Show resolved Hide resolved
client/telemetry/src/lib.rs Outdated Show resolved Hide resolved
client/telemetry/src/worker.rs Outdated Show resolved Hide resolved
client/telemetry/src/worker/node_pool.rs Outdated Show resolved Hide resolved
@cecton cecton marked this pull request as ready for review January 15, 2021 15:22
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the hard work. 👍

Some last nitpicks are left and than we are good to merge this IMHO.

for addr in addresses {
if let Some(node) = node_pool.get_mut(&addr) {
node.telemetry_connection_notifier
.push(connection_notifier.clone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.push(connection_notifier.clone());
.push(connection_notifier);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't! It's in a loop

Comment on lines 177 to 185
let input = if let Some(x) = input {
x
} else {
log::error!(
target: "telemetry",
"The telemetry seems to be polled multiple times simultaneously"
"Unexpected end of stream. This is a bug.",
);
// Returning `Pending` here means that we may never get polled again, but this is
// ok because we're in a situation where something else is actually currently doing
// the polling.
return Poll::Pending;
return;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be an expect, as we hold sender/receiver and this can not happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f462418

Comment on lines 255 to 263
let (id, verbosity, message) = if let Some(x) = input {
x
} else {
log::error!(
target: "telemetry",
"Unexpected end of stream. This is a bug.",
);
return;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f462418

addr,
verbosity,
);
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return;
continue;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 Fixed in f6ec36d

Node::new(transport.clone(), addr.clone(), Vec::new(), Vec::new())
});

node.connection_messages.extend(connection_message.clone());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
node.connection_messages.extend(connection_message.clone());
node.connection_messages.extend(connection_message);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't, it's in a loop

let mut node_pool: HashMap<Multiaddr, _> = HashMap::new();

// initialize the telemetry nodes and register new notifiers
async fn process_register(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect 👍

Now just move this and the other function below out of the run function :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 24596be

(...like that??)

@@ -16,339 +16,486 @@
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <https://www.gnu.org/licenses/>.

//! Telemetry utilities.
//! To start using this module, please initialize the global logger from `sc-tracing`. This will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to have these docs more a high level description on how the system works.

This means:

  • How does it work with the span
  • How does the communication works with the telemetry worker
  • etc
  • I just imagine some sort of "diagram" explaining it :D

However, I will not make this a hard requirement to merge this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look better? acf1ec8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More doc bc36136

@cecton cecton marked this pull request as draft January 20, 2021 09:01
@cecton cecton marked this pull request as ready for review January 20, 2021 09:38
@cecton cecton added the B0-silent Changes should not be mentioned in any release notes label Jan 20, 2021
@cecton
Copy link
Contributor Author

cecton commented Jan 20, 2021

bot merge

@ghost
Copy link

ghost commented Jan 20, 2021

Checks failed; merge aborted.

@cecton cecton merged commit 8f43280 into master Jan 20, 2021
@cecton cecton deleted the cecton-telemetry-without-global branch January 20, 2021 11:28
Copy link
Contributor

@dvdplm dvdplm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was half-way through when I noticed the PR was merged. Feel free to open a followup with the suggested changes (and let me know if I should do the second half).

disable_log_color,
})?;
let mut logger = GlobalLoggerBuilder::new(self.log_filters()?);
logger.with_log_reloading(!self.is_log_filter_reloading_disabled()?);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.with_log_reloading(!self.is_log_filter_reloading_disabled()?);
let reloadable = !self.is_log_filter_reloading_disabled().expect("Hardcoded Ok; qed");
logger.with_log_reloading(reloadable);

…but I think the booleans should be flipped so we'd have logger.with_log_reloading(self.log_reloading).

The origin of these funny booleans isn't this PR, but maybe we can avoid extending the awkwardness and have to do one bool flip here, and then another one in GlobalLoggerBuilder::with_log_reloading(). I'd rename GlobalLoggerBuilder.disable_log_reloading to just log_reloading: bool and work backwards from there.


/// Get a new [`TelemetryHandle`].
///
/// This is used when you want to register a new telemetry for a Substrate node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new telemetry… what? Telemetry endpoint? Telemetry worker? Telemetry channel? Please elaborate (same on handle()).

Comment on lines +386 to +390
let telemetry_span = if config.telemetry_endpoints.is_some() {
Some(TelemetrySpan::new())
} else {
None
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let telemetry_span = if config.telemetry_endpoints.is_some() {
Some(TelemetrySpan::new())
} else {
None
};
let telemetry_span = config.telemetry_endpoints.as_ref().map(|_| TelemetrySpan::new() );

(same for full client)

Comment on lines +216 to +220
match self.telemetry_endpoints.as_ref() {
// Don't initialise telemetry if `telemetry_endpoints` == Some([])
Some(endpoints) if !endpoints.is_empty() => Some(endpoints),
_ => None,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match self.telemetry_endpoints.as_ref() {
// Don't initialise telemetry if `telemetry_endpoints` == Some([])
Some(endpoints) if !endpoints.is_empty() => Some(endpoints),
_ => None,
}
// Don't initialise telemetry if `telemetry_endpoints` == Some([])
self.telemetry_endpoints
.as_ref()
.filter(|endpoints| endpoints.is_empty())```

@@ -228,14 +233,17 @@ pub struct TaskManager {
/// terminates and gracefully shutdown. Also ends the parent `future()` if a child's essential
/// task fails.
children: Vec<TaskManager>,
/// A telemetry handle used to enter the telemetry span when a task is spawned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing; it's a TelemetrySpan, not a TelemetryHandle. Can you reword this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, that's because I changed the implementation at some point. Before that the handle also handled the span


> **Note**: Cloning the [`Telemetry`] and polling from multiple clones has an unspecified behaviour.
Substrate's nodes initialize/register to the [`TelemetryWorker`] using a [`TelemetryHandle`].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Substrate's nodes initialize/register to the [`TelemetryWorker`] using a [`TelemetryHandle`].
Substrate's nodes initialize/register with the [`TelemetryWorker`] using a [`TelemetryHandle`].

> **Note**: Cloning the [`Telemetry`] and polling from multiple clones has an unspecified behaviour.
Substrate's nodes initialize/register to the [`TelemetryWorker`] using a [`TelemetryHandle`].
This handle can be cloned and passed around. It uses an asynchronous channel to communicate with
the running [`TelemetryWorker`] dedicated to registration. Registering a telemetry can happen at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the running [`TelemetryWorker`] dedicated to registration. Registering a telemetry can happen at
the running [`TelemetryWorker`] dedicated to registration. Registering a telemetry source can happen at

?

/// The URL string can be either a URL or a multiaddress.
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct TelemetryEndpoints(
#[serde(deserialize_with = "url_or_multiaddr_deser")] pub(crate) Vec<(Multiaddr, u8)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[serde(deserialize_with = "url_or_multiaddr_deser")] pub(crate) Vec<(Multiaddr, u8)>,
#[serde(deserialize_with = "url_or_multiaddr_deser")]
pub(crate) Vec<(Multiaddr, u8)>,

impl TelemetryLayer {
/// Create a new [`TelemetryLayer`] and [`TelemetryWorker`].
///
/// If not provided, the `buffer_size` will be 16 by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// If not provided, the `buffer_size` will be 16 by default.
/// The `buffer_size` defaults to 16.

worker: Option<worker::TelemetryWorker>,
/// Receives log entries for them to be dispatched to the worker.
receiver: mpsc::Receiver<async_record::AsyncRecord>,
/// It should be ran as a background task using the [`TelemetryWorker::run`] method. This method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// It should be ran as a background task using the [`TelemetryWorker::run`] method. This method
/// It should run as a background task using the [`TelemetryWorker::run`] method. This method

@cecton
Copy link
Contributor Author

cecton commented Jan 20, 2021

@dvdplm np I will make another PR

ghost pushed a commit to paritytech/polkadot that referenced this pull request Jan 20, 2021
* Adapt service creation to new substrate API

* Fix test

* WIP

* Revert "WIP"

This reverts commit 816a363.

* WIP

* Adapt to current code

* Fix tests

* Yet another fix

* CLEANUP

* WIP

* WIP

* WIP

* Adapt code to changes on substrate

* Adapt code

* Introduce kick.

* Fixes

* WIP

* WIP

* WIP

* WIP

* Bump

* Update sp-io

* WIP

Co-authored-by: Gav Wood <[email protected]>
Co-authored-by: Parity Benchmarking Bot <[email protected]>
Copy link
Contributor

@dvdplm dvdplm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

One slight concern I have that I'm not sure is valid is: there are a few places where you note that it's important to not call things out of order or at most once. It's a bit of a code smell and I wonder if we can do better than that, OnceCell-something-something. More something to ponder than a request for change.

client/tracing/src/logging/mod.rs Show resolved Hide resolved
client/tracing/src/logging/mod.rs Show resolved Hide resolved
client/tracing/src/logging/mod.rs Show resolved Hide resolved
client/tracing/src/logging/mod.rs Show resolved Hide resolved
client/tracing/src/logging/mod.rs Show resolved Hide resolved
client/tracing/src/logging/directives.rs Show resolved Hide resolved
ordian added a commit to paritytech/polkadot that referenced this pull request Jan 23, 2021
* master:
  Companion for sub/7040 (#1719)
  Companion PR for paritytech/substrate#7463 (#1948)
  Bump serde from 1.0.119 to 1.0.120 (#2292)
@dvdplm dvdplm mentioned this pull request Jan 25, 2021
@cecton
Copy link
Contributor Author

cecton commented Jan 27, 2021

One slight concern I have that I'm not sure is valid is: there are a few places where you note that it's important to not call things out of order or at most once. It's a bit of a code smell and I wonder if we can do better than that, OnceCell-something-something. More something to ponder than a request for change.

Sorry I forgot to answer you.

Now that the telemetry can be started in the middle of anything I think the only thing that remains is if you start the telemetry twice for the same substrate node. It's not really breaking, it will just extend the existing connections if necessary but I think it will add another connection message so every time the connection re-establishes it will send 2 connection messages instead of one. I could fix that to make it override the previous connection message.

ggwpez pushed a commit to ggwpez/runtimes that referenced this pull request Mar 10, 2023
* Adapt service creation to new substrate API

* Fix test

* WIP

* Revert "WIP"

This reverts commit 816a3633e91abc943b12b2bfa77ce98b959e78b2.

* WIP

* Adapt to current code

* Fix tests

* Yet another fix

* CLEANUP

* WIP

* WIP

* WIP

* Adapt code to changes on substrate

* Adapt code

* Introduce kick.

* Fixes

* WIP

* WIP

* WIP

* WIP

* Bump

* Update sp-io

* WIP

Co-authored-by: Gav Wood <[email protected]>
Co-authored-by: Parity Benchmarking Bot <[email protected]>
ggwpez pushed a commit to ggwpez/runtimes that referenced this pull request Jul 13, 2023
* Adapt service creation to new substrate API

* Fix test

* WIP

* Revert "WIP"

This reverts commit 816a3633e91abc943b12b2bfa77ce98b959e78b2.

* WIP

* Adapt to current code

* Fix tests

* Yet another fix

* CLEANUP

* WIP

* WIP

* WIP

* Adapt code to changes on substrate

* Adapt code

* Introduce kick.

* Fixes

* WIP

* WIP

* WIP

* WIP

* Bump

* Update sp-io

* WIP

Co-authored-by: Gav Wood <[email protected]>
Co-authored-by: Parity Benchmarking Bot <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C3-medium PR touches the given topic and has a medium impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants