Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

feat: measured oneshots #3902

Merged
15 commits merged into from
Sep 29, 2021
Merged

feat: measured oneshots #3902

15 commits merged into from
Sep 29, 2021

Conversation

drahnr
Copy link
Contributor

@drahnr drahnr commented Sep 22, 2021

Introduces oneshots that measure the delay between creation, first poll and completion or error (distinguishing between hard timeout and cancellation).

Also prints a warning based on a set soft timeout.

Closes #3648

@drahnr drahnr self-assigned this Sep 22, 2021
@drahnr drahnr added A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Sep 22, 2021
@drahnr drahnr added this to the v0.9.11 milestone Sep 22, 2021
@drahnr drahnr force-pushed the bernhard-measured-oneshots branch 4 times, most recently from 5f2ad61 to 8a0d1c6 Compare September 22, 2021 12:41
@drahnr
Copy link
Contributor Author

drahnr commented Sep 22, 2021

The API usage is as follows:

When creating a pair, it has to be given a name as well as two timeouts.

// 1. create
let (tx, rx) = metered::oneshot::channel("foo", Duration::from_secs(500), Duration::from_secs(3));

// 2. store the meter
let meter = rx.meter();

// 3.
let val = rx.await;

reasoning for this rather providing the metrics as part of the error or as tuple, is compat. This is a rather minimal change and we could add metrics as needed, yet just change all use-statements at once and adjust to the additional params without much chore.

@drahnr drahnr added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Sep 22, 2021
node/metered-channel/src/oneshot.rs Outdated Show resolved Hide resolved
node/metered-channel/src/oneshot.rs Outdated Show resolved Hide resolved
node/metered-channel/src/oneshot.rs Outdated Show resolved Hide resolved
node/metered-channel/src/oneshot.rs Outdated Show resolved Hide resolved
@ordian
Copy link
Member

ordian commented Sep 22, 2021

The impl looks correct, but a bit too complicated to me. Can we just have a thin abstraction on top of rx.await that will do the following?

let now = Instant::now();
let result = rx.timeout(HARD_TIMEOUT).await?;
match result {
    None => {
        error!("Subsystem {name} is unresponsive", name=name);
    }
    Some(msg) if now.elapsed() > SOFT_TIMEOUT {
        warn!("subsystem {name} is slow", name=name);
    }
<snip>    
}

@drahnr
Copy link
Contributor Author

drahnr commented Sep 22, 2021

The impl looks correct, but a bit too complicated to me. Can we just have a thin abstraction on top of rx.await that will do the following?

let now = Instant::now();
let result = rx.timeout(HARD_TIMEOUT).await?;
match result {
    None => {
        error!("Subsystem {name} is unresponsive", name=name);
    }
    Some(msg) if now.elapsed() > SOFT_TIMEOUT {
        warn!("subsystem {name} is slow", name=name);
    }
<snip>    
}

Essentially that is what is implemented explicitly within poll, since I'd rather not have the above code spread over all cases of oneshot metering ( + the above is not measuring anything yet).

The complexity arises from Option<impl Future>, since we need to measure from the time we poll for the first time rather than creation. Without that, it'd become much simpler, but that'd be slightly less helpful I think.

Another, alternate simplification would be to spawn Delay with a very large timeout. And then once polled for the first time, call reset. But that seems mildly hacky.

I very much appreciate the feedback!

Note that I did not consider the complexity of the above code that much, I was more concerned how invasive introducing it throughout the code base would be.

@ordian
Copy link
Member

ordian commented Sep 22, 2021

The complexity arises from Option, since we need to measure from the time we poll for the first time rather than creation. Without that, it'd become much simpler, but that'd be slightly less helpful I think.

What I'm suggesting is implementing a function (or a struct wrapper for more type-safety) with the code above (modulo bikeshed on naming)

async fn timed_await(rx: oneshot::Receiver<T>) -> Result<T> {...}

and replacing rx.await? with timed_await(rx).await?. The time we poll the first time is the same as the time we call this function, so I don't see a difference here.

But maybe I'm missing something?

@drahnr
Copy link
Contributor Author

drahnr commented Sep 23, 2021

The complexity arises from Option, since we need to measure from the time we poll for the first time rather than creation. Without that, it'd become much simpler, but that'd be slightly less helpful I think.

What I'm suggesting is implementing a function (or a struct wrapper for more type-safety) with the code above (modulo bikeshed on naming)

async fn timed_await(rx: oneshot::Receiver<T>) -> Result<T> {...}

and replacing rx.await? with timed_await(rx).await?. The time we poll the first time is the same as the time we call this function, so I don't see a difference here.

But maybe I'm missing something?

What you describe would be the functional approach and at the first glance you indeed obtain similar measurements while avoiding some complexity.

With the struct- wrappers you will always end up with the complexity exhibited in this PR - one has to somehow mingle the futures and provide a Future trait impl for the wrapper type.

The complexity on the other handside allows to better measure the actual time between the first poll and the send rather than the time of the receiver end blocking. In the common case these will be almost identical, but I'd rather not simplify here, especially when one looks at grafana dashboards occluding the odd deltas.

This also allows to simply add a bunch of extra measurements as needed (what's the delay between receiver creation and first poll, the delta between the true send and the true conclusion, so machine overload conditions will not show as wrongly blocking durations, but these are just a few details that will eventually become important in our grafana views).

I'd prefer to merge this with the tad of added complexity rather than measuring almost the right thing :) - in fact this impl allows to measure both as needed.

node/metered-channel/src/oneshot.rs Outdated Show resolved Hide resolved
node/metered-channel/src/oneshot.rs Show resolved Hide resolved
node/metered-channel/src/oneshot.rs Outdated Show resolved Hide resolved
@drahnr drahnr added A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. labels Sep 29, 2021
@drahnr drahnr added A0-please_review Pull request needs code review. and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Sep 29, 2021
)
}

#[allow(missing_docs)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this allow(missing_docs)? the measurements function is also without docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in particular has missing docs, because iff the error is not self describing, there is something wrong already. The error annotation should be enough. This is a pattern we have in more places throughout the codebase, limited to Error types.

Copy link
Contributor

@rphmeier rphmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks OK, but I wonder about expected usage - if it's meant to be plugged into prometheus, then it feels fairly unwieldy to do so.

@drahnr
Copy link
Contributor Author

drahnr commented Sep 29, 2021

The code here looks OK, but I wonder about expected usage - if it's meant to be plugged into prometheus, then it feels fairly unwieldy to do so.

How so?

Currently we have a struct Metrics per subsystem, with collector fn-implementations specific to the subsystem i.e. fn on_foo(u64) and fn on_bar(Hash).

Now the returned Result type of the receiver ( or more specific: Measurements) contains 3 values, which would need to be forwarded to a prometheus type like (in the most common case) prometheus::Histogram with i.e. fn observe(&self, v: f64).

So yes, the resulting evaluation is not perfect but it's also encapsulated and one does not have to differentiate between Ok and Err at the cost of a local let result = rx.await binding (or not, since Error and OutputWithMesasurements both impl Measurable).

There are trade-offs with this impl, but in a discussion /w @eskimor this seemed the most practical one, happy to be enlightened though

test_launch(
"starve_till_soft_timeout_then_food",
|tx| async move {
Delay::new(Duration::from_secs(2)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it's just a test, I think it would be better to make this a constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all used exactly once, and are really only relevant per test case (and each of them is super short), so I don't think that's necessary (though I generally agree with you).

tracing::debug!(
target: LOG_TARGET,
unconnected_authorities = %PrettyAuthorities(unconnected_authorities.iter()),
unconnected_authorities = %pretty,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't that pretty enough? 😛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't handle it, too pretty 🎭

@drahnr
Copy link
Contributor Author

drahnr commented Sep 29, 2021

bot merge

@ghost
Copy link

ghost commented Sep 29, 2021

Waiting for commit status.

@ghost ghost merged commit 71a721e into master Sep 29, 2021
@ghost ghost deleted the bernhard-measured-oneshots branch September 29, 2021 16:06
ordian added a commit that referenced this pull request Sep 29, 2021
* master:
  feat: measured oneshots (#3902)
  remove `AllSubsystems` and `AllSubsystemsGen` types (#3874)
  Companion for Substrate#9867 (#3938)
  Substrate Companion for #9552 (#3834)
  CI: run disputes tests (#3962)
  Bump parity-scale-codec from 2.3.0 to 2.3.1 (#3959)
  approval-voting: populate session cache in advance (#3954)
  Bump libc from 0.2.102 to 0.2.103 (#3950)
  fix master (#3955)
  Docker files chore (#3880)
  Bump nix from 0.19.1 to 0.20.0 (#3587)
  remove connected disconnected state, 3rd attempt (#3898)
  fix flaky chain-selection tests (#3948)
  Add benchmarking for parachain runtime initializer pallet (#3913)
  minor chore changes (#3944)
  disputes: reject single-sided disputes (#3903)
ordian added a commit that referenced this pull request Sep 30, 2021
* master: (52 commits)
  Companion for substrate PR#9890 (#3961)
  Bump version, tx_version and spec_version in prep for v0.9.11 (#3970)
  Fix master compilation (#3977)
  Make most XCM APIs accept an Into<MultiLocation> where MultiLocation is accepted (#3627)
  fix disputes tests (#3974)
  Drop availability only for candidates that lose disputes (#3973)
  revert +1 change to be on the safer side (#3972)
  paras_inherent: reject only candidates with concluded disputes (#3969)
  feat: measured oneshots (#3902)
  remove `AllSubsystems` and `AllSubsystemsGen` types (#3874)
  Companion for Substrate#9867 (#3938)
  Substrate Companion for #9552 (#3834)
  CI: run disputes tests (#3962)
  Bump parity-scale-codec from 2.3.0 to 2.3.1 (#3959)
  approval-voting: populate session cache in advance (#3954)
  Bump libc from 0.2.102 to 0.2.103 (#3950)
  fix master (#3955)
  Docker files chore (#3880)
  Bump nix from 0.19.1 to 0.20.0 (#3587)
  remove connected disconnected state, 3rd attempt (#3898)
  ...
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lack of oneshot response channel timing metrics
5 participants