Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document reload layer performance impact #1632

Open
bkchr opened this issue Oct 11, 2021 · 6 comments
Open

Document reload layer performance impact #1632

bkchr opened this issue Oct 11, 2021 · 6 comments

Comments

@bkchr
Copy link
Contributor

bkchr commented Oct 11, 2021

Hey, we have observed that using the reload layer has quite some performance impact. Without parking_lot(we realized later that this is disabled by default) it takes factor 6x more time when you have RUST_LOG=bullshit=trace set, aka we are required to check if any target matches "bullshit"(which no target matches). With parking lot enabled the performance decrease is still factor 3. We think that it would be nice to document this.

@hawkw
Copy link
Member

hawkw commented Oct 11, 2021

There is a note in the module-level documentation for reload that it introduces some performance overhead, although it doesn't go into great detail on this: https://docs.rs/tracing-subscriber/0.2.25/tracing_subscriber/reload/index.html, perhaps it should be expanded. Anything you'd like to see mentioned more specifically in the docs?

I'm somewhat surprised that the reload layer makes filtering as much as 3x slower with parking_lot enabled, though. That may be worth investigating...

@bkchr
Copy link
Contributor Author

bkchr commented Oct 11, 2021

I'm somewhat surprised that the reload layer makes filtering as much as 3x slower with parking_lot enabled, though. That may be worth investigating...

There is probably a lot of stuff being printed from different threads. I have "measured" this overhead with wasmtime while it compiled some wasm code and wasmtime is using a lot of threads for it. However, most of the calls are using a read lock as far as I have seen. So, I also don't really understand why it is that slow.

@hawkw
Copy link
Member

hawkw commented Oct 11, 2021

Hmm, there is currently a microbenchmark for comparing the performance of a reload layer with that of the same subscriber without reloading enabled. However, I noticed that this benchmark currently only tests the single-threaded case.

I wrote a version where events are recorded by multiple threads concurrently, but I still don't see a significant performance difference between the benchmark with reloading and the version without:

reload/complex-collector
                        time:   [1.2370 us 1.2400 us 1.2431 us]

reload/complex-collector-reloadable
                        time:   [1.2078 us 1.2117 us 1.2152 us]

reload_concurrent/complex-collector
                        time:   [8.4684 us 8.5037 us 8.5417 us]

reload_concurrent/complex-collector-reloadable
                        time:   [8.6130 us 8.6478 us 8.6845 us]

I would certainly expect the RwLock to introduce some amount of overhead, but I wouldn't expect it to be nearly so severe when the lock is never contended. I'm assuming you're not constantly reloading the filter very frequently?

If your application spawns an extremely large number of threads, it's possible that the RwLock has a more noticeable impact than in the case where the layer is shared between a smaller number of threads. Potentially, we could use a sharded lock approach to decrease the impact in the read case.

@hawkw
Copy link
Member

hawkw commented Oct 11, 2021

Per your comment here:

However, most of the calls are using a read lock as far as I have seen. So, I also don't really understand why it is that slow.

I noticed you said "most of the calls are using a read lock". Does that mean that during normal operation of the program, the layer is being reloaded regularly?

The intended use-case for the reload layer is one where the filter configuration is reloaded very infrequently; typically in response to a user-initiated configuration change (e.g. changing settings in a configuration UI, sending the program a SIGHUP to reload a config file, etc) A general rule of thumb is that if reloading a filter is happening through some automated process, you're probably reloading the filter too frequently to get good performance out of reload --- if you're using it to implement dynamic filtering or sampling where a different filter configuration is used for, say, 1 out of every 100 requests, or something, you should probably be using a custom filter implementation instead.

@bkchr
Copy link
Contributor Author

bkchr commented Oct 11, 2021

I noticed you said "most of the calls are using a read lock". Does that mean that during normal operation of the program, the layer is being reloaded regularly?

No, no. The layer is not reloaded regularly. It happens when the user provides some new logging targets, which happens almost never. So, that should be fine.

@bkchr
Copy link
Contributor Author

bkchr commented Oct 11, 2021

Could you try to change this to

fn events() {
tracing::info!(target: "this", "hi");
tracing::debug!(target: "enabled", "hi");
tracing::warn!(target: "hepp", "hi");
tracing::trace!(target: "stuff", "hi");
}
to

for _ in 1000 {
   tracing::info!(target: "this", "hi");
   tracing::debug!(target: "enabled", "hi");
   tracing::warn!(target: "hepp", "hi");
   tracing::trace!(target: "stuff", "hi");
}

And change this

.add_directive(LevelFilter::INFO.into());
to something=trace? The point is to have the min level be trace, but we need to filter by the prefix.

hawkw pushed a commit that referenced this issue Oct 18, 2021
…1636)

## Motivation

We use `tracing` as our logger in
[`substrate`](https://github.com/paritytech/substrate). We've noticed
that as soon as *any* `trace` log is enabled (even one which doesn't
exists) the whole logging machinery starts to take a lot of time, even
if nothing at all is actually printed!

In one of our quick-and-dirty reproduction benchmarks (JIT-ing a WASM
program) we saw the total real runtime rise from around ~1.3s to ~7s
just by adding a `trace` log filter which doesn't match anything.
(Depending on the hardware and on how many threads are simultaneously
logging this figure varies pretty widely, but it's always a very
significant drop.)

After looking into this problem I've found that the culprit of the
slowdown were `trace!` and `debug!` logs sprinkled quite liberally in
some of the more hot codepaths. When there are no `trace`-level filters
defined on the logger it can basically reject those inert `trace!` and
`debug!` logs purely based on the current maximum logging level (which
is cheap!), but as soon as you define *any* trace filter the current
maximum logging changes, and then every `trace!` and `debug!` log has to
go through the whole filtering machinery before it can be rejected.
While this is cheap if you only do it once, it starts to become very
expensive when you do it a lot, especially when you're running multiple
threads and enable log reloading. (This is related to
#1632.)

## Solution

I've added an opt-in per-thread LRU cache which tries to cache whenever
the logger is actually interested in a given `target` + `level` pair for
every log emitted through the `log` crate.

I've also added a benchmark very roughly replicating the situation from
our code; here's the performance *without* the cache: (`cargo bench`)

```
[838.67 ns 846.51 ns 854.04 ns]
```

And here's the performance *with* the cache: (`cargo bench --features
interest-cache`)

```
[25.322 ns 25.556 ns 25.820 ns]
```

As you can see the per-call cost was cut down to less than ~3%.
kaffarell pushed a commit to kaffarell/tracing that referenced this issue May 22, 2024
…okio-rs#1636)

## Motivation

We use `tracing` as our logger in
[`substrate`](https://github.com/paritytech/substrate). We've noticed
that as soon as *any* `trace` log is enabled (even one which doesn't
exists) the whole logging machinery starts to take a lot of time, even
if nothing at all is actually printed!

In one of our quick-and-dirty reproduction benchmarks (JIT-ing a WASM
program) we saw the total real runtime rise from around ~1.3s to ~7s
just by adding a `trace` log filter which doesn't match anything.
(Depending on the hardware and on how many threads are simultaneously
logging this figure varies pretty widely, but it's always a very
significant drop.)

After looking into this problem I've found that the culprit of the
slowdown were `trace!` and `debug!` logs sprinkled quite liberally in
some of the more hot codepaths. When there are no `trace`-level filters
defined on the logger it can basically reject those inert `trace!` and
`debug!` logs purely based on the current maximum logging level (which
is cheap!), but as soon as you define *any* trace filter the current
maximum logging changes, and then every `trace!` and `debug!` log has to
go through the whole filtering machinery before it can be rejected.
While this is cheap if you only do it once, it starts to become very
expensive when you do it a lot, especially when you're running multiple
threads and enable log reloading. (This is related to
tokio-rs#1632.)

## Solution

I've added an opt-in per-thread LRU cache which tries to cache whenever
the logger is actually interested in a given `target` + `level` pair for
every log emitted through the `log` crate.

I've also added a benchmark very roughly replicating the situation from
our code; here's the performance *without* the cache: (`cargo bench`)

```
[838.67 ns 846.51 ns 854.04 ns]
```

And here's the performance *with* the cache: (`cargo bench --features
interest-cache`)

```
[25.322 ns 25.556 ns 25.820 ns]
```

As you can see the per-call cost was cut down to less than ~3%.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants