Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage when misusing spans #458

Closed
rklaehn opened this issue Dec 5, 2019 · 19 comments
Closed

High CPU usage when misusing spans #458

rklaehn opened this issue Dec 5, 2019 · 19 comments
Labels
crate/subscriber Related to the `tracing-subscriber` crate kind/perf Performance improvements

Comments

@rklaehn
Copy link

rklaehn commented Dec 5, 2019

Bug Report

Version

│       ├── tracing v0.1.10
│       │   ├── tracing-attributes v0.1.5
│       │   └── tracing-core v0.1.7
│       ├── tracing-futures v0.1.1
│       │   └── tracing v0.1.10 (*)
│       └── tracing-subscriber v0.1.6
│           ├── tracing-core v0.1.7 (*)
│           └── tracing-log v0.1.1
│               └── tracing-core v0.1.7 (*)
│           └── tracing-log v0.1.1 (*)
│   ├── tracing v0.1.10 (*)
│   ├── tracing-futures v0.1.1 (*)
│   │   ├── tracing v0.1.10 (*)
│   │   ├── tracing-subscriber v0.1.6 (*)
│   ├── tracing v0.1.10 (*)
│   ├── tracing-futures v0.1.1 (*)
│   ├── tracing v0.1.10 (*)
│   └── tracing-subscriber v0.1.6 (*)
├── tracing v0.1.10 (*)
├── tracing-futures v0.1.1 (*)
├── tracing-subscriber v0.1.6 (*)
    ├── tracing v0.1.10 (*)
    ├── tracing-futures v0.1.1 (*)

Platform

Linux linux-box 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12 10:36:11 UTC 2019 x86_64 

Description

Not sure if this is a bug since the API was not used correctly, but since rust has the motto of fearless concurrency, I thought I might report it anyway. Maybe this can be better documented or caught.

We have a fairly large project that is using tracing internally. I had this very confusing behaviour where the thing would consume 100% CPU on all cores sometimes. Using flamegraph, I have narrowed this down to tracing, and then tracing spans.

The issue turned out to be that somebody had created a tracing span on a different thread to the thread where the span was entered. I still don't know what exactly happened then, but tracing did not like it at all!

There was no crash and the program still executed correctly, but all 12 virtual cores were loaded for about a minute.

Here is the code that caused so much trouble:

    {
        let current = Span::current();
        let span = debug_span!(parent: current, "spawn_mut");
        let (snd, rcv) = oneshot::channel();
        self.sender
            .send(Box::new(move |res| {
                let _enter = span.enter();

Here is the fix:

    {
        let current = Span::current();
        let (snd, rcv) = oneshot::channel();
        self.sender
            .send(Box::new(move |res| {
                let span = debug_span!(parent: current, "spawn_mut");
                let _enter = span.enter();

I attached a flamegraph of the behaviour I saw.

flamegraph_tracing_100.zip

Maybe something can be done in code or in documentation to prevent people from running into the same issue.

Anyway, feel free to close this if you think it is not worth addressing. And thanks for the awesome library!

@hawkw hawkw added crate/subscriber Related to the `tracing-subscriber` crate kind/perf Performance improvements labels Dec 5, 2019
@hawkw
Copy link
Member

hawkw commented Dec 5, 2019

Looking at your flamegraph, this appears to be a bug in tracing-subscriber::fmt. I'm guessing something in the Store::new_span method can cause a hot loop when a span is entered while it is still being inserted on another thread (though I may be wrong):

pub(crate) fn new_span<F>(&self, attrs: &Attributes<'_>, fmt_fields: &F) -> Id
where
F: for<'writer> FormatFields<'writer>,
{
let mut span = Some(Data::new(attrs, self));
// The slab's free list is a modification of Treiber's lock-free stack,
// using slab indices instead of pointers, and with a provision for
// growing the slab when needed.
//
// In order to insert a new span into the slab, we "pop" the next free
// index from the stack.
loop {
// Acquire a snapshot of the head of the free list.
let head = self.next.load(Ordering::Relaxed);
{
// Try to insert the span without modifying the overall
// structure of the stack.
let this = try_lock!(self.inner.read(), else return Id::from_u64(0xDEADFACE));
// Can we insert without reallocating?
if head < this.slab.len() {
// If someone else is writing to the head slot, we need to
// acquire a new snapshot!
if let Ok(mut slot) = this.slab[head].try_write() {
// Is the slot we locked actually empty? If not, fall
// through and try to grow the slab.
if let Some(next) = slot.next() {
// Is our snapshot still valid?
if self.next.compare_and_swap(head, next, Ordering::Release) == head {
// We can finally fill the slot!
slot.fill(span.take().unwrap(), attrs, fmt_fields);
return idx_to_id(head);
}
}
}
// Our snapshot got stale, try again!
atomic::spin_loop_hint();
continue;
}
}
// We need to grow the slab, and must acquire a write lock.
if let Ok(mut this) = self.inner.try_write() {
let len = this.slab.len();
// Insert the span into a new slot.
let slot = Slot::new(span.take().unwrap(), attrs, fmt_fields);
this.slab.push(RwLock::new(slot));
// TODO: can we grow the slab in chunks to avoid having to
// realloc as often?
// Update the head pointer and return.
self.next.store(len + 1, Ordering::Release);
return idx_to_id(len);
}
atomic::spin_loop_hint();
}
}

It's worth noting that all of this code has been rewritten in the upcoming tracing-subscriber v0.2. The new implementation should handle spans moving between threads much better. I'd love to know if you can still reproduce this with a Git dependency on the current tracing-subscriber on master?

@rklaehn
Copy link
Author

rklaehn commented Dec 5, 2019

I can try this right away. How would the git link in the Cargo.toml look like? Sorry about the stupid question, never done this before. I checked here https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html but could not find how to specify a git link to a subproject.

@rklaehn
Copy link
Author

rklaehn commented Dec 5, 2019

Never mind. this is the right dependency spec, right?

tracing-subscriber = { version= "0.2.0-alpha.1",  git = "https://github.com/tokio-rs/tracing", branch = "master", package = 'tracing-subscriber' }

@rklaehn
Copy link
Author

rklaehn commented Dec 5, 2019

OK, I made the experiment. This seems to be the issue. Performance with the 0.2 alpha is much better than 0.1. See attached videos.

So when is 0.2 coming out? I am afraid we have to disable all tracing until then, as you can see from the video....

This has been driving me crazy, since apparently I was more affected by this than my colleagues due to me having the most virtual cores...

tracing.zip

@hawkw
Copy link
Member

hawkw commented Dec 5, 2019

Never mind. this is the right dependency spec, right?

tracing-subscriber = { version= "0.2.0-alpha.1",  git = "https://github.com/tokio-rs/tracing", branch = "master", package = 'tracing-subscriber' }

Yeah, that should work. Just FYI, you don't need to specify everything you've listed here, the following should be equivalent:

tracing-subscriber = { git = "https://github.com/tokio-rs/tracing" }

:)

Note that if you were actually going to use a git dependency in production, you would probably want to pin to a fixed git revision, like so:

tracing-subscriber = { git = "https://github.com/tokio-rs/tracing", rev = "3f14e9d7127c4f2ca70ab1c5c2ad7c9439f5c40b" }

@rklaehn
Copy link
Author

rklaehn commented Dec 5, 2019

FYI It seems that this is only related to spans. I have another branch where I have disabled all the spans but left in all the debug! and trace! calls, and it performs well and does not have the problem!

@hawkw
Copy link
Member

hawkw commented Dec 5, 2019

OK, I made the experiment. This seems to be the issue. Performance with the 0.2 alpha is much better than 0.1. See attached videos.

Awesome, that's great to hear! Glad the 0.2 release fixed it!

So when is 0.2 coming out? I am afraid we have to disable all tracing until then, as you can see from the video....

There's an 0.2.0-alpha.1 version of tracing-subscriber already published; however, there have been some important bug fixes merged since that was published. I can publish an 0.2.0-alpha.2 for you shortly if you're interested?

I'd like to get a release version of 0.2 out soon, but there are a couple of other potentially breaking changes I'd like to spend some more time looking into.

@hawkw
Copy link
Member

hawkw commented Dec 5, 2019

FYI It seems that this is only related to spans. I have another branch where I have disabled all the spans but left in all the debug! and trace! calls, and it performs well and does not have the problem!

Yeah, that's not surprising — the issue is specifically in the code for storing in-progress spans in tracing_subscriber::fmt. Events (like debug!, info!, etc) and other subscriber implementations shouldn't be affected at all.

@rklaehn
Copy link
Author

rklaehn commented Dec 5, 2019

I think we would use 0.2.0-alpha.2 for now if you would publish it. Would I have to update the other tracing dependencies as well, or does tracing-fmt 0.2.0 work with tracing-0.1? (I mean, it seems to do in my test, but just asking...)

@hawkw
Copy link
Member

hawkw commented Dec 5, 2019

Would I have to update the other tracing dependencies as well, or does tracing-fmt 0.2.0 work with tracing-0.1? (I mean, it seems to do in my test, but just asking...)

It should be totally compatible with everything else. The only breaking changes in 0.2 are to some of the APIs for customizing the behaviour of the fmt subscriber — for example, if you are setting a custom formatter for events, the traits for that have changed slightly.

@rklaehn
Copy link
Author

rklaehn commented Dec 5, 2019

OK, so then it would be very cool if you could publish 0.2.0-alpha.2. We would then use that for now and then update the rest later. Thanks a lot for the quick help!

@rklaehn
Copy link
Author

rklaehn commented Dec 6, 2019

So I tried using an explicit reference to the most recent tracing on master for now. Like this:

tracing-subscriber = { git = "https://github.com/tokio-rs/tracing", rev = "43685e1c493f8e5979fbe75f797e23952592e843"}

works fine on my machine, and fixes the perf issues, but our CI fails for one of the ARM builds.

error[E0080]: could not evaluate constant
  --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/page/slot.rs:54:24
   |
54 |     const LEN: usize = (cfg::WIDTH - C::RESERVED_BITS) - Self::SHIFT;
   |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ attempt to subtract with overflow

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/lib.rs:691:26
    |
691 |         let shift = 1 << (Self::LEN - 1);
    |                          ^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
  --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/cfg.rs:40:30
   |
40 |     const USED_BITS: usize = Generation::<Self>::LEN + Generation::<Self>::SHIFT;
   |                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/cfg.rs:142:34
    |
142 |             .field("used_bits", &C::USED_BITS)
    |                                  ^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/lib.rs:704:25
    |
704 |     const MASK: usize = Self::BITS << Self::SHIFT;
    |                         ^^^^^^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/page/slot.rs:560:24
    |
560 |     const LEN: usize = Generation::<C>::LEN;
    |                        ^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/page/slot.rs:495:37
    |
495 |     const LEN: usize = cfg::WIDTH - (Lifecycle::<C>::LEN + Generation::<C>::LEN);
    |                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/lib.rs:699:26
    |
699 |     const SHIFT: usize = Self::Prev::SHIFT + Self::Prev::LEN;
    |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error: aborting due to 8 previous errors

Apparently this is due to the new dependency on sharded-slab

sharded-slab = { version = "0.0.6", optional = true }

Any way to get around this? It is marked as an optional dep.

@hawkw
Copy link
Member

hawkw commented Dec 6, 2019

error[E0080]: could not evaluate constant
 ...

Apparently this is due to the new dependency on sharded-slab

Yeah, this is a known issue (hawkw/sharded-slab#9) which has a fix up — I can merge that as soon as I can get that passing CI, and publish a new sharded-slab release, and bump the dep in tracing-subscriber.

Any way to get around this? It is marked as an optional dep.

It's required by the fmt feature, which gives you the FmtSubscriber you're using. It's an optional dep because that feature can be disabled.

BTW, the FmtSubscriber performance improvements in 0.2 are due to the switch from a custom span storage type to using sharded-slab, so that dependency is actually why this issue is fixed in 0.2.

@rklaehn
Copy link
Author

rklaehn commented Dec 6, 2019

As soon as there is a new version I will try this out (on Monday) and see if it causes any issues on our various build targets. If not I would love to use it.

BTW, the FmtSubscriber performance improvements in 0.2 are due to the switch from a custom span storage type to using sharded-slab, so that dependency is actually why this issue is fixed in 0.2.

Yes, I saw that after investigating the feature dependencies more closely...

@hawkw
Copy link
Member

hawkw commented Dec 8, 2019

@rklaehn published a new alpha: https://crates.io/crates/tracing-subscriber/0.2.0-alpha.2

Sorry it took a couple days, had to un-break some unrelated build failures first.

@rklaehn
Copy link
Author

rklaehn commented Dec 13, 2019

Thanks a lot ❤️ . Will try if this works on all our targets.

Can you give a rough estimate when tracing 0.2.0 will be out? When it's done is also a perfectly valid answer.

@hawkw
Copy link
Member

hawkw commented Dec 13, 2019

Can you give a rough estimate when tracing 0.2.0 will be out? When it's done is also a perfectly valid answer.

"When it's done, but hopefully soon"? There are a couple of APIs I want to try to clean up before releasing 0.2, but I haven't had the time to work on that lately. If it ends up taking too long, I can probably push some of that to an eventual 0.3 — there's already a lot of new stuff in 0.2.

@rklaehn
Copy link
Author

rklaehn commented Dec 19, 2019

OK, I have now switched to alpha 2, it passes all tests and has no perf impact anymore. We are going to use this in production.

Is there a regression test to make sure the perf does not degrade, or is it such an obvious change that that is not necessary?

In any case, thanks for the quick help. Feel free to close this issue. For us it is solved with 0.2 alpha 2.

@hawkw
Copy link
Member

hawkw commented Dec 19, 2019

Is there a regression test to make sure the perf does not degrade, or is it such an obvious change that that is not necessary?

There are some benchmarks, but I'll have to double check that this specific case is covered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crate/subscriber Related to the `tracing-subscriber` crate kind/perf Performance improvements
Projects
None yet
Development

No branches or pull requests

2 participants