High CPU usage when misusing spans #458

rklaehn · 2019-12-05T18:52:57Z

Bug Report

Version

│       ├── tracing v0.1.10
│       │   ├── tracing-attributes v0.1.5
│       │   └── tracing-core v0.1.7
│       ├── tracing-futures v0.1.1
│       │   └── tracing v0.1.10 (*)
│       └── tracing-subscriber v0.1.6
│           ├── tracing-core v0.1.7 (*)
│           └── tracing-log v0.1.1
│               └── tracing-core v0.1.7 (*)
│           └── tracing-log v0.1.1 (*)
│   ├── tracing v0.1.10 (*)
│   ├── tracing-futures v0.1.1 (*)
│   │   ├── tracing v0.1.10 (*)
│   │   ├── tracing-subscriber v0.1.6 (*)
│   ├── tracing v0.1.10 (*)
│   ├── tracing-futures v0.1.1 (*)
│   ├── tracing v0.1.10 (*)
│   └── tracing-subscriber v0.1.6 (*)
├── tracing v0.1.10 (*)
├── tracing-futures v0.1.1 (*)
├── tracing-subscriber v0.1.6 (*)
    ├── tracing v0.1.10 (*)
    ├── tracing-futures v0.1.1 (*)

Platform

Linux linux-box 4.15.0-70-generic #79-Ubuntu SMP Tue Nov 12 10:36:11 UTC 2019 x86_64

Description

Not sure if this is a bug since the API was not used correctly, but since rust has the motto of fearless concurrency, I thought I might report it anyway. Maybe this can be better documented or caught.

We have a fairly large project that is using tracing internally. I had this very confusing behaviour where the thing would consume 100% CPU on all cores sometimes. Using flamegraph, I have narrowed this down to tracing, and then tracing spans.

The issue turned out to be that somebody had created a tracing span on a different thread to the thread where the span was entered. I still don't know what exactly happened then, but tracing did not like it at all!

There was no crash and the program still executed correctly, but all 12 virtual cores were loaded for about a minute.

Here is the code that caused so much trouble:

    {
        let current = Span::current();
        let span = debug_span!(parent: current, "spawn_mut");
        let (snd, rcv) = oneshot::channel();
        self.sender
            .send(Box::new(move |res| {
                let _enter = span.enter();

Here is the fix:

    {
        let current = Span::current();
        let (snd, rcv) = oneshot::channel();
        self.sender
            .send(Box::new(move |res| {
                let span = debug_span!(parent: current, "spawn_mut");
                let _enter = span.enter();

I attached a flamegraph of the behaviour I saw.

flamegraph_tracing_100.zip

Maybe something can be done in code or in documentation to prevent people from running into the same issue.

Anyway, feel free to close this if you think it is not worth addressing. And thanks for the awesome library!

The text was updated successfully, but these errors were encountered:

hawkw · 2019-12-05T20:56:44Z

Looking at your flamegraph, this appears to be a bug in tracing-subscriber::fmt. I'm guessing something in the Store::new_span method can cause a hot loop when a span is entered while it is still being inserted on another thread (though I may be wrong):

tracing/tracing-subscriber/src/fmt/span.rs

Lines 333 to 394 in bec40bc

    
           pub(crate) fn new_span<F>(&self, attrs: &Attributes<'_>, fmt_fields: &F) -> Id 
        
           where 
        
               F: for<'writer> FormatFields<'writer>, 
        
           { 
        
               let mut span = Some(Data::new(attrs, self)); 
        
               // The slab's free list is a modification of Treiber's lock-free stack, 
        
               // using slab indices instead of pointers, and with a provision for 
        
               // growing the slab when needed. 
        
               // 
        
               // In order to insert a new span into the slab, we "pop" the next free 
        
               // index from the stack. 
        
               loop { 
        
                   // Acquire a snapshot of the head of the free list. 
        
                   let head = self.next.load(Ordering::Relaxed); 
        
                   { 
        
                       // Try to insert the span without modifying the overall 
        
                       // structure of the stack. 
        
                       let this = try_lock!(self.inner.read(), else return Id::from_u64(0xDEADFACE)); 
        
                       // Can we insert without reallocating? 
        
                       if head < this.slab.len() { 
        
                           // If someone else is writing to the head slot, we need to 
        
                           // acquire a new snapshot! 
        
                           if let Ok(mut slot) = this.slab[head].try_write() { 
        
                               // Is the slot we locked actually empty? If not, fall 
        
                               // through and try to grow the slab. 
        
                               if let Some(next) = slot.next() { 
        
                                   // Is our snapshot still valid? 
        
                                   if self.next.compare_and_swap(head, next, Ordering::Release) == head { 
        
                                       // We can finally fill the slot! 
        
                                       slot.fill(span.take().unwrap(), attrs, fmt_fields); 
        
                                       return idx_to_id(head); 
        
                                   } 
        
                               } 
        
                           } 
        
                           // Our snapshot got stale, try again! 
        
                           atomic::spin_loop_hint(); 
        
                           continue; 
        
                       } 
        
                   } 
        
                   // We need to grow the slab, and must acquire a write lock. 
        
                   if let Ok(mut this) = self.inner.try_write() { 
        
                       let len = this.slab.len(); 
        
                       // Insert the span into a new slot. 
        
                       let slot = Slot::new(span.take().unwrap(), attrs, fmt_fields); 
        
                       this.slab.push(RwLock::new(slot)); 
        
                       // TODO: can we grow the slab in chunks to avoid having to 
        
                       // realloc as often? 
        
                       // Update the head pointer and return. 
        
                       self.next.store(len + 1, Ordering::Release); 
        
                       return idx_to_id(len); 
        
                   } 
        
                   atomic::spin_loop_hint(); 
        
               } 
        
           }

It's worth noting that all of this code has been rewritten in the upcoming tracing-subscriber v0.2. The new implementation should handle spans moving between threads much better. I'd love to know if you can still reproduce this with a Git dependency on the current tracing-subscriber on master?

rklaehn · 2019-12-05T21:13:29Z

I can try this right away. How would the git link in the Cargo.toml look like? Sorry about the stupid question, never done this before. I checked here https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html but could not find how to specify a git link to a subproject.

rklaehn · 2019-12-05T21:20:06Z

Never mind. this is the right dependency spec, right?

tracing-subscriber = { version= "0.2.0-alpha.1",  git = "https://github.com/tokio-rs/tracing", branch = "master", package = 'tracing-subscriber' }

rklaehn · 2019-12-05T21:37:01Z

OK, I made the experiment. This seems to be the issue. Performance with the 0.2 alpha is much better than 0.1. See attached videos.

So when is 0.2 coming out? I am afraid we have to disable all tracing until then, as you can see from the video....

This has been driving me crazy, since apparently I was more affected by this than my colleagues due to me having the most virtual cores...

tracing.zip

hawkw · 2019-12-05T21:37:41Z

Never mind. this is the right dependency spec, right?

tracing-subscriber = { version= "0.2.0-alpha.1",  git = "https://github.com/tokio-rs/tracing", branch = "master", package = 'tracing-subscriber' }

Yeah, that should work. Just FYI, you don't need to specify everything you've listed here, the following should be equivalent:

tracing-subscriber = { git = "https://github.com/tokio-rs/tracing" }

:)

Note that if you were actually going to use a git dependency in production, you would probably want to pin to a fixed git revision, like so:

tracing-subscriber = { git = "https://github.com/tokio-rs/tracing", rev = "3f14e9d7127c4f2ca70ab1c5c2ad7c9439f5c40b" }

rklaehn · 2019-12-05T21:39:30Z

FYI It seems that this is only related to spans. I have another branch where I have disabled all the spans but left in all the debug! and trace! calls, and it performs well and does not have the problem!

hawkw · 2019-12-05T21:41:07Z

OK, I made the experiment. This seems to be the issue. Performance with the 0.2 alpha is much better than 0.1. See attached videos.

Awesome, that's great to hear! Glad the 0.2 release fixed it!

So when is 0.2 coming out? I am afraid we have to disable all tracing until then, as you can see from the video....

There's an 0.2.0-alpha.1 version of tracing-subscriber already published; however, there have been some important bug fixes merged since that was published. I can publish an 0.2.0-alpha.2 for you shortly if you're interested?

I'd like to get a release version of 0.2 out soon, but there are a couple of other potentially breaking changes I'd like to spend some more time looking into.

hawkw · 2019-12-05T21:42:47Z

FYI It seems that this is only related to spans. I have another branch where I have disabled all the spans but left in all the debug! and trace! calls, and it performs well and does not have the problem!

Yeah, that's not surprising — the issue is specifically in the code for storing in-progress spans in tracing_subscriber::fmt. Events (like debug!, info!, etc) and other subscriber implementations shouldn't be affected at all.

rklaehn · 2019-12-05T21:43:19Z

I think we would use 0.2.0-alpha.2 for now if you would publish it. Would I have to update the other tracing dependencies as well, or does tracing-fmt 0.2.0 work with tracing-0.1? (I mean, it seems to do in my test, but just asking...)

hawkw · 2019-12-05T21:46:51Z

Would I have to update the other tracing dependencies as well, or does tracing-fmt 0.2.0 work with tracing-0.1? (I mean, it seems to do in my test, but just asking...)

It should be totally compatible with everything else. The only breaking changes in 0.2 are to some of the APIs for customizing the behaviour of the fmt subscriber — for example, if you are setting a custom formatter for events, the traits for that have changed slightly.

rklaehn · 2019-12-05T21:51:07Z

OK, so then it would be very cool if you could publish 0.2.0-alpha.2. We would then use that for now and then update the rest later. Thanks a lot for the quick help!

rklaehn · 2019-12-06T09:35:35Z

So I tried using an explicit reference to the most recent tracing on master for now. Like this:

tracing-subscriber = { git = "https://github.com/tokio-rs/tracing", rev = "43685e1c493f8e5979fbe75f797e23952592e843"}

works fine on my machine, and fixes the perf issues, but our CI fails for one of the ARM builds.

error[E0080]: could not evaluate constant
  --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/page/slot.rs:54:24
   |
54 |     const LEN: usize = (cfg::WIDTH - C::RESERVED_BITS) - Self::SHIFT;
   |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ attempt to subtract with overflow

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/lib.rs:691:26
    |
691 |         let shift = 1 << (Self::LEN - 1);
    |                          ^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
  --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/cfg.rs:40:30
   |
40 |     const USED_BITS: usize = Generation::<Self>::LEN + Generation::<Self>::SHIFT;
   |                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/cfg.rs:142:34
    |
142 |             .field("used_bits", &C::USED_BITS)
    |                                  ^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/lib.rs:704:25
    |
704 |     const MASK: usize = Self::BITS << Self::SHIFT;
    |                         ^^^^^^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/page/slot.rs:560:24
    |
560 |     const LEN: usize = Generation::<C>::LEN;
    |                        ^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/page/slot.rs:495:37
    |
495 |     const LEN: usize = cfg::WIDTH - (Lifecycle::<C>::LEN + Generation::<C>::LEN);
    |                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error[E0080]: could not evaluate constant
   --> /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sharded-slab-0.0.6/src/lib.rs:699:26
    |
699 |     const SHIFT: usize = Self::Prev::SHIFT + Self::Prev::LEN;
    |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ referenced constant has errors

error: aborting due to 8 previous errors

Apparently this is due to the new dependency on sharded-slab

sharded-slab = { version = "0.0.6", optional = true }

Any way to get around this? It is marked as an optional dep.

hawkw · 2019-12-06T18:31:57Z

error[E0080]: could not evaluate constant
 ...
Apparently this is due to the new dependency on sharded-slab

Yeah, this is a known issue (hawkw/sharded-slab#9) which has a fix up — I can merge that as soon as I can get that passing CI, and publish a new sharded-slab release, and bump the dep in tracing-subscriber.

Any way to get around this? It is marked as an optional dep.

It's required by the fmt feature, which gives you the FmtSubscriber you're using. It's an optional dep because that feature can be disabled.

BTW, the FmtSubscriber performance improvements in 0.2 are due to the switch from a custom span storage type to using sharded-slab, so that dependency is actually why this issue is fixed in 0.2.

rklaehn · 2019-12-06T19:52:12Z

As soon as there is a new version I will try this out (on Monday) and see if it causes any issues on our various build targets. If not I would love to use it.

BTW, the FmtSubscriber performance improvements in 0.2 are due to the switch from a custom span storage type to using sharded-slab, so that dependency is actually why this issue is fixed in 0.2.

Yes, I saw that after investigating the feature dependencies more closely...

hawkw · 2019-12-08T20:16:03Z

@rklaehn published a new alpha: https://crates.io/crates/tracing-subscriber/0.2.0-alpha.2

Sorry it took a couple days, had to un-break some unrelated build failures first.

rklaehn · 2019-12-13T09:45:01Z

Thanks a lot ❤️ . Will try if this works on all our targets.

Can you give a rough estimate when tracing 0.2.0 will be out? When it's done is also a perfectly valid answer.

hawkw · 2019-12-13T17:38:22Z

Can you give a rough estimate when tracing 0.2.0 will be out? When it's done is also a perfectly valid answer.

"When it's done, but hopefully soon"? There are a couple of APIs I want to try to clean up before releasing 0.2, but I haven't had the time to work on that lately. If it ends up taking too long, I can probably push some of that to an eventual 0.3 — there's already a lot of new stuff in 0.2.

rklaehn · 2019-12-19T08:43:33Z

OK, I have now switched to alpha 2, it passes all tests and has no perf impact anymore. We are going to use this in production.

Is there a regression test to make sure the perf does not degrade, or is it such an obvious change that that is not necessary?

In any case, thanks for the quick help. Feel free to close this issue. For us it is solved with 0.2 alpha 2.

hawkw · 2019-12-19T15:46:36Z

Is there a regression test to make sure the perf does not degrade, or is it such an obvious change that that is not necessary?

There are some benchmarks, but I'll have to double check that this specific case is covered.

hawkw added crate/subscriber Related to the `tracing-subscriber` crate kind/perf Performance improvements labels Dec 5, 2019

hawkw closed this as completed in 8711f63 Feb 4, 2020

rklaehn mentioned this issue Mar 1, 2023

fix: Call wait_idle on endpoint drop to allow for time for the close … n0-computer/quic-rpc#42

Merged

rklaehn mentioned this issue Sep 29, 2023

Gather statistics for transfer_collection n0-computer/iroh#1546

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage when misusing spans #458

High CPU usage when misusing spans #458

rklaehn commented Dec 5, 2019

hawkw commented Dec 5, 2019

rklaehn commented Dec 5, 2019

rklaehn commented Dec 5, 2019

rklaehn commented Dec 5, 2019

hawkw commented Dec 5, 2019 •

edited

Loading

rklaehn commented Dec 5, 2019

hawkw commented Dec 5, 2019

hawkw commented Dec 5, 2019

rklaehn commented Dec 5, 2019

hawkw commented Dec 5, 2019

rklaehn commented Dec 5, 2019

rklaehn commented Dec 6, 2019

hawkw commented Dec 6, 2019

rklaehn commented Dec 6, 2019

hawkw commented Dec 8, 2019

rklaehn commented Dec 13, 2019

hawkw commented Dec 13, 2019

rklaehn commented Dec 19, 2019

hawkw commented Dec 19, 2019

High CPU usage when misusing spans #458

High CPU usage when misusing spans #458

Comments

rklaehn commented Dec 5, 2019

Bug Report

Version

Platform

Description

hawkw commented Dec 5, 2019

rklaehn commented Dec 5, 2019

rklaehn commented Dec 5, 2019

rklaehn commented Dec 5, 2019

hawkw commented Dec 5, 2019 • edited Loading

rklaehn commented Dec 5, 2019

hawkw commented Dec 5, 2019

hawkw commented Dec 5, 2019

rklaehn commented Dec 5, 2019

hawkw commented Dec 5, 2019

rklaehn commented Dec 5, 2019

rklaehn commented Dec 6, 2019

hawkw commented Dec 6, 2019

rklaehn commented Dec 6, 2019

hawkw commented Dec 8, 2019

rklaehn commented Dec 13, 2019

hawkw commented Dec 13, 2019

rklaehn commented Dec 19, 2019

hawkw commented Dec 19, 2019

hawkw commented Dec 5, 2019 •

edited

Loading