fix(subscriber): only send new tasks/resources/etc over the event channel #238

hawkw · 2021-12-21T19:08:14Z

Motivation

Currently, there are some rather bad issues that occur when the event
buffer is at capacity and events are dropped.

Completely losing data due to buffer capacity is relatively okay: if
we set a bound on how much memory the console can use, and we don't
record new things that occur when we've reached that limit, this is
correct and acceptable behavior. However, the current design can result
in incorrect data when events are lost due to the buffer being at
capacity.

This is because we currently record things like starting to poll a
task/resource, ending a poll, dropping a task/resource/async op, and
waker operations, as individual events in the buffer. This means that we
can have a situation where the creation of a task was recorded, because
there was buffer capacity at the time, but then when the task ended, the
buffer was full, so we never recorded its termination. That results in
the tasks appearing to run forever and never terminate --- see issue
#230. Similarly, if we record the beginning of a poll, but drop the
end-of-poll event because we're at capacity, this means we will
(incorrectly) record a poll that goes on forever, which is obviously
incorrect. I think this may also be the cause of the false positives
with the lost waker lint (#149), if we record a waker drop but missed a
waker clone that previously occurred.

The change in #212 fixed one category of issue that occurs due to event
buffer capacity --- when a task, resource, or async op's creation
event is dropped due to buffer capacity, we skip any subsequent events
related to that task/resource/op. However, this doesn't fix issues where
the subsequent events are the ones that are dropped.

Solution

This branch proposes a solution to this whole category of event buffer
capacity related issues. Unfortunately, this requires rewriting a lot
of console-subscriber's internals.

In the new approach, we now only send events over the channel when
creating a new task, resource, or async op. Those events now contain an
Arc holding the stats for that entity. Another clone of the Arc is
stored in the tracing_subscriber::Registry's span extensions for the
span corresponding to that entity. When the ConsoleLayer records
subsequent events for a particular entity, such as starting/ending a
poll, it looks up the span by ID, and updates the stats type stored in
its extensions. The aggregator stores its clone of the Arc in a map of
entities, just like it does currently, but no longer handles actually
updating the stats; just building wire format updates from any tracked
entities whose data was updated by the layer.

This should fix all issues where dropping something due to event buffer
capacity results in incorrect data. Once we have successfully recorded
the creation of a task, resource, or async op, any subsequent updates
to its stats are guaranteed to be reliable. If the channel is at
capacity and we fail to record a new resource/task/op, we never create a
stats extension for it, and we won't record anything for it at all.
Otherwise, it will always have correct data recorded.

When possible, the stats in the Arced stats are updated atomically. In
some cases, this isn't easily possible, and some fields of the stats
types are stored in a mutex. In particualr, this is required for storing
timestamps. I don't really love that, but these mutices should be
contented very infrequently. Stats aren't marked as having unset updates
until after the stats inside the mutices have been updated, so the
aggregator will not try to lock the mutex if the layer is currently
updating it; instead, it will simply be included in the next update once
the layer is no longer touching it. Mutices here will only be contended
when multiple threads are updating a task's stats at the same time,
which should occur very rarely...and in most cases, they still won't
have to contend a mutex, since access to most of the mutices are guarded
by an atomic variable for e.g. determining which thread actually was the
last to complete a concurrent poll. The biggest performance downside of
the mutices is probably not actually contention, but the additional heap
allocation required when using std::sync::Mutex. However, since we
have conditional parking_lot support, parking_lot can be used to avoid
requiring additional allocations.

In the future, it's probably possible to make more of this atomic by
converting timestamps into integers and storing them in atomic
variables. I haven't done this yet because both the protobuf timestamps
and std::time timestamps are larger than a single 64-bit number and it
might take a little extra work to ensure we can nicely fit them in an
AtomicUsize...but we can probably do that later.

hawkw · 2021-12-21T19:10:31Z

This is currently a draft PR because there's still a decently large amount of code that's commented out. In particular, we don't currently record attribute updates, because I haven't written that part yet, and basically all of the JSON recording is broken by this change and needs to be rewritten. Also, I haven't cleaned up the code, added comments, or fixed most of the warnings yet.

But, the console's basic functionality mostly works --- you can run this branch and see tasks and their poll durations and so on. I wanted to go ahead and open a PR to get some feedback from @seanmonstar and @zaharidichev so that we can agree that the overall approach here is a good idea. If you two (or anyone else) has a chance to take a look at this, I'd really appreciate that!

seanmonstar

The plan looks good to me. Would be cool to measure the impact of before and after, in case locking and atomic ops hurts the runtime, but mostly just so we are aware...

seanmonstar · 2021-12-21T19:49:18Z

console-subscriber/src/record.rs

-                at: *at,
-                op: *op,
-            },
+            // TODO(eliza): this more or less bricks recording, lol...put this back.


hawkw · 2021-12-21T20:37:32Z

Would be cool to measure the impact of before and after, in case locking and atomic ops hurts the runtime, but mostly just so we are aware...

Yeah, it would definitely be good to come up with a benchmark setup for measuring the console instrumentation's performance cost...we haven't really done this yet, so it would be good to do regardless.

FWIW, using the MPSC channel also involves locking and some atomic ops, so 🤷‍♀️

Currently, `console-subscriber` contains a bunch of machinery for rewriting non-sequential `span::Id`s from `tracing` to sequential IDs (see #75). Upon thinking about this for a bit, I don't actually understand why this has to be done on the instrumentation-side. This seems like extra work that's currently done in the instrumented application, when it really doesn't have to be. Instead, the client should be responsible for rewriting `tracing` IDs to pretty, sequential user-facing IDs. This would have a few advantages: - it moves some work out of the application, which is always good - if data is being emitted through an implementation other than `console-subscriber`, we will *still* get nicely ordered ids - this also makes some of the stuff i'm working on in #238 easier This branch removes ID rewriting from `console-subscriber`, and adds it to the `console` CLI's `state` module. Closes #240 Signed-off-by: Eliza Weisman <[email protected]>

hawkw · 2022-01-07T20:36:13Z

Okay, I've marked this as "ready for review", as I believe everything should work now and it compiles without warnings. @seanmonstar and @zaharidichev, whenever you two have time to take a look, I'd appreciate a careful review, as this is a kind of large change.

I think it's possible to remove some of the mutexes; the change I discussed in #254 could remove the need for mutexes in timestamps. However, in practice, I think the mutexes should basically never be contended, as the aggregator task won't try to read from them unless the dirty bit is set, and it only gets set when we're done updating the data. The main overhead of the mutexes is probably adding a heap allocation/ptr dereference, which is fixed by enabling parking_lot...

zaharidichev

The approach looks great to me and makes sense. I think there is a problem however with the lifecycle of async ops. If you run the mutex example you will see that the lock always stays in one state and after you run it in a bit the async ops associated with the Mutex resource just go to zero. On main however new ops continue to appear and the state attribute of the resource changes from locked=true to locked=false.

zaharidichev · 2022-01-09T17:07:45Z

console-subscriber/src/lib.rs

+
+        let span = match ctx.span(id) {
+            Some(span) => span,
+            // XXX(eliza): no span exists for a resource ID, we should maybe


I think we should just go ahead and print an error. This kind of error can lead to wrong data and quite a bit of headscratchers, so we better just print. On the other hand having random output to any app that includes the crate is a bit annoying. I wonder whether we can introduce some more elegant way of handling these logic errors. WDYT?

zaharidichev · 2022-01-09T19:31:04Z

console-subscriber/src/stats.rs

+    pub(crate) fn drop_task(&self, dropped_at: SystemTime) {
+        if self.is_dropped.swap(true, AcqRel) {
+            // The task was already dropped.
+            // TODO(eliza): this could maybe panic in debug mode...


Would be great to elaborate

hawkw · 2022-01-10T20:13:03Z

If you run the mutex example you will see that the lock always stays in one state and after you run it in a bit the async ops associated with the Mutex resource just go to zero

Thanks, good catch! I must have missed that because I was mostly testing with the semaphore example. I'll try to figure out what's going on.

zaharidichev · 2022-01-10T21:12:39Z

Thanks, good catch! I must have missed that because I was mostly testing with the semaphore example. I'll try to figure out what's going on.

@hawkw I think that has something to do with the fact that you are creating resources,async ops and tasks as not dirty. I wonder whether that is correct. The way you have it now if in an instrumented app you simply create a bunch of resources like locks, they will not show up in the console until they are used in some way. Is that what we want? In any case if you flip the is_dirty flag to true for new async ops, the problem I am describing goes away. I have not yet identified the actual root cause of this though. I assume even if an async op is not created as dirty it needs to be set as dirty at some point when polled...
https://github.com/tokio-rs/console/pull/238/files#diff-096bbe8a08cf82daf24133d20a29936b0288c393537f10858ce731ca1338ba61R497

The other problem with the mutex example where the locked always stays true is a false alarm I think. It just happens so that whenever a mutex guard is dropped, we immediately acquire the lock and we skip issuing the wire update with locked set to false. Does that make sense?

hawkw · 2022-01-10T22:05:31Z

@hawkw I think that has something to do with the fact that you are creating resources,async ops and tasks as not dirty. I wonder whether that is correct.

Oh, that's definitely not right. I'll fix that.

Signed-off-by: Eliza Weisman <[email protected]>

@zaharidichev

thanks @zaharidichev for noticing that! Signed-off-by: Eliza Weisman <[email protected]>

zaharidichev

Very nice refactor, works great. I have tested it with most of the examples and things are as expected

hawkw · 2022-01-11T18:37:29Z

@zaharidichev

The other problem with the mutex example where the locked always stays true is a false alarm I think. It just happens so that whenever a mutex guard is dropped, we immediately acquire the lock and we skip issuing the wire update with locked set to false. Does that make sense?

~~I'm not totally sure about this; the locked attribute changes on master, so I don't think the current behavior on this branch is correct. I'd like to look a little closer at it...~~

EDIT: after adding some printlns, it does appear that @zaharidichev is correct about this; it looks like it just happens that, because the mutex is re-acquired very quickly in the example, we never actually publish an update where it is in the unlocked state. If I modify the example a bit so that the tasks wait a variable amount if time before locking the mutex, and change the main task to not immediately lock the mutex in a loop, we see it in both the locked and unlocked state. I think this is probably fine.

Signed-off-by: Eliza Weisman <[email protected]>

zaharidichev · 2022-01-11T19:26:52Z

after adding some printlns, it does appear that @zaharidichev is correct about this; it looks like it just happens that, because the mutex is re-acquired very quickly in the example, we never actually publish an update where it is in the unlocked state. If I modify the example a bit so that the tasks wait a variable amount if time before locking the mutex, and change the main task to not immediately lock the mutex in a loop, we see it in both the locked and unlocked state. I think this is probably fine.

@hawkw Yes just sat down to comment that this is exactly what I did as well to ensure that the update gets propagated. In this example it is just the case that the lock has heavily contended - there is basically always an async op awaiting on it so the moment it gets released, it is acquired again. So yes, all works as expected.

Signed-off-by: Eliza Weisman <[email protected]>

<a name="0.1.1"></a> ## 0.1.1 (2022-01-18) #### Bug Fixes * only send *new* tasks/resources/etc over the event channel (#238) ([fdc77e2](fdc77e2)) * increased default event buffer capacity (#235) ([0cf0aee](0cf0aee)) * use saturating arithmetic for attribute updates (#234) ([fe82e17](fe82e17)) #### Changes * moved ID rewriting from `console-subscriber` to the client (#244) ([095b1ef](095b1ef)) ## 0.1.0 (2021-12-16) - Initial release! 🎉

<a name="0.1.1"></a> ## 0.1.1 (2022-01-18) #### Bug Fixes * only send *new* tasks/resources/etc over the event channel (#238) ([fdc77e2](fdc77e2)) * increased default event buffer capacity (#235) ([0cf0aee](0cf0aee)) * use saturating arithmetic for attribute updates (#234) ([fe82e17](fe82e17)) #### Changes * moved ID rewriting from `console-subscriber` to the client (#244) ([095b1ef](095b1ef))

Self-wakes were not being detected and displayed in the console. The `burn` task in the `app` example - which deliberately has many self-wakes - was not registering any. It appears that a logic error was present in the self-wake counting in `console-subscriber` since it was added in #238. When a self wake was detected, the `wakes` count was incremented a second time (the `wakes` count is incremented for all wakes before checking for a self wake), instead of increamenting the `self_wakes` count. This PR fixes the logic so that when a self wake is detected, the `self_wakes` count is incremented.

When entering and exiting a span the old code was also updating the parent stats. This was causing excessive polling being reported for the parent tasks. See issue #378 for more details. The regression was introduced by the refactor in #238. This fixes the issue by limiting updates to the current span. Closes #378 Co-authored-by: Hayden Stainsby <[email protected]> Co-authored-by: Eliza Weisman <[email protected]>

Self-wakes were not being detected and displayed in the console. The `burn` task in the `app` example - which deliberately has many self-wakes - was not registering any. It appears that a logic error was present in the self-wake counting in `console-subscriber` since it was added in #238. When a self wake was detected, the `wakes` count was incremented a second time (the `wakes` count is incremented for all wakes before checking for a self wake), instead of increamenting the `self_wakes` count. This PR fixes the logic so that when a self wake is detected, the `self_wakes` count is incremented.

When entering and exiting a span the old code was also updating the parent stats. This was causing excessive polling being reported for the parent tasks. See issue #378 for more details. The regression was introduced by the refactor in #238. This fixes the issue by limiting updates to the current span. Closes #378 Co-authored-by: Hayden Stainsby <[email protected]> Co-authored-by: Eliza Weisman <[email protected]>

hawkw requested review from seanmonstar and zaharidichev December 21, 2021 19:08

hawkw mentioned this pull request Dec 21, 2021

Console may show "ghost" tasks when the event buffer is at capacity #230

Closed

seanmonstar reviewed Dec 21, 2021

View reviewed changes

This was referenced Dec 21, 2021

move ID rewriting to the client side #240

Closed

refac: move ID rewriting to console #244

Merged

hawkw force-pushed the eliza/less-buffer branch from 59f3996 to a4b9d2c Compare December 28, 2021 18:51

hawkw mentioned this pull request Dec 28, 2021

subscriber: feature-flag recording #245

Open

hawkw force-pushed the eliza/less-buffer branch from 0a5c577 to 235b2c4 Compare December 28, 2021 20:33

hawkw mentioned this pull request Dec 29, 2021

consider redesigning resource states to use tracing more idiomatically #246

Open

hawkw force-pushed the eliza/less-buffer branch 2 times, most recently from af8f9ee to b4549ed Compare January 7, 2022 17:24

hawkw mentioned this pull request Jan 7, 2022

subscriber: lighter-weight timestamp recording #254

Open

hawkw changed the title ~~[WIP] fix(subscriber): only send *new* tasks/resources/etc over the event channel~~ fix(subscriber): only send *new* tasks/resources/etc over the event channel Jan 7, 2022

hawkw requested a review from seanmonstar January 7, 2022 20:26

hawkw self-assigned this Jan 7, 2022

zaharidichev reviewed Jan 9, 2022

View reviewed changes

hawkw added 6 commits January 10, 2022 14:17

wip

752859e

Signed-off-by: Eliza Weisman <[email protected]>

wip

42828ee

Signed-off-by: Eliza Weisman <[email protected]>

wip (everything except attributes)

9e4e6e9

Signed-off-by: Eliza Weisman <[email protected]>

wip (layer part)

d11a588

Signed-off-by: Eliza Weisman <[email protected]>

basic functionality works

fc25ef3

Signed-off-by: Eliza Weisman <[email protected]>

put back JSON recording

ef9439e

Signed-off-by: Eliza Weisman <[email protected]>

hawkw added 4 commits January 10, 2022 14:17

put back poll ops

6db2711

Signed-off-by: Eliza Weisman <[email protected]>

clean up warnings

731de20

Signed-off-by: Eliza Weisman <[email protected]>

make clippy happy

37fea83

Signed-off-by: Eliza Weisman <[email protected]>

fix tasks/ops/resources not being created dirty

78f3c4c

thanks @zaharidichev for noticing that! Signed-off-by: Eliza Weisman <[email protected]>

hawkw force-pushed the eliza/less-buffer branch from 453b878 to 78f3c4c Compare January 10, 2022 22:18

zaharidichev approved these changes Jan 11, 2022

View reviewed changes

remove commented-out code

056a66f

Signed-off-by: Eliza Weisman <[email protected]>

minor style/cleanup

9b6e6b9

Signed-off-by: Eliza Weisman <[email protected]>

seanmonstar approved these changes Jan 12, 2022

View reviewed changes

hawkw merged commit fdc77e2 into main Jan 12, 2022

hawkw deleted the eliza/less-buffer branch January 12, 2022 18:59

hawkw restored the eliza/less-buffer branch January 12, 2022 23:45

hawkw mentioned this pull request Jan 18, 2022

subscriber: prepare to release v0.1.1 #260

Merged

zaharidichev mentioned this pull request Jan 24, 2022

The console randomly die with an unwrap #266

Closed

MatteoNardi mentioned this pull request Jun 1, 2023

Console reporting excessive polling, that isn't there #378

Closed

hds mentioned this pull request Jun 1, 2023

fix(subscriber): fix self wakes count #430

Merged

MatteoNardi mentioned this pull request Jun 25, 2023

fix(subscriber): do not report excessive polling (#378) #440

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(subscriber): only send new tasks/resources/etc over the event channel #238

fix(subscriber): only send new tasks/resources/etc over the event channel #238

hawkw commented Dec 21, 2021

hawkw commented Dec 21, 2021

seanmonstar left a comment

seanmonstar Dec 21, 2021

hawkw commented Dec 21, 2021

hawkw commented Jan 7, 2022

zaharidichev left a comment

zaharidichev Jan 9, 2022

zaharidichev Jan 9, 2022

hawkw commented Jan 10, 2022

zaharidichev commented Jan 10, 2022

hawkw commented Jan 10, 2022

zaharidichev left a comment

hawkw commented Jan 11, 2022 •

edited

Loading

zaharidichev commented Jan 11, 2022

fix(subscriber): only send *new* tasks/resources/etc over the event channel #238

fix(subscriber): only send *new* tasks/resources/etc over the event channel #238

Conversation

hawkw commented Dec 21, 2021

Motivation

Solution

hawkw commented Dec 21, 2021

seanmonstar left a comment

Choose a reason for hiding this comment

seanmonstar Dec 21, 2021

Choose a reason for hiding this comment

hawkw commented Dec 21, 2021

hawkw commented Jan 7, 2022

zaharidichev left a comment

Choose a reason for hiding this comment

zaharidichev Jan 9, 2022

Choose a reason for hiding this comment

zaharidichev Jan 9, 2022

Choose a reason for hiding this comment

hawkw commented Jan 10, 2022

zaharidichev commented Jan 10, 2022

hawkw commented Jan 10, 2022

zaharidichev left a comment

Choose a reason for hiding this comment

hawkw commented Jan 11, 2022 • edited Loading

zaharidichev commented Jan 11, 2022

fix(subscriber): only send new tasks/resources/etc over the event channel #238

fix(subscriber): only send new tasks/resources/etc over the event channel #238

hawkw commented Jan 11, 2022 •

edited

Loading