Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(subscriber): only send *new* tasks/resources/etc over the event channel #238

Merged
merged 17 commits into from
Jan 12, 2022

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Dec 21, 2021

Motivation

Currently, there are some rather bad issues that occur when the event
buffer is at capacity and events are dropped.

Completely losing data due to buffer capacity is relatively okay: if
we set a bound on how much memory the console can use, and we don't
record new things that occur when we've reached that limit, this is
correct and acceptable behavior. However, the current design can result
in incorrect data when events are lost due to the buffer being at
capacity.

This is because we currently record things like starting to poll a
task/resource, ending a poll, dropping a task/resource/async op, and
waker operations, as individual events in the buffer. This means that we
can have a situation where the creation of a task was recorded, because
there was buffer capacity at the time, but then when the task ended, the
buffer was full, so we never recorded its termination. That results in
the tasks appearing to run forever and never terminate --- see issue
#230. Similarly, if we record the beginning of a poll, but drop the
end-of-poll event because we're at capacity, this means we will
(incorrectly) record a poll that goes on forever, which is obviously
incorrect. I think this may also be the cause of the false positives
with the lost waker lint (#149), if we record a waker drop but missed a
waker clone that previously occurred.

The change in #212 fixed one category of issue that occurs due to event
buffer capacity --- when a task, resource, or async op's creation
event is dropped due to buffer capacity, we skip any subsequent events
related to that task/resource/op. However, this doesn't fix issues where
the subsequent events are the ones that are dropped.

Solution

This branch proposes a solution to this whole category of event buffer
capacity related issues. Unfortunately, this requires rewriting a lot
of console-subscriber's internals.

In the new approach, we now only send events over the channel when
creating a new task, resource, or async op. Those events now contain an
Arc holding the stats for that entity. Another clone of the Arc is
stored in the tracing_subscriber::Registry's span extensions for the
span corresponding to that entity. When the ConsoleLayer records
subsequent events for a particular entity, such as starting/ending a
poll, it looks up the span by ID, and updates the stats type stored in
its extensions. The aggregator stores its clone of the Arc in a map of
entities, just like it does currently, but no longer handles actually
updating the stats; just building wire format updates from any tracked
entities whose data was updated by the layer.

This should fix all issues where dropping something due to event buffer
capacity results in incorrect data. Once we have successfully recorded
the creation of a task, resource, or async op, any subsequent updates
to its stats are guaranteed to be reliable. If the channel is at
capacity and we fail to record a new resource/task/op, we never create a
stats extension for it, and we won't record anything for it at all.
Otherwise, it will always have correct data recorded.

When possible, the stats in the Arced stats are updated atomically. In
some cases, this isn't easily possible, and some fields of the stats
types are stored in a mutex. In particualr, this is required for storing
timestamps. I don't really love that, but these mutices should be
contented very infrequently. Stats aren't marked as having unset updates
until after the stats inside the mutices have been updated, so the
aggregator will not try to lock the mutex if the layer is currently
updating it; instead, it will simply be included in the next update once
the layer is no longer touching it. Mutices here will only be contended
when multiple threads are updating a task's stats at the same time,
which should occur very rarely...and in most cases, they still won't
have to contend a mutex, since access to most of the mutices are guarded
by an atomic variable for e.g. determining which thread actually was the
last to complete a concurrent poll. The biggest performance downside of
the mutices is probably not actually contention, but the additional heap
allocation required when using std::sync::Mutex. However, since we
have conditional parking_lot support, parking_lot can be used to avoid
requiring additional allocations.

In the future, it's probably possible to make more of this atomic by
converting timestamps into integers and storing them in atomic
variables. I haven't done this yet because both the protobuf timestamps
and std::time timestamps are larger than a single 64-bit number and it
might take a little extra work to ensure we can nicely fit them in an
AtomicUsize...but we can probably do that later.

@hawkw
Copy link
Member Author

hawkw commented Dec 21, 2021

This is currently a draft PR because there's still a decently large amount of code that's commented out. In particular, we don't currently record attribute updates, because I haven't written that part yet, and basically all of the JSON recording is broken by this change and needs to be rewritten. Also, I haven't cleaned up the code, added comments, or fixed most of the warnings yet.

But, the console's basic functionality mostly works --- you can run this branch and see tasks and their poll durations and so on. I wanted to go ahead and open a PR to get some feedback from @seanmonstar and @zaharidichev so that we can agree that the overall approach here is a good idea. If you two (or anyone else) has a chance to take a look at this, I'd really appreciate that!

Copy link
Member

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan looks good to me. Would be cool to measure the impact of before and after, in case locking and atomic ops hurts the runtime, but mostly just so we are aware...

at: *at,
op: *op,
},
// TODO(eliza): this more or less bricks recording, lol...put this back.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol 😭

@hawkw
Copy link
Member Author

hawkw commented Dec 21, 2021

Would be cool to measure the impact of before and after, in case locking and atomic ops hurts the runtime, but mostly just so we are aware...

Yeah, it would definitely be good to come up with a benchmark setup for measuring the console instrumentation's performance cost...we haven't really done this yet, so it would be good to do regardless.

FWIW, using the MPSC channel also involves locking and some atomic ops, so 🤷‍♀️

@hawkw hawkw force-pushed the eliza/less-buffer branch from 59f3996 to a4b9d2c Compare December 28, 2021 18:51
hawkw added a commit that referenced this pull request Dec 28, 2021
Currently, `console-subscriber` contains a bunch of machinery for
rewriting non-sequential `span::Id`s from `tracing` to sequential IDs
(see #75). Upon thinking about this for a bit, I don't actually
understand why this has to be done on the instrumentation-side. This
seems like extra work that's currently done in the instrumented
application, when it really doesn't have to be.

Instead, the client should be responsible for rewriting `tracing` IDs to
pretty, sequential user-facing IDs. This would have a few advantages:
- it moves some work out of the application, which is always good
- if data is being emitted through an implementation other than
  `console-subscriber`, we will *still* get nicely ordered ids
- this also makes some of the stuff i'm working on in #238 easier

This branch removes ID rewriting from `console-subscriber`, and adds it
to the `console` CLI's `state` module.

Closes #240

Signed-off-by: Eliza Weisman <[email protected]>
@hawkw hawkw force-pushed the eliza/less-buffer branch from 0a5c577 to 235b2c4 Compare December 28, 2021 20:33
@hawkw hawkw force-pushed the eliza/less-buffer branch 2 times, most recently from af8f9ee to b4549ed Compare January 7, 2022 17:24
@hawkw hawkw changed the title [WIP] fix(subscriber): only send *new* tasks/resources/etc over the event channel fix(subscriber): only send *new* tasks/resources/etc over the event channel Jan 7, 2022
@hawkw hawkw requested a review from seanmonstar January 7, 2022 20:26
@hawkw hawkw self-assigned this Jan 7, 2022
@hawkw
Copy link
Member Author

hawkw commented Jan 7, 2022

Okay, I've marked this as "ready for review", as I believe everything should work now and it compiles without warnings. @seanmonstar and @zaharidichev, whenever you two have time to take a look, I'd appreciate a careful review, as this is a kind of large change.

I think it's possible to remove some of the mutexes; the change I discussed in #254 could remove the need for mutexes in timestamps. However, in practice, I think the mutexes should basically never be contended, as the aggregator task won't try to read from them unless the dirty bit is set, and it only gets set when we're done updating the data. The main overhead of the mutexes is probably adding a heap allocation/ptr dereference, which is fixed by enabling parking_lot...

Copy link
Collaborator

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks great to me and makes sense. I think there is a problem however with the lifecycle of async ops. If you run the mutex example you will see that the lock always stays in one state and after you run it in a bit the async ops associated with the Mutex resource just go to zero. On main however new ops continue to appear and the state attribute of the resource changes from locked=true to locked=false.


let span = match ctx.span(id) {
Some(span) => span,
// XXX(eliza): no span exists for a resource ID, we should maybe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just go ahead and print an error. This kind of error can lead to wrong data and quite a bit of headscratchers, so we better just print. On the other hand having random output to any app that includes the crate is a bit annoying. I wonder whether we can introduce some more elegant way of handling these logic errors. WDYT?

pub(crate) fn drop_task(&self, dropped_at: SystemTime) {
if self.is_dropped.swap(true, AcqRel) {
// The task was already dropped.
// TODO(eliza): this could maybe panic in debug mode...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to elaborate

@hawkw
Copy link
Member Author

hawkw commented Jan 10, 2022

If you run the mutex example you will see that the lock always stays in one state and after you run it in a bit the async ops associated with the Mutex resource just go to zero

Thanks, good catch! I must have missed that because I was mostly testing with the semaphore example. I'll try to figure out what's going on.

@zaharidichev
Copy link
Collaborator

Thanks, good catch! I must have missed that because I was mostly testing with the semaphore example. I'll try to figure out what's going on.

@hawkw I think that has something to do with the fact that you are creating resources,async ops and tasks as not dirty. I wonder whether that is correct. The way you have it now if in an instrumented app you simply create a bunch of resources like locks, they will not show up in the console until they are used in some way. Is that what we want? In any case if you flip the is_dirty flag to true for new async ops, the problem I am describing goes away. I have not yet identified the actual root cause of this though. I assume even if an async op is not created as dirty it needs to be set as dirty at some point when polled...
https://github.com/tokio-rs/console/pull/238/files#diff-096bbe8a08cf82daf24133d20a29936b0288c393537f10858ce731ca1338ba61R497

The other problem with the mutex example where the locked always stays true is a false alarm I think. It just happens so that whenever a mutex guard is dropped, we immediately acquire the lock and we skip issuing the wire update with locked set to false. Does that make sense?

@hawkw
Copy link
Member Author

hawkw commented Jan 10, 2022

@hawkw I think that has something to do with the fact that you are creating resources,async ops and tasks as not dirty. I wonder whether that is correct.

Oh, that's definitely not right. I'll fix that.

hawkw added 6 commits January 10, 2022 14:17
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
hawkw added 4 commits January 10, 2022 14:17
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
thanks @zaharidichev for noticing that!

Signed-off-by: Eliza Weisman <[email protected]>
@hawkw hawkw force-pushed the eliza/less-buffer branch from 453b878 to 78f3c4c Compare January 10, 2022 22:18
Copy link
Collaborator

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice refactor, works great. I have tested it with most of the examples and things are as expected

@hawkw
Copy link
Member Author

hawkw commented Jan 11, 2022

@zaharidichev

The other problem with the mutex example where the locked always stays true is a false alarm I think. It just happens so that whenever a mutex guard is dropped, we immediately acquire the lock and we skip issuing the wire update with locked set to false. Does that make sense?

I'm not totally sure about this; the locked attribute changes on master, so I don't think the current behavior on this branch is correct. I'd like to look a little closer at it...

EDIT: after adding some printlns, it does appear that @zaharidichev is correct about this; it looks like it just happens that, because the mutex is re-acquired very quickly in the example, we never actually publish an update where it is in the unlocked state. If I modify the example a bit so that the tasks wait a variable amount if time before locking the mutex, and change the main task to not immediately lock the mutex in a loop, we see it in both the locked and unlocked state. I think this is probably fine.

Signed-off-by: Eliza Weisman <[email protected]>
@zaharidichev
Copy link
Collaborator

after adding some printlns, it does appear that @zaharidichev is correct about this; it looks like it just happens that, because the mutex is re-acquired very quickly in the example, we never actually publish an update where it is in the unlocked state. If I modify the example a bit so that the tasks wait a variable amount if time before locking the mutex, and change the main task to not immediately lock the mutex in a loop, we see it in both the locked and unlocked state. I think this is probably fine.

@hawkw Yes just sat down to comment that this is exactly what I did as well to ensure that the update gets propagated. In this example it is just the case that the lock has heavily contended - there is basically always an async op awaiting on it so the moment it gets released, it is acquired again. So yes, all works as expected.

Signed-off-by: Eliza Weisman <[email protected]>
@hawkw hawkw merged commit fdc77e2 into main Jan 12, 2022
@hawkw hawkw deleted the eliza/less-buffer branch January 12, 2022 18:59
@hawkw hawkw restored the eliza/less-buffer branch January 12, 2022 23:45
hawkw added a commit that referenced this pull request Jan 18, 2022
<a name="0.1.1"></a>
## 0.1.1 (2022-01-18)

#### Bug Fixes

*  only send *new* tasks/resources/etc over the event channel (#238) ([fdc77e2](fdc77e2))
*  increased default event buffer capacity (#235) ([0cf0aee](0cf0aee))
*  use saturating arithmetic for attribute updates (#234) ([fe82e17](fe82e17))

#### Changes

*  moved ID rewriting from `console-subscriber` to the client (#244) ([095b1ef](095b1ef))
## 0.1.0 (2021-12-16)

- Initial release! &#x1f389;
hawkw added a commit that referenced this pull request Jan 18, 2022
<a name="0.1.1"></a>
## 0.1.1 (2022-01-18)

#### Bug Fixes

*  only send *new* tasks/resources/etc over the event channel (#238) ([fdc77e2](fdc77e2))
*  increased default event buffer capacity (#235) ([0cf0aee](0cf0aee))
*  use saturating arithmetic for attribute updates (#234) ([fe82e17](fe82e17))

#### Changes

*  moved ID rewriting from `console-subscriber` to the client (#244) ([095b1ef](095b1ef))
## 0.1.0 (2021-12-16)

- Initial release! &#x1f389;
hawkw added a commit that referenced this pull request Jan 18, 2022
<a name="0.1.1"></a>
## 0.1.1 (2022-01-18)

#### Bug Fixes

*  only send *new* tasks/resources/etc over the event channel (#238) ([fdc77e2](fdc77e2))
*  increased default event buffer capacity (#235) ([0cf0aee](0cf0aee))
*  use saturating arithmetic for attribute updates (#234) ([fe82e17](fe82e17))

#### Changes

*  moved ID rewriting from `console-subscriber` to the client (#244) ([095b1ef](095b1ef))
hawkw pushed a commit that referenced this pull request Jun 1, 2023
Self-wakes were not being detected and displayed in the console. The
`burn` task in the `app` example - which deliberately has many
self-wakes - was not registering any.

It appears that a logic error was present in the self-wake counting in
`console-subscriber` since it was added in #238. When a self wake was
detected, the `wakes` count was incremented a second time (the `wakes`
count is incremented for all wakes before checking for a self wake),
instead of increamenting the `self_wakes` count.

This PR fixes the logic so that when a self wake is detected, the
`self_wakes` count is incremented.
hawkw added a commit that referenced this pull request Jul 3, 2023
When entering and exiting a span the old code was also updating the parent
stats. This was causing excessive polling being reported for the parent tasks.
See issue #378 for more details. The regression was introduced by the refactor
in #238.
This fixes the issue by limiting updates to the current span.

Closes #378

Co-authored-by: Hayden Stainsby <[email protected]>
Co-authored-by: Eliza Weisman <[email protected]>
hawkw pushed a commit that referenced this pull request Sep 29, 2023
Self-wakes were not being detected and displayed in the console. The
`burn` task in the `app` example - which deliberately has many
self-wakes - was not registering any.

It appears that a logic error was present in the self-wake counting in
`console-subscriber` since it was added in #238. When a self wake was
detected, the `wakes` count was incremented a second time (the `wakes`
count is incremented for all wakes before checking for a self wake),
instead of increamenting the `self_wakes` count.

This PR fixes the logic so that when a self wake is detected, the
`self_wakes` count is incremented.
hawkw added a commit that referenced this pull request Sep 29, 2023
When entering and exiting a span the old code was also updating the parent
stats. This was causing excessive polling being reported for the parent tasks.
See issue #378 for more details. The regression was introduced by the refactor
in #238.
This fixes the issue by limiting updates to the current span.

Closes #378

Co-authored-by: Hayden Stainsby <[email protected]>
Co-authored-by: Eliza Weisman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants