warnings: False positives with `LostWaker` warning #149

hawkw · 2021-09-16T17:40:12Z

Currently, the LostWaker warning seems to often display false positives. These seem to occur when a task has a single waker clone, is woken by value (dropping the waker), and yields after having been woken. This seems to occur the most frequently when a task self-wakes (and that may be the only instance where this occurs?).

It would be nice to fix this --- the warning should only be shown when a task definitely won't be woken again. However, I'm not immediately sure how we would determine this.

One potential solution is that the false-positives seem to be pretty transient. They flicker in and out of existence rapidly. We could potentially just only display the lint if the task has remained in the 0-waker state for a long time.

The text was updated successfully, but these errors were encountered:

seanmonstar · 2021-09-16T17:57:01Z

Maybe we can check if last_wake is after last_poll_started, and if so we can assume it will be polled again?

Milo123459 · 2021-09-19T10:22:49Z

Sadly this doesn't fix the issue I was coming across. I still get a lost waker

hawkw · 2021-09-19T18:24:42Z

Hmm, that's surprising!

hawkw · 2021-09-21T17:23:22Z

On the latest main branch I see a ton of (apparent) false positives, just running the demo app like

$ cargo run --example app &
$ cargo run

I'm honestly quite surprised this lint doesn't just work with the naive implementation, not totally sure what's going on here!

Milo123459 · 2021-09-21T18:49:46Z

Still not fixed... :(

seanmonstar · 2021-09-21T18:52:10Z

What's the behavior of it not working? I don't see any false positives in the example app, whereas I used to sometimes see it flicker.

Milo123459 · 2021-09-21T18:53:31Z

The project it is happening on is closed source - but, whats happening is it "loses its waker" then it'll end the task somehow. If it lost the waker, how can it end the task?

blt · 2021-09-21T18:59:04Z

For what it's worth we see this in the vector project, our console being currently pinned to sha f2c30d52c9f22de69bac38009a9183135808806c. We'd be happy to provide any debug information, run experiments etc.

seanmonstar · 2021-09-21T19:33:01Z

@blt what if you update that pin to current main? There's been 2 commits since then to address this issue.

REF tokio-rs/console#149 Signed-off-by: Brian L. Troutwine <[email protected]>

blt · 2021-09-21T19:54:05Z

Still happening. I've updated vector to use the most up to date SHA and have made sure my checkout of console is at the same SHA, draft-PR here: vectordotdev/vector#9293 Screenshot:

Unfortunately I don't have a small reproduction case. My compile and run command for vector is:

> RUSTFLAGS="--cfg tokio_unstable" CARGO_PROFILE_RELEASE_DEBUG=2 cargo  build --release --no-default-features --features "tokio-console,sources-datadog,sources-syslog,sources-internal_metrics,transforms-remap,sinks-prometheus,sinks-datadog" && RUST_BACKTRACE=1 firejail --noprofile --rlimit-as=2g --cpu=1,2,3,4,5,6,7,8 ./target/release/vector -qq --config ~/projects/us/troutwine/lading/configs/vector-ddlogs-remap.toml

where the reference config vector-ddlogs-remap.toml is:

data_dir = "/tmp/data/vector"

##
## Sources
##

[sources.internal_metrics]
type = "internal_metrics"

[sources.datadog_agent]
type = "datadog_agent"
acknowledgements = false
address = "0.0.0.0:8282"

##
## Transforms
##

[transforms.parse_message]
type = "remap"
inputs = ["datadog_agent"]
source = '''
pyld, err = parse_json(.message)
if err == null {
  .message = pyld.mineral
}
'''

##
## Sinks
##

[sinks.prometheus]
type = "prometheus_exporter"
inputs = ["internal_metrics"]
address = "0.0.0.0:9598"

[sinks.dd_logs_egress]
type = "datadog_logs"
inputs = ["parse_message"]
endpoint = "http://localhost:8080/v1/input"
default_api_key = "DEADBEEF"

I don't see waker loss when vector sits idle, only when I'm generating load. This is done with my lading project. I run vector against two programs from lading, http_gen and http_blackhole, run like so:

> cargo build --release && firejail --rlimit-as=8g --cpu=11,12 --noprofile ./target/release/http_gen --config-path configs/http_gen-dd.toml

> firejail --rlimit-as=2g  --noprofile ./target/release/http_blackhole --worker-threads 16 --concurrent-requests-max 512 --binding-addr "0.0.0.0:8080" --prometheus-addr "0.0.0.0:9002"

The configuration http_gen-dd.toml is:

worker_threads = 2
prometheus_addr = "0.0.0.0:9001"

[targets.simple]
target_uri = "http://localhost:8282/v1/input"
bytes_per_second = "500 Mb"
parallel_connections = 10
method.type = "Post"
method.variant = "DatadogLog"
method.maximum_prebuild_cache_size_bytes = "1 Gb"

[targets.simple.headers]
dd-api-key = "DEADBEEF"

When load is generated by http_gen the console reports a regular growth of waker loss. It's entirely possible it's legitimate, though some of the areas I've spot checked I'm not sure how you'd lose your waker there.

seanmonstar · 2021-09-21T21:16:40Z

This screenshot and being able to check the source on GitHub has been really useful. Here's some things I'm noticing:

Several of the tasks with a warning don't have a single poll, so the console should be updated to not care before the first poll.
- I noticed that one task didn't have a poll for over 150s, and that neighboring tasks had been busy for just as long. I think an upcoming warning/lint about blocking will notice that one :)
One of the warned tasks is a block_on. We may need to check how that affects the lint.
Another task (from topology take_healthchecks) has seen a few polls, and then been idle for a couple minutes. It could be the console has missed something, but I see there is a manual impl Future for Task in vector, so could that be losing the waker?

Noted in #149 (comment)

seanmonstar · 2021-09-21T22:06:37Z

With the latest commit, the first class should be fixed. I haven't had time to investigate block on. I do wonder if the 3rd class are real warnings :)

blt · 2021-09-21T22:35:46Z

Tell you what, it's entirely possible console has turned up a real bug on that third point. I'll update and report back, hopefully later today but at most by tomorrow morning.

blt · 2021-09-22T00:02:13Z

This is substantially improved! I've updated our referenced PR, results as follows:

I do believe the topology waker loss is a bug on our side. There are two waker loss dings from datadog/logs/sink.rs that I'm unable to explain. The first location:

https://github.com/vectordotdev/vector/blob/master/src/sinks/datadog/logs/sink.rs#L289

The second location:

https://github.com/vectordotdev/vector/blob/e4bc841cd8e93428d85eee7e4bb95c5e7a0790f3/src/sinks/datadog/logs/sink.rs#L380

seanmonstar · 2021-09-22T00:54:14Z

Hm, looking at the image, nothing else sticks out at me. Do those tasks eventually get polled again? Or are they stalled? (Does the number of polls go up even while the warning is present?)

blt · 2021-09-22T01:18:38Z

The tasks from sink.rs are all short-lived and never get re-polled once they move into the wakeless state. There are more now than when I initially took my screen capture, 5 as compared to 2, but none have been polled more than twice. I did manage to catch a task that entered the warning state, was polled somewhat rapidly and then exited the warning state but I failed to notice what its code location was. I seem to recall there's not a capture mode for the console yet?

…hannel (#238) ## Motivation Currently, there are some rather bad issues that occur when the event buffer is at capacity and events are dropped. Completely *losing* data due to buffer capacity is relatively okay: if we set a bound on how much memory the console can use, and we don't record new things that occur when we've reached that limit, this is correct and acceptable behavior. However, the current design can result in *incorrect* data when events are lost due to the buffer being at capacity. This is because we currently record things like starting to poll a task/resource, ending a poll, dropping a task/resource/async op, and waker operations, as individual events in the buffer. This means that we can have a situation where the creation of a task was recorded, because there was buffer capacity at the time, but then when the task ended, the buffer was full, so we never recorded its termination. That results in the tasks appearing to run forever and never terminate --- see issue #230. Similarly, if we record the beginning of a poll, but drop the end-of-poll event because we're at capacity, this means we will (incorrectly) record a poll that goes on forever, which is obviously incorrect. I think this may also be the cause of the false positives with the lost waker lint (#149), if we record a waker drop but missed a waker clone that previously occurred. The change in #212 fixed one category of issue that occurs due to event buffer capacity --- when a task, resource, or async op's _creation_ event is dropped due to buffer capacity, we skip any subsequent events related to that task/resource/op. However, this doesn't fix issues where the subsequent events are the ones that are dropped. ## Solution This branch proposes a solution to this whole category of event buffer capacity related issues. Unfortunately, this requires rewriting a *lot* of `console-subscriber`'s internals. In the new approach, we now _only_ send events over the channel when creating a new task, resource, or async op. Those events now contain an `Arc` holding the stats for that entity. Another clone of the `Arc` is stored in the `tracing_subscriber::Registry`'s [span extensions] for the span corresponding to that entity. When the `ConsoleLayer` records subsequent events for a particular entity, such as starting/ending a poll, it looks up the span by ID, and updates the stats type stored in its extensions. The aggregator stores its clone of the `Arc` in a map of entities, just like it does currently, but no longer handles actually updating the stats; just building wire format updates from any tracked entities whose data was updated by the layer. This should fix all issues where dropping something due to event buffer capacity results in incorrect data. Once we have successfully recorded the *creation* of a task, resource, or async op, any subsequent updates to its stats are *guaranteed* to be reliable. If the channel is at capacity and we fail to record a new resource/task/op, we never create a stats extension for it, and we won't record anything for it at all. Otherwise, it will always have correct data recorded. When possible, the stats in the `Arc`ed stats are updated atomically. In some cases, this isn't easily possible, and some fields of the stats types are stored in a mutex. In particualr, this is required for storing timestamps. I don't really love that, but these mutices should be contented very infrequently. Stats aren't marked as having unset updates until after the stats inside the mutices have been updated, so the aggregator will not try to lock the mutex if the layer is currently updating it; instead, it will simply be included in the next update once the layer is no longer touching it. Mutices here will only be contended when multiple threads are updating a task's stats at the same time, which should occur very rarely...and in most cases, they *still* won't have to contend a mutex, since access to most of the mutices are guarded by an atomic variable for e.g. determining which thread actually was the last to complete a concurrent poll. The biggest performance downside of the mutices is probably not actually contention, but the additional heap allocation required when using `std::sync::Mutex`. However, since we have conditional `parking_lot` support, parking_lot can be used to avoid requiring additional allocations. In the future, it's probably possible to make more of this atomic by converting timestamps into integers and storing them in atomic variables. I haven't done this yet because both the protobuf timestamps and `std::time` timestamps are larger than a single 64-bit number and it might take a little extra work to ensure we can nicely fit them in an `AtomicUsize`...but we can probably do that later. [span extensions]: https://docs.rs/tracing-subscriber/latest/tracing_subscriber/registry/struct.SpanRef.html#method.extensions Signed-off-by: Eliza Weisman <[email protected]>

hawkw · 2022-01-12T19:06:01Z

I believe #238 almost fixes all the false positives we've seen with this lint. @Milo123459 and @blt, are either of you able to try the latest git revision of console-subscriber (fdc77e2) and confirm that this is the case? You'll also need to build the git version of the console CLI.

Milo123459 · 2022-01-12T19:08:36Z

I've sadly temporarily stopped using tokio console due to #184 - I can't really share much

blt · 2022-01-12T19:10:39Z

@hawkw I can give this a shot, yes.

blt · 2022-01-12T21:14:39Z

I haven't been able to reproduce the same lost waker problem with vector, using the setup detailed here. PR on the vector side to include the changes: vectordotdev/vector#10825 I am using tokio-console fdc77e2. The proposed fix looks good!

hawkw added S-bug Severity: bug C-console Crate: console. A-warnings Area: warnings labels Sep 16, 2021

seanmonstar mentioned this issue Sep 17, 2021

fix(console): make LostWaker lint stricter to reduce false positives #151

Merged

seanmonstar closed this as completed in #151 Sep 17, 2021

hawkw reopened this Sep 19, 2021

seanmonstar mentioned this issue Sep 21, 2021

fix(subscriber): include tracing events starting with tokio in filter #159

Merged

seanmonstar closed this as completed in 6786d3e Sep 21, 2021

blt added a commit to vectordotdev/vector that referenced this issue Sep 21, 2021

Update console to current main

3a4db13

REF tokio-rs/console#149 Signed-off-by: Brian L. Troutwine <[email protected]>

blt mentioned this issue Sep 21, 2021

chore: Update console to current main vectordotdev/vector#9293

Closed

seanmonstar reopened this Sep 21, 2021

seanmonstar mentioned this issue Sep 21, 2021

fix(console): assume waiting for first poll is not a lost waker #160

Merged

hawkw pushed a commit that referenced this issue Sep 21, 2021

fix(console): assume waiting for first poll is not a lost waker (#160)

c33fb2d

Noted in #149 (comment)

hawkw mentioned this issue Dec 21, 2021

Thousands of tokio::task <cargo>/tokio-1.15.0/src/task/local.rs:302:22 lost their waker #237

Open

hawkw mentioned this issue Dec 21, 2021

fix(subscriber): only send *new* tasks/resources/etc over the event channel #238

Merged

hawkw closed this as completed Jan 12, 2022

davidlattimore mentioned this issue Sep 25, 2022

Lost waker false positive when task has detatched #380

Closed

elmattic mentioned this issue Dec 1, 2022

Refactor async task spawns ChainSafe/forest#2271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warnings: False positives with `LostWaker` warning #149

warnings: False positives with `LostWaker` warning #149

hawkw commented Sep 16, 2021

seanmonstar commented Sep 16, 2021

Milo123459 commented Sep 19, 2021

hawkw commented Sep 19, 2021

hawkw commented Sep 21, 2021

Milo123459 commented Sep 21, 2021

seanmonstar commented Sep 21, 2021

Milo123459 commented Sep 21, 2021

blt commented Sep 21, 2021

seanmonstar commented Sep 21, 2021

blt commented Sep 21, 2021

seanmonstar commented Sep 21, 2021

seanmonstar commented Sep 21, 2021

blt commented Sep 21, 2021

blt commented Sep 22, 2021

seanmonstar commented Sep 22, 2021

blt commented Sep 22, 2021

hawkw commented Jan 12, 2022

Milo123459 commented Jan 12, 2022

blt commented Jan 12, 2022

blt commented Jan 12, 2022

warnings: False positives with LostWaker warning #149

warnings: False positives with LostWaker warning #149

Comments

hawkw commented Sep 16, 2021

seanmonstar commented Sep 16, 2021

Milo123459 commented Sep 19, 2021

hawkw commented Sep 19, 2021

hawkw commented Sep 21, 2021

Milo123459 commented Sep 21, 2021

seanmonstar commented Sep 21, 2021

Milo123459 commented Sep 21, 2021

blt commented Sep 21, 2021

seanmonstar commented Sep 21, 2021

blt commented Sep 21, 2021

seanmonstar commented Sep 21, 2021

seanmonstar commented Sep 21, 2021

blt commented Sep 21, 2021

blt commented Sep 22, 2021

seanmonstar commented Sep 22, 2021

blt commented Sep 22, 2021

hawkw commented Jan 12, 2022

Milo123459 commented Jan 12, 2022

blt commented Jan 12, 2022

blt commented Jan 12, 2022

warnings: False positives with `LostWaker` warning #149

warnings: False positives with `LostWaker` warning #149