Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task: add Tracing instrumentation to spawned tasks #2655

Merged
merged 14 commits into from
Jul 13, 2020
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Jul 11, 2020

Motivation

When debugging asynchronous systems, it can be very valuable to inspect
what tasks are currently active (see #2510). The tracing crate and
related libraries provide an interface for Rust libraries and
applications to emit and consume structured, contextual, and async-aware
diagnostic information. Because this diagnostic information is
structured and machine-readable, it is a better fit for the
task-tracking use case than textual logging — tracing spans can be
consumed to generate metrics ranging from a simple counter of active
tasks to histograms of poll durations, idle durations, and total task
lifetimes. This information is potentially valuable to both Tokio users
and to maintainers.

Additionally, tracing is maintained by the Tokio project and is
becoming widely adopted by other libraries in the "Tokio stack", such as
hyper, h2, and tonic and in other parts of the broader Rust
ecosystem. Therefore, it is suitable for use in Tokio itself.

Solution

This PR is an MVP for instrumenting Tokio with tracing spans. When the
"tracing" feature flag and the "tokio_unstable" cfg are both enabled,
every spawned future will be instrumented with a tracing span. If both
the cfg flag and the "tracing" feature flag are not set, tokio will
not depend on tracing, and no spans will be generated.

The generated spans are at the TRACE verbosity level, and have the
target "tokio::task", which may be used by consumers to filter whether
they should be recorded. They include fields for the type name of the
spawned future and for what kind of task the span corresponds to (a
standard spawned task, a local task spawned by spawn_local, or a
blocking task spawned by spawn_blocking). Because tracing has
separate concepts of "opening/closing" and "entering/exiting" a span, we
enter these spans every time the spawned task is polled. This allows
collecting data such as:

  • the total lifetime of the task from spawn to drop
  • the number of times the task was polled before it completed
  • the duration of each individual time that the span was polled (and
    therefore, aggregated metrics like histograms or averages of poll
    durations)
  • the total time a span was actively being polled, and the total time
    it was alive but not being polled
  • the time between when the task was spawned and the first poll

As an example, here is the output of a version of the chat example
instrumented with tracing:
image
And, with multiple connections actually sending messages:
trace_example_1

I haven't added any tracing spans in the example, only converted the
existing println!s to tracing::info and tracing::error for
consistency. The span durations in the above output are generated by
tracing-subscriber. Of course, Tokio-specific subscriber could
generate even more detailed statistics, but that's follow-up work once
basic tracing support has been added.

Note that the Instrumented type from tracing-futures, which attaches
a tracing span to a future, was reimplemented inside of Tokio to avoid
a dependency on that crate. tracing-futures has a feature flag that
enables an optional dependency on Tokio, and I believe that if another
crate in a dependency graph enables that feature while Tokio's tracing
support is also enabled, it would create a circular dependency that
Cargo wouldn't be able to handle. Also, it avoids a dependency for a
very small amount of code that is unlikely to ever change.

There is, of course, room for plenty of future work here. This might
include:

  • instrumenting other parts of tokio, such as I/O resources and
    channels (possibly via waker instrumentation)
  • instrumenting the threadpool so that the state of worker threads
    can be inspected
  • writing tracing-subscriber Layers to collect and display
    Tokio-specific data from these traces
  • using track_caller (when it's stable) to record where a task
    was spawned from
  • moving the feature out of "tokio_unstable"

However, this is intended as an MVP to get us started on that path.

Signed-off-by: Eliza Weisman [email protected]

The context shows the currently active task, so we _really_ only need
`new` and `close` events.

Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
there is no option for open/close only, currently

Signed-off-by: Eliza Weisman <[email protected]>
@hawkw hawkw requested review from carllerche, davidbarsky and a team July 11, 2020 19:17
macro_rules! cfg_trace {
($($item:item)*) => {
$(
#[cfg(all(feature = "tracing", tokio_unstable))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this, but I don't think tracing merits being placed behind the tokio_unstable unstable flag. @carllerche will need to weigh in on this, but unlike the CancelationToken, changes to the tracing output aren't what the Tokio project has historically considered to be a "breaking change". An off-by-default (e.g., excluded from full) feature flag seems like a reasonable default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at one point, @carllerche suggested that any dependency on a pre-1.0 crate should require the tokio_unstable flag...but, in this case, we aren't exposing anything from tracing in Tokio's public APIs. AFAICT, it shouldn't be possible for the tracing dependency to cause any compile time breakage --- the worst failure mode, if there is eventually a breaking change to the tracing-core crate that we can't work around, is that the tracing diagnostics might silently fail to be collected.

IMO, you're right, and an off-by-default optional dep should be fine. That would be pretty similar to what we do for parking_lot, which is also pre-1.0 (and churns versions somewhat frequently!), since it's also not exposed publicly. Of course, if we were to expose tracing types (such as Span or Dispatch) in a public API later on, those APIs should require the unstable flag.

With all of that said: I made the more conservative choice of opening this PR with the tokio_unstable flag required, because I want to bias towards getting this merged sooner rather than later, and continuing to iterate once it merges. We can always remove the flag in a follow-up, but it's harder to add if these changes ship in a published release...

tokio/src/util/trace.rs Outdated Show resolved Hide resolved
@jonhoo
Copy link
Contributor

jonhoo commented Jul 12, 2020

This is really cool, I'm excited!

I'll add to the future that from experience it'd also be really handy to have events for when a resource is notified.

examples/chat.rs Outdated Show resolved Hide resolved
@hawkw
Copy link
Member Author

hawkw commented Jul 12, 2020

I'll add to the future that from experience it'd also be really handy to have events for when a resource is notified.

@jonhoo Yup, the next major thing to do in Tokio itself is probably going to be adding instrumentation to Wakers, IMO. That should let us track when a task expresses interest in a resource, and when a resource notifies a task.

Copy link
Member

@carllerche carllerche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. We can remove the unstable flag.

tokio/Cargo.toml Outdated Show resolved Hide resolved
hawkw added a commit that referenced this pull request Jul 20, 2020
- docs: misc improvements (#2572, #2658, #2663, #2656, #2647, #2630, #2487, #2621,
  #2624, #2600, #2623, #2622, #2577, #2569, #2589, #2575, #2540, #2564, #2567,
  #2520, #2521, #2572, #2493)
- rt: allow calls to `block_on` inside calls to `block_in_place` that are
  themselves inside `block_on` (#2645)
- net: fix non-portable behavior when dropping `TcpStream` `OwnedWriteHalf` (#2597)
- io: improve stack usage by allocating large buffers on directly on the heap
  (#2634)
- io: fix unsound pin projection in `AsyncReadExt::read_buf` and
  `AsyncWriteExt::write_buf` (#2612)
- io: fix unnecessary zeroing for `AsyncRead` implementors (#2525)
- io: Fix `BufReader` not correctly forwarding `poll_write_buf` (#2654)

- coop: returning `Poll::Pending` no longer decrements the task budget (#2549)

- io: little-endian variants of `AsyncReadExt` and `AsyncWriteExt` methods
  (#1915)
- io: fix panic in `AsyncReadExt::read_line` (#2541)
- task: add [`tracing`] instrumentation to spawned tasks (#2655)
- sync: allow unsized types in `Mutex` and `RwLock` (via `default` constructors)
  (#2615)
- net: add `ToSocketAddrs` implementation for `&[SocketAddr]` (#2604)
- fs: add `OpenOptionsExt` for `OpenOptions` (#2515)
- fs: add `DirBuilder` (#2524)

[`tracing`]: https://crates.io/crates/tracing

Signed-off-by: Eliza Weisman <[email protected]>
hawkw added a commit that referenced this pull request Jul 22, 2020
# 0.2.22 (July 2!, 2020)

### Fixes
- docs: misc improvements (#2572, #2658, #2663, #2656, #2647, #2630, #2487, #2621,
  #2624, #2600, #2623, #2622, #2577, #2569, #2589, #2575, #2540, #2564, #2567,
  #2520, #2521, #2493)
- rt: allow calls to `block_on` inside calls to `block_in_place` that are
  themselves inside `block_on` (#2645)
- net: fix non-portable behavior when dropping `TcpStream` `OwnedWriteHalf` (#2597)
- io: improve stack usage by allocating large buffers on directly on the heap
  (#2634)
- io: fix unsound pin projection in `AsyncReadExt::read_buf` and
  `AsyncWriteExt::write_buf` (#2612)
- io: fix unnecessary zeroing for `AsyncRead` implementors (#2525)
- io: Fix `BufReader` not correctly forwarding `poll_write_buf` (#2654)
- io: fix panic in `AsyncReadExt::read_line` (#2541)

### Changes
- coop: returning `Poll::Pending` no longer decrements the task budget (#2549)

### Added
- io: little-endian variants of `AsyncReadExt` and `AsyncWriteExt` methods
  (#1915)
- task: add [`tracing`] instrumentation to spawned tasks (#2655)
- sync: allow unsized types in `Mutex` and `RwLock` (via `default` constructors)
  (#2615)
- net: add `ToSocketAddrs` implementation for `&[SocketAddr]` (#2604)
- fs: add `OpenOptionsExt` for `OpenOptions` (#2515)
- fs: add `DirBuilder` (#2524)

[`tracing`]: https://crates.io/crates/tracing

Signed-off-by: Eliza Weisman <[email protected]>
@Darksonn Darksonn added A-tokio Area: The main tokio crate C-enhancement Category: A PR with an enhancement or bugfix. M-tracing Tracing support in Tokio labels Jul 25, 2020
hawkw added a commit to linkerd/linkerd2-proxy that referenced this pull request Jul 31, 2020
## Motivation

When debugging proxy issues, it can be useful to inspect the list of
currently spawned Tokio tasks and their states. This can be used
similarly to the thread or coroutine dumps provided by other languages'
runtimes.

## Solution

This branch adds a new endpoint to the proxy's admin server, `/tasks`,
that returns a dump of all tasks currently spawned on the Tokio runtime,
using the new Tracing instrumentation added in tokio-rs/tokio#2655, and
a work-in-progress [`tokio-trace`] crate that provides Tokio-specific
Tracing layers. 

Currently, the `/tasks` admin endpoint records the following information
about each task:

* Whether it is a normal, local, or blocking task (not relevant to us 
  currently, since Linkerd does not use local or blocking tasks...but
  we might eventually!)
* Whether the task is active (currently being polled) or idle (waiting
  to be polled)
* The type of the future that was spawned
* The Tracing span context from which the task was spawned
* The total number of times the task has been polled
* Timing statistics about the task, including:
  - The time in nanoseconds between when the task was spawned and when
    it was first polled (essentially, measuring the Tokio scheduler's
    latency)
  - The total time in nanoseconds the task has existed
  - The task's _busy time_ in nanoseconds (time it was actively being
    polled)
  - The tasks _idle time_ in nanoseconds (time it was _not_ being 
    polled)

In the future, Tokio will likely expose additional Tracing information,
which we'll be able to collect as well.

The task dump can be accessed either as an HTML table or as JSON. JSON
is returned if the request has an `Accept: application/json` header, or
whenever the path `/tasks.json` is requested; otherwise, the data is
rendered as an HTML table. Like the `/proxy-log-level` endpoint, access
is denied to requests coming from sources other than localhost, to help
restrict access to authorized users (since a high volume of requests for
task dumps could be used to starve the proxy).

Example JSON output (in Firefox Dev Edition's extremely nice GUI 
JSON viewer):

![Screenshot_20200715_121938](https://user-images.githubusercontent.com/2796466/87598059-b9f68380-c6a7-11ea-8f21-842b57793baa.png)

Zoomed in on the timing data for a single task:
![Screenshot_20200715_122047](https://user-images.githubusercontent.com/2796466/87598089-c4b11880-c6a7-11ea-93ac-895f7ecee0f0.png)

And HTML:

![Screenshot_20200715_143155](https://user-images.githubusercontent.com/2796466/87598414-fe821f00-c6a7-11ea-93b8-d18e4837346c.png)


Because the task data is generated from Tracing spans emitted by Tokio,
the task spans must be enabled for it to be used. This can be done by
setting a trace filter that enables the `trace` level for the target
`tokio::task`, e.g.:
```
tokio::task=trace
```
or 
```
tokio=trace
```

## Notes

* This branch depends on unreleased code from upstream, including a
  Tokio change that has merged to master but not been published, and my
  unreleased work-in-progress [`tokio-trace`] crate. Therefore, I've
  pinned these upstreams to fixed Git SHAs, to guard against
  dependencies changing under us unexpectedly.
* I considered requiring a build-time feature flag to enable this
  feature, the way we do for the mock SO_ORIG_DST implementation for
  testing. However, this would make it harder to use task tracking to
  debug issues in proxies not built with the flag. I'm happy to change
  this code to be feature flagged if we think that's the right approach.

[`tokio-trace`]: https://github.com/hawkw/tokio-trace

Closes linkerd/linkerd2#3803

Signed-off-by: Eliza Weisman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-enhancement Category: A PR with an enhancement or bugfix. M-tracing Tracing support in Tokio
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants