-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use tracing crate to have more context in log messages. #696
Conversation
@SomeoneToIgnore, you mentioned in the all hands today that you had started hacking on this. As it happens, so did I a couple of weeks ago but never got around to finish it into a PR. Here's what I ended up with, does this look similar to what you had in mind? |
We have some options on the output log format. This currently prints something very similar to what we have in 'main today:
The lines are very wide, which is not very nice on the terminal. But it's important to have the context there, and when you print the tenant and timeline id it just takes a lot of space, there's no way around that. Alternatively, we could switch to a more structured JSON format, the 'tracing' crate comes with a built-in formatter for that. I'm not sure what the state of the art with that is, but presumably there are some nice tools out there to parse and display JSON logs. Anyone have experience with that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my branch: https://github.com/zenithdb/zenith/compare/kb/attempt-to-tracing?expand=1
It's pretty small and concentrated on the "right" way to do things in logging.rs
lib file, so pick whatever you find useful and let me know when I can remove the branch.
tracing is somewhat odd when in comes to configuring, because there's so many moving parts that could be put together many different ways.
In my case, I've used a more formalized way to allow writing without blocks (supposed to be good, but needs benches and comparison) + rotate the files automatically (cool, not sure if super needed).
To sum this up:
- I like how simple
with_writer
setup is in ypur PR, I wonder, if we could try the same with the non-blocking option and check if it works and works better - we need to treat async code a bit more gently or do not place spans (logging is fine) inside that code
- a few extra deps in Cargo.toml are worth throwing off
zenith_utils/src/logging.rs
Outdated
// FIXME: It's actually nice to have them, so leave them on for now. Use | ||
// 'less -R' option to display them. | ||
// | ||
// .with_ansi(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still use the non-ansi version by default due to many other tools literally interpreting our logs, examples:
- k8s by default
- CI logs: https://app.circleci.com/pipelines/github/zenithdb/zenith/2311/workflows/79be9d93-dc36-424d-89b7-ed09fe7e0dd2/jobs/16582/artifacts
- local test logs too, presumably?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, changed it to non-ansi.
However, that ran into a bug in 'tracing'. I'm currently actually using the 'pretty' format, so the log looks like this:
Oct 07 13:18:22.704 INFO GET /v1/branch/40d00a53756334d1a471043acce25557/test_config 200 OK
at zenith_utils/src/http/endpoint.rs:26
Oct 07 13:18:22.732 INFO taking basebackup lsn=0/0, prev_lsn=0/169C350
at pageserver/src/basebackup.rs:84
in basebackup with ESC[1mtimelineESC[0m: 1d97ed9047930b55c1c4535db09ec4f5, ESC[1mtenantESC[0m: 40d00a53756334d1a471043acce25557
Oct 07 13:18:22.733 INFO list_nonrels called at 0/169C350
at pageserver/src/layered_repository.rs:725
in basebackup with ESC[1mtimelineESC[0m: 1d97ed9047930b55c1c4535db09ec4f5, ESC[1mtenantESC[0m: 40d00a53756334d1a471043acce25557, ESC[1mlsnESC[0m: "0/169C350"
So even with with_ansi(false)
, there are still some ansi control chars in the context lines. There's a github issue about that at tokio-rs/tracing#1310. I propose that we go ahead and push this even though we still get some of those control chars, and wait/work on the upstream issue to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea to proceed anyway.
What if we disable the pretty logging too, at least for now?
What does that pretty
mode do, by the way? I've not used it before 😄
Docs say that it should be "pretty": https://docs.rs/tracing-subscriber/0.2.25/tracing_subscriber/fmt/struct.Layer.html#method.pretty
Sets the layer being built to use an excessively pretty, human-readable formatter.
which is not much, grepping along the repo shows this:
https://github.com/tokio-rs/tracing/blob/6b7051c9897a6ea2afd61cfa139714cfeed8bf4f/tracing-subscriber/src/fmt/mod.rs#L120
//! * [`format::Pretty`]: Emits excessively pretty, multi-line logs, optimized
//! for human readability. This is primarily intended to be used in local
//! development and debugging, or for command-line applications, where
//! automated analysis and compact storage of logs is less of a priority than
//! readability and visual appeal.
If that's really the case, I don't like it:
- it's not pretty right now
- multiline logging is often tricky to handle in production cloud environment, when tons of threads concurrently writing something: unless every that line is prepended with some id/guid/uuid/whatnot , it's impossible to distinguish one line from another. We can dare and explore tracing behavior, or we can prefer single lines everywhere.
zenith_utils/Cargo.toml
Outdated
slog-term = "2.8.0" | ||
slog = "2.7.0" | ||
tracing = "0.1.27" | ||
tracing-log = "0.1.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the first bullet in
https://docs.rs/tracing-subscriber/0.2.24/tracing_subscriber/#optional-dependencies
tracing-subscriber with default features is enough to init tracing backend with log
redirected inside it, see my branch for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried removing LogTracer::init() call from logging.rs, but a lot of log lines went missing. Clearly it is needed, it is not initialized automatically. I didn't see anything relevant in your branch, can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I've missed that part too, because had called .finish();
on my subscriber builder...
Like I've mentioned, the configuration is somewhat convoluted here.
Here's the difference in the in their sources: https://github.com/tokio-rs/tracing/blob/6b7051c9897a6ea2afd61cfa139714cfeed8bf4f/tracing-subscriber/src/fmt/mod.rs#L556-L579
the main bit is
/// If the `tracing-log` feature is enabled, this will also install
/// the LogTracer to convert `Log` records into `tracing` `Event`s.
that feature is enabled with the default features, so you should have it, then you should either invoke the right method on the formatter, or do this manually:
https://github.com/tokio-rs/tracing/blob/master/tracing-subscriber/src/util.rs#L57-L67
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang, nice! Nothing really to say that @SomeoneToIgnore hasn't already mentioned, just weighing in with my two cents on holding .entered()
guards at the top of functions.
IMO it looks a little weird. Proc macros have their downsides, but I can't think of any other clean ways to do it -- it's maybe worth putting some thought into it.
d06dea7
to
15f1bcc
Compare
2bfe4f6
to
78e4dbb
Compare
Whenever we start processing a request, we now enter a tracing "span" that includes context information like the tenant and timeline ID, and the operation we're performing. That context information gets attached to every log message we create within the span. That way, we don't need to include basic context information like that in every log message, and it also becomes easier to filter the logs programmatically. This removes the eplicit timeline and tenant IDs from most log messages, as you get that information from the enclosing span now. Also improve log messages in general, dialing down the level of some messages that are not very useful, and adding information to others. We now obey the RUST_LOG env variable, if it's set. The 'tracing' crate allows for different log formatters, like JSON or bunyan output. The one we use now is human-readable multi-line format, which is nice when reading the log directly, but hard for post-processing. For production, we'll probably want JSON output and some tools for working with it, but that's left as a TODO. The log format is easy to change.
The main open question here is the log format. I'm currently using the pretty format; I like it because it's very human-friendly. However, as @SomeoneToIgnore mentioned, as a multi-line format it's not suitable for post-processing. When this is deployed, we should probably use a JSON format to make automated filtering and post-processing easy. And if the tools to turn the JSON output into more human-readable format are readily available to all developers too, we can just always use JSON. Or maybe there's some other industry standard format we should be using? @cicdteam, thoughts on that? How are we going to collect and process logs in production? @SomeoneToIgnore's branch also contained changes to do log rotation. That's not included here, but we'll need it sooner or later. Not sure if the page server itself should be doing rotation, or we should use the system 'logrotate' tool, or what. The log format is easy to change when we know what we want. So I'm going to push this with the human-friendly multi-line format for now, and leave figuring out the format as a TODO. |
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID,
and the operation we're performing. That context information gets
attached to every log message we create within the span. That way, we
don't need to include basic context information like that in every log
message, and it also becomes easier to filter the logs programmatically.
This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.
Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.
We now obey the RUST_LOG env variable, if it's set.
The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable, with ANSI colors,
pretty close to the old format.