Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tracing crate to have more context in log messages. #696

Merged
merged 1 commit into from
Oct 11, 2021
Merged

Conversation

hlinnaka
Copy link
Contributor

@hlinnaka hlinnaka commented Oct 1, 2021

Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID,
and the operation we're performing. That context information gets
attached to every log message we create within the span. That way, we
don't need to include basic context information like that in every log
message, and it also becomes easier to filter the logs programmatically.

This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.

Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.

We now obey the RUST_LOG env variable, if it's set.

The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable, with ANSI colors,
pretty close to the old format.

@hlinnaka
Copy link
Contributor Author

hlinnaka commented Oct 1, 2021

@SomeoneToIgnore, you mentioned in the all hands today that you had started hacking on this. As it happens, so did I a couple of weeks ago but never got around to finish it into a PR. Here's what I ended up with, does this look similar to what you had in mind?

@hlinnaka
Copy link
Contributor Author

hlinnaka commented Oct 1, 2021

We have some options on the output log format. This currently prints something very similar to what we have in 'main today:

Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}: query: CopyData(b"\x02\x01\0\0\0\0\x01k\x1c\xc0\0\0\x06\x80\0\0\0\0\0\0\x04\xd1\0\0\0\0\x01")
Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1233 blkno=1 req_lsn=0/16B1CC0}: get_layer_for_read called for 1664/0/1233.0 at 0/16B1CC0
Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1233 blkno=1 req_lsn=0/16B1CC0}: found layer in cache: 4b1f73f3b2fe5c20727ab3532deaf3ee 0/169C350-0/169C351
Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1233 blkno=1 req_lsn=0/16B1CC0}: found page image for blk 1 in 1664/0/1233 at 0/16B1CC0, no WAL redo required
Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}: query: CopyData(b"\x02\x01\0\0\0\0\x01k\x1c\xc0\0\0\x06\x80\0\0\0\0\0\0\x04\xbe\0\0\0\0\0")
Oct 01 21:31:58.053 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1214 blkno=0 req_lsn=0/16B1CC0}: get_layer_for_read called for 1664/0/1214.0 at 0/16B1CC0
Oct 01 21:31:58.053 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1214 blkno=0 req_lsn=0/16B1CC0}: found layer in cache: 4b1f73f3b2fe5c20727ab3532deaf3ee 0/169C350-0/169C351
Oct 01 21:31:58.053 DEBUG pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1214 blkno=0 req_lsn=0/16B1CC0}: loaded from ./tenants/b044e526cebb15ee983d648614127f1a/timelines/4b1f73f3b2fe5c20727ab3532deaf3ee/rel_1664_0_1214_0_0_000000000169C350    
Oct 01 21:31:58.053 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1214 blkno=0 req_lsn=0/16B1CC0}: found page image for blk 0 in 1664/0/1214 at 0/16B1CC0, no WAL redo required

The lines are very wide, which is not very nice on the terminal. But it's important to have the context there, and when you print the tenant and timeline id it just takes a lot of space, there's no way around that.

Alternatively, we could switch to a more structured JSON format, the 'tracing' crate comes with a built-in formatter for that. I'm not sure what the state of the art with that is, but presumably there are some nice tools out there to parse and display JSON logs. Anyone have experience with that?

Copy link
Contributor

@SomeoneToIgnore SomeoneToIgnore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my branch: https://github.com/zenithdb/zenith/compare/kb/attempt-to-tracing?expand=1
It's pretty small and concentrated on the "right" way to do things in logging.rs lib file, so pick whatever you find useful and let me know when I can remove the branch.

tracing is somewhat odd when in comes to configuring, because there's so many moving parts that could be put together many different ways.
In my case, I've used a more formalized way to allow writing without blocks (supposed to be good, but needs benches and comparison) + rotate the files automatically (cool, not sure if super needed).

To sum this up:

  • I like how simple with_writer setup is in ypur PR, I wonder, if we could try the same with the non-blocking option and check if it works and works better
  • we need to treat async code a bit more gently or do not place spans (logging is fine) inside that code
  • a few extra deps in Cargo.toml are worth throwing off

// FIXME: It's actually nice to have them, so leave them on for now. Use
// 'less -R' option to display them.
//
// .with_ansi(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still use the non-ansi version by default due to many other tools literally interpreting our logs, examples:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, changed it to non-ansi.

However, that ran into a bug in 'tracing'. I'm currently actually using the 'pretty' format, so the log looks like this:

  Oct 07 13:18:22.704  INFO GET /v1/branch/40d00a53756334d1a471043acce25557/test_config 200 OK
    at zenith_utils/src/http/endpoint.rs:26

  Oct 07 13:18:22.732  INFO taking basebackup lsn=0/0, prev_lsn=0/169C350
    at pageserver/src/basebackup.rs:84
    in basebackup with ESC[1mtimelineESC[0m: 1d97ed9047930b55c1c4535db09ec4f5, ESC[1mtenantESC[0m: 40d00a53756334d1a471043acce25557

  Oct 07 13:18:22.733  INFO list_nonrels called at 0/169C350
    at pageserver/src/layered_repository.rs:725
    in basebackup with ESC[1mtimelineESC[0m: 1d97ed9047930b55c1c4535db09ec4f5, ESC[1mtenantESC[0m: 40d00a53756334d1a471043acce25557, ESC[1mlsnESC[0m: "0/169C350"

So even with with_ansi(false), there are still some ansi control chars in the context lines. There's a github issue about that at tokio-rs/tracing#1310. I propose that we go ahead and push this even though we still get some of those control chars, and wait/work on the upstream issue to be fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea to proceed anyway.
What if we disable the pretty logging too, at least for now?


What does that pretty mode do, by the way? I've not used it before 😄
Docs say that it should be "pretty": https://docs.rs/tracing-subscriber/0.2.25/tracing_subscriber/fmt/struct.Layer.html#method.pretty

Sets the layer being built to use an excessively pretty, human-readable formatter.

which is not much, grepping along the repo shows this:
https://github.com/tokio-rs/tracing/blob/6b7051c9897a6ea2afd61cfa139714cfeed8bf4f/tracing-subscriber/src/fmt/mod.rs#L120

//! * [`format::Pretty`]: Emits excessively pretty, multi-line logs, optimized
//!   for human readability. This is primarily intended to be used in local
//!   development and debugging, or for command-line applications, where
//!   automated analysis and compact storage of logs is less of a priority than
//!   readability and visual appeal.

If that's really the case, I don't like it:

  • it's not pretty right now
  • multiline logging is often tricky to handle in production cloud environment, when tons of threads concurrently writing something: unless every that line is prepended with some id/guid/uuid/whatnot , it's impossible to distinguish one line from another. We can dare and explore tracing behavior, or we can prefer single lines everywhere.

pageserver/Cargo.toml Show resolved Hide resolved
pageserver/src/http/routes.rs Show resolved Hide resolved
pageserver/src/page_service.rs Show resolved Hide resolved
slog-term = "2.8.0"
slog = "2.7.0"
tracing = "0.1.27"
tracing-log = "0.1.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the first bullet in
https://docs.rs/tracing-subscriber/0.2.24/tracing_subscriber/#optional-dependencies

tracing-subscriber with default features is enough to init tracing backend with log redirected inside it, see my branch for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing LogTracer::init() call from logging.rs, but a lot of log lines went missing. Clearly it is needed, it is not initialized automatically. I didn't see anything relevant in your branch, can you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've missed that part too, because had called .finish(); on my subscriber builder...
Like I've mentioned, the configuration is somewhat convoluted here.

Here's the difference in the in their sources: https://github.com/tokio-rs/tracing/blob/6b7051c9897a6ea2afd61cfa139714cfeed8bf4f/tracing-subscriber/src/fmt/mod.rs#L556-L579

the main bit is

/// If the `tracing-log` feature is enabled, this will also install
/// the LogTracer to convert `Log` records into `tracing` `Event`s.

that feature is enabled with the default features, so you should have it, then you should either invoke the right method on the formatter, or do this manually:
https://github.com/tokio-rs/tracing/blob/master/tracing-subscriber/src/util.rs#L57-L67

Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang, nice! Nothing really to say that @SomeoneToIgnore hasn't already mentioned, just weighing in with my two cents on holding .entered() guards at the top of functions.

IMO it looks a little weird. Proc macros have their downsides, but I can't think of any other clean ways to do it -- it's maybe worth putting some thought into it.

zenith_utils/src/logging.rs Outdated Show resolved Hide resolved
@hlinnaka hlinnaka force-pushed the remove-wal-loading-leftovers branch 3 times, most recently from d06dea7 to 15f1bcc Compare October 6, 2021 12:59
Base automatically changed from remove-wal-loading-leftovers to main October 6, 2021 13:27
@hlinnaka hlinnaka force-pushed the tracing branch 10 times, most recently from 2bfe4f6 to 78e4dbb Compare October 10, 2021 20:17
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.

This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.

Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.

We now obey the RUST_LOG env variable, if it's set.

The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing.  For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
@hlinnaka
Copy link
Contributor Author

The main open question here is the log format. I'm currently using the pretty format; I like it because it's very human-friendly. However, as @SomeoneToIgnore mentioned, as a multi-line format it's not suitable for post-processing. When this is deployed, we should probably use a JSON format to make automated filtering and post-processing easy. And if the tools to turn the JSON output into more human-readable format are readily available to all developers too, we can just always use JSON. Or maybe there's some other industry standard format we should be using? @cicdteam, thoughts on that? How are we going to collect and process logs in production?

@SomeoneToIgnore's branch also contained changes to do log rotation. That's not included here, but we'll need it sooner or later. Not sure if the page server itself should be doing rotation, or we should use the system 'logrotate' tool, or what.

The log format is easy to change when we know what we want. So I'm going to push this with the human-friendly multi-line format for now, and leave figuring out the format as a TODO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants