Use tracing crate to have more context in log messages. #696

hlinnaka · 2021-10-01T18:34:27Z

Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID,
and the operation we're performing. That context information gets
attached to every log message we create within the span. That way, we
don't need to include basic context information like that in every log
message, and it also becomes easier to filter the logs programmatically.

This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.

Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.

We now obey the RUST_LOG env variable, if it's set.

The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable, with ANSI colors,
pretty close to the old format.

hlinnaka · 2021-10-01T18:35:40Z

@SomeoneToIgnore, you mentioned in the all hands today that you had started hacking on this. As it happens, so did I a couple of weeks ago but never got around to finish it into a PR. Here's what I ended up with, does this look similar to what you had in mind?

hlinnaka · 2021-10-01T18:41:09Z

We have some options on the output log format. This currently prints something very similar to what we have in 'main today:

Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}: query: CopyData(b"\x02\x01\0\0\0\0\x01k\x1c\xc0\0\0\x06\x80\0\0\0\0\0\0\x04\xd1\0\0\0\0\x01")
Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1233 blkno=1 req_lsn=0/16B1CC0}: get_layer_for_read called for 1664/0/1233.0 at 0/16B1CC0
Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1233 blkno=1 req_lsn=0/16B1CC0}: found layer in cache: 4b1f73f3b2fe5c20727ab3532deaf3ee 0/169C350-0/169C351
Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1233 blkno=1 req_lsn=0/16B1CC0}: found page image for blk 1 in 1664/0/1233 at 0/16B1CC0, no WAL redo required
Oct 01 21:31:58.052 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}: query: CopyData(b"\x02\x01\0\0\0\0\x01k\x1c\xc0\0\0\x06\x80\0\0\0\0\0\0\x04\xbe\0\0\0\0\0")
Oct 01 21:31:58.053 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1214 blkno=0 req_lsn=0/16B1CC0}: get_layer_for_read called for 1664/0/1214.0 at 0/16B1CC0
Oct 01 21:31:58.053 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1214 blkno=0 req_lsn=0/16B1CC0}: found layer in cache: 4b1f73f3b2fe5c20727ab3532deaf3ee 0/169C350-0/169C351
Oct 01 21:31:58.053 DEBUG pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1214 blkno=0 req_lsn=0/16B1CC0}: loaded from ./tenants/b044e526cebb15ee983d648614127f1a/timelines/4b1f73f3b2fe5c20727ab3532deaf3ee/rel_1664_0_1214_0_0_000000000169C350    
Oct 01 21:31:58.053 TRACE pagestream{timeline=4b1f73f3b2fe5c20727ab3532deaf3ee tenant=b044e526cebb15ee983d648614127f1a}:get_page{rel=1664/0/1214 blkno=0 req_lsn=0/16B1CC0}: found page image for blk 0 in 1664/0/1214 at 0/16B1CC0, no WAL redo required

The lines are very wide, which is not very nice on the terminal. But it's important to have the context there, and when you print the tenant and timeline id it just takes a lot of space, there's no way around that.

Alternatively, we could switch to a more structured JSON format, the 'tracing' crate comes with a built-in formatter for that. I'm not sure what the state of the art with that is, but presumably there are some nice tools out there to parse and display JSON logs. Anyone have experience with that?

SomeoneToIgnore

Here's my branch: https://github.com/zenithdb/zenith/compare/kb/attempt-to-tracing?expand=1
It's pretty small and concentrated on the "right" way to do things in logging.rs lib file, so pick whatever you find useful and let me know when I can remove the branch.

tracing is somewhat odd when in comes to configuring, because there's so many moving parts that could be put together many different ways.
In my case, I've used a more formalized way to allow writing without blocks (supposed to be good, but needs benches and comparison) + rotate the files automatically (cool, not sure if super needed).

To sum this up:

I like how simple with_writer setup is in ypur PR, I wonder, if we could try the same with the non-blocking option and check if it works and works better
we need to treat async code a bit more gently or do not place spans (logging is fine) inside that code
a few extra deps in Cargo.toml are worth throwing off

SomeoneToIgnore · 2021-10-01T19:54:16Z

zenith_utils/src/logging.rs

+            // FIXME: It's actually nice to have them, so leave them on for now. Use
+            // 'less -R' option to display them.
+            //
+            // .with_ansi(false)


I would still use the non-ansi version by default due to many other tools literally interpreting our logs, examples:

k8s by default

CI logs: https://app.circleci.com/pipelines/github/zenithdb/zenith/2311/workflows/79be9d93-dc36-424d-89b7-ed09fe7e0dd2/jobs/16582/artifacts

local test logs too, presumably?

Makes sense, changed it to non-ansi.

However, that ran into a bug in 'tracing'. I'm currently actually using the 'pretty' format, so the log looks like this:

Oct 07 13:18:22.704 INFO GET /v1/branch/40d00a53756334d1a471043acce25557/test_config 200 OK at zenith_utils/src/http/endpoint.rs:26 Oct 07 13:18:22.732 INFO taking basebackup lsn=0/0, prev_lsn=0/169C350 at pageserver/src/basebackup.rs:84 in basebackup with ESC[1mtimelineESC[0m: 1d97ed9047930b55c1c4535db09ec4f5, ESC[1mtenantESC[0m: 40d00a53756334d1a471043acce25557 Oct 07 13:18:22.733 INFO list_nonrels called at 0/169C350 at pageserver/src/layered_repository.rs:725 in basebackup with ESC[1mtimelineESC[0m: 1d97ed9047930b55c1c4535db09ec4f5, ESC[1mtenantESC[0m: 40d00a53756334d1a471043acce25557, ESC[1mlsnESC[0m: "0/169C350"

So even with with_ansi(false), there are still some ansi control chars in the context lines. There's a github issue about that at tokio-rs/tracing#1310. I propose that we go ahead and push this even though we still get some of those control chars, and wait/work on the upstream issue to be fixed.

I like the idea to proceed anyway.
What if we disable the pretty logging too, at least for now?

What does that pretty mode do, by the way? I've not used it before 😄
Docs say that it should be "pretty": https://docs.rs/tracing-subscriber/0.2.25/tracing_subscriber/fmt/struct.Layer.html#method.pretty

Sets the layer being built to use an excessively pretty, human-readable formatter.

which is not much, grepping along the repo shows this:
https://github.com/tokio-rs/tracing/blob/6b7051c9897a6ea2afd61cfa139714cfeed8bf4f/tracing-subscriber/src/fmt/mod.rs#L120

//! * [`format::Pretty`]: Emits excessively pretty, multi-line logs, optimized //! for human readability. This is primarily intended to be used in local //! development and debugging, or for command-line applications, where //! automated analysis and compact storage of logs is less of a priority than //! readability and visual appeal.

If that's really the case, I don't like it:

it's not pretty right now

multiline logging is often tricky to handle in production cloud environment, when tons of threads concurrently writing something: unless every that line is prepended with some id/guid/uuid/whatnot , it's impossible to distinguish one line from another. We can dare and explore tracing behavior, or we can prefer single lines everywhere.

pageserver/Cargo.toml

pageserver/src/http/routes.rs

pageserver/src/page_service.rs

SomeoneToIgnore · 2021-10-01T20:18:00Z

zenith_utils/Cargo.toml

-slog-term = "2.8.0"
-slog = "2.7.0"
+tracing = "0.1.27"
+tracing-log = "0.1.2"


See the first bullet in
https://docs.rs/tracing-subscriber/0.2.24/tracing_subscriber/#optional-dependencies

tracing-subscriber with default features is enough to init tracing backend with log redirected inside it, see my branch for details.

I tried removing LogTracer::init() call from logging.rs, but a lot of log lines went missing. Clearly it is needed, it is not initialized automatically. I didn't see anything relevant in your branch, can you elaborate?

Good catch, I've missed that part too, because had called .finish(); on my subscriber builder...
Like I've mentioned, the configuration is somewhat convoluted here.

Here's the difference in the in their sources: https://github.com/tokio-rs/tracing/blob/6b7051c9897a6ea2afd61cfa139714cfeed8bf4f/tracing-subscriber/src/fmt/mod.rs#L556-L579

the main bit is

/// If the `tracing-log` feature is enabled, this will also install /// the LogTracer to convert `Log` records into `tracing` `Event`s.

that feature is enabled with the default features, so you should have it, then you should either invoke the right method on the formatter, or do this manually:
https://github.com/tokio-rs/tracing/blob/master/tracing-subscriber/src/util.rs#L57-L67

sharnoff

Dang, nice! Nothing really to say that @SomeoneToIgnore hasn't already mentioned, just weighing in with my two cents on holding .entered() guards at the top of functions.

IMO it looks a little weird. Proc macros have their downsides, but I can't think of any other clean ways to do it -- it's maybe worth putting some thought into it.

zenith_utils/src/logging.rs

Whenever we start processing a request, we now enter a tracing "span" that includes context information like the tenant and timeline ID, and the operation we're performing. That context information gets attached to every log message we create within the span. That way, we don't need to include basic context information like that in every log message, and it also becomes easier to filter the logs programmatically. This removes the eplicit timeline and tenant IDs from most log messages, as you get that information from the enclosing span now. Also improve log messages in general, dialing down the level of some messages that are not very useful, and adding information to others. We now obey the RUST_LOG env variable, if it's set. The 'tracing' crate allows for different log formatters, like JSON or bunyan output. The one we use now is human-readable multi-line format, which is nice when reading the log directly, but hard for post-processing. For production, we'll probably want JSON output and some tools for working with it, but that's left as a TODO. The log format is easy to change.

hlinnaka · 2021-10-11T06:04:04Z

The main open question here is the log format. I'm currently using the pretty format; I like it because it's very human-friendly. However, as @SomeoneToIgnore mentioned, as a multi-line format it's not suitable for post-processing. When this is deployed, we should probably use a JSON format to make automated filtering and post-processing easy. And if the tools to turn the JSON output into more human-readable format are readily available to all developers too, we can just always use JSON. Or maybe there's some other industry standard format we should be using? @cicdteam, thoughts on that? How are we going to collect and process logs in production?

@SomeoneToIgnore's branch also contained changes to do log rotation. That's not included here, but we'll need it sooner or later. Not sure if the page server itself should be doing rotation, or we should use the system 'logrotate' tool, or what.

The log format is easy to change when we know what we want. So I'm going to push this with the human-friendly multi-line format for now, and leave figuring out the format as a TODO.

hlinnaka requested a review from sharnoff October 1, 2021 18:36

hlinnaka force-pushed the tracing branch from 7a78f55 to 1c049eb Compare October 1, 2021 18:36

hlinnaka force-pushed the tracing branch from 1c049eb to 6a48970 Compare October 1, 2021 18:45

SomeoneToIgnore reviewed Oct 1, 2021

View reviewed changes

sharnoff approved these changes Oct 1, 2021

View reviewed changes

zenith_utils/src/logging.rs Outdated Show resolved Hide resolved

hlinnaka force-pushed the remove-wal-loading-leftovers branch 3 times, most recently from d06dea7 to 15f1bcc Compare October 6, 2021 12:59

Base automatically changed from remove-wal-loading-leftovers to main October 6, 2021 13:27

hlinnaka force-pushed the tracing branch 10 times, most recently from 2bfe4f6 to 78e4dbb Compare October 10, 2021 20:17

hlinnaka force-pushed the tracing branch from 78e4dbb to 7216f22 Compare October 11, 2021 06:04

hlinnaka merged commit 7216f22 into main Oct 11, 2021

hlinnaka deleted the tracing branch October 11, 2021 06:04

SomeoneToIgnore mentioned this pull request Apr 15, 2022

Remove log crate from pageserver #1524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tracing crate to have more context in log messages. #696

Use tracing crate to have more context in log messages. #696

hlinnaka commented Oct 1, 2021

hlinnaka commented Oct 1, 2021

hlinnaka commented Oct 1, 2021

SomeoneToIgnore left a comment •

edited

Loading

SomeoneToIgnore Oct 1, 2021

hlinnaka Oct 7, 2021

SomeoneToIgnore Oct 7, 2021

SomeoneToIgnore Oct 1, 2021

hlinnaka Oct 7, 2021

SomeoneToIgnore Oct 7, 2021

sharnoff left a comment

hlinnaka commented Oct 11, 2021

Use tracing crate to have more context in log messages. #696

Use tracing crate to have more context in log messages. #696

Conversation

hlinnaka commented Oct 1, 2021

hlinnaka commented Oct 1, 2021

hlinnaka commented Oct 1, 2021

SomeoneToIgnore left a comment • edited Loading

Choose a reason for hiding this comment

SomeoneToIgnore Oct 1, 2021

Choose a reason for hiding this comment

hlinnaka Oct 7, 2021

Choose a reason for hiding this comment

SomeoneToIgnore Oct 7, 2021

Choose a reason for hiding this comment

SomeoneToIgnore Oct 1, 2021

Choose a reason for hiding this comment

hlinnaka Oct 7, 2021

Choose a reason for hiding this comment

SomeoneToIgnore Oct 7, 2021

Choose a reason for hiding this comment

sharnoff left a comment

Choose a reason for hiding this comment

hlinnaka commented Oct 11, 2021

SomeoneToIgnore left a comment •

edited

Loading