-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/server/debug: add tenant ID filtering to debug/logspy #94777
Conversation
513a125
to
2a91f9c
Compare
e427a92
to
2a06808
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 2 of 31 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @knz)
-- commits
line 104 at r3:
two questions here:
- Why even write errors to a channel? Can't we just log them and continue?
- Why should an error cause an exit in logspy? Can we just continue with the rest of the logs?
pkg/server/debug/logspy.go
line 293 at r3 (raw file):
var entry logpb.Entry if err := json.Unmarshal(jsonEntry, &entry); err != nil { i.errCh <- errors.NewAssertionErrorWithWrappedErrf(err, "could not unmarshal entry: %s", jsonEntry)
this channel is initialized unbuffered, which means this call may block. is that the behavior we want? I'm not entirely sure how Intercept
works but if it's in the path of logging, would that block logging?
pkg/server/debug/server.go
line 147 at r3 (raw file):
panic("programmer error: failed parsing ambientCtx tenantID during debug.Server initialization") } tID, err := roachpb.MakeTenantID(parsed)
you can use MustMakeTenantID
here if you want.
pkg/util/log/format_crdb_v1.go
line 438 at r1 (raw file):
} // Process the remaining tags. tagsWithoutTenantID := v2TenantIdTagSelect.ReplaceAllString(string(m[7]), "")
can you perhaps rename v2TenantIdTagSelect
since it's now used for v1 as well?
2a06808
to
0117074
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, PTAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)
Previously, dhartunian (David Hartunian) wrote…
two questions here:
- Why even write errors to a channel? Can't we just log them and continue?
- Why should an error cause an exit in logspy? Can we just continue with the rest of the logs?
- It changes the behavior a bit (before, we didn't tolerate unmarshal errors), but I don't see it as a big deal to change this... especially considering the introduction of the error channel comes with complexity & risk! Updated to just log, count the entry as dropped, and continue.
- It was the old behavior (not tolerating unmarshal errors) that I was trying to maintain for no good reason, but now we will just log & record that the message was dropped and continue.
pkg/server/debug/logspy.go
line 293 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
this channel is initialized unbuffered, which means this call may block. is that the behavior we want? I'm not entirely sure how
Intercept
works but if it's in the path of logging, would that block logging?
Removed the errCh entirely.
pkg/server/debug/server.go
line 147 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
you can use
MustMakeTenantID
here if you want.
Done - thank you!
pkg/util/log/format_crdb_v1.go
line 438 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
can you perhaps rename
v2TenantIdTagSelect
since it's now used for v1 as well?
Good idea - done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work on the architecture. No comment there.
I've noticed two small things that I believe are important.
Reviewed 2 of 3 files at r1, 34 of 34 files at r4, 2 of 2 files at r5, 31 of 31 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)
pkg/server/debug/logspy.go
line 288 at r6 (raw file):
if err := json.Unmarshal(jsonEntry, &entry); err != nil { // Failed to unmarshal, so just log & mark the entry as dropped and continue. log.Errorf(i.ctx, "logspy failed to unmarshal entry: %v - err: %v", entry, err)
that will cause a circular re-entry into logging and the interceptor, resulting in an infinite call stack (and a crash).
IMHO if you encounter an error here you should just inject it into i.jsonEntry
too.
pkg/util/log/format_crdb_v1.go
line 438 at r4 (raw file):
} // Process the remaining tags. tagsWithoutTenantID := crdbFmtTenantIdTagSelect.ReplaceAllString(string(m[7]), "")
since you guarangee the tag is at the beginning, could you perhaps do a slice truncation (up to first ,
) instead of applying a regex? the regex is an expensive solution and it results in a heap allocation, which i believe is not warranted here.
0117074
to
784fd3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR everyone, RFAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)
pkg/server/debug/logspy.go
line 288 at r6 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
that will cause a circular re-entry into logging and the interceptor, resulting in an infinite call stack (and a crash).
IMHO if you encounter an error here you should just inject it intoi.jsonEntry
too.
Thanks for catching this! If an error occurs here we now pass it along in the payload to i.jsonEntry
, and the consumer will return the error once consumed. Let me know if I misunderstood your suggestion.
pkg/util/log/format_crdb_v1.go
line 438 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
since you guarangee the tag is at the beginning, could you perhaps do a slice truncation (up to first
,
) instead of applying a regex? the regex is an expensive solution and it results in a heap allocation, which i believe is not warranted here.
Thanks for the advice - removed the TenantID tag regex for both the crdb_v1
and crdb_v2
formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 35 files at r7, 2 of 2 files at r8, 31 of 31 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)
pkg/util/log/format_crdb_v1.go
line 431 at r9 (raw file):
// Look for a tenant ID tag. Default to system otherwise. entry.TenantID = systemTenantID tagsToProcess := string(m[7])
Please avoid the conversion to string
which forces a heap allocation.
Instead, keep []byte
and use bytes.HasPrefix
and bytes.IndexByte
to search for the comma.
pkg/util/log/format_crdb_v2.go
line 700 at r9 (raw file):
// Strip out the tenant ID tag. We handle it separately in getTenantID(). if strings.HasPrefix(tagsStr, TenantIDLogTagKey) { tenantIDSplit := strings.SplitN(tagsStr, ",", 2)
strings.IndexByte
c093bbc
to
c1a92a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)
pkg/util/log/format_crdb_v1.go
line 431 at r9 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Please avoid the conversion to
string
which forces a heap allocation.
Instead, keep[]byte
and usebytes.HasPrefix
andbytes.IndexByte
to search for the comma.
Thanks for being vigilant here! Done. Let me know if you think there's any more efficiency to be gained.
pkg/util/log/format_crdb_v2.go
line 700 at r9 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
strings.IndexByte
Done.
c1a92a3
to
52f8d7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 36 files at r10, 1 of 2 files at r11, 2 of 2 files at r12, 31 of 31 files at r13, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)
pkg/util/log/format_crdb_v1.go
line 437 at r13 (raw file):
if commaIndex >= 0 { // More tags exist than just the tenant ID tag. entry.TenantID = string(tagsToProcess[1:commaIndex])
I wonder if we can also get rid of this conversion. The tenant ID is a number. Can we maybe extract the number from the string and have an int instead of string in entry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)
pkg/util/log/format_crdb_v1.go
line 437 at r13 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I wonder if we can also get rid of this conversion. The tenant ID is a number. Can we maybe extract the number from the string and have an int instead of string in
entry
?
I'm open to the idea! I think I chose to use a string because the ID is served as a string from the idPayload
in the context when encoding, but the decode side doesn't need to be influenced by the encode side of things. It would require some broader changes that fall outside the scope of this PR, so I'd like to handle that in its own thread.
I already got a couple good pieces of feedback from Andrei on the tenant ID log tagging stuff that I'm going to address (kind of like a "various QoL improvements" PR). I think this suggestion would fit nicely alongside it if you're okay with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's good. I'm pulling on this thread because we know that log parsing has historically been a perf bottleneck so it's a worthy area to be careful about.
It's very much appreciated 🙏 thanks for the review! |
In a previous patch, we updated all log formats to always tag log entries with tenant IDs to prepare ourselves for multiple tenants per-process. When we did this work initially, the decision was made to not tag `crdb_v1` formatted logs with the tenant ID, since this is an older format that's not expected to be used in clusters running multiple tenants per-process. However, after doing some additional work, it was noticed that this format is still used frequently in debugging tools. Not having a tenant ID tag in the format therefore becomes confusing when analyzing logs for clusters running multiple tenants per-process, since the tenant IDs are missing per- entry. This patch updates the `crdb_v1` format to now always include a tenant ID tag as the first tag, making use of its existing tags structure. As a result, debug tools making use of the crdb_v1 format will be compatible with upcoming features such as filtering based on tenant ID. Release note (ops change): crdb_v1 formatted logs will now always contain a tag containing the tenant ID that the log entry pertains to. For example, for tenant ID 2, we include the tag `T2`: ``` W000101 00:00:12.300000 456 somefile.go:136 [T2,nsql?,othertag=somevalue] 2 this is a log message ``` If decoding an older crdb_v1 log message that does not contain a tenant ID tag, then we default to the primary system tenant.
Previously, a crdb_v2 log entry without any tags would use `[-]` in the log entry to indicate that the tags for the entry were empty. With recent changes ensuring that all logs are tagged with the tenant ID (crdb_v1, crdb_v2, json), we never expect a log entry with empty tags. When decoding the crdb_v2 format, we forgot to update the decoder to no longer set "-" in the tags field when no tags other than the tenant ID tag were available. This does not play nicely when re-formatting the logpb.Entry back into crdb_v1 format (commonly done in our debugging tools such as debug merge-logs) since it would create entries with tag strings such as: ``` [T1,-] ``` This patch updates the crdb_v2 format's decoder to no longer use `-` to denote empty tags. Instead, an empty string on the tags field is used which is much more compatible with writing the entry back into the crdb_v1 format. Note that the decoder still properly interprets `-` as an empty tags list, in case a newer CRDB version is used to decode crdb_v2 logs generated from an older version. Release note: none
52f8d7a
to
168cc04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the last commit msg needs revising.
Reviewed 35 of 35 files at r7, 2 of 2 files at r8, 1 of 31 files at r9, 2 of 36 files at r10, 2 of 2 files at r11.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @knz)
-- commits
line 99 at r13:
nit: This is no longer true, right?
We need to ensure that any log-related debug endpoints that can be used by tenants (especially those available in DB Console) filter to the specific tenant that made the request. This is important to respect tenant boundaries when it comes to logging data. This patch updates the `debug/logspy` endpoint to filter logs to only include those belonging to the tenant handling the request. For the system tenant, we apply no such filtering. The system tenant has sufficient privilege to view logs across tenants, so we avoid filtering in this case. Another thing of note is that previously, the `log.Interceptor` had no chance of error generation. Instead, any errors would be created after consuming off the channel that the interceptor produced into. Specifically, if we ever failed to unmarshall the raw JSON bytes into a `logpb.Entry`, which was not a tolerated error. Now that the interceptor filters on tenant ID, it needs to unmarshall the raw JSON bytes for each log entry into a `logpb.Entry` so that it can filter against the tenant ID. This creates an opportunity for error within the interceptor that didn't exist before, and as before, we do not want to tolerate such errors in the logSpy system. As a solution, in the payload that we send from the interceptor to the log spy consumer, we optionally pass along any unmarshal errors for the consumer to handle (which in this case, the consumer does not tolerate). Release note: none
168cc04
to
8117ff0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR 💜
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @dhartunian, and @knz)
Previously, dhartunian (David Hartunian) wrote…
nit: This is no longer true, right?
Thanks for catching! Fixed
95016: pkg/cli: add tenant ID filter flag to `debug merge-logs` r=dhartunian a=abarganier **NB: Only review the final commit - all previous commits are being reviewed in #94777 With multiple tenants running per-process, log files are shared between tenants where tenant ID tags are used to distinguish which tenant a log entry originated from. When dealing with logs, especially in a debug.zip bundle, we need to provide tools that make it easy to filter logs to a specific tenant, or group of tenants. This patch updates the `debug merge-logs` command with a `--tenant-ids` flag which can be used to filter merged logs to specific tenant ID(s). For example, to only include logs for app tenants 2 and 3: ``` cockroach debug merge-logs ... --tenant-ids=2,3 ``` Release note (cli change): The `debug merge-logs` CLI command is now equipped with a `--tenant-ids` flag, which can be used to provide the command with a list of tenant IDs to filter logs on. For example, to only include logs for secondary tenants 2 and 3: ``` cockroach debug merge-logs ... --tenant-ids=2,3 ``` Note that when processing logs from older cluster versions, where a tenant ID tag may not be present on the log entry, the command will interpret any log without a tenant ID as belonging to the system tenant (tenant ID = 1). Epic: CRDB-14486 Addresses: https://cockroachlabs.atlassian.net/browse/CRDB-22152 95192: roachprod: disable ext4 reserved space r=herkolategan a=srosenberg By default, ext4 reserves 5% of blocks for the root user. The main rationale for this historical default is to reserve space for (a) important root processes, and (b) de-fragmentation. However, when this concerns disks which are used to store SST files, there is seemingly no good reason for this wasted space. Furthermore, debugging an out-of-disk space event becomes more surprising without a direct knowledge of this default. Consequently, it was decided in favor of removing this reserved space, but only for data disks. Please see the corresponding GH issue for more details. Epic: none Fixes: #93120 Release note: None Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Stan Rosenberg <[email protected]>
NB: The first commit updates crdb_v1 to always tag log entries with the tenant ID. See commit description for details.
We need to ensure that any log-related debug endpoints that
can be used by tenants (especially those available in DB
Console) filter to the specific tenant that made the request.
This is important to respect tenant boundaries when it comes
to logging data.
This patch updates the
debug/logspy
endpoint to filter logsto only include those belonging to the tenant handling the
request.
For the system tenant, we apply no such filtering. The system
tenant has sufficient privilege to view logs across tenants,
so we avoid filtering in this case.
Another thing of note is that previously, the
log.Interceptor
had no chance of error generation. Instead, any errors would
be created after consuming off the channel that the interceptor
produced into. Specifically, if we ever failed to unmarshall
the raw JSON bytes into a
logpb.Entry
, which was not atolerated error.
Now that the interceptor filters on tenant ID, it needs to
unmarshall the raw JSON bytes for each log entry into a
logpb.Entry
so that it can filter against the tenant ID.This creates an opportunity for error within the interceptor
that didn't exist before, and as before, we do not want to
tolerate such errors in the logSpy system. However, the
interceptor's interface does not allow us to return an error.
As a solution, the logSpyInterceptor now has an error channel
that the Interceptor can produce into if any errors occur
when unmarshalling JSON. The logSpy's run routine will
exit upon receiving an error on this channel. The behavior
should effectively be the same: we didn't tolerate parse
errors before, and we don't with these updates either.
Release note: none
Epic: CRDB-14486
Addresses #94779