pkg/server/debug: add tenant ID filtering to debug/logspy #94777

abarganier · 2023-01-05T19:15:12Z

NB: The first commit updates crdb_v1 to always tag log entries with the tenant ID. See commit description for details.

We need to ensure that any log-related debug endpoints that
can be used by tenants (especially those available in DB
Console) filter to the specific tenant that made the request.
This is important to respect tenant boundaries when it comes
to logging data.

This patch updates the debug/logspy endpoint to filter logs
to only include those belonging to the tenant handling the
request.

For the system tenant, we apply no such filtering. The system
tenant has sufficient privilege to view logs across tenants,
so we avoid filtering in this case.

Another thing of note is that previously, the log.Interceptor
had no chance of error generation. Instead, any errors would
be created after consuming off the channel that the interceptor
produced into. Specifically, if we ever failed to unmarshall
the raw JSON bytes into a logpb.Entry, which was not a
tolerated error.

Now that the interceptor filters on tenant ID, it needs to
unmarshall the raw JSON bytes for each log entry into a
logpb.Entry so that it can filter against the tenant ID.
This creates an opportunity for error within the interceptor
that didn't exist before, and as before, we do not want to
tolerate such errors in the logSpy system. However, the
interceptor's interface does not allow us to return an error.

As a solution, the logSpyInterceptor now has an error channel
that the Interceptor can produce into if any errors occur
when unmarshalling JSON. The logSpy's run routine will
exit upon receiving an error on this channel. The behavior
should effectively be the same: we didn't tolerate parse
errors before, and we don't with these updates either.

Release note: none

Epic: CRDB-14486

Addresses #94779

cockroach-teamcity · 2023-01-05T19:15:22Z

This change is

dhartunian

Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 2 of 31 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @knz)

-- commits line 104 at r3:
two questions here:

Why even write errors to a channel? Can't we just log them and continue?
Why should an error cause an exit in logspy? Can we just continue with the rest of the logs?

pkg/server/debug/logspy.go line 293 at r3 (raw file):

	var entry logpb.Entry
	if err := json.Unmarshal(jsonEntry, &entry); err != nil {
		i.errCh <- errors.NewAssertionErrorWithWrappedErrf(err, "could not unmarshal entry: %s", jsonEntry)

this channel is initialized unbuffered, which means this call may block. is that the behavior we want? I'm not entirely sure how Intercept works but if it's in the path of logging, would that block logging?

pkg/server/debug/server.go line 147 at r3 (raw file):

		panic("programmer error: failed parsing ambientCtx tenantID during debug.Server initialization")
	}
	tID, err := roachpb.MakeTenantID(parsed)

you can use MustMakeTenantID here if you want.

pkg/util/log/format_crdb_v1.go line 438 at r1 (raw file):

		}
		// Process the remaining tags.
		tagsWithoutTenantID := v2TenantIdTagSelect.ReplaceAllString(string(m[7]), "")

can you perhaps rename v2TenantIdTagSelect since it's now used for v1 as well?

abarganier

Thanks for the feedback, PTAL

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)

-- commits line 104 at r3:

Previously, dhartunian (David Hartunian) wrote…

two questions here:

Why even write errors to a channel? Can't we just log them and continue?

Why should an error cause an exit in logspy? Can we just continue with the rest of the logs?

It changes the behavior a bit (before, we didn't tolerate unmarshal errors), but I don't see it as a big deal to change this... especially considering the introduction of the error channel comes with complexity & risk! Updated to just log, count the entry as dropped, and continue.
It was the old behavior (not tolerating unmarshal errors) that I was trying to maintain for no good reason, but now we will just log & record that the message was dropped and continue.

pkg/server/debug/logspy.go line 293 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

this channel is initialized unbuffered, which means this call may block. is that the behavior we want? I'm not entirely sure how Intercept works but if it's in the path of logging, would that block logging?

Removed the errCh entirely.

pkg/server/debug/server.go line 147 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

you can use MustMakeTenantID here if you want.

Done - thank you!

pkg/util/log/format_crdb_v1.go line 438 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

can you perhaps rename v2TenantIdTagSelect since it's now used for v1 as well?

Good idea - done!

knz

Good work on the architecture. No comment there.
I've noticed two small things that I believe are important.

Reviewed 2 of 3 files at r1, 34 of 34 files at r4, 2 of 2 files at r5, 31 of 31 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)

pkg/server/debug/logspy.go line 288 at r6 (raw file):

	if err := json.Unmarshal(jsonEntry, &entry); err != nil {
		// Failed to unmarshal, so just log & mark the entry as dropped and continue.
		log.Errorf(i.ctx, "logspy failed to unmarshal entry: %v - err: %v", entry, err)

that will cause a circular re-entry into logging and the interceptor, resulting in an infinite call stack (and a crash).
IMHO if you encounter an error here you should just inject it into i.jsonEntry too.

pkg/util/log/format_crdb_v1.go line 438 at r4 (raw file):

		}
		// Process the remaining tags.
		tagsWithoutTenantID := crdbFmtTenantIdTagSelect.ReplaceAllString(string(m[7]), "")

since you guarangee the tag is at the beginning, could you perhaps do a slice truncation (up to first ,) instead of applying a regex? the regex is an expensive solution and it results in a heap allocation, which i believe is not warranted here.

abarganier

TFTR everyone, RFAL

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)

pkg/server/debug/logspy.go line 288 at r6 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

that will cause a circular re-entry into logging and the interceptor, resulting in an infinite call stack (and a crash).
IMHO if you encounter an error here you should just inject it into i.jsonEntry too.

Thanks for catching this! If an error occurs here we now pass it along in the payload to i.jsonEntry, and the consumer will return the error once consumed. Let me know if I misunderstood your suggestion.

pkg/util/log/format_crdb_v1.go line 438 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

since you guarangee the tag is at the beginning, could you perhaps do a slice truncation (up to first ,) instead of applying a regex? the regex is an expensive solution and it results in a heap allocation, which i believe is not warranted here.

Thanks for the advice - removed the TenantID tag regex for both the crdb_v1 and crdb_v2 formats.

knz

Reviewed 2 of 35 files at r7, 2 of 2 files at r8, 31 of 31 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)

pkg/util/log/format_crdb_v1.go line 431 at r9 (raw file):

		// Look for a tenant ID tag. Default to system otherwise.
		entry.TenantID = systemTenantID
		tagsToProcess := string(m[7])

Please avoid the conversion to string which forces a heap allocation.
Instead, keep []byte and use bytes.HasPrefix and bytes.IndexByte to search for the comma.

pkg/util/log/format_crdb_v2.go line 700 at r9 (raw file):

	// Strip out the tenant ID tag. We handle it separately in getTenantID().
	if strings.HasPrefix(tagsStr, TenantIDLogTagKey) {
		tenantIDSplit := strings.SplitN(tagsStr, ",", 2)

strings.IndexByte

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)

pkg/util/log/format_crdb_v1.go line 431 at r9 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Please avoid the conversion to string which forces a heap allocation.
Instead, keep []byte and use bytes.HasPrefix and bytes.IndexByte to search for the comma.

Thanks for being vigilant here! Done. Let me know if you think there's any more efficiency to be gained.

pkg/util/log/format_crdb_v2.go line 700 at r9 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

strings.IndexByte

Done.

knz

Reviewed 2 of 36 files at r10, 1 of 2 files at r11, 2 of 2 files at r12, 31 of 31 files at r13, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)

pkg/util/log/format_crdb_v1.go line 437 at r13 (raw file):

			if commaIndex >= 0 {
				// More tags exist than just the tenant ID tag.
				entry.TenantID = string(tagsToProcess[1:commaIndex])

I wonder if we can also get rid of this conversion. The tenant ID is a number. Can we maybe extract the number from the string and have an int instead of string in entry?

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)

pkg/util/log/format_crdb_v1.go line 437 at r13 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I wonder if we can also get rid of this conversion. The tenant ID is a number. Can we maybe extract the number from the string and have an int instead of string in entry?

I'm open to the idea! I think I chose to use a string because the ID is served as a string from the idPayload in the context when encoding, but the decode side doesn't need to be influenced by the encode side of things. It would require some broader changes that fall outside the scope of this PR, so I'd like to handle that in its own thread.

I already got a couple good pieces of feedback from Andrei on the tenant ID log tagging stuff that I'm going to address (kind of like a "various QoL improvements" PR). I think this suggestion would fit nicely alongside it if you're okay with that.

knz

yes that's good. I'm pulling on this thread because we know that log parsing has historically been a perf bottleneck so it's a worthy area to be careful about.

abarganier · 2023-01-11T21:46:13Z

I'm pulling on this thread because we know that log parsing has historically been a perf bottleneck so it's a worthy area to be careful about.

It's very much appreciated 🙏 thanks for the review!

In a previous patch, we updated all log formats to always tag log entries with tenant IDs to prepare ourselves for multiple tenants per-process. When we did this work initially, the decision was made to not tag `crdb_v1` formatted logs with the tenant ID, since this is an older format that's not expected to be used in clusters running multiple tenants per-process. However, after doing some additional work, it was noticed that this format is still used frequently in debugging tools. Not having a tenant ID tag in the format therefore becomes confusing when analyzing logs for clusters running multiple tenants per-process, since the tenant IDs are missing per- entry. This patch updates the `crdb_v1` format to now always include a tenant ID tag as the first tag, making use of its existing tags structure. As a result, debug tools making use of the crdb_v1 format will be compatible with upcoming features such as filtering based on tenant ID. Release note (ops change): crdb_v1 formatted logs will now always contain a tag containing the tenant ID that the log entry pertains to. For example, for tenant ID 2, we include the tag `T2`: ``` W000101 00:00:12.300000 456 somefile.go:136 [T2,nsql?,othertag=somevalue] 2 this is a log message ``` If decoding an older crdb_v1 log message that does not contain a tenant ID tag, then we default to the primary system tenant.

Previously, a crdb_v2 log entry without any tags would use `[-]` in the log entry to indicate that the tags for the entry were empty. With recent changes ensuring that all logs are tagged with the tenant ID (crdb_v1, crdb_v2, json), we never expect a log entry with empty tags. When decoding the crdb_v2 format, we forgot to update the decoder to no longer set "-" in the tags field when no tags other than the tenant ID tag were available. This does not play nicely when re-formatting the logpb.Entry back into crdb_v1 format (commonly done in our debugging tools such as debug merge-logs) since it would create entries with tag strings such as: ``` [T1,-] ``` This patch updates the crdb_v2 format's decoder to no longer use `-` to denote empty tags. Instead, an empty string on the tags field is used which is much more compatible with writing the entry back into the crdb_v1 format. Note that the decoder still properly interprets `-` as an empty tags list, in case a newer CRDB version is used to decode crdb_v2 logs generated from an older version. Release note: none

dhartunian

I think the last commit msg needs revising.

Reviewed 35 of 35 files at r7, 2 of 2 files at r8, 1 of 31 files at r9, 2 of 36 files at r10, 2 of 2 files at r11.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @knz)

-- commits line 99 at r13:
nit: This is no longer true, right?

We need to ensure that any log-related debug endpoints that can be used by tenants (especially those available in DB Console) filter to the specific tenant that made the request. This is important to respect tenant boundaries when it comes to logging data. This patch updates the `debug/logspy` endpoint to filter logs to only include those belonging to the tenant handling the request. For the system tenant, we apply no such filtering. The system tenant has sufficient privilege to view logs across tenants, so we avoid filtering in this case. Another thing of note is that previously, the `log.Interceptor` had no chance of error generation. Instead, any errors would be created after consuming off the channel that the interceptor produced into. Specifically, if we ever failed to unmarshall the raw JSON bytes into a `logpb.Entry`, which was not a tolerated error. Now that the interceptor filters on tenant ID, it needs to unmarshall the raw JSON bytes for each log entry into a `logpb.Entry` so that it can filter against the tenant ID. This creates an opportunity for error within the interceptor that didn't exist before, and as before, we do not want to tolerate such errors in the logSpy system. As a solution, in the payload that we send from the interceptor to the log spy consumer, we optionally pass along any unmarshal errors for the consumer to handle (which in this case, the consumer does not tolerate). Release note: none

abarganier

TFTR 💜

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @dhartunian, and @knz)

-- commits line 99 at r13:

Previously, dhartunian (David Hartunian) wrote…

nit: This is no longer true, right?

Thanks for catching! Fixed

95016: pkg/cli: add tenant ID filter flag to `debug merge-logs` r=dhartunian a=abarganier **NB: Only review the final commit - all previous commits are being reviewed in #94777 With multiple tenants running per-process, log files are shared between tenants where tenant ID tags are used to distinguish which tenant a log entry originated from. When dealing with logs, especially in a debug.zip bundle, we need to provide tools that make it easy to filter logs to a specific tenant, or group of tenants. This patch updates the `debug merge-logs` command with a `--tenant-ids` flag which can be used to filter merged logs to specific tenant ID(s). For example, to only include logs for app tenants 2 and 3: ``` cockroach debug merge-logs ... --tenant-ids=2,3 ``` Release note (cli change): The `debug merge-logs` CLI command is now equipped with a `--tenant-ids` flag, which can be used to provide the command with a list of tenant IDs to filter logs on. For example, to only include logs for secondary tenants 2 and 3: ``` cockroach debug merge-logs ... --tenant-ids=2,3 ``` Note that when processing logs from older cluster versions, where a tenant ID tag may not be present on the log entry, the command will interpret any log without a tenant ID as belonging to the system tenant (tenant ID = 1). Epic: CRDB-14486 Addresses: https://cockroachlabs.atlassian.net/browse/CRDB-22152 95192: roachprod: disable ext4 reserved space r=herkolategan a=srosenberg By default, ext4 reserves 5% of blocks for the root user. The main rationale for this historical default is to reserve space for (a) important root processes, and (b) de-fragmentation. However, when this concerns disks which are used to store SST files, there is seemingly no good reason for this wasted space. Furthermore, debugging an out-of-disk space event becomes more surprising without a direct knowledge of this default. Consequently, it was decided in favor of removing this reserved space, but only for data disks. Please see the corresponding GH issue for more details. Epic: none Fixes: #93120 Release note: None Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Stan Rosenberg <[email protected]>

abarganier requested review from knz, dhartunian, aadityasondhi and a team January 5, 2023 19:15

abarganier requested a review from a team as a code owner January 5, 2023 19:15

abarganier force-pushed the log_tenant_filter branch from 513a125 to 2a91f9c Compare January 5, 2023 19:32

abarganier changed the title ~~pkg/util/log: tags crdb_v1 format logs with tenant ID~~ pkg/server/debug: add tenant ID filtering to debug/logspy Jan 5, 2023

abarganier force-pushed the log_tenant_filter branch 3 times, most recently from e427a92 to 2a06808 Compare January 6, 2023 17:39

dhartunian reviewed Jan 9, 2023

View reviewed changes

abarganier force-pushed the log_tenant_filter branch from 2a06808 to 0117074 Compare January 10, 2023 15:58

abarganier commented Jan 10, 2023

View reviewed changes

knz reviewed Jan 10, 2023

View reviewed changes

abarganier mentioned this pull request Jan 10, 2023

pkg/cli: add tenant ID filter flag to debug merge-logs #95016

Merged

abarganier force-pushed the log_tenant_filter branch from 0117074 to 784fd3a Compare January 11, 2023 18:42

abarganier commented Jan 11, 2023

View reviewed changes

abarganier requested a review from knz January 11, 2023 18:45

knz reviewed Jan 11, 2023

View reviewed changes

abarganier force-pushed the log_tenant_filter branch 2 times, most recently from c093bbc to c1a92a3 Compare January 11, 2023 20:52

abarganier commented Jan 11, 2023

View reviewed changes

abarganier force-pushed the log_tenant_filter branch from c1a92a3 to 52f8d7a Compare January 11, 2023 21:01

abarganier requested a review from knz January 11, 2023 21:06

knz reviewed Jan 11, 2023

View reviewed changes

abarganier commented Jan 11, 2023

View reviewed changes

abarganier requested a review from knz January 11, 2023 21:21

knz approved these changes Jan 11, 2023

View reviewed changes

abarganier added 2 commits January 11, 2023 18:00

abarganier force-pushed the log_tenant_filter branch from 52f8d7a to 168cc04 Compare January 11, 2023 22:00

dhartunian approved these changes Jan 12, 2023

View reviewed changes

abarganier force-pushed the log_tenant_filter branch from 168cc04 to 8117ff0 Compare January 12, 2023 17:05

abarganier commented Jan 12, 2023

View reviewed changes

abarganier merged commit 5865718 into cockroachdb:master Jan 12, 2023

abarganier deleted the log_tenant_filter branch January 12, 2023 18:25

cockroach-teamcity mentioned this pull request Jan 13, 2023

PR #94777 - pkg/util/log: tags crdb_v1 format logs with tenant ID cockroachdb/docs#15998

Closed

abarganier mentioned this pull request Jan 24, 2023

pkg/util/log: don't falsify tenant ID tag in logs if none in ctx #95789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/server/debug: add tenant ID filtering to debug/logspy #94777

pkg/server/debug: add tenant ID filtering to debug/logspy #94777

abarganier commented Jan 5, 2023 •

edited

Loading

cockroach-teamcity commented Jan 5, 2023

dhartunian left a comment

abarganier left a comment

knz left a comment

abarganier left a comment

knz left a comment

abarganier left a comment

knz left a comment

abarganier left a comment

knz left a comment

abarganier commented Jan 11, 2023

dhartunian left a comment

abarganier left a comment

pkg/server/debug: add tenant ID filtering to debug/logspy #94777

pkg/server/debug: add tenant ID filtering to debug/logspy #94777

Conversation

abarganier commented Jan 5, 2023 • edited Loading

cockroach-teamcity commented Jan 5, 2023

dhartunian left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

abarganier commented Jan 11, 2023

dhartunian left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

abarganier commented Jan 5, 2023 •

edited

Loading