Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/server/debug: add tenant ID filtering to debug/logspy #94777

Merged
merged 3 commits into from
Jan 12, 2023

Conversation

abarganier
Copy link
Contributor

@abarganier abarganier commented Jan 5, 2023

NB: The first commit updates crdb_v1 to always tag log entries with the tenant ID. See commit description for details.

We need to ensure that any log-related debug endpoints that
can be used by tenants (especially those available in DB
Console) filter to the specific tenant that made the request.
This is important to respect tenant boundaries when it comes
to logging data.

This patch updates the debug/logspy endpoint to filter logs
to only include those belonging to the tenant handling the
request.

For the system tenant, we apply no such filtering. The system
tenant has sufficient privilege to view logs across tenants,
so we avoid filtering in this case.

Another thing of note is that previously, the log.Interceptor
had no chance of error generation. Instead, any errors would
be created after consuming off the channel that the interceptor
produced into. Specifically, if we ever failed to unmarshall
the raw JSON bytes into a logpb.Entry, which was not a
tolerated error.

Now that the interceptor filters on tenant ID, it needs to
unmarshall the raw JSON bytes for each log entry into a
logpb.Entry so that it can filter against the tenant ID.
This creates an opportunity for error within the interceptor
that didn't exist before, and as before, we do not want to
tolerate such errors in the logSpy system. However, the
interceptor's interface does not allow us to return an error.

As a solution, the logSpyInterceptor now has an error channel
that the Interceptor can produce into if any errors occur
when unmarshalling JSON. The logSpy's run routine will
exit upon receiving an error on this channel. The behavior
should effectively be the same: we didn't tolerate parse
errors before, and we don't with these updates either.

Release note: none

Epic: CRDB-14486

Addresses #94779

@abarganier abarganier requested review from knz, dhartunian, aadityasondhi and a team January 5, 2023 19:15
@abarganier abarganier requested a review from a team as a code owner January 5, 2023 19:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@abarganier abarganier changed the title pkg/util/log: tags crdb_v1 format logs with tenant ID pkg/server/debug: add tenant ID filtering to debug/logspy Jan 5, 2023
@abarganier abarganier force-pushed the log_tenant_filter branch 3 times, most recently from e427a92 to 2a06808 Compare January 6, 2023 17:39
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 2 of 31 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @knz)


-- commits line 104 at r3:
two questions here:

  1. Why even write errors to a channel? Can't we just log them and continue?
  2. Why should an error cause an exit in logspy? Can we just continue with the rest of the logs?

pkg/server/debug/logspy.go line 293 at r3 (raw file):

	var entry logpb.Entry
	if err := json.Unmarshal(jsonEntry, &entry); err != nil {
		i.errCh <- errors.NewAssertionErrorWithWrappedErrf(err, "could not unmarshal entry: %s", jsonEntry)

this channel is initialized unbuffered, which means this call may block. is that the behavior we want? I'm not entirely sure how Intercept works but if it's in the path of logging, would that block logging?


pkg/server/debug/server.go line 147 at r3 (raw file):

		panic("programmer error: failed parsing ambientCtx tenantID during debug.Server initialization")
	}
	tID, err := roachpb.MakeTenantID(parsed)

you can use MustMakeTenantID here if you want.


pkg/util/log/format_crdb_v1.go line 438 at r1 (raw file):

		}
		// Process the remaining tags.
		tagsWithoutTenantID := v2TenantIdTagSelect.ReplaceAllString(string(m[7]), "")

can you perhaps rename v2TenantIdTagSelect since it's now used for v1 as well?

Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, PTAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)


-- commits line 104 at r3:

Previously, dhartunian (David Hartunian) wrote…

two questions here:

  1. Why even write errors to a channel? Can't we just log them and continue?
  2. Why should an error cause an exit in logspy? Can we just continue with the rest of the logs?
  1. It changes the behavior a bit (before, we didn't tolerate unmarshal errors), but I don't see it as a big deal to change this... especially considering the introduction of the error channel comes with complexity & risk! Updated to just log, count the entry as dropped, and continue.
  2. It was the old behavior (not tolerating unmarshal errors) that I was trying to maintain for no good reason, but now we will just log & record that the message was dropped and continue.

pkg/server/debug/logspy.go line 293 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

this channel is initialized unbuffered, which means this call may block. is that the behavior we want? I'm not entirely sure how Intercept works but if it's in the path of logging, would that block logging?

Removed the errCh entirely.


pkg/server/debug/server.go line 147 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

you can use MustMakeTenantID here if you want.

Done - thank you!


pkg/util/log/format_crdb_v1.go line 438 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

can you perhaps rename v2TenantIdTagSelect since it's now used for v1 as well?

Good idea - done!

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on the architecture. No comment there.
I've noticed two small things that I believe are important.

Reviewed 2 of 3 files at r1, 34 of 34 files at r4, 2 of 2 files at r5, 31 of 31 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)


pkg/server/debug/logspy.go line 288 at r6 (raw file):

	if err := json.Unmarshal(jsonEntry, &entry); err != nil {
		// Failed to unmarshal, so just log & mark the entry as dropped and continue.
		log.Errorf(i.ctx, "logspy failed to unmarshal entry: %v - err: %v", entry, err)

that will cause a circular re-entry into logging and the interceptor, resulting in an infinite call stack (and a crash).
IMHO if you encounter an error here you should just inject it into i.jsonEntry too.


pkg/util/log/format_crdb_v1.go line 438 at r4 (raw file):

		}
		// Process the remaining tags.
		tagsWithoutTenantID := crdbFmtTenantIdTagSelect.ReplaceAllString(string(m[7]), "")

since you guarangee the tag is at the beginning, could you perhaps do a slice truncation (up to first ,) instead of applying a regex? the regex is an expensive solution and it results in a heap allocation, which i believe is not warranted here.

Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR everyone, RFAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)


pkg/server/debug/logspy.go line 288 at r6 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

that will cause a circular re-entry into logging and the interceptor, resulting in an infinite call stack (and a crash).
IMHO if you encounter an error here you should just inject it into i.jsonEntry too.

Thanks for catching this! If an error occurs here we now pass it along in the payload to i.jsonEntry, and the consumer will return the error once consumed. Let me know if I misunderstood your suggestion.


pkg/util/log/format_crdb_v1.go line 438 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

since you guarangee the tag is at the beginning, could you perhaps do a slice truncation (up to first ,) instead of applying a regex? the regex is an expensive solution and it results in a heap allocation, which i believe is not warranted here.

Thanks for the advice - removed the TenantID tag regex for both the crdb_v1 and crdb_v2 formats.

@abarganier abarganier requested a review from knz January 11, 2023 18:45
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 35 files at r7, 2 of 2 files at r8, 31 of 31 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)


pkg/util/log/format_crdb_v1.go line 431 at r9 (raw file):

		// Look for a tenant ID tag. Default to system otherwise.
		entry.TenantID = systemTenantID
		tagsToProcess := string(m[7])

Please avoid the conversion to string which forces a heap allocation.
Instead, keep []byte and use bytes.HasPrefix and bytes.IndexByte to search for the comma.


pkg/util/log/format_crdb_v2.go line 700 at r9 (raw file):

	// Strip out the tenant ID tag. We handle it separately in getTenantID().
	if strings.HasPrefix(tagsStr, TenantIDLogTagKey) {
		tenantIDSplit := strings.SplitN(tagsStr, ",", 2)

strings.IndexByte

@abarganier abarganier force-pushed the log_tenant_filter branch 2 times, most recently from c093bbc to c1a92a3 Compare January 11, 2023 20:52
Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)


pkg/util/log/format_crdb_v1.go line 431 at r9 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Please avoid the conversion to string which forces a heap allocation.
Instead, keep []byte and use bytes.HasPrefix and bytes.IndexByte to search for the comma.

Thanks for being vigilant here! Done. Let me know if you think there's any more efficiency to be gained.


pkg/util/log/format_crdb_v2.go line 700 at r9 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

strings.IndexByte

Done.

@abarganier abarganier requested a review from knz January 11, 2023 21:06
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 36 files at r10, 1 of 2 files at r11, 2 of 2 files at r12, 31 of 31 files at r13, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @dhartunian)


pkg/util/log/format_crdb_v1.go line 437 at r13 (raw file):

			if commaIndex >= 0 {
				// More tags exist than just the tenant ID tag.
				entry.TenantID = string(tagsToProcess[1:commaIndex])

I wonder if we can also get rid of this conversion. The tenant ID is a number. Can we maybe extract the number from the string and have an int instead of string in entry?

Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @dhartunian, and @knz)


pkg/util/log/format_crdb_v1.go line 437 at r13 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I wonder if we can also get rid of this conversion. The tenant ID is a number. Can we maybe extract the number from the string and have an int instead of string in entry?

I'm open to the idea! I think I chose to use a string because the ID is served as a string from the idPayload in the context when encoding, but the decode side doesn't need to be influenced by the encode side of things. It would require some broader changes that fall outside the scope of this PR, so I'd like to handle that in its own thread.

I already got a couple good pieces of feedback from Andrei on the tenant ID log tagging stuff that I'm going to address (kind of like a "various QoL improvements" PR). I think this suggestion would fit nicely alongside it if you're okay with that.

@abarganier abarganier requested a review from knz January 11, 2023 21:21
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's good. I'm pulling on this thread because we know that log parsing has historically been a perf bottleneck so it's a worthy area to be careful about.

@abarganier
Copy link
Contributor Author

I'm pulling on this thread because we know that log parsing has historically been a perf bottleneck so it's a worthy area to be careful about.

It's very much appreciated 🙏 thanks for the review!

In a previous patch, we updated all log formats to always
tag log entries with tenant IDs to prepare ourselves for
multiple tenants per-process.

When we did this work initially, the decision was made to
not tag `crdb_v1` formatted logs with the tenant ID, since
this is an older format that's not expected to be used in
clusters running multiple tenants per-process.

However, after doing some additional work, it was noticed
that this format is still used frequently in debugging tools.
Not having a tenant ID tag in the format therefore becomes
confusing when analyzing logs for clusters running multiple
tenants per-process, since the tenant IDs are missing per-
entry.

This patch updates the `crdb_v1` format to now always include
a tenant ID tag as the first tag, making use of its existing
tags structure. As a result, debug tools making use of the
crdb_v1 format will be compatible with upcoming features such
as filtering based on tenant ID.

Release note (ops change): crdb_v1 formatted logs will now
always contain a tag containing the tenant ID that the log
entry pertains to. For example, for tenant ID 2, we include
the tag `T2`:
```
W000101 00:00:12.300000 456 somefile.go:136  [T2,nsql?,othertag=somevalue] 2 this is a log message
```
If decoding an older crdb_v1 log message that does not contain
a tenant ID tag, then we default to the primary system tenant.
Previously, a crdb_v2 log entry without any tags would use `[-]` in
the log entry to indicate that the tags for the entry were empty.

With recent changes ensuring that all logs are tagged with the
tenant ID (crdb_v1, crdb_v2, json), we never expect a log entry
with empty tags.

When decoding the crdb_v2 format, we forgot to update the decoder
to no longer set "-" in the tags field when no tags other than
the tenant ID tag were available. This does not play nicely
when re-formatting the logpb.Entry back into crdb_v1 format
(commonly done in our debugging tools such as debug merge-logs)
since it would create entries with tag strings such as:
```
[T1,-]
```

This patch updates the crdb_v2 format's decoder to no longer
use `-` to denote empty tags. Instead, an empty string on the
tags field is used which is much more compatible with writing
the entry back into the crdb_v1 format.

Note that the decoder still properly interprets `-` as an
empty tags list, in case a newer CRDB version is used to
decode crdb_v2 logs generated from an older version.

Release note: none
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I think the last commit msg needs revising.

Reviewed 35 of 35 files at r7, 2 of 2 files at r8, 1 of 31 files at r9, 2 of 36 files at r10, 2 of 2 files at r11.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, and @knz)


-- commits line 99 at r13:
nit: This is no longer true, right?

We need to ensure that any log-related debug endpoints that
can be used by tenants (especially those available in DB
Console) filter to the specific tenant that made the request.
This is important to respect tenant boundaries when it comes
to logging data.

This patch updates the `debug/logspy` endpoint to filter logs
to only include those belonging to the tenant handling the
request.

For the system tenant, we apply no such filtering. The system
tenant has sufficient privilege to view logs across tenants,
so we avoid filtering in this case.

Another thing of note is that previously, the `log.Interceptor`
had no chance of error generation. Instead, any errors would
be created after consuming off the channel that the interceptor
produced into. Specifically, if we ever failed to unmarshall
the raw JSON bytes into a `logpb.Entry`, which was not a
tolerated error.

Now that the interceptor filters on tenant ID, it needs to
unmarshall the raw JSON bytes for each log entry into a
`logpb.Entry` so that it can filter against the tenant ID.
This creates an opportunity for error within the interceptor
that didn't exist before, and as before, we do not want to
tolerate such errors in the logSpy system. As a solution,
in the payload that we send from the interceptor to the log
spy consumer, we optionally pass along any unmarshal errors
for the consumer to handle (which in this case, the consumer
does not tolerate).

Release note: none
Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR 💜

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @dhartunian, and @knz)


-- commits line 99 at r13:

Previously, dhartunian (David Hartunian) wrote…

nit: This is no longer true, right?

Thanks for catching! Fixed

@abarganier abarganier merged commit 5865718 into cockroachdb:master Jan 12, 2023
@abarganier abarganier deleted the log_tenant_filter branch January 12, 2023 18:25
craig bot pushed a commit that referenced this pull request Jan 13, 2023
95016: pkg/cli: add tenant ID filter flag to `debug merge-logs` r=dhartunian a=abarganier

**NB: Only review the final commit - all previous commits are being reviewed in #94777

With multiple tenants running per-process, log files are
shared between tenants where tenant ID tags are used to
distinguish which tenant a log entry originated from.

When dealing with logs, especially in a debug.zip bundle,
we need to provide tools that make it easy to filter logs
to a specific tenant, or group of tenants.

This patch updates the `debug merge-logs` command with a
`--tenant-ids` flag which can be used to filter merged
logs to specific tenant ID(s). For example, to only include
logs for app tenants 2 and 3:

```
cockroach debug merge-logs ... --tenant-ids=2,3
```

Release note (cli change): The `debug merge-logs`
CLI command is now equipped with a `--tenant-ids`
flag, which can be used to provide the command with
a list of tenant IDs to filter logs on. For example,
to only include logs for secondary tenants 2 and 3:

```
cockroach debug merge-logs ... --tenant-ids=2,3
```

Note that when processing logs from older cluster
versions, where a tenant ID tag may not be present
on the log entry, the command will interpret any
log without a tenant ID as belonging to the system
tenant (tenant ID = 1).

Epic: CRDB-14486

Addresses: https://cockroachlabs.atlassian.net/browse/CRDB-22152

95192: roachprod: disable ext4 reserved space r=herkolategan a=srosenberg

By default, ext4 reserves 5% of blocks for the root user. The main rationale for this historical default is to reserve space for (a) important root processes, and (b) de-fragmentation. However, when this concerns disks which are used to store SST files, there is seemingly no good reason for this wasted space. Furthermore, debugging an out-of-disk space event becomes more surprising without a direct knowledge of this default.

Consequently, it was decided in favor of removing this reserved space, but only for data disks. Please see the corresponding GH issue for more details.

Epic: none
Fixes: #93120

Release note: None

Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Stan Rosenberg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants