-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/sqlproxyccl: improve connection logging behavior #134613
ccl/sqlproxyccl: improve connection logging behavior #134613
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
efce681
to
065c944
Compare
Previously, for every successful connection, we would get at least 6 log lines:
With this change, this is what I'd expect to get (i.e. the number of log lines trimmed by half):
Similarly, for a connection with auth failure: Before:
After:
Note that the above scenarios are purely speculation and the desired behavior. Will have to test this manually to validate. Update: Tested the above, and it works as expected. |
Previously, the connection logging behavior in the proxy had several issues that led to unnecessary log spam: 1. **Excessive "registerAssignment" and "unregisterAssignment" logs**: These logs were emitted for every connection attempt (including migration) and were not useful under normal operation. 2. **Redundant error logging**: Most errors were logged twice -- once in the `handle` method, and again when it returns. 3. **Unfiltered error hints**: User-facing errors containing hints were logged line by line, cluttering the logs. 4. **Lack of context in error logs**: Errors logged from the proxy lacked tenant and cluster context, which made troubleshooting more difficult. This commit addresses these issues as follows: 1. Reduced logging: "registerAssignment" and "unregisterAssignment" logs are now only shown with vmodule logging enabled. 2. Error logging improvements: `handle` no longer logs errors (with the exception of some cases); the caller is now responsible for logging. Additionally, errors with hints are no longer logged, only the main error is recorded. When those errors are logged, they will now include the tenant and cluster information where possible. No release note as this is an internal change. Epic: none Release note: None
ea30f75
to
981066a
Compare
Previously, each request refused by the throttler would result in a "throttler refused connection" message being logged, generating a log entry for every rejected request. The throttler service is responsible for rate limiting invalid login attempts from IP addresses, and in practice, it can generate a high volume of such traffic. To address this, errors are now rate limited in the logs to occur once every 5 minutes per (IP, tenant) pair, ensuring that only one log entry is generated within that time frame. This change is internal, so no release note is required. Epic: none Release note: None
981066a
to
737dccb
Compare
if throttle.everyLog.ShouldLog() { | ||
// ctx should include logtags about the connection. | ||
log.Error(ctx, "throttler refused connection due to too many failed authentication attempts") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we didn't want the service to probe into the internals of the throttle struct (e.g. everyLog), this could be its own method on throttle (e.g. reportThrottled), but I don't feel strongly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some basic questions, this is all new code to me
@@ -49,7 +54,7 @@ func NewConnTracker( | |||
timeSource = timeutil.DefaultTimeSource{} | |||
} | |||
|
|||
t := &ConnTracker{timeSource: timeSource} | |||
t := &ConnTracker{timeSource: timeSource, verboseLogging: log.V(2)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is log level 2 and how did you pick it? Are there corresponding constants like error/warn/info/etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think we have any rule of thumb around vmodule logging levels. See this internal Slack thread: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1614100926026700. Each vmodule logging level seems to be specific to the file itself. I picked 2 somewhat arbitrarily; it's a middle value, and opens up opportunities for something less/more verbose as well.
For this case, if we wanted to display logs from the connection tracker, we would start the proxy with the following flag: --vmodule=conn_tracker=2
(i.e. --vmodule=FILE=LEVEL,FILE2=LEVEL2,...
).
return errors.Wrap(err, "extracting cluster identifier") | ||
} | ||
|
||
// Validate the incoming connection and ensure that the cluster name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this block of code move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to ensure that we only add logtags after validating the connection. If we left the validation in its original location, we may end up having a log line with a mismatched cluster name and tenant ID, leading to confusions during debugging. See the block that comes after the validation:
// Only add logtags after validating the connection. If the connection isn't
// validated, clusterName may not match the tenant ID, and this could cause
// confusion when analyzing logs.
ctx = logtags.AddTag(ctx, "cluster", clusterName)
ctx = logtags.AddTag(ctx, "tenant", tenID)
The proxy uses the tenant ID to spin up a new SQL server. Back then, there were concerns that users could iterate through --cluster=random-name-1
to --cluster=random-name-N
, causing the operator to inadvertently provision resources for those tenants, and leading to potential resource exhaustion on the host. To address that, we implemented a validation check to ensure that "random-name" matches the actual cluster name associated with that tenant ID (as seen in the DnsPrefix field in the CrdbTenant spec, or the tenants table in CC). This validation guarantees that the user connecting to tenant N has some knowledge of the cluster, beyond just the tenant ID, and prevents random or malicious requests from triggering resource allocation.
@@ -72,6 +72,8 @@ func NewLocalService(opts ...LocalOption) Service { | |||
return s | |||
} | |||
|
|||
var _ Service = (*localService)(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this line do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that localService
implements the Service
interface at compile time. This is the same as doing:
var _ Service = &localService{}
We use this pattern in CC as well.
for key, value := range reqTags { | ||
ctx = logtags.AddTag(ctx, key, value) | ||
} | ||
// log.Infof automatically prints hints (one per line) that are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also change the behavior of log.Infof
or is that a much larger task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log
package (https://github.com/cockroachdb/cockroach/tree/master/pkg/util/log) is utilized extensively throughout the CRDB codebase, and I'd imagine there could be a scenario where we actually want hints. For the proxy's use case, I think the right thing to do here is to avoid returning user-facing error messages from handle
. This would require a deeper audit, and some refactoring of the logic and tests. For now, this approach works to address the unnecessary log spam issue.
Thanks for the reviews! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the explanations!
TFTR! bors r=davidwding |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-24.2-134613 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134848/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 24.2.x failed. See errors above. error setting reviewers, but backport branch blathers/backport-release-24.3-134613 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134849/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 24.3.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
In cockroachdb#134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None
In cockroachdb#134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None
In cockroachdb#134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None
135008: ccl/sqlproxyccl: throttle logging of all high-frequency errors r=davidwding a=jaylim-crl In #134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None 135011: githubpost: plumb Side-Eye snapshot URL r=DarrylWong a=andreimatei The URL to the Side-Eye snapshot was not properly plumbed to the GitHub issue poster, resulting in the issues created for timing out roachtests not including the link to Side-Eye. Now, with the plumbing fixed, the respective issues should look like can be seen in the templates used by the roachtest unit tests -- e.g. https://github.com/cockroachdb/cockroach/blob/496c129415425d8b787e2d186508b68b03c782bc/pkg/cmd/roachtest/testdata/github/basic_test_create_post_request#L10 Epic: none Release note: None Co-authored-by: Jay Lim <[email protected]> Co-authored-by: Andrei Matei <[email protected]>
In #134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None
In #134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None
ccl/sqlproxyccl: improve connection logging behavior
Previously, the connection logging behavior in the proxy had several issues that led to unnecessary log spam:
handle
method, and again when it returns.This commit addresses these issues as follows:
handle
no longer logs errors (with the exception of some cases); the caller is now responsible for logging. Additionally, errors with hints are no longer logged, only the main error is recorded. When those errors are logged, they will now include the tenant and cluster information where possible.No release note as this is an internal change.
Epic: none
Release note: None
ccl/sqlproxyccl: rate limit throttler errors from being logged
Previously, each request refused by the throttler would result in a "throttler refused connection" message being logged, generating a log entry for every rejected request.
The throttler service is responsible for rate limiting invalid login attempts from IP addresses, and in practice, it can generate a high volume of such traffic. To address this, errors are now rate limited in the logs to occur once every 5 minutes per (IP, tenant) pair, ensuring that only one log entry is generated within that time frame.
This change is internal, so no release note is required.
Epic: none
Release note: None