Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/sqlproxyccl: improve connection logging behavior #134613

Merged
merged 2 commits into from
Nov 11, 2024

Conversation

jaylim-crl
Copy link
Collaborator

@jaylim-crl jaylim-crl commented Nov 8, 2024

ccl/sqlproxyccl: improve connection logging behavior

Previously, the connection logging behavior in the proxy had several issues that led to unnecessary log spam:

  1. Excessive "registerAssignment" and "unregisterAssignment" logs: These logs were emitted for every connection attempt (including migration) and were not useful under normal operation.
  2. Redundant error logging: Most errors were logged twice -- once in the handle method, and again when it returns.
  3. Unfiltered error hints: User-facing errors containing hints were logged line by line, cluttering the logs.
  4. Lack of context in error logs: Errors logged from the proxy lacked tenant and cluster context, which made troubleshooting more difficult.

This commit addresses these issues as follows:

  1. Reduced logging: "registerAssignment" and "unregisterAssignment" logs are now only shown with vmodule logging enabled.
  2. Error logging improvements: handle no longer logs errors (with the exception of some cases); the caller is now responsible for logging. Additionally, errors with hints are no longer logged, only the main error is recorded. When those errors are logged, they will now include the tenant and cluster information where possible.

No release note as this is an internal change.

Epic: none

Release note: None

ccl/sqlproxyccl: rate limit throttler errors from being logged

Previously, each request refused by the throttler would result in a "throttler refused connection" message being logged, generating a log entry for every rejected request.

The throttler service is responsible for rate limiting invalid login attempts from IP addresses, and in practice, it can generate a high volume of such traffic. To address this, errors are now rate limited in the logs to occur once every 5 minutes per (IP, tenant) pair, ensuring that only one log entry is generated within that time frame.

This change is internal, so no release note is required.

Epic: none

Release note: None

Copy link

blathers-crl bot commented Nov 8, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jaylim-crl jaylim-crl added backport-24.2.x Flags PRs that need to be backported to 24.2 backport-24.3.x Flags PRs that need to be backported to 24.3 labels Nov 8, 2024
@jaylim-crl jaylim-crl force-pushed the jay/241107-cleanup-sqlproxy-logs branch from efce681 to 065c944 Compare November 8, 2024 03:34
@jaylim-crl
Copy link
Collaborator Author

jaylim-crl commented Nov 8, 2024

Previously, for every successful connection, we would get at least 6 log lines:

I241108 01:27:16.332951 859 ccl/sqlproxyccl/balancer/conn_tracker.go:155  [client=x:54897,cluster=y,tenant=123] 20  registerAssignment: x.y.pod.cluster.local:26257
I241108 01:27:17.210178 859 ccl/sqlproxyccl/connector.go:170  [client=x:54897,cluster=y,tenant=123] 21  connected to x:26257 through normal auth
I241108 01:27:17.210311 859 ccl/sqlproxyccl/proxy_handler.go:501  [client=x:54897,cluster=y,tenant=123] 22  new connection
I241108 01:27:32.886097 859 ccl/sqlproxyccl/proxy_handler.go:504  [client=x:54897,cluster=y,tenant=123] 23  closing after 15.68s
I241108 01:27:32.886158 863 ccl/sqlproxyccl/balancer/conn_tracker.go:155  [client=x:54897,cluster=y,tenant=123] 24  unregisterAssignment: x.y.pod.cluster.local:26257
I241108 01:27:32.886377 859 ccl/sqlproxyccl/server.go:279  [client=x:54897] 25  connection error: codeClientReadFailed: unable to read from client: peeking message: EOF

With this change, this is what I'd expect to get (i.e. the number of log lines trimmed by half):

I241108 01:27:17.210178 859 ccl/sqlproxyccl/connector.go:170  [client=x:54897,cluster=y,tenant=123] 21  connected to x:26257 through normal auth
I241108 01:27:32.886097 859 ccl/sqlproxyccl/proxy_handler.go:504  [client=x:54897,cluster=y,tenant=123] 23  closing after 15.68s
I241108 01:27:32.886377 859 ccl/sqlproxyccl/server.go:279  [client=x:54897,cluster=y,tenant=123] 25  connection error: codeClientReadFailed: unable to read from client: peeking message: EOF

Similarly, for a connection with auth failure:

Before:

I241108 03:44:58.186986 6941 ccl/sqlproxyccl/balancer/conn_tracker.go:155  [client=x:61639,cluster=y,tenant=123] 57  registerAssignment: x.y.pod.cluster.local:26257
I241108 03:44:58.236704 6941 ccl/sqlproxyccl/balancer/conn_tracker.go:155  [client=x:61639,cluster=y,tenant=123] 58  unregisterAssignment: x.y.pod.cluster.local:26257
E241108 03:44:58.237019 6941 ccl/sqlproxyccl/proxy_handler.go:484  [client=x:61639,cluster=y,tenant=123] 59  could not connect to cluster: codeAuthFailed: authentication failed: password authentication failed for user foo
I241108 03:44:58.237296 6941 ccl/sqlproxyccl/server.go:279  [client=x:61639] 60  connection error: codeAuthFailed: authentication failed: password authentication failed for user foo

After:

I241108 03:44:58.237296 6941 ccl/sqlproxyccl/server.go:279  [client=x:61639,cluster=y,tenant=123] 60  connection error: codeAuthFailed: authentication failed: password authentication failed for user foo

Note that the above scenarios are purely speculation and the desired behavior. Will have to test this manually to validate.

Update: Tested the above, and it works as expected.

@jaylim-crl jaylim-crl requested a review from DuskEagle November 8, 2024 03:52
@jaylim-crl jaylim-crl marked this pull request as ready for review November 8, 2024 03:53
@jaylim-crl jaylim-crl requested review from a team as code owners November 8, 2024 03:53
Previously, the connection logging behavior in the proxy had several issues
that led to unnecessary log spam:

1. **Excessive "registerAssignment" and "unregisterAssignment" logs**: These
   logs were emitted for every connection attempt (including migration) and
   were not useful under normal operation.
2. **Redundant error logging**: Most errors were logged twice -- once in the
   `handle` method, and again when it returns.
3. **Unfiltered error hints**: User-facing errors containing hints were logged
   line by line, cluttering the logs.
4. **Lack of context in error logs**: Errors logged from the proxy lacked
   tenant and cluster context, which made troubleshooting more difficult.

This commit addresses these issues as follows:

1. Reduced logging: "registerAssignment" and "unregisterAssignment" logs are
   now only shown with vmodule logging enabled.
2. Error logging improvements: `handle` no longer logs errors (with the
   exception of some cases); the caller is now responsible for logging.
   Additionally, errors with hints are no longer logged, only the main error is
   recorded. When those errors are logged, they will now include the tenant
   and cluster information where possible.

No release note as this is an internal change.

Epic: none

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/241107-cleanup-sqlproxy-logs branch 2 times, most recently from ea30f75 to 981066a Compare November 8, 2024 05:45
Previously, each request refused by the throttler would result in a
"throttler refused connection" message being logged, generating a log entry
for every rejected request.

The throttler service is responsible for rate limiting invalid login attempts
from IP addresses, and in practice, it can generate a high volume of such
traffic. To address this, errors are now rate limited in the logs to occur
once every 5 minutes per (IP, tenant) pair, ensuring that only one log entry
is generated within that time frame.

This change is internal, so no release note is required.

Epic: none

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/241107-cleanup-sqlproxy-logs branch from 981066a to 737dccb Compare November 8, 2024 05:50
Comment on lines +101 to +104
if throttle.everyLog.ShouldLog() {
// ctx should include logtags about the connection.
log.Error(ctx, "throttler refused connection due to too many failed authentication attempts")
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we didn't want the service to probe into the internals of the throttle struct (e.g. everyLog), this could be its own method on throttle (e.g. reportThrottled), but I don't feel strongly.

Copy link
Collaborator

@davidwding davidwding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some basic questions, this is all new code to me

@@ -49,7 +54,7 @@ func NewConnTracker(
timeSource = timeutil.DefaultTimeSource{}
}

t := &ConnTracker{timeSource: timeSource}
t := &ConnTracker{timeSource: timeSource, verboseLogging: log.V(2)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is log level 2 and how did you pick it? Are there corresponding constants like error/warn/info/etc.?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think we have any rule of thumb around vmodule logging levels. See this internal Slack thread: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1614100926026700. Each vmodule logging level seems to be specific to the file itself. I picked 2 somewhat arbitrarily; it's a middle value, and opens up opportunities for something less/more verbose as well.

For this case, if we wanted to display logs from the connection tracker, we would start the proxy with the following flag: --vmodule=conn_tracker=2 (i.e. --vmodule=FILE=LEVEL,FILE2=LEVEL2,...).

return errors.Wrap(err, "extracting cluster identifier")
}

// Validate the incoming connection and ensure that the cluster name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this block of code move?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to ensure that we only add logtags after validating the connection. If we left the validation in its original location, we may end up having a log line with a mismatched cluster name and tenant ID, leading to confusions during debugging. See the block that comes after the validation:

// Only add logtags after validating the connection. If the connection isn't
// validated, clusterName may not match the tenant ID, and this could cause
// confusion when analyzing logs.
ctx = logtags.AddTag(ctx, "cluster", clusterName)
ctx = logtags.AddTag(ctx, "tenant", tenID)

The proxy uses the tenant ID to spin up a new SQL server. Back then, there were concerns that users could iterate through --cluster=random-name-1 to --cluster=random-name-N, causing the operator to inadvertently provision resources for those tenants, and leading to potential resource exhaustion on the host. To address that, we implemented a validation check to ensure that "random-name" matches the actual cluster name associated with that tenant ID (as seen in the DnsPrefix field in the CrdbTenant spec, or the tenants table in CC). This validation guarantees that the user connecting to tenant N has some knowledge of the cluster, beyond just the tenant ID, and prevents random or malicious requests from triggering resource allocation.

@@ -72,6 +72,8 @@ func NewLocalService(opts ...LocalOption) Service {
return s
}

var _ Service = (*localService)(nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that localService implements the Service interface at compile time. This is the same as doing:

var _ Service = &localService{}

We use this pattern in CC as well.

for key, value := range reqTags {
ctx = logtags.AddTag(ctx, key, value)
}
// log.Infof automatically prints hints (one per line) that are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also change the behavior of log.Infof or is that a much larger task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log package (https://github.com/cockroachdb/cockroach/tree/master/pkg/util/log) is utilized extensively throughout the CRDB codebase, and I'd imagine there could be a scenario where we actually want hints. For the proxy's use case, I think the right thing to do here is to avoid returning user-facing error messages from handle. This would require a deeper audit, and some refactoring of the logic and tests. For now, this approach works to address the unnecessary log spam issue.

@jaylim-crl
Copy link
Collaborator Author

Thanks for the reviews!

Copy link
Collaborator

@davidwding davidwding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the explanations!

@jaylim-crl
Copy link
Collaborator Author

TFTR!

bors r=davidwding

@craig craig bot merged commit 82d9ea8 into cockroachdb:master Nov 11, 2024
23 checks passed
Copy link

blathers-crl bot commented Nov 11, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-24.2-134613 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134848/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 24.2.x failed. See errors above.


error setting reviewers, but backport branch blathers/backport-release-24.3-134613 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134849/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 24.3.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Nov 12, 2024
In cockroachdb#134613, we introduced rate-limiting for throttler errors, but deferred
handling of errors caused by connections blocked by misconfigured access
control lists (ACLs). These errors, often due to incorrect CIDR ranges or
private endpoints, can lead to excessive retries and logging noise.

This commit addresses that issue by introducing a new log-limiting mechanism
for high-frequency errors based on an (IP, tenant) pair. The following cases
are now covered:

1. Refused connections (ACL misconfigurations) - excessive retries from
   disallowed IPs or private endpoint IDs.
2. Auth throttling (invalid logins) - throttling errors due to invalid login
   attempts.
3. Deleted/invalid cluster - errors when a deleted tenant still receives
   request.

This change is internal, so no release note is required.

Epic: none

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Nov 12, 2024
In cockroachdb#134613, we introduced rate-limiting for throttler errors, but deferred
handling of errors caused by connections blocked by misconfigured access
control lists (ACLs). These errors, often due to incorrect CIDR ranges or
private endpoints, can lead to excessive retries and logging noise.

This commit addresses that issue by introducing a new log-limiting mechanism
for high-frequency errors based on an (IP, tenant) pair. The following cases
are now covered:

1. Refused connections (ACL misconfigurations) - excessive retries from
   disallowed IPs or private endpoint IDs.
2. Auth throttling (invalid logins) - throttling errors due to invalid login
   attempts.
3. Deleted/invalid cluster - errors when a deleted tenant still receives
   request.

This change is internal, so no release note is required.

Epic: none

Release note: None
jaylim-crl added a commit to jaylim-crl/cockroach that referenced this pull request Nov 13, 2024
In cockroachdb#134613, we introduced rate-limiting for throttler errors, but deferred
handling of errors caused by connections blocked by misconfigured access
control lists (ACLs). These errors, often due to incorrect CIDR ranges or
private endpoints, can lead to excessive retries and logging noise.

This commit addresses that issue by introducing a new log-limiting mechanism
for high-frequency errors based on an (IP, tenant) pair. The following cases
are now covered:

1. Refused connections (ACL misconfigurations) - excessive retries from
   disallowed IPs or private endpoint IDs.
2. Auth throttling (invalid logins) - throttling errors due to invalid login
   attempts.
3. Deleted/invalid cluster - errors when a deleted tenant still receives
   request.

This change is internal, so no release note is required.

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Nov 14, 2024
135008: ccl/sqlproxyccl: throttle logging of all high-frequency errors r=davidwding a=jaylim-crl

In #134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise.

This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered:

1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs.
2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts.
3. Deleted/invalid cluster - errors when a deleted tenant still receives request.

This change is internal, so no release note is required.

Epic: none

Release note: None

135011: githubpost: plumb Side-Eye snapshot URL r=DarrylWong a=andreimatei

The URL to the Side-Eye snapshot was not properly plumbed to the GitHub issue poster, resulting in the issues created for timing out roachtests not including the link to Side-Eye. Now, with the plumbing fixed, the respective issues should look like can be seen in the templates used by the roachtest unit tests -- e.g. https://github.com/cockroachdb/cockroach/blob/496c129415425d8b787e2d186508b68b03c782bc/pkg/cmd/roachtest/testdata/github/basic_test_create_post_request#L10

Epic: none
Release note: None

Co-authored-by: Jay Lim <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Nov 14, 2024
In #134613, we introduced rate-limiting for throttler errors, but deferred
handling of errors caused by connections blocked by misconfigured access
control lists (ACLs). These errors, often due to incorrect CIDR ranges or
private endpoints, can lead to excessive retries and logging noise.

This commit addresses that issue by introducing a new log-limiting mechanism
for high-frequency errors based on an (IP, tenant) pair. The following cases
are now covered:

1. Refused connections (ACL misconfigurations) - excessive retries from
   disallowed IPs or private endpoint IDs.
2. Auth throttling (invalid logins) - throttling errors due to invalid login
   attempts.
3. Deleted/invalid cluster - errors when a deleted tenant still receives
   request.

This change is internal, so no release note is required.

Epic: none

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Nov 14, 2024
In #134613, we introduced rate-limiting for throttler errors, but deferred
handling of errors caused by connections blocked by misconfigured access
control lists (ACLs). These errors, often due to incorrect CIDR ranges or
private endpoints, can lead to excessive retries and logging noise.

This commit addresses that issue by introducing a new log-limiting mechanism
for high-frequency errors based on an (IP, tenant) pair. The following cases
are now covered:

1. Refused connections (ACL misconfigurations) - excessive retries from
   disallowed IPs or private endpoint IDs.
2. Auth throttling (invalid logins) - throttling errors due to invalid login
   attempts.
3. Deleted/invalid cluster - errors when a deleted tenant still receives
   request.

This change is internal, so no release note is required.

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.2.x Flags PRs that need to be backported to 24.2 backport-24.3.x Flags PRs that need to be backported to 24.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants