ccl/sqlproxyccl: improve connection logging behavior #134613

jaylim-crl · 2024-11-08T03:20:41Z

ccl/sqlproxyccl: improve connection logging behavior

Previously, the connection logging behavior in the proxy had several issues that led to unnecessary log spam:

Excessive "registerAssignment" and "unregisterAssignment" logs: These logs were emitted for every connection attempt (including migration) and were not useful under normal operation.
Redundant error logging: Most errors were logged twice -- once in the handle method, and again when it returns.
Unfiltered error hints: User-facing errors containing hints were logged line by line, cluttering the logs.
Lack of context in error logs: Errors logged from the proxy lacked tenant and cluster context, which made troubleshooting more difficult.

This commit addresses these issues as follows:

Reduced logging: "registerAssignment" and "unregisterAssignment" logs are now only shown with vmodule logging enabled.
Error logging improvements: handle no longer logs errors (with the exception of some cases); the caller is now responsible for logging. Additionally, errors with hints are no longer logged, only the main error is recorded. When those errors are logged, they will now include the tenant and cluster information where possible.

No release note as this is an internal change.

Epic: none

Release note: None

ccl/sqlproxyccl: rate limit throttler errors from being logged

Previously, each request refused by the throttler would result in a "throttler refused connection" message being logged, generating a log entry for every rejected request.

The throttler service is responsible for rate limiting invalid login attempts from IP addresses, and in practice, it can generate a high volume of such traffic. To address this, errors are now rate limited in the logs to occur once every 5 minutes per (IP, tenant) pair, ensuring that only one log entry is generated within that time frame.

This change is internal, so no release note is required.

Epic: none

Release note: None

blathers-crl · 2024-11-08T03:20:44Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-11-08T03:20:50Z

This change is

jaylim-crl · 2024-11-08T03:49:54Z

Previously, for every successful connection, we would get at least 6 log lines:

I241108 01:27:16.332951 859 ccl/sqlproxyccl/balancer/conn_tracker.go:155  [client=x:54897,cluster=y,tenant=123] 20  registerAssignment: x.y.pod.cluster.local:26257
I241108 01:27:17.210178 859 ccl/sqlproxyccl/connector.go:170  [client=x:54897,cluster=y,tenant=123] 21  connected to x:26257 through normal auth
I241108 01:27:17.210311 859 ccl/sqlproxyccl/proxy_handler.go:501  [client=x:54897,cluster=y,tenant=123] 22  new connection
I241108 01:27:32.886097 859 ccl/sqlproxyccl/proxy_handler.go:504  [client=x:54897,cluster=y,tenant=123] 23  closing after 15.68s
I241108 01:27:32.886158 863 ccl/sqlproxyccl/balancer/conn_tracker.go:155  [client=x:54897,cluster=y,tenant=123] 24  unregisterAssignment: x.y.pod.cluster.local:26257
I241108 01:27:32.886377 859 ccl/sqlproxyccl/server.go:279  [client=x:54897] 25  connection error: codeClientReadFailed: unable to read from client: peeking message: EOF

With this change, this is what I'd expect to get (i.e. the number of log lines trimmed by half):

I241108 01:27:17.210178 859 ccl/sqlproxyccl/connector.go:170  [client=x:54897,cluster=y,tenant=123] 21  connected to x:26257 through normal auth
I241108 01:27:32.886097 859 ccl/sqlproxyccl/proxy_handler.go:504  [client=x:54897,cluster=y,tenant=123] 23  closing after 15.68s
I241108 01:27:32.886377 859 ccl/sqlproxyccl/server.go:279  [client=x:54897,cluster=y,tenant=123] 25  connection error: codeClientReadFailed: unable to read from client: peeking message: EOF

Similarly, for a connection with auth failure:

Before:

I241108 03:44:58.186986 6941 ccl/sqlproxyccl/balancer/conn_tracker.go:155  [client=x:61639,cluster=y,tenant=123] 57  registerAssignment: x.y.pod.cluster.local:26257
I241108 03:44:58.236704 6941 ccl/sqlproxyccl/balancer/conn_tracker.go:155  [client=x:61639,cluster=y,tenant=123] 58  unregisterAssignment: x.y.pod.cluster.local:26257
E241108 03:44:58.237019 6941 ccl/sqlproxyccl/proxy_handler.go:484  [client=x:61639,cluster=y,tenant=123] 59  could not connect to cluster: codeAuthFailed: authentication failed: password authentication failed for user foo
I241108 03:44:58.237296 6941 ccl/sqlproxyccl/server.go:279  [client=x:61639] 60  connection error: codeAuthFailed: authentication failed: password authentication failed for user foo

After:

I241108 03:44:58.237296 6941 ccl/sqlproxyccl/server.go:279  [client=x:61639,cluster=y,tenant=123] 60  connection error: codeAuthFailed: authentication failed: password authentication failed for user foo

Note that the above scenarios are purely speculation and the desired behavior. Will have to test this manually to validate.

Update: Tested the above, and it works as expected.

Previously, the connection logging behavior in the proxy had several issues that led to unnecessary log spam: 1. **Excessive "registerAssignment" and "unregisterAssignment" logs**: These logs were emitted for every connection attempt (including migration) and were not useful under normal operation. 2. **Redundant error logging**: Most errors were logged twice -- once in the `handle` method, and again when it returns. 3. **Unfiltered error hints**: User-facing errors containing hints were logged line by line, cluttering the logs. 4. **Lack of context in error logs**: Errors logged from the proxy lacked tenant and cluster context, which made troubleshooting more difficult. This commit addresses these issues as follows: 1. Reduced logging: "registerAssignment" and "unregisterAssignment" logs are now only shown with vmodule logging enabled. 2. Error logging improvements: `handle` no longer logs errors (with the exception of some cases); the caller is now responsible for logging. Additionally, errors with hints are no longer logged, only the main error is recorded. When those errors are logged, they will now include the tenant and cluster information where possible. No release note as this is an internal change. Epic: none Release note: None

Previously, each request refused by the throttler would result in a "throttler refused connection" message being logged, generating a log entry for every rejected request. The throttler service is responsible for rate limiting invalid login attempts from IP addresses, and in practice, it can generate a high volume of such traffic. To address this, errors are now rate limited in the logs to occur once every 5 minutes per (IP, tenant) pair, ensuring that only one log entry is generated within that time frame. This change is internal, so no release note is required. Epic: none Release note: None

jaylim-crl · 2024-11-08T14:21:57Z

pkg/ccl/sqlproxyccl/throttler/local.go

+		if throttle.everyLog.ShouldLog() {
+			// ctx should include logtags about the connection.
+			log.Error(ctx, "throttler refused connection due to too many failed authentication attempts")
+		}


If we didn't want the service to probe into the internals of the throttle struct (e.g. everyLog), this could be its own method on throttle (e.g. reportThrottled), but I don't feel strongly.

davidwding

Some basic questions, this is all new code to me

davidwding · 2024-11-08T18:35:52Z

pkg/ccl/sqlproxyccl/balancer/conn_tracker.go

@@ -49,7 +54,7 @@ func NewConnTracker(
 		timeSource = timeutil.DefaultTimeSource{}
 	}

-	t := &ConnTracker{timeSource: timeSource}
+	t := &ConnTracker{timeSource: timeSource, verboseLogging: log.V(2)}


What is log level 2 and how did you pick it? Are there corresponding constants like error/warn/info/etc.?

Unfortunately, I don't think we have any rule of thumb around vmodule logging levels. See this internal Slack thread: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1614100926026700. Each vmodule logging level seems to be specific to the file itself. I picked 2 somewhat arbitrarily; it's a middle value, and opens up opportunities for something less/more verbose as well.

For this case, if we wanted to display logs from the connection tracker, we would start the proxy with the following flag: --vmodule=conn_tracker=2 (i.e. --vmodule=FILE=LEVEL,FILE2=LEVEL2,...).

davidwding · 2024-11-08T18:37:41Z

pkg/ccl/sqlproxyccl/proxy_handler.go

+		return errors.Wrap(err, "extracting cluster identifier")
+	}
+
+	// Validate the incoming connection and ensure that the cluster name


Why did this block of code move?

I'd like to ensure that we only add logtags after validating the connection. If we left the validation in its original location, we may end up having a log line with a mismatched cluster name and tenant ID, leading to confusions during debugging. See the block that comes after the validation:

// Only add logtags after validating the connection. If the connection isn't // validated, clusterName may not match the tenant ID, and this could cause // confusion when analyzing logs. ctx = logtags.AddTag(ctx, "cluster", clusterName) ctx = logtags.AddTag(ctx, "tenant", tenID)

The proxy uses the tenant ID to spin up a new SQL server. Back then, there were concerns that users could iterate through --cluster=random-name-1 to --cluster=random-name-N, causing the operator to inadvertently provision resources for those tenants, and leading to potential resource exhaustion on the host. To address that, we implemented a validation check to ensure that "random-name" matches the actual cluster name associated with that tenant ID (as seen in the DnsPrefix field in the CrdbTenant spec, or the tenants table in CC). This validation guarantees that the user connecting to tenant N has some knowledge of the cluster, beyond just the tenant ID, and prevents random or malicious requests from triggering resource allocation.

davidwding · 2024-11-08T18:42:49Z

pkg/ccl/sqlproxyccl/throttler/local.go

@@ -72,6 +72,8 @@ func NewLocalService(opts ...LocalOption) Service {
 	return s
 }

+var _ Service = (*localService)(nil)


What does this line do?

This ensures that localService implements the Service interface at compile time. This is the same as doing:

var _ Service = &localService{}

We use this pattern in CC as well.

davidwding · 2024-11-08T18:47:10Z

pkg/ccl/sqlproxyccl/server.go

+				for key, value := range reqTags {
+					ctx = logtags.AddTag(ctx, key, value)
+				}
+				// log.Infof automatically prints hints (one per line) that are


Can we also change the behavior of log.Infof or is that a much larger task?

The log package (https://github.com/cockroachdb/cockroach/tree/master/pkg/util/log) is utilized extensively throughout the CRDB codebase, and I'd imagine there could be a scenario where we actually want hints. For the proxy's use case, I think the right thing to do here is to avoid returning user-facing error messages from handle. This would require a deeper audit, and some refactoring of the logic and tests. For now, this approach works to address the unnecessary log spam issue.

jaylim-crl · 2024-11-09T02:31:58Z

Thanks for the reviews!

davidwding

Thanks for all the explanations!

jaylim-crl · 2024-11-11T17:33:31Z

TFTR!

bors r=davidwding

craig · 2024-11-11T18:11:24Z

Build succeeded:

blathers-crl · 2024-11-11T18:11:42Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error setting reviewers, but backport branch blathers/backport-release-24.2-134613 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134848/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 24.2.x failed. See errors above.

error setting reviewers, but backport branch blathers/backport-release-24.3-134613 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/134849/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 24.3.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

In cockroachdb#134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None

135008: ccl/sqlproxyccl: throttle logging of all high-frequency errors r=davidwding a=jaylim-crl In #134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None 135011: githubpost: plumb Side-Eye snapshot URL r=DarrylWong a=andreimatei The URL to the Side-Eye snapshot was not properly plumbed to the GitHub issue poster, resulting in the issues created for timing out roachtests not including the link to Side-Eye. Now, with the plumbing fixed, the respective issues should look like can be seen in the templates used by the roachtest unit tests -- e.g. https://github.com/cockroachdb/cockroach/blob/496c129415425d8b787e2d186508b68b03c782bc/pkg/cmd/roachtest/testdata/github/basic_test_create_post_request#L10 Epic: none Release note: None Co-authored-by: Jay Lim <[email protected]> Co-authored-by: Andrei Matei <[email protected]>

In #134613, we introduced rate-limiting for throttler errors, but deferred handling of errors caused by connections blocked by misconfigured access control lists (ACLs). These errors, often due to incorrect CIDR ranges or private endpoints, can lead to excessive retries and logging noise. This commit addresses that issue by introducing a new log-limiting mechanism for high-frequency errors based on an (IP, tenant) pair. The following cases are now covered: 1. Refused connections (ACL misconfigurations) - excessive retries from disallowed IPs or private endpoint IDs. 2. Auth throttling (invalid logins) - throttling errors due to invalid login attempts. 3. Deleted/invalid cluster - errors when a deleted tenant still receives request. This change is internal, so no release note is required. Epic: none Release note: None

jaylim-crl added backport-24.2.x Flags PRs that need to be backported to 24.2 backport-24.3.x Flags PRs that need to be backported to 24.3 labels Nov 8, 2024

jaylim-crl force-pushed the jay/241107-cleanup-sqlproxy-logs branch from efce681 to 065c944 Compare November 8, 2024 03:34

jaylim-crl requested a review from DuskEagle November 8, 2024 03:52

jaylim-crl marked this pull request as ready for review November 8, 2024 03:53

jaylim-crl requested review from a team as code owners November 8, 2024 03:53

jaylim-crl force-pushed the jay/241107-cleanup-sqlproxy-logs branch 2 times, most recently from ea30f75 to 981066a Compare November 8, 2024 05:45

jaylim-crl force-pushed the jay/241107-cleanup-sqlproxy-logs branch from 981066a to 737dccb Compare November 8, 2024 05:50

jaylim-crl commented Nov 8, 2024

View reviewed changes

davidwding reviewed Nov 8, 2024

View reviewed changes

davidwding approved these changes Nov 11, 2024

View reviewed changes

craig bot merged commit 82d9ea8 into cockroachdb:master Nov 11, 2024
23 checks passed

This was referenced Nov 11, 2024

release-24.2: ccl/sqlproxyccl: improve connection logging behavior #134848

Merged

release-24.3: ccl/sqlproxyccl: improve connection logging behavior #134849

Merged

jaylim-crl mentioned this pull request Nov 12, 2024

ccl/sqlproxyccl: throttle logging of all high-frequency errors #135008

Merged

blathers-crl bot mentioned this pull request Nov 14, 2024

release-24.2: ccl/sqlproxyccl: throttle logging of all high-frequency errors #135219

Merged

blathers-crl bot mentioned this pull request Nov 14, 2024

release-24.3: ccl/sqlproxyccl: throttle logging of all high-frequency errors #135220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/sqlproxyccl: improve connection logging behavior #134613

ccl/sqlproxyccl: improve connection logging behavior #134613

jaylim-crl commented Nov 8, 2024 •

edited

Loading

blathers-crl bot commented Nov 8, 2024

cockroach-teamcity commented Nov 8, 2024

jaylim-crl commented Nov 8, 2024 •

edited

Loading

jaylim-crl Nov 8, 2024

davidwding left a comment

davidwding Nov 8, 2024

jaylim-crl Nov 9, 2024

davidwding Nov 8, 2024

jaylim-crl Nov 9, 2024

davidwding Nov 8, 2024

jaylim-crl Nov 9, 2024

davidwding Nov 8, 2024

jaylim-crl Nov 9, 2024

jaylim-crl commented Nov 9, 2024

davidwding left a comment

jaylim-crl commented Nov 11, 2024

craig bot commented Nov 11, 2024

blathers-crl bot commented Nov 11, 2024

ccl/sqlproxyccl: improve connection logging behavior #134613

ccl/sqlproxyccl: improve connection logging behavior #134613

Conversation

jaylim-crl commented Nov 8, 2024 • edited Loading