-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge release-23.2.12-rc to release-23.2: released CockroachDB version 23.2.12. Next version: 23.2.13 #131582
Merged
vidit-bhat
merged 58 commits into
cockroachdb:release-23.2
from
cockroach-teamcity:merge-release-23.2.12-rc-to-release-23.2-nu6b
Sep 30, 2024
Merged
merge release-23.2.12-rc to release-23.2: released CockroachDB version 23.2.12. Next version: 23.2.13 #131582
vidit-bhat
merged 58 commits into
cockroachdb:release-23.2
from
cockroach-teamcity:merge-release-23.2.12-rc-to-release-23.2-nu6b
Sep 30, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Introduce the `ranges.decommissioning` gauge metric, which represents the number of ranges with at least one replica on a decommissioning node. The metric is reported by the leaseholder, or if there is no valid leaseholder, the first live replica in the descriptor, similar to (under|over)-replication metrics. The metric can be used to approximately identify the distribution of decommissioning work remaining across nodes, as the leaseholder replica is responsible for triggering the replacement of decommissioning replicas for its own range. Informs: cockroachdb#130085 Release note (ops change): The `ranges.decommissioning` metric is added, representing the number of ranges which have a replica on a decommissioning node.
When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to a positive non-zero value, leaseholder replicas of ranges which have decommissioning replicas will be enqueued into the replicate queue every `kv.enqueue_in_replicate_queue_on_problem.interval` interval. When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to 0, no enqueueing on decommissioning will take place, outside of the regular replica scanner. A recommended value for users enabling the enqueue (non-zero), is at least 15 minutes e.g., ``` SET CLUSTER SETTING kv.enqueue_in_replicate_queue_on_problem.interval='900s' ``` Resolves: cockroachdb#130085 Informs: cockroachdb#130199 Release note: None
…2.11-rc-130117 release-23.2.11-rc: kvserver: enqueue decom ranges at an interval behind a setting
This reverts commit 1a5279e. Release note: None.
This change extends response for `uiconfig` endpoint which now contains information about license type and time until license expires. Release note: None
With this change, new alert message is shown in Db Console when license is expired or less than 15 days left before it will expire. This change doesn't affect clusters that doesn't have any license set. Release note (ui change): show alert message in Db Console when license is expired or less than 15 days left before it expires.
This change adds a dismissable alert to the Overview page of DB Console that informs users about upcoming license changes. This popup is only shown if the cluster does not have an active "Enterprise" license The popup links to this page: "https://www.cockroachlabs.com/enterprise-license-update/" When the popup is dismissed, the dismissal is stored in the DB for this user and they don't see this notification again. Resolves: CRDB-40939 Release note (ui change): DB Console will show a notification alerting customers without an Enterprise license, to upcoming license changes with a link to more information.
…comparator-revert release-23.2.11-rc: Revert "release-23.2: storage: fix comparison of suffixes"
…-rc-120475-120490-129420 release-23.2.11-rc: ui: add license change notification to db console
Previously the cidr metrics were only started for the system tenant. This was problematic for SQL tenants since the mapping wouldn't be updated. Fixes: cockroachdb#130708 Release note: None
This commit adds a cluster setting (turned off by default) that sets the period at which manual liveness range compactions are done. This is done in a goroutine rather than in MVCC GC queue because: 1) This is meant to be a stop gap as this in not needed in 24.3 onwards. Therefore, a simple change like this should achieve the goal. 2) The MVCC GC queue runs against leaseholder replicas only. This means that we need to send a compaction request to the other liveness replicas. Fixes: cockroachdb#128968 Epic: None Release note: None
Previously we didn't include the locality of the remote node when we dialed a node. This prevented us from capturing locality aware stats for the connections. Epic: CRDB-41138 Release note: none
This commit adds the nodes locality information into the ContextOptions. This allows metrics to consult this to determine if a connection is from a remote locality. Epic: CRDB-41138 Release note: None
Some of the places that call UnvalidatedDial have the locality. By passing it in when it is known they will more accurately update the statistics. Epic: CRDB-41138 Release note: None
Extract a constant to make it easier to change the expected count. Epic: CRDB-41138 Release note: None
Previously we didn't track bytes sent and received per node. This commit adds the metrics for these. Additionally it adds a metric for the connected count from a TCP perspective as this may be different than the healthy or unhealty counts. Epic: CRDB-41138 Release note (ops change): Adds three new network tracking metrics. `rpc.connection.connected` is the number of rRPC TCP level connections established to remote nodes. `rpc.client.bytes.egress` is the number of TCP bytes sent via gRPC on connections we initiated. `rpc.client.bytes.ingress` is the number of TCP bytes received via gRPC on connections we initiated.
Previously the metric was not threadsafe and this prevented it from being shared by multiple connections. The value is only updated on heartbeat messages, so adding syncronization here should not cause any performance issues. Epic: CRDB-41138 Release note: None
The Metrics object stores all the metrics for the connection and was previously passed by value. This PR passes it by pointer instead to allow more complex state within the Metrics object in future commits. Epic: CRDB-41138 Release note: None
Previously the metrics for ConnectionHealthy, ConnectionUnhealthy and ConnectionInactive were manually set to 0 or 1. This prevented easily aggregating the peer metrics by something (like locality). This PR changes the way those three metrics are handled to only increment or decrement rather than setting to 0/1. Epic: CRDB-41138 Release note: None
Retain metrics across dropped/re-established connections. If we delete and unlink the counters if we later recreate it the counts will start at zero after a reconnect. Instead we track the counters in a map and reuse them later if the key matches. Epic: CRDB-41138 Release note: None
Previously we would publish network stats broken down by every remote node. This would result in a large number of stats for large clusters. In practice we can aggregate them by remote localities. This reduces the number of stats with only a minimal loss of visibility into how the network is being used. Epic: CRDB-41138 Release note: None
This test flakes under stress/race conditions due to the use of network ports. Skipping under stress. Epic: None Release note: None
For certain tooling it is important to differentiate between the locality tag of the local node from the locality tag of the remote node. By adding both the local and the destination, it allows those tools to understand the source and destination of the connections. Epic: CRDB-41138 Release note: None
This commit adds a new utility which can store and efficiently process a large number of CIDR records by mapping them to a unique name. Epic: CRDB-41142 Release note: None
This commit constructs the cidr.Lookup into the sql server ExecutorConfig and evalContext. Additionally this enables the cidr mapping and adds a new configuration parameter for it. Epic: CRDB-41142 Release note (ops change): Adds a new configuration parameters server.cidr_mapping_url which maps IPv4 CIDR blocks to arbitrary tag names.
Previously the writeBuffer required a specific Metric implementation as part of its paramater. This made it more complicated to change the metric type that was passed in. Epic: none Release note: None
This commit changes the SQL byte metrics to be broken down by cidr block of the source. Epic: CRDB-41142 Release note (ops change): Modifies metrics sql.bytesin and sql.bytesout to be agg metrics if child metrics are enabled.
Previously the test was assuming the setting would propagate syncronously. This could fail under stress and race conditions. Epic: none Release note: None
Adds a utility to cidrLookup to created a DialContext that is tracked based on the lookup. This works for any third-party libraries that expose a way to set the DialContext. Epic: none Release note: None
Unfortunately, the WithHTTPClient option overrides all other options when constructing a GCS client. As a result, it appears we can not both set credentials options and set an HTTP client with custom configs via the primary API. Here, we construct a transport using the SDK which allows us to attach the relevant credential options to the transport directly before making the HTTP client. The downside here is that the SDK's NewTransport's documentation says that it is not intended for end-user use -- so we may expect breakage in the future. Epic: none Release note: None
Previously the cidr http check did not return true when it completed. This adds the true return and additionally adds logging for other failure cases. Epic: none Release note: None
Previously if the `server.cidr_mapping_url` was set and a node restarted, there was a race condition where `SetOnChange` for the setting could be called before the `Start` was called. This could result in it blocking while attempting to submit to the channel. Fixes: cockroachdb#130589 Release note: None
For changefeeds we need some additional network wrapping methods. This commit also adds testing to Wrap, WrapTLS and WrapDialer. Part of: cockroachdb#130097 Epic: none Release note: None
Adds network tracing infrasturcture to changefeeds. This commit adds the metrics but does not populate them for any of the existing changefeeds. Part of: cockroachdb#130097 Epic: none Release note (ops change): This commit adds two metrics changefeed.network.bytes_in and changefeed.network.bytes_out. These metrics track the number of bytes sent by the individual changefeeds to different sinks.
Add support for the cidr network metrics to the kafka v1 sink. Part of: cockroachdb#130097 Release note (enterprise change): Added network metrics to the kafka v1 sink.
Add network metrics to cdc webhook sinks. Part of cockroachdb#130097 Release note (enterprise change): Added network metrics to webhook sinks.
Add support for the cidr network metrics to the pubsub sinks. Part of: cockroachdb#130097 Release note (enterprise change): Added network metrics to the pubsub sinks.
Add support for the cidr network metrics to the sql sink. Part of: cockroachdb#130097 Release note (enterprise change): Added network metrics to the sql sink.
This commit adds the network metrics for the kafka sink. Epic: none Release note: None
….12-rc-130521-130528-130664 release-23.2.12-rc: all of the network metrics
…2.12 Release note: None Epic: None Release justification: non-production (release infra) change.
…ort-release-23.2.12-rc-130709 release-23.2.12-rc: util: publish cidr metrics for tenants
The `TestServerController` test server stops quickly (due to deferred stop) after executing `CREATE TENANT hello` while the creation of the tenant is ongoing in `newTenantServer`. This causes `baseCfg.CidrLookup.Start` in `newTenantServer` to fail with `ErrUnavailable` because `s.runPrelude()` in `stopper.RunAsyncTask` returns true if a server is stopping: https://github.com/cockroachdb/cockroach/blob/3bf34dc3a192d7efeee8aa97e46bf73f817b2b9b/pkg/util/stop/stopper.go#L469-L471. Fixes: cockroachdb#130757 Epic: CRDB-42208 Release note: None
…-rc-129827 release-23.2.12-rc: kvserver: compact liveness range periodically
Narrowed down scope of counter filters in order to not catch stray increment events from background queries. Resolves: cockroachdb#128045, cockroachdb#128171 Release note: None
This test flakes in cases where we run a query and expect the `sql.plan.type.force-custom` to not get incremented. This can't be guaranteed as this is the default counter and it occasionally gets bumped by background operations. There's no easy way to prevent these from happening so these cases are removed from this suite. Resolves: cockroachdb#128523, cockroachdb#128640 Epic: None Release note: None
…e-23.2.12-rc-128383-128715 release-23.2.12-rc: telemetryccl_test: fix TestTelemetry
…ort-release-23.2.12-rc-130850 release-23.2.12-rc: server, util: fix failing TestServerController
This commit adds a new changefeed testing knob, AsyncFlushSync, which can be used to introduce a synchronization point between goroutines during an async flush. It's currently only used in the cloud storage sink. Epic: none Release note: none
Adds a test that reproduces a memory leak from pgzip, the library used for fast gzip compression for changefeeds using cloud storage sinks. The leak was caused by a race condition between Flush/flushTopicVerions and the async flusher: if the Flush clears files before the async flusher closes the compression codec as part of flushing the files, and the flush returns an error, the compression codec will not be closed properly. This test uses the AsyncFlushSync testing knob to introduce synchronization points between these two goroutines to trigger the regression. Co-authored by: wenyihu6 Epic: none Release note: none
When using the cloud storage sink with fast gzip and async flush enabled, changefeeds could leak memory from the pgzip library if a write error to the sink occurred. This was due to a race condition when flushing, if the goroutine initiating the flush cleared the files before the async flusher had cleaned up the compression codec and received the error from the sink. This fix clears the files after waiting for the async flusher to finish flushing the files, so that if an error occurs the files can be closed when the sink is closed. Co-authored by: wenyihu6 Epic: none Fixes: cockroachdb#129947 Release note(bug fix): Fixes a potential memory leak in changefeeds using a cloud storage sink. The memory leak could occur if both changefeed.fast_gzip.enabled and changefeed.cloudstorage.async_flush.enabled are true and the changefeed received an error while attempting to write to the cloud storage sink.
…12-rc-130204 release-23.2.12-rc: changefeedccl: fix memory leak in cloud storage sink with fast gzip
…ort-release-23.2.12-rc-130789 release-23.2.12-rc: released CockroachDB version 23.2.11. Next version: 23.2.12
Previously the code would encounter an index out of bounds if the cidr mapping file had a cidr length greater than 32 bits. This could only happen with IPv6 addresses. Note that if there any invalid mappings the code will display the error in the logs but won't process any of the file. The code already handled mapping lookups for IPv6, but these code changes also make that more explicit. Epic: none Informs: cockroachdb#130814 Release note: None
We don't use the WrapTLS method and it is better to remove it and if we need it in the future bring it back. Epic: none Release note: None
…ort-release-23.2.12-rc-131221 release-23.2.12-rc: util: don't panic on IPv6 entries in cidr mapping
…n 23.2.12. Next version: 23.2.13 Release note: None Epic: None Release justification: non-production (release infra) change.
vidit-bhat
approved these changes
Sep 30, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Release note: None
Epic: None
Release justification: non-production (release infra) change.