Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge release-23.2.12-rc to release-23.2: released CockroachDB version 23.2.12. Next version: 23.2.13 #131582

Conversation

cockroach-teamcity
Copy link
Member

Release note: None
Epic: None
Release justification: non-production (release infra) change.

kvoli and others added 30 commits September 10, 2024 09:10
Introduce the `ranges.decommissioning` gauge metric, which represents
the number of ranges with at least one replica on a decommissioning
node.

The metric is reported by the leaseholder, or if there is no valid
leaseholder, the first live replica in the descriptor, similar to
(under|over)-replication metrics.

The metric can be used to approximately identify the distribution of
decommissioning work remaining across nodes, as the leaseholder replica
is responsible for triggering the replacement of decommissioning
replicas for its own range.

Informs: cockroachdb#130085
Release note (ops change): The `ranges.decommissioning` metric is added,
representing the number of ranges which have a replica on a
decommissioning node.
When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to a
positive non-zero value, leaseholder replicas of ranges which have
decommissioning replicas will be enqueued into the replicate queue every
`kv.enqueue_in_replicate_queue_on_problem.interval` interval.

When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to 0,
no enqueueing on decommissioning will take place, outside of the regular
replica scanner.

A recommended value for users enabling the enqueue (non-zero), is at
least 15 minutes e.g.,

```
SET CLUSTER SETTING
kv.enqueue_in_replicate_queue_on_problem.interval='900s'
```

Resolves: cockroachdb#130085
Informs: cockroachdb#130199
Release note: None
…2.11-rc-130117

release-23.2.11-rc: kvserver: enqueue decom ranges at an interval behind a setting
This change extends response for `uiconfig` endpoint which
now contains information about license type and time until
license expires.

Release note: None
With this change, new alert message is shown in Db Console
when license is expired or less than 15 days left before
it will expire.
This change doesn't affect clusters that doesn't have
any license set.

Release note (ui change): show alert message in Db Console
when license is expired or less than 15 days left before
it expires.
This change adds a dismissable alert to the Overview page of DB
Console that informs users about upcoming license changes.

This popup is only shown if the cluster does not have an active
"Enterprise" license

The popup links to this page:
"https://www.cockroachlabs.com/enterprise-license-update/"

When the popup is dismissed, the dismissal is stored in the DB for
this user and they don't see this notification again.

Resolves: CRDB-40939

Release note (ui change): DB Console will show a notification alerting
customers without an Enterprise license, to upcoming license changes
with a link to more information.
…comparator-revert

release-23.2.11-rc: Revert "release-23.2: storage: fix comparison of suffixes"
…-rc-120475-120490-129420

release-23.2.11-rc: ui: add license change notification to db console
Previously the cidr metrics were only started for the system tenant.
This was problematic for SQL tenants since the mapping wouldn't be
updated.

Fixes: cockroachdb#130708

Release note: None
This commit adds a cluster setting (turned off by default) that sets the
period at which manual liveness range compactions are done.

This is done in a goroutine rather than in MVCC GC queue because:

1) This is meant to be a stop gap as this in not needed in 24.3 onwards.
Therefore, a simple change like this should achieve the goal.

2) The MVCC GC queue runs against leaseholder replicas only. This means
that we need to send a compaction request to the other liveness
replicas.

Fixes: cockroachdb#128968

Epic: None

Release note: None
Previously we didn't include the locality of the remote node when we
dialed a node. This prevented us from capturing locality aware stats for
the connections.

Epic: CRDB-41138

Release note: none
This commit adds the nodes locality information into the ContextOptions.
This allows metrics to consult this to determine if a connection is
from a remote locality.

Epic: CRDB-41138

Release note: None
Some of the places that call UnvalidatedDial have the locality. By
passing it in when it is known they will more accurately update the
statistics.

Epic: CRDB-41138

Release note: None
Extract a constant to make it easier to change the expected count.

Epic: CRDB-41138

Release note: None
Previously we didn't track bytes sent and received per node. This commit
adds the metrics for these. Additionally it adds a metric for the
connected count from a TCP perspective as this may be different than the
healthy or unhealty counts.

Epic: CRDB-41138

Release note (ops change): Adds three new network tracking metrics.
`rpc.connection.connected` is the number of rRPC TCP level connections
established to remote nodes. `rpc.client.bytes.egress` is the number of
TCP bytes sent via gRPC on connections we initiated.
`rpc.client.bytes.ingress` is the number of TCP bytes received via gRPC
on connections we initiated.
Previously the metric was not threadsafe and this prevented it from
being shared by multiple connections. The value is only updated on
heartbeat messages, so adding syncronization here should not cause any
performance issues.

Epic: CRDB-41138

Release note: None
The Metrics object stores all the metrics for the connection and was
previously passed by value. This PR passes it by pointer instead to
allow more complex state within the Metrics object in future commits.

Epic: CRDB-41138

Release note: None
Previously the metrics for ConnectionHealthy, ConnectionUnhealthy and
ConnectionInactive were manually set to 0 or 1. This prevented easily
aggregating the peer metrics by something (like locality). This PR
changes the way those three metrics are handled to only increment or
decrement rather than setting to 0/1.

Epic: CRDB-41138

Release note: None
Retain metrics across dropped/re-established connections. If we delete
and unlink the counters if we later recreate it the counts will start at
zero after a reconnect. Instead we track the counters in a map and reuse
them later if the key matches.

Epic: CRDB-41138

Release note: None
Previously we would publish network stats broken down by every remote
node. This would result in a large number of stats for large clusters.
In practice we can aggregate them by remote localities. This reduces
the number of stats with only a minimal loss of visibility into how the
network is being used.

Epic: CRDB-41138

Release note: None
This test flakes under stress/race conditions due to the use of network
ports. Skipping under stress.

Epic: None

Release note: None
For certain tooling it is important to differentiate between the
locality tag of the local node from the locality tag of the remote node.
By adding both the local and the destination, it allows those tools to
understand the source and destination of the connections.

Epic: CRDB-41138

Release note: None
This commit adds a new utility which can store and efficiently process a
large number of CIDR records by mapping them to a unique name.

Epic: CRDB-41142

Release note: None
This commit constructs the cidr.Lookup into the sql server
ExecutorConfig and evalContext. Additionally this enables the cidr
mapping and adds a new configuration parameter for it.

Epic: CRDB-41142

Release note (ops change): Adds a new configuration parameters
server.cidr_mapping_url which maps IPv4 CIDR blocks to arbitrary tag
names.
Previously the writeBuffer required a specific Metric implementation as
part of its paramater. This made it more complicated to change the
metric type that was passed in.

Epic: none

Release note: None
This commit changes the SQL byte metrics to be broken down by cidr block
of the source.

Epic: CRDB-41142

Release note (ops change): Modifies metrics sql.bytesin and sql.bytesout
to be agg metrics if child metrics are enabled.
Previously the test was assuming the setting would propagate
syncronously. This could fail under stress and race conditions.

Epic: none

Release note: None
Adds a utility to cidrLookup to created a DialContext that is tracked
based on the lookup. This works for any third-party libraries that
expose a way to set the DialContext.

Epic: none

Release note: None
Unfortunately, the WithHTTPClient option overrides all other options
when constructing a GCS client. As a result, it appears we can not
both set credentials options and set an HTTP client with custom
configs via the primary API.

Here, we construct a transport using the SDK which allows us to attach
the relevant credential options to the transport directly before
making the HTTP client.

The downside here is that the SDK's NewTransport's documentation says
that it is not intended for end-user use -- so we may expect breakage
in the future.

Epic: none

Release note: None
andrewbaptist and others added 27 commits September 13, 2024 16:47
Previously the cidr http check did not return true when it completed.
This adds the true return and additionally adds logging for other
failure cases.

Epic: none

Release note: None
Previously if the `server.cidr_mapping_url` was set and a node
restarted, there was a race condition where `SetOnChange` for the
setting could be called before the `Start` was called. This could result
in it blocking while attempting to submit to the channel.

Fixes: cockroachdb#130589

Release note: None
For changefeeds we need some additional network wrapping methods. This
commit also adds testing to Wrap, WrapTLS and WrapDialer.

Part of: cockroachdb#130097

Epic: none

Release note: None
Adds network tracing infrasturcture to changefeeds. This commit adds the
metrics but does not populate them for any of the existing changefeeds.

Part of: cockroachdb#130097

Epic: none

Release note (ops change): This commit adds two metrics
changefeed.network.bytes_in and changefeed.network.bytes_out. These
metrics track the number of bytes sent by the individual changefeeds to
different sinks.
Add support for the cidr network metrics to the kafka v1 sink.

Part of: cockroachdb#130097

Release note (enterprise change): Added network metrics to the kafka v1 sink.
Add network metrics to cdc webhook sinks.

Part of cockroachdb#130097

Release note (enterprise change): Added network metrics to webhook sinks.
Add support for the cidr network metrics to the pubsub sinks.

Part of: cockroachdb#130097

Release note (enterprise change): Added network metrics to the pubsub sinks.
Add support for the cidr network metrics to the sql sink.

Part of: cockroachdb#130097

Release note (enterprise change): Added network metrics to the sql sink.
This commit adds the network metrics for the kafka sink.

Epic: none

Release note: None
….12-rc-130521-130528-130664

release-23.2.12-rc: all of the network metrics
…2.12

Release note: None
Epic: None
Release justification: non-production (release infra) change.
…ort-release-23.2.12-rc-130709

release-23.2.12-rc: util: publish cidr metrics for tenants
The `TestServerController` test server stops quickly (due to deferred stop)
after executing `CREATE TENANT hello` while the creation of the tenant is
ongoing in `newTenantServer`. This causes `baseCfg.CidrLookup.Start` in
`newTenantServer` to fail with `ErrUnavailable` because `s.runPrelude()` in
`stopper.RunAsyncTask` returns true if a server is stopping:
https://github.com/cockroachdb/cockroach/blob/3bf34dc3a192d7efeee8aa97e46bf73f817b2b9b/pkg/util/stop/stopper.go#L469-L471.

Fixes: cockroachdb#130757
Epic: CRDB-42208
Release note: None
…-rc-129827

release-23.2.12-rc: kvserver: compact liveness range periodically
Narrowed down scope of counter filters in order to not catch stray
increment events from background queries.

Resolves: cockroachdb#128045, cockroachdb#128171

Release note: None
This test flakes in cases where we run a query and expect the
`sql.plan.type.force-custom` to not get incremented. This can't be
guaranteed as this is the default counter and it occasionally gets
bumped by background operations.

There's no easy way to prevent these from happening so these cases are
removed from this suite.

Resolves: cockroachdb#128523, cockroachdb#128640
Epic: None

Release note: None
…e-23.2.12-rc-128383-128715

release-23.2.12-rc: telemetryccl_test: fix TestTelemetry
…ort-release-23.2.12-rc-130850

release-23.2.12-rc: server, util: fix failing TestServerController
This commit adds a new changefeed testing knob, AsyncFlushSync, which
can be used to introduce a synchronization point between goroutines
during an async flush. It's currently only used in the cloud storage
sink.

Epic: none

Release note: none
Adds a test that reproduces a memory leak from pgzip, the library used
for fast gzip compression for changefeeds using cloud storage sinks. The
leak was caused by a race condition between Flush/flushTopicVerions and
the async flusher: if the Flush clears files before the async flusher
closes the compression codec as part of flushing the files, and the
flush returns an error, the compression codec will not be closed
properly. This test uses the AsyncFlushSync testing knob to introduce
synchronization points between these two goroutines to trigger the
regression.

Co-authored by: wenyihu6

Epic: none

Release note: none
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
…12-rc-130204

release-23.2.12-rc: changefeedccl: fix memory leak in cloud storage sink with fast gzip
…ort-release-23.2.12-rc-130789

release-23.2.12-rc: released CockroachDB version 23.2.11. Next version: 23.2.12
Previously the code would encounter an index out of bounds if the cidr
mapping file had a cidr length greater than 32 bits. This could only
happen with IPv6 addresses. Note that if there any invalid mappings the
code will display the error in the logs but won't process any of the
file.

The code already handled mapping lookups for IPv6, but these code
changes also make that more explicit.

Epic: none
Informs: cockroachdb#130814

Release note: None
We don't use the WrapTLS method and it is better to remove it and if we
need it in the future bring it back.

Epic: none

Release note: None
…ort-release-23.2.12-rc-131221

release-23.2.12-rc: util: don't panic on IPv6 entries in cidr mapping
…n 23.2.12. Next version: 23.2.13

Release note: None
Epic: None
Release justification: non-production (release infra) change.
@cockroach-teamcity
Copy link
Member Author

This change is Reviewable

@vidit-bhat vidit-bhat merged commit 830c239 into cockroachdb:release-23.2 Sep 30, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.