-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: raft transport metrics #83917
Comments
cc @cockroachdb/replication |
Seems reasonable, but why do we need to add the |
Ah right, it looks like extending the interface might be needed for plumbing. I'll explore it in more detail. |
85680: amazon,externalconn: add s3 support to External Connections r=adityamaru a=adityamaru This change registers s3 as a URI that can be represented as an External Connection. Most notably we take a page from the CDC book and switch the s3 parse function to check for invalid parameters, and configurations. This allows us to catch certain misconfiguration at the time we create the external connection. Informs: #84753 Release note (sql change): Users can now `CREATE EXTERNAL CONNECTION` to represent an s3 URI. 85849: sql/catalog/tabledesc: relaxed validation for virtual col in SUFFIX cols r=Xiang-Gu a=Xiang-Gu One of the validation rule says that "we don't allow virtual columns to be in the SUFFIX columns of a secondary index", except for one case: `ALTER PRIMARY KYE USING HASH`, where the implicitly created virtual, shard column, will need to appear in the SUFFIX columns of the implicitly created unique, secondary index of the old PK key columns ( which btw is a CRDB unique feature). The validation has logic to exempt us from this special case but it's written specifically to the legacy schema changer. Namely, it uses the canonical `PrimaryKeySwap` mutation type as the signal but we don't have that in the declarative schema changer. This PR addresses this issue and allows the validation logic to also exempt the exact same case but in the declarative schema changer. Release note: None 85862: ui: new insights table component r=maryliag a=maryliag Creation of the base insights table, that can be used on both schema insights and statement details page. This commit only introduce the table, with the different types, but still needs the proper cell formating for each type. Part of #83783 Release note: None 85865: rowexec: use the sorter's own context r=yuzefovich a=yuzefovich Release note: none 85866: streamingccl, rowexec: correctly mark eventStream as "streaming" r=yuzefovich a=yuzefovich This commit fixes an incomplete solution of 9bb5d30 which attempted to mark `eventStream` generator builtin as "streaming" so that the columnarizer on top of the `projectSetProcessor` would not buffer anything. As found by Steven, the solution in that commit was incomplete since the generators array is not populated at the time where `MustBeStreaming` check is performed. This commit fixes that oversight by using a different way of propagating the property - via the function properties. Release note: None 85885: kvserver: add queue size metric to RaftTransport r=tbg a=pavelkalinnikov Currently, there is a `RaftEnqueuePending` metric in `StoreMetrics` which exposes the `RaftTransport` outgoing queue size. However, this is a per-`Store` metric, so it duplicates the same variable across all the stores. The metric should be tracked on a per-node/transport basis. This PR introduces a per-transport `SendQueueSize` metric to replace the old `RaftEnqueuePending`. It also does the necessary plumbing to include it to the exported metrics, since it's the first metric in `RaftTransport`. <img width="1350" alt="Screenshot 2022-08-10 at 15 37 47" src="https://user-images.githubusercontent.com/3757441/183929666-0f6523d7-5877-4f2f-8460-4f013f87966d.png"> Touches #83917 Release note: None 85906: eval: don't always ignore error from ResolveFunctionByOID r=chengxiong-ruan a=rafiss This addresses a comment from the review of #85656. Release note: None 85914: systemschema_test: fix timestamp stripping regex r=postamar a=postamar The test in this package was flaky because hlc.Timestamp values inside descriptors (modification time, for instance) with a logical component would be improperly redacted. This commit fixes this. Fixes #85799. Release note: None Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Marius Posta <[email protected]>
84107: sql, builtins: implement format r=[rafiss] a=HonoreDB Closes #68330. Unfortunately hooking into fmt by implementing the Formatter interface doesn't give us the postgres-style syntax, so instead I've implemented my own very stripped-down version of fmt by copying golang's. Release note (sql change): Added the format builtin function. format interpolates arguments into a string in the style of C's sprintf. For example, format('Hello, %s', 'world') returns 'Hello, world'. 85580: storage: remove key comparison in MVCC stats computations r=jbowens a=erikgrinaker This patch restructures MVCC stats computations to use an MVCC iterator with appropriate bounds. This allows omitting a key comparison in the hot path, yielding a ~10% performance improvement. The new stats functions are: * `ComputeStats`: calculates stats for a `Reader` key span. * `ComputeStatsWithVisitors`: calculates stats for a `Reader` key span, calling the given visitor callbacks for every point/range key. * `ComputeStatsForIter`: calculates stats for a given `MVCCIterator`, by scanning it until exhausted. It also removes the `MVCCIterator.ComputeStats` method, which had no business being part of the interface. ``` name old time/op new time/op delta MVCCComputeStats_Pebble/valueSize=64-24 130ms ± 1% 119ms ± 1% -8.77% (p=0.000 n=10+10) name old speed new speed delta MVCCComputeStats_Pebble/valueSize=64-24 516MB/s ± 1% 565MB/s ± 1% +9.61% (p=0.000 n=10+10) ``` Resolves #41899. Touches #84544. Release note: None 85778: sql/catalog/tabledesc: fixed a bug in `maybeAddConstraintIDs` r=Xiang-Gu a=Xiang-Gu Previously, when we RunPostDeserializationChanges, one of the step is to fill in constraint IDs if they are not already set, which is done in function `maybeAddConstraintIDs`. That function however is buggy in that, if there is constraint mutation on the table descriptor, we directly go assign a constraint ID (from `tbl.nextConstraintID`) to it, without checking whether it already has a non-zero constraint ID or not. This PR fixes that. Release note: None 85943: sql: don't create idx recommendations for internal queries r=maryliag a=maryliag Previously we were generating recommendations for internal queries. This commit adds a check and if is internal, it won't generate a recommendation. Partially addresses #85934 Release note: None 85951: kvserver: replace ad-hoc stats in RaftTransport by metrics r=tbg a=pavelkalinnikov These stats were only printed in text to the logs, and hard to use and match with other existing metrics in monitoring. There is a queue size metric introduced in 5d51ab3 which properly exports the `queue` counter to monitoring. The exported graphs allow observing both the current, and historical queue sizes (including the approx. max size), so this PR safely removes both stats as obsolete. This PR also similarly replaces all the remaining ad-hoc counters with proper metrics, and removes the plumbing and data structures that were previously maintaining those stats. The granularity of the new metrics is per-node/transport. This is due to the metrics infrastructure not being particularly good at handling high cardinality combinations of labels such as O(N^2) node ID pairs. Part of #83917 Release note: None Co-authored-by: Aaron Zinger <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
86039: kvserver: record paused replica message drops to a metric r=tbg a=pavelkalinnikov Part of #83917 Release justification: The metric may help investigating issues with paused replicas Release note: None 86200: sql/schemachanger: Initial support for generating a corpus r=fqazi a=fqazi These changes introduce the following: 1. Support for gathering declarative schema changer states (corpus) from logictests to help backwards/mixed version compatibility testing 2. Debug command support for replaying corpuses 3. Optional test for testing a generated corpus when a parameter is specified Release justification: Mainly test enhancements that are disabled by default, which will allow us to have an automated workload to test declarative schema changer state between versions. Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>
@tbg The stats->metrics replacement is done. We can also explore what other Raft transport metrics would be useful. Previously we discussed that it would be nice to track the queue byte size, in addition to counting the messages. I think we can do it as part of this issue. WDYT? |
Made an issue for you :-) Let's close this one. |
Is your feature request related to a problem? Please describe.
We currently have metrics for received messages:
cockroach/pkg/kv/kvserver/metrics.go
Lines 1589 to 1609 in f744376
but nothing for outgoing messages. In particular, we don't expose the dropping of outgoing messages, which can occur in
cockroach/pkg/kv/kvserver/raft_transport.go
Line 677 in ac77889
and
cockroach/pkg/kv/kvserver/raft_transport.go
Lines 609 to 615 in 7cfa5d4
as well as
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 1503 to 1505 in 571bfa3
Describe the solution you'd like
Let
RaftMessageHandler
export aMetrics() *RaftTransportMetrics
method where*RaftTransportMetrics
absorbs the recv-related fields in the snippet above, and also starts tracking metrics for outgoing fields.Retire (i.e. delete) the existing internal
*raftTransportStats
which weren't exposed anywhere (outside ofV(1)
logging) and are difficult to map onto our metrics model where nodes/stores only report datapoints about themselves. (The stats are reporting for the remote node).Describe alternatives you've considered
Additional context
Came up in #83851 where we're intentionally dropping messages.
Jira issue: CRDB-17351
The text was updated successfully, but these errors were encountered: