-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable region filter on SQL Activity for Serverless #89949
Labels
A-sql-observability
Related to observability of the SQL layer
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Comments
maryliag
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-sql-observability
Related to observability of the SQL layer
T-sql-observability
labels
Oct 13, 2022
@andy-kimball on dedicated we should both region and nodes on the same column, can you confirm that for Serverless should be just the region, and we should continue to hide the node info. Or do you also want to see the node info (filter and column)? |
Correct, we only want to select the region, not the node. |
craig bot
pushed a commit
that referenced
this issue
Nov 16, 2022
90244: pkg/ui: Make tracez v2 page node-aware r=benbardin a=benbardin Release note: None Demo video below. Note that it starts on the v1 trace page, and I navigate to the new one by typing in the address bar. https://user-images.githubusercontent.com/261508/196736069-952dadde-dcb7-4060-8563-e98727b48a48.mov Part of #83679 92007: ui: tidy some SQL Activity code r=matthewtodd a=matthewtodd Cleanup encountered along the way to adding region columns & filters for serverless. Simplest to review commit by commit. The changes are small, but the commit message notes help add context. Part of #89949 Co-authored-by: Ben Bardin <[email protected]> Co-authored-by: Matthew Todd <[email protected]>
This was referenced Nov 21, 2022
craig bot
pushed a commit
that referenced
this issue
Dec 9, 2022
93268: statusccl: temporarily serve /_status/nodes to tenants r=matthewtodd a=matthewtodd Fixes #92065 Part of #89949 Our current UI architecture has many of the SQL Activity pages mapping node or sql instance ids to regions at page load time, using the response from this `/_status/nodes` endpoint to get the information it needs. This approach is problematic for ephemeral serverless tenants but will need a broader reimagining, probably ending up with us storing locality information directly in many of the sqlstats tables, which will probably require something like sending instance localities along in the distsql tracing. An initial exploration in this direction happened in #92259. In the meantime, as a stop-gap measure, we implement a reduced version of this `/_status/nodes` endpoint for tenants, keeping the SQL Activity pages somewhat working for multiregion tenants. Release note: None Co-authored-by: Matthew Todd <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Dec 19, 2022
90830: gc: use clear range operations to remove multiple garbage keys across keys and versions r=erikgrinaker a=aliher1911 Summary: GC Will track consecutive keys and issue GCClearRange requests to remove multiple keys at once if it can find more than configured number of keys. This covers multiple versions of the same key as well as multiple consecutive keys. On the GC side it will count consecutive keys while building "points batches" e.g. batches containing MVCC keys to delete. Those batches are capped to fit into resulting raft entries once deletion is executed. GC could accumulate more than a single batch in memory in that case. Once necessary number of consecutive keys is found, GCClearRange is issued and pending batches are discarded. If non gc-able data is found, pending batches are GC'd and clear range tracking is restarted. To protect GC from consuming too much memory for example if it collects a range with very large keys, it uses a configrable memory limit on how many key bytes could be held in pending batches. If this limit is reached before GC reaches minimum desired number of keys, oldest batch is GC'd and starting point for clear range is adjusted. On the evaluation side, GC requests for ranges will obtain write latches removed range to prevent other requests writing in the middle of the range. However, if only a single key is touched i.e. multiple versions are deleted, then no latches are obtained as it is done when deleting individual versions as all the removed data is below the GC threshold. GC exposes a metric `queue.gc.info.numclearconsecutivekeys` that shows how many times GCClearRange was used. GC adds new system setting `kv.gc.clear_range_min_keys` that sets minimum number of keys eligible for GCClearRange. ---- batcheval: handle ClearSubRange request fields in GC requests Handle ClearSubRange GC request fields to use GCClearRange storage functions for ranges of keys that contain no visible data instead of removing each key individually. This PR only enables processing of request field, while actual GC and storage functionality is added in previous PRs. Release note: None ---- gc: add version gate to GC clear range requests ClearRange fields of GCRequests are ignored by earlier versions, this feature needs to be version gated to let GC work when talking to a previous version if GC is run from a different node. Release note: None ---- gc: add metrics for clear range requests in GC This commit adds two metrics for clear range functionality in GC: queue.gc.info.numclearconsecutivekeys is incremented every time GC sends a ClearRange request for a bunch of consecutive keys to a replica Release note (ops change): Metric queue.gc.info.numclearconsecutivekeys added. It exposes number of clear range requests to remove consecutive keys issued by mvcc garbage collection. ---- gc: GC use delete_range_keys for consecutive ranges of keys When deleting large chunks of keys usually produced by range tombstones GC will use point deletion for all versions found within replica key range. That kind of operations would generate unnecessary load on the storage because each individual version must be deleted independently. This patch adds an ability for GC to find ranges of consecutive keys which are not interleaved with any newer data and create delete_range_key GC requests to remove them. Release note: None ---- rditer: helper to visit multiple replica key spans IterateMVCCReplicaKeySpans helper to complement IterateplicaKeySpans Existing tests moved from ReplicaMVCCDataIterator to IterateMVCCReplicaKeySpans as former is being sunsetted and only a single call site exists. Release note: None ---- storage: add gc operation that uses clear range Deleting multiple versions of same key or continuous ranges of point keys with pebble tombstones could be inefficient when range has lots of garbage. This pr adds a storage gc operation that uses pebble range tombstones to remove data. Release note: None ---- gc: add delete range parameters to gc request This commit adds clear_sub_range_key parameters to GC request. clear_sub_range_keys is used by GC queue to remove chunks of consecutive keys where no new data was written above the GC threshold and storage can optimize deletions with range tombstones. To support new types of keys, GCer interface is also updated to pass provided keys down to request. Release note: None Fixes: #84560, #61455 93732: dev: refactor `doctor` r=healthy-pod a=rickystewart Create a new interactive `dev doctor` experience, so you don't have to copy-paste commands from `dev doctor` and it can solve problems for you. Now `dev doctor` will be interactive by default and will ask for user input to confirm the actions it's going to take. You can also set `--autofix` to automatically agree to some default actions without having to interactive. Interactive doctor can be disabled with `--interactive=false`. Epic: none Release note: None 93814: workload: fix TPC-H generator to generate correct values for p_name and ps_supplycost r=rytaft a=rytaft **workload: fix TPC-H generator to generate correct values for p_name** The TPC-H generator creates values for `p_name` in the `part` table by selecting 5 elements from a list of words called `randPartNames`. It is supposed to select 5 random unique elements from the full list, but prior to this commit, it only selected a permutation of the first 5 elements. This commit fixes the logic and adds a unit test. Epic: None Release note: None **workload: fix TPC-H generator value of ps_supplycost** Prior to this commit, the TPC-H generator was generating a value between 0 and 10 for `ps_supplycost` in the `partsupp` table. However, the spec calls for a random value in the range `[1.00 .. 1,000.00]`. This commit fixes the oversight. Epic: None Release note: None 93852: build-and-publish-patched-go: upgrade from 1.19.1 to 1.19.4 r=ajwerner a=ajwerner This picks up a mix of security and compiler fixes. The compiler fixes I'm interested in relate to generic data structures, but the other fixes seem worthwhile too. * [x] Adjust the Pebble tests to run in new version. * [x] Adjust version in Docker image ([source](./builder/Dockerfile)). * [x] Adjust version in the TeamCity agent image ([setup script](./packer/teamcity-agent.sh)) * [x] Rebuild and push the Docker image (following [Basic Process](#basic-process)) * [x] Update `build/teamcity/internal/release/build-and-publish-patched-go/impl.sh` with the new version and adjust SHA256 sums as necessary. * [x] Run the `Internal / Release / Build and Publish Patched Go` build configuration in TeamCity with your latest version of the script above. This will print out the new URL's and SHA256 sums for the patched Go that you built above. * [x] Bump the version in `WORKSPACE` under `go_download_sdk`. You may need to bump [rules_go](https://github.com/bazelbuild/rules_go/releases). Also edit the filenames listed in `sdks` and update all the hashes to match what you built in the step above. * [x] Run `./dev generate bazel` to refresh `distdir_files.bzl`, then `bazel fetch `@distdir//:archives`` to ensure you've updated all hashes to the correct value. * [x] Bump the version in `builder.sh` accordingly ([source](./builder.sh#L6)). * [x] Bump the version in `go-version-check.sh` ([source](./go-version-check.sh)), unless bumping to a new patch release. * [x] Bump the go version in `go.mod`. * [x] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh)). * [x] Replace other mentions of the older version of go (grep for `golang:<old_version>` and `go<old_version>`). * [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects. * [ ] Ask the Developer Infrastructure team to deploy new TeamCity agent images according to [packer/README.md](./packer/README.md) Epic: None Release note: None 93914: ui: refresh nodes on tenants r=matthewtodd a=matthewtodd Part of #89949 Now that we can show meaningful region information for tenants, we need to actually trigger the fetching of that information. Release note: None Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Rebecca Taft <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Matthew Todd <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Dec 21, 2022
89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. Release note (backward-incompatible change): We've stopped supporting/populating the crdb_internal.gossip_network table. It was an internal table with no API guarantees (so perhaps no meriting a release note?). Release note (performance improvement): Significantly reduced CPU usage of the underlying gossip network in large clusters. --- Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done) This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change. I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time. `@irfansharif` `@kvoli` 93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd Part of #89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly. Release note: None 93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d). A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots. This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic. Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests. Closes #93849. Release note: None. Epic: CRDB-20293 Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Nick Travers <[email protected]>
adityamaru
pushed a commit
to adityamaru/cockroach
that referenced
this issue
Dec 22, 2022
Part of cockroachdb#89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in cockroachdb#93268; we'll begin addressing it shortly. Release note: None
This was referenced Jan 11, 2023
craig bot
pushed a commit
that referenced
this issue
Mar 6, 2023
95449: sqlstats: include regions in statement_statistics r=matthewtodd a=matthewtodd Part of #89949. This completes the work abandoned in #92259, properly storing the regions in which a statement was executed, rather than waiting until UI render time to try to map (dangerously ephemeral) node IDs to their respective regions. With this work in place, we will be able to remove both the UI mapping code (usages of this [selector][]) and the stop-gap work of #93268. [selector]: https://github.com/cockroachdb/cockroach/blob/0f6333cbd6c78264a59dc435324c0c33d75ea148/pkg/ui/workspaces/cluster-ui/src/store/nodes/nodes.selectors.ts#L52-L64 Release note (sql change): A regions field was added to the statistics column of crdb_internal.statement_statistics, reporting the regions of the nodes on which the statement was executed. Co-authored-by: Matthew Todd <[email protected]>
craig bot
pushed a commit
that referenced
this issue
Apr 26, 2023
102192: sqlstats: only include local region in statement_statistics r=j82w a=matthewtodd Part of #89949. Addresses #98020. Addresses #99563. Related to cockroachdb/roachperf#129. Related to #102170. Previously, we attempted to record all the regions hit in a single statement execution in the sqlstats tables, leaning on the sqlAddressResolver to map traced nodeIDs to localities at execution time. While the sqlAddressResolver is generally non-blocking, the introduction of this code did cause some of the multiregion "this query shouldn't span regions" tests to start [flaking][] and it's more recently been [implicated][] in a 2.5% performance regression. Given that the probabilistic nature of the tracing meant that we generally weren't capturing all the relevant nodeIDs anyway, it seems like the most prudent thing to do here is take a step back and regroup. In the short term, let's stop even trying to gather all these regions. In the medium/long term, let's see if we can find a better approach. [flaking]: #98020 [implicated]: https://github.com/cockroachdb/roachperf/pull/129 Release note: None Co-authored-by: Matthew Todd <[email protected]>
This was referenced Apr 26, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-sql-observability
Related to observability of the SQL layer
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Currently, the filter for regions on the SQL Activity is disabled. Once Multi-Region is enable on Serverless, this filter should be enable.
The column showing this info should also be enabled.
Jira issue: CRDB-20510
Epic CRDB-20437
The text was updated successfully, but these errors were encountered: