Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable region filter on SQL Activity for Serverless #89949

Closed
maryliag opened this issue Oct 13, 2022 · 2 comments · Fixed by #92357
Closed

Enable region filter on SQL Activity for Serverless #89949

maryliag opened this issue Oct 13, 2022 · 2 comments · Fixed by #92357
Assignees
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@maryliag
Copy link
Contributor

maryliag commented Oct 13, 2022

Currently, the filter for regions on the SQL Activity is disabled. Once Multi-Region is enable on Serverless, this filter should be enable.
The column showing this info should also be enabled.

Jira issue: CRDB-20510

Epic CRDB-20437

@maryliag maryliag added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-observability Related to observability of the SQL layer T-sql-observability labels Oct 13, 2022
@maryliag
Copy link
Contributor Author

@andy-kimball on dedicated we should both region and nodes on the same column, can you confirm that for Serverless should be just the region, and we should continue to hide the node info. Or do you also want to see the node info (filter and column)?

@andy-kimball
Copy link
Contributor

Correct, we only want to select the region, not the node.

craig bot pushed a commit that referenced this issue Nov 16, 2022
90244: pkg/ui: Make tracez v2 page node-aware r=benbardin a=benbardin

Release note: None

Demo video below. Note that it starts on the v1 trace page, and I navigate to the new one by typing in the address bar.

https://user-images.githubusercontent.com/261508/196736069-952dadde-dcb7-4060-8563-e98727b48a48.mov

Part of #83679


92007: ui: tidy some SQL Activity code r=matthewtodd a=matthewtodd

Cleanup encountered along the way to adding region columns & filters for serverless.

Simplest to review commit by commit. The changes are small, but the commit message notes help add context.

Part of #89949

Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Matthew Todd <[email protected]>
craig bot pushed a commit that referenced this issue Dec 9, 2022
93268: statusccl: temporarily serve /_status/nodes to tenants r=matthewtodd a=matthewtodd

Fixes #92065
Part of #89949

Our current UI architecture has many of the SQL Activity pages mapping node or sql instance ids to regions at page load time, using the response from this `/_status/nodes` endpoint to get the information it needs.

This approach is problematic for ephemeral serverless tenants but will need a broader reimagining, probably ending up with us storing locality information directly in many of the sqlstats tables, which will probably require something like sending instance localities along in the distsql tracing. An initial exploration in this direction happened in #92259.

In the meantime, as a stop-gap measure, we implement a reduced version of this `/_status/nodes` endpoint for tenants, keeping the SQL Activity pages somewhat working for multiregion tenants.

Release note: None

Co-authored-by: Matthew Todd <[email protected]>
@craig craig bot closed this as completed in 297b73b Dec 14, 2022
craig bot pushed a commit that referenced this issue Dec 19, 2022
90830: gc: use clear range operations to remove multiple garbage keys across keys and versions r=erikgrinaker a=aliher1911

Summary:

GC Will track consecutive keys and issue GCClearRange requests to remove multiple keys at once if it can find more than configured number of keys. This covers multiple versions of the same key as well as multiple consecutive keys.

On the GC side it will count consecutive keys while building "points batches" e.g. batches containing MVCC keys to delete. Those batches are capped to fit into resulting raft entries once deletion is executed. GC could accumulate more than a single batch in memory in that case. Once necessary number of consecutive keys is found, GCClearRange is issued and pending batches are discarded. If non gc-able data is found, pending batches are GC'd and clear range tracking is restarted.
To protect GC from consuming too much memory for example if it collects a range with very large keys, it uses a configrable memory limit on how many key bytes could be held in pending batches. If this limit is reached before GC reaches minimum desired number of keys, oldest batch is GC'd and starting point for clear range is adjusted.

On the evaluation side, GC requests for ranges will obtain write latches removed range to prevent other requests writing in the middle of the range. However, if only a single key is touched i.e. multiple versions are deleted, then no latches are obtained as it is done when deleting individual versions as all the removed data is below the GC threshold.

GC exposes a metric `queue.gc.info.numclearconsecutivekeys` that shows how many times GCClearRange was used.
GC adds new system setting `kv.gc.clear_range_min_keys` that sets minimum number of keys eligible for GCClearRange.


----

batcheval: handle ClearSubRange request fields in GC requests

Handle ClearSubRange GC request fields to use GCClearRange storage
functions for ranges of keys that contain no visible data instead
of removing each key individually.
This PR only enables processing of request field, while actual GC
and storage functionality is added in previous PRs.

Release note: None

----

gc: add version gate to GC clear range requests

ClearRange fields of GCRequests are ignored by earlier versions,
this feature needs to be version gated to let GC work when talking
to a previous version if GC is run from a different node.

Release note: None

----

gc: add metrics for clear range requests in GC

This commit adds two metrics for clear range functionality in GC:

queue.gc.info.numclearconsecutivekeys is incremented every time GC sends
a ClearRange request for a bunch of consecutive keys to a replica

Release note (ops change): Metric queue.gc.info.numclearconsecutivekeys
added. It exposes number of clear range requests to remove consecutive
keys issued by mvcc garbage collection.

----

gc: GC use delete_range_keys for consecutive ranges of keys

When deleting large chunks of keys usually produced by range tombstones
GC will use point deletion for all versions found within replica key
range.
That kind of operations would generate unnecessary load on the storage
because each individual version must be deleted independently.
This patch adds an ability for GC to find ranges of consecutive keys
which are not interleaved with any newer data and create
delete_range_key GC requests to remove them.

Release note: None

----

rditer: helper to visit multiple replica key spans

IterateMVCCReplicaKeySpans helper to complement IterateplicaKeySpans
Existing tests moved from ReplicaMVCCDataIterator to
IterateMVCCReplicaKeySpans as former is being sunsetted and only a
single call site exists.

Release note: None

----

storage: add gc operation that uses clear range

Deleting multiple versions of same key or continuous ranges of
point keys with pebble tombstones could be inefficient when range
has lots of garbage.
This pr adds a storage gc operation that uses pebble range tombstones
to remove data.

Release note: None

----

gc: add delete range parameters to gc request

This commit adds clear_sub_range_key parameters to GC request.
clear_sub_range_keys is used by GC queue to remove chunks of
consecutive keys where no new data was written above the GC
threshold and storage can optimize deletions with range
tombstones.
To support new types of keys, GCer interface is also updated
to pass provided keys down to request.

Release note: None

Fixes: #84560, #61455

93732: dev: refactor `doctor` r=healthy-pod a=rickystewart

Create a new interactive `dev doctor` experience, so you don't have to copy-paste commands from `dev doctor` and it can solve problems for you.

Now `dev doctor` will be interactive by default and will ask for user input to confirm the actions it's going to take. You can also set `--autofix` to automatically agree to some default actions without having to interactive. Interactive doctor can be disabled with `--interactive=false`.

Epic: none
Release note: None

93814: workload: fix TPC-H generator to generate correct values for p_name and ps_supplycost r=rytaft a=rytaft

**workload: fix TPC-H generator to generate correct values for p_name**

The TPC-H generator creates values for `p_name` in the `part` table by selecting 5 elements from a list of words called `randPartNames`. It is supposed to select 5 random unique elements from the full list, but prior to this commit, it only selected a permutation of the first 5 elements. This commit fixes the logic and adds a unit test.

Epic: None
Release note: None

**workload: fix TPC-H generator value of ps_supplycost**

Prior to this commit, the TPC-H generator was generating
a value between 0 and 10 for `ps_supplycost` in the `partsupp`
table. However, the spec calls for a random value in the range 
`[1.00 .. 1,000.00]`. This commit fixes the oversight.

Epic: None
Release note: None

93852: build-and-publish-patched-go: upgrade from 1.19.1 to 1.19.4 r=ajwerner a=ajwerner

This picks up a mix of security and compiler fixes. The compiler fixes I'm interested in relate to generic data structures, but the other fixes seem worthwhile too.

* [x] Adjust the Pebble tests to run in new version.
* [x] Adjust version in Docker image ([source](./builder/Dockerfile)).
* [x] Adjust version in the TeamCity agent image ([setup script](./packer/teamcity-agent.sh))
* [x] Rebuild and push the Docker image (following [Basic Process](#basic-process))
* [x] Update `build/teamcity/internal/release/build-and-publish-patched-go/impl.sh` with the new version and adjust SHA256 sums as necessary.
* [x] Run the `Internal / Release / Build and Publish Patched Go` build configuration in TeamCity with your latest version of the script above. This will print out the new URL's and SHA256 sums for the patched Go that you built above.
* [x] Bump the version in `WORKSPACE` under `go_download_sdk`. You may need to bump [rules_go](https://github.com/bazelbuild/rules_go/releases). Also edit the filenames listed in `sdks` and update all the hashes to match what you built in the step above.
* [x] Run `./dev generate bazel` to refresh `distdir_files.bzl`, then `bazel fetch `@distdir//:archives`` to ensure you've updated all hashes to the correct value.
* [x] Bump the version in `builder.sh` accordingly ([source](./builder.sh#L6)).
* [x] Bump the version in `go-version-check.sh` ([source](./go-version-check.sh)), unless bumping to a new patch release.
* [x] Bump the go version in `go.mod`.
* [x] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh)).
* [x] Replace other mentions of the older version of go (grep for `golang:<old_version>` and `go<old_version>`).
* [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects.
* [ ] Ask the Developer Infrastructure team to deploy new TeamCity agent images according to [packer/README.md](./packer/README.md)


Epic: None

Release note: None

93914: ui: refresh nodes on tenants r=matthewtodd a=matthewtodd

Part of #89949

Now that we can show meaningful region information for tenants, we need to actually trigger the fetching of that information.

Release note: None

Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Matthew Todd <[email protected]>
craig bot pushed a commit that referenced this issue Dec 21, 2022
89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson

These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them.

These gossip-clients keys make up two thirds or more of the gossip
traffic in various large clusters I've inspected, consuming almost an
entire CPU core in the worst case I've seen. They don't provide enough
value to justify that sort of ongoing cost, so this commit removes them
entirely as well as the periodic logging of the gossip network and the
crdb_internal.gossip_network table, both of which relied on them.

Release note (backward-incompatible change): We've stopped
supporting/populating the crdb_internal.gossip_network table. It was an
internal table with no API guarantees (so perhaps no meriting a release
note?).

Release note (performance improvement): Significantly reduced CPU usage
of the underlying gossip network in large clusters.

---

Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done)

This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change.

I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time.

`@irfansharif` `@kvoli` 

93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd

Part of #89949

Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region.

This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience.

This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly.

Release note: None

93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav

The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d).

A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots.

This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic.

Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests.

Closes #93849.

Release note: None.

Epic: CRDB-20293

Co-authored-by: Alex Robinson <[email protected]>
Co-authored-by: Matthew Todd <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
adityamaru pushed a commit to adityamaru/cockroach that referenced this issue Dec 22, 2022
Part of cockroachdb#89949

Previously, when a tenant SQL instance had spun down (leaving us no way
to remember which region it had been in), the SQL Activity pages would
claim that statements and transactions had occurred in an "undefined"
region.

This change moves from saying "undefined" to saying nothing at all, a
slightly nicer user experience.

This broader problem of losing the region mapping has been described
in cockroachdb#93268; we'll begin addressing it shortly.

Release note: None
craig bot pushed a commit that referenced this issue Mar 6, 2023
95449: sqlstats: include regions in statement_statistics r=matthewtodd a=matthewtodd

Part of #89949.

This completes the work abandoned in #92259, properly storing the regions in which a statement was executed, rather than waiting until UI render time to try to map (dangerously ephemeral) node IDs to their respective regions.

With this work in place, we will be able to remove both the UI mapping code (usages of this [selector][]) and the stop-gap work of #93268.

[selector]: https://github.com/cockroachdb/cockroach/blob/0f6333cbd6c78264a59dc435324c0c33d75ea148/pkg/ui/workspaces/cluster-ui/src/store/nodes/nodes.selectors.ts#L52-L64

Release note (sql change): A regions field was added to the statistics column of crdb_internal.statement_statistics, reporting the regions of the nodes on which the statement was executed.

Co-authored-by: Matthew Todd <[email protected]>
craig bot pushed a commit that referenced this issue Apr 26, 2023
102192: sqlstats: only include local region in statement_statistics r=j82w a=matthewtodd

Part of #89949.

Addresses #98020.
Addresses #99563.

Related to cockroachdb/roachperf#129.
Related to #102170.

Previously, we attempted to record all the regions hit in a single statement execution in the sqlstats tables, leaning on the sqlAddressResolver to map traced nodeIDs to localities at execution time.

While the sqlAddressResolver is generally non-blocking, the introduction of this code did cause some of the multiregion "this query shouldn't span regions" tests to start [flaking][] and it's more recently been [implicated][] in a 2.5% performance regression.

Given that the probabilistic nature of the tracing meant that we generally weren't capturing all the relevant nodeIDs anyway, it seems like the most prudent thing to do here is take a step back and regroup.

In the short term, let's stop even trying to gather all these regions. In the medium/long term, let's see if we can find a better approach.

[flaking]: #98020
[implicated]: https://github.com/cockroachdb/roachperf/pull/129

Release note: None

Co-authored-by: Matthew Todd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-observability Related to observability of the SQL layer C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants