Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: information returned by SpanStats should be authoritative #103957

Closed
zachlite opened this issue May 26, 2023 · 0 comments
Closed

server: information returned by SpanStats should be authoritative #103957

zachlite opened this issue May 26, 2023 · 0 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@zachlite
Copy link
Contributor

zachlite commented May 26, 2023

The work done to improve server.SpanStats in #96223 and #101378 uses instances of kvcoord.RangeIterator so that each node may collect range descriptors relevant to the request span(s).

The problem is that kvcoord.RangeIterator uses kvcoord.DistSender's caching infrastructure to find these descriptors, and the state of the cache on each node may be different at any given time. Although the caches will eventually be consistent, the risk of inconsistency is not acceptable for this endpoint. This has come to light in #103128 where server.SpanStats has become a dependency of SHOW RANGES, which must be authoritative.

Some additional points from @knz:

  • the “new” code for the stats-for-span function is deeply embedded in the obs API server, which is architecturally unsound since the functionality is a pure KV function and should be usable from SQL without interfacing with the API server at all.
  • we want to have a version of this code that is guaranteed to operate on a consistent view of the entire span (i.e. not a mix of cached data for part of the span, and uncached for another part)
  • we want the API to be able to reuse an existing client.Txn object (instead of synthetisizing its own)

This issue revisits CRDB-22711 and CRDB-17463

Jira issue: CRDB-28270
Epic: CRDB-24928

@zachlite zachlite added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-cluster-observability labels May 26, 2023
@zachlite zachlite self-assigned this May 26, 2023
zachlite pushed a commit to zachlite/cockroach that referenced this issue Jun 6, 2023
This commit guarantees that SpanStats uses up-to-date range
descriptors on all nodes. The dependency on DistSender is removed
and is replaced with a dependency on ScanMetaKVs.

ScanMetaKVs is used to:
1) Locate nodes that house replicas of a span to avoid cluster-wide fan-outs.
2) Find Range Descriptors that touch a span, to build a SpanStatsResponse.

This commit also fixes cockroachdb#103809, where a SpanStatsResponse incorrectly returned
the replica_count for a span, instead of the range_count.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-24928
Issue: cockroachdb#103957
Release note (bug fix): SpanStats is no longer subject to stale
information, and should be considered authoritative.
zachlite pushed a commit to zachlite/cockroach that referenced this issue Jun 9, 2023
This commit guarantees that SpanStats uses up-to-date range
descriptors on all nodes. The dependency on DistSender is removed
and is replaced with a dependency on rangedesc.Scanner.

rangedesc.Scanner is used to:
1) Locate nodes that house replicas of a span to avoid cluster-wide fan-outs.
2) Find Range Descriptors that touch a span, to build a SpanStatsResponse.

This commit also fixes cockroachdb#103809, where a SpanStatsResponse incorrectly returned
the replica_count for a span, instead of the range_count.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-24928
Issue: cockroachdb#103957
Release note (bug fix): SpanStats is no longer subject to stale
information, and should be considered authoritative.
zachlite pushed a commit to zachlite/cockroach that referenced this issue Jun 15, 2023
This commit guarantees that SpanStats uses up-to-date range
descriptors on all nodes. The dependency on DistSender is removed
and is replaced with a dependency on rangedesc.Scanner.

rangedesc.Scanner is used to:
1) Locate nodes that house replicas of a span to avoid cluster-wide fan-outs.
2) Find Range Descriptors that touch a span, to build a SpanStatsResponse.

This commit also fixes cockroachdb#103809, where a SpanStatsResponse incorrectly returned
the replica_count for a span, instead of the range_count.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-24928
Issue: cockroachdb#103957
Release note (bug fix): SpanStats is no longer subject to stale
information, and should be considered authoritative.
craig bot pushed a commit that referenced this issue Jun 15, 2023
103945: jobs: add a Profiler tab to the job details page r=maryliag a=adityamaru

This change adds a Profiler tab to the job details page. This change also adds a row that allows collection of a cluster-wide CPU profile for 5 seconds, of all the samples corresponding to the job's execution.

Fixes: #102735

Release note (ui change): job details page now has a profiler tab for more advanced observability into a job's execution. Currently, we support collecting a cluster-wide CPU profile of the job.

104423: server: make SpanStats authoritative r=zachlite a=zachlite

This commit guarantees that SpanStats uses up-to-date range
descriptors on all nodes. The dependency on DistSender is removed
and is replaced with a dependency on rangedesc.Scanner.

rangedesc.Scanner is used to:
1) Locate nodes that house replicas of a span to avoid cluster-wide fan-outs.
2) Find Range Descriptors that touch a span, to build a SpanStatsResponse.

This commit also fixes #103809, where a SpanStatsResponse incorrectly returned
the replica_count for a span, instead of the range_count.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-24928
Issue: #103957
Release note (bug fix): SpanStats is no longer subject to stale
information, and should be considered authoritative.

104933: roachtest: fix regressions introduced by metamorphic arm64 and fips c… r=yuzefovich,renatolabs a=srosenberg

…hanges

There were two regressions introduced in [1]. This PR resolves them in a quick way. Subsequent PRs will address cpu architecture validation for attaching to pre-existing clusters, as well as enable benchmarks to run on arm64.

When using "local mode", don't fall back to amd64 if a test happens to be a benchmark. When attaching to a pre-existing cluster, pretend that its (cpu) architecture is the same as the one that was chosen for the given test.

[1] #103710

Epic: none
Fixes: #104678

Release note: None

104982: go.mod: bump Pebble to e9a280333ef7 r=jbowens a=jbowens

```
e9a28033 db: tighten missized tombstone metric count
17e0ab3c sharedcache: use a pool of workers for writes
3726f551 db: don't use internal types in ScanInternal
2fed2e71 sharedcache: refactor block math
```

Epic: none
Release note: None

104986: changefeedccl: add version gate to `SHOW CHANGEFEED JOBS` r=Xiang-Gu a=jayshrivastava

On 23.1 binaries, `SHOW CHANGEFEED JOBS` queries `crdb_internal.system_jobs`, which was added in 23.1. In 22.2-23.1 mixed version clusters, this query fails.

This change adds a version gate such that before an upgrade to 23.1 is finalized, `SHOW CHANGEFEED JOBS` will query `system.jobs` instead.

This change also adds regression testing in the `changefeed` logic test, ensuring that it runs in a mixed version environment.

Release note (bug fix): `SHOW CHANGEFEED JOBS` no
longer fails on 22.2, 23.1 mixed version clusters.

Epic: None

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Zach Lite <[email protected]>
Co-authored-by: Stan Rosenberg <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Jun 15, 2023
This commit guarantees that SpanStats uses up-to-date range
descriptors on all nodes. The dependency on DistSender is removed
and is replaced with a dependency on rangedesc.Scanner.

rangedesc.Scanner is used to:
1) Locate nodes that house replicas of a span to avoid cluster-wide fan-outs.
2) Find Range Descriptors that touch a span, to build a SpanStatsResponse.

This commit also fixes #103809, where a SpanStatsResponse incorrectly returned
the replica_count for a span, instead of the range_count.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-24928
Issue: #103957
Release note (bug fix): SpanStats is no longer subject to stale
information, and should be considered authoritative.
zachlite pushed a commit to zachlite/cockroach that referenced this issue Jun 20, 2023
This commit guarantees that SpanStats uses up-to-date range
descriptors on all nodes. The dependency on DistSender is removed
and is replaced with a dependency on rangedesc.Scanner.

rangedesc.Scanner is used to:
1) Locate nodes that house replicas of a span to avoid cluster-wide fan-outs.
2) Find Range Descriptors that touch a span, to build a SpanStatsResponse.

This commit also fixes cockroachdb#103809, where a SpanStatsResponse incorrectly returned
the replica_count for a span, instead of the range_count.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-24928
Issue: cockroachdb#103957
Release note (bug fix): SpanStats is no longer subject to stale
information, and should be considered authoritative.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

1 participant