Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: storage: skip TestPebbleMetricEventListener #99734

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Mar 27, 2023

Informs #97843.
Release note: None.
Release justification: Non-production code changes.

jbowens and others added 30 commits March 20, 2023 18:13
This reduces the amount of work.

Informs cockroachdb#98904.
Epic: None
Release note: None
This patch allows tsdb to support tenant-sperated timeseries metrics.
The changes can be summarized as:

(1) Changes in the write path:
In `pkg/status` we now store metrics corresponding to non-system tenants
with their tenant IDs.  This makes the `source` component of the ts
key includes both node IDs and tenant IDs:
```
.../[NodeID]-[TenantID]
```
Note: the system tenant does not include the tenant ID to allow for
backwards compatibility.

(2) Changes in the query path:
(2.1) as system tenant:
If the request to ts server comes annotated as system tenant, we
aggregate timeseries across all tenants in that metric name's keyspace.

(2.2) as secondary tenant:
When a secondary tenant queries the server, it is routed through tenant
connect and checked for having tenant capabilites for viewing tsdb data.
During this stage that request is annotated with the Tenant ID. When the
system tenant ts server recevies this query with a Tenant ID set:
- If no sources are requested as part of the query, it will aggregate
  across all sources that match the tenant ID.
- If specific sources are specified along with the tenant ID, it will
  scan the keyspace for the NodeID and then filter out results that do
not match with the TenantID.

These changes ensure that the system tenant is able to have a view of
metrics across all tenants and application tenants have access to their
own metrics. These changes are all done entirely server-side so no
client changes are needed. The client will be automatically served the
timeseries data it has access to based on its tenant capability.

Fixes: cockroachdb#96438.

Release note (ui change): Timeseries metrics in db concole will show
tenant-specific metrics. For the system tenant, these timeseries will be
aggregated across all tenants. For a secondary tenant, only metrics
belonging to that particular tenant will be shown.
Previously, both the stmt and txns fingerprints
pages were using the same api response from the
/statements. This was because we need the stmts
response to build txns pages. Howevever having to
fetch both types of stats can slow down the request.
Now that we can specify the fetch_mode as part of
the request, we can split the 2 into their own api
calls and redux fields.

Epic: none
Part of: cockroachdb#97875

Release note: None
This commit adds limit and sort fields to the combined
statements request. These params can be used to specify
how much data is returned and the priority at which to
return data (e.g. top-k). The current sort options are:
- service latency
- cpu time
- p99 (statements only)
- contention time
- execution count

Epic: none
Part of: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
This commit changes the table used for the combined
stats endpoint from the view combining persisted and
in-memory stats to the view that is just a wrapper
around the system table. Thus, we are now reading
only persisted stats from the system table for the
combined stats endpoint.

This commit updates tests using the combined api to
flush stats before using the api.

Epic: None
Release note (ui change): On the SQL Activity fingerprints
pages, users will not see stats that have not yet been
flushed to disk.
This commit adds new knobs to the sql stats
fingerprint pages. Users can now specify a
limit and sort priority with their sql stats
fingerprints requests.

Closes: cockroachdb#97876
Part of: cockroachdb#97875

Release note: None
Previously, the calculatation for the `% of runtime` column
for the sql activity pages, was performed on the client app
by summing the (execCount* avgAvcLat) from all statements
returned. This was okay when we were returning a large number
of stmts (20,000) but now that we have reduced that limit
significantly (default 100, with a max provided option of 500),
performing this calculation on the client side won't give us
the full picture of total runtime of the workload in the requested
time interval.

This commit modifies the statements API to return the total
estimated runtime of the workload so that we can continue to
show the '% of runtime` column. We also now provide this as a
server sort option, so that users can request the top-k stmts or
txns by % of runtime.

Epic: none

Release note (ui change): Users can request top-k stmts by
% runtime in sql activity fingerprints pages.
Create new Search Criteria component, and add
to Statements and Transactions page.
This commit also updates the UI, with the new
position of filters and reset sql stats.

Part Of cockroachdb#98891

Release note (ui change): Add Search Criteria to Statements and
Transactions pages, updates UX with improvements.
97789: docs: Pass in Jira tickets in the body of product change issues r=nickvigilante a=nickvigilante

Epic and issue references are now pointing to the associated Jira
tickets.

Fixes DOC-7019

Release note: None

98873: storage: add storage.ingest_as_flushable.enabled setting r=nicktrav a=jbowens

Add a cluster setting controlling whether to allow lazy ingestion of sstables that overlap with a memtable. This cluster setting defaults to enabling lazy ingestion, but is only respected by Pebble if the cluster version is ratcheted appropriately high.

Epic: None
Close cockroachdb#97194.
Release note: None

98874: opt: set udf CalledOnNullInput correctly for scalar UDFs r=DrewKimball a=DrewKimball

Previously, the `UDFPrivate.CalledOnNullInput` field would be set to true for scalar UDFs, since strict behavior is handled using a CASE statement. However, this is confusing and it is also correct to just set `CalledOnNullInput` to the value implied by the function's definition.

Informs CRDB-20535

Release note: None

Co-authored-by: Nick Vigilante <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Drew Kimball <[email protected]>
Fixes a hint for some instances of an "unknown function" error
that was not properly applied previously.

Informs: cockroachdb#99002
Epic: None

Release note: None
This adds two new system privileges:

- `REPLICATION`: Allows the user to call the internal functions that
produce a cross-cluster replication stream.

- `MANAGETENANT`: Allows the user to create and manage
  tenants.

A user with the MANAGETENANT privileges is now able to execute
the following statements:

- SHOW TENANT
- SHOW TENANTS
- CREATE TENANT
- CREATE TENANT FROM REPLICATION STREAM
- DROP TENANT (if it is part of an active stream)
- ALTER TENANT

A user with the REPLICATION privileges is able to call the following
functions:

- crdb_internal.start_replication_stream
- crdb_internal.replication_stream_progress
- crdb_internal.stream_partition
- crdb_internal.replication_stream_spec
- crdb_internal.complete_replication_stream

Fixes cockroachdb#95425

Release note: None
…db#99062

98353: kvserver: make some cluster settings system only r=andrewbaptist a=kvoli

Update cluster settings in the `kv/kvserver` pkg to be `SystemOnly`.
Previously, there were many cluster settings which which were
`TenantWritable`  or `TenantReadOnly`. These settings, even if altered
by a tenant have no effect.

There are settings which are not updated, due due to tests relying on
modifying the setting value using a non-system tenant. We ignore these
in this commit and defer to cockroachdb#98723 for handling these.

These settings are updated to be `SystemOnly`:

```
kv.bulk_io_write.max_rate
kv.bulk_sst.max_allowed_overage
kv.bulk_sst.target_size
kv.closed_timestamp.follower_reads_enabled
kv.log_range_and_node_events.enabled
kv.range_split.by_load_enabled
kv.range_split.load_cpu_threshold
kv.range_split.load_qps_threshold
kv.replica_stats.addsst_request_size_factor
kv.replication_reports.interval
server.shutdown.lease_transfer_wait
```

Resolves: cockroachdb#98347

Release note (ops change): Some KV server cluster settings are now system
only. These settings could previously be written or read by
tenants, however writing to these settings had no effect.

99037: kvserver: skip `TestStoreRangeSplitRaceUninitializedRHS` under race/deadlock r=erikgrinaker a=erikgrinaker

The Raft groups are unable to maintain quorum when stressed under race/deadlock.

Resolves cockroachdb#98840.

Epic: none
Release note: None

99052: sql: add `switchToAnotherPortal` signal for result consumer r=ZhouXing19 a=ZhouXing19

This PR is part of the effort to implement the multiple active portals. (Extracted from cockroachdb#96358)

---

### sql/settings: add sql.pgwire.multiple_active_portals.enabled cluster setting


This commit is to add a non-public `sql.pgwire.multiple_active_portals.enabled`
setting. This setting is only for a PREVIEWABLE feature.
With it set to true, all non-internal portals with read-only queries without sub/post
queries can be paused and resumed in an interleaving manner but are executed with
local plan.

---
### sql: add switchToAnotherPortal signal for result consumer

Previously, after pushing the result to the consumer, we only accept the
following actions as the next step:
1. Pushing more data from the same source to the same consumer;
2. Drain or close the pipeline.

This commit is to add another option: pause the current execution, and switch
to the execution to another portal. I.e. when receiving an `ExecPortal` cmd but
for another portal, we do nothing and return the control to the connExecutor.
This allows us to execute different portals interleaving-ly.

Epic: CRDB-17622

Release note (sql change): add a non-public `sql.pgwire.multiple_active_portals.enabled` setting. This setting is only for a PREVIEWABLE feature. With it set to true, all non-internal portals with read-only queries without sub/post queries can be paused and resumed in an interleaving manner but are executed with local plan.


99062: sql: deflake `TestTrace` r=yuzefovich a=erikgrinaker

This has been seen to flake in CI:

```
=== RUN   TestTrace/ShowTraceForVectorized/TracingOff/node-1
    trace_test.go:386: expected span: "session recording", got: "pendingLeaseRequest: requesting lease"
    trace_test.go:397: remaining span: "session recording"
    trace_test.go:397: remaining span: "sql query"
    trace_test.go:397: remaining span: "sql txn"
    trace_test.go:397: remaining span: "txn coordinator send"
            --- FAIL: TestTrace/ShowTraceForVectorized/TracingOff/node-1 (0.02s)
```

There was already an exception for this span, but with a `storage.` prefix. This patch removes the prefix, and makes it match on substrings.

This flake has possibly been made worse with the introduction of a metamorphic setting to only use expiration-based leases in ecc931b.

Resolves cockroachdb#98971.

Epic: none
Release note: None

Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Jane Xing <[email protected]>
98833: kv: don't signal raft scheduler when already on scheduler goroutine r=erikgrinaker a=nvanbenschoten

Informs cockroachdb#96800.

This commit avoids unnecessary interactions with the Raft scheduler as a way to relieve pressure on the scheduler's mutex. In cockroachdb#96800 (comment), we found that calls into the scheduler in response to synchronously handled `MsgStorageApply`s were responsible for **17.8%** raftScheduler enqueue calls. These enqueue calls were unnecessary because they were performed while already running on the raft scheduler, so the corresponding local messages were already going to be flushed and processed before the scheduler is yielded. By recognizing when we are in these situations, we can omit the unnecessary scheduler interactions.

```
# benchmarking on AWS with instance stores
# see cockroachdb#96800 for why this has little to no effect on GCP of AWS with EBS

name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=32   51.5k ± 2%   57.3k ± 3%  +11.18%  (p=0.008 n=5+5)

name                          old p50(ms)  new p50(ms)  delta
kv0/enc=false/nodes=3/cpu=32    2.40 ± 0%    2.10 ± 5%  -12.50%  (p=0.000 n=4+5)

name                          old p99(ms)  new p99(ms)  delta
kv0/enc=false/nodes=3/cpu=32    11.0 ± 0%     9.6 ± 4%  -12.36%  (p=0.000 n=4+5)
```

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Previously COPY would allow a wide range of syntax in the COPY
TO substatement.  Now like PG we limit it to a few things.

PG grammar is:
```
PreparableStmt:
			SelectStmt
			| InsertStmt
			| UpdateStmt
			| DeleteStmt
			| MergeStmt
```

And now we do something similar. This prevents the wheels from coming
off when RSG generates EXPLAIN's in the substatement for instance.

Release note: none
Epic: none
Release note: none.
Epic: none.
Release note: none.
Epic: none.
99031: ccl/multiregionccl: skip flaky regional_by_table test r=matthewtodd a=matthewtodd

Part of cockroachdb#98020.

Release note: None

Co-authored-by: Matthew Todd <[email protected]>
98993: jobs: enhance the InfoStorage API r=adityamaru a=knz

Needed for cockroachdb#98459.
Epic: CRDB-23559

98998: kvserver: unskip TestBehaviorDuringLeaseTransfer r=shralex a=shralex

Unskip this test, its not failing locally. Previous teamcity logs are no longer available.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-21501
Fixes: cockroachdb#91948
Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: shralex <[email protected]>
Before this change, we could encounter internal errors while attempting
to add result columns during a `CREATE OR REPLACE VIEW` if the number of
columns in the new view was less than the number of columns in the old
view. This led to an inconsistency with postgres, which would only
return the error `cannot drop columns from view`.

This PR moves the check comparing the number of columns before and after
the view replacement earlier so that the correct error returns.

Co-authored-by: [email protected]

Fixes: cockroachdb#99000
Epic: None

Release note (bug fix): Fixes an internal error that can occur when
`CREATE OR REPLACE VIEW` replaces a view with fewer columns and another
entity depended on the view.
98077: ts: add support for multitenant timeseries metrics r=aadityasondhi a=aadityasondhi

This patch allows tsdb to support tenant-sperated timeseries metrics.
The changes can be summarized as:

(1) Changes in the write path:
In `pkg/status` we now store metrics corresponding to non-system tenants
with their tenant IDs.  This makes the `source` component of the ts
key includes both node IDs and tenant IDs:
```
.../[NodeID]-[TenantID]
```
Note: the system tenant does not include the tenant ID to allow for
backwards compatibility.

(2) Changes in the query path:
(2.1) as system tenant:
If the request to ts server comes annotated as system tenant, we
aggregate timeseries across all tenants in that metric name's keyspace.

(2.2) as secondary tenant:
When a secondary tenant queries the server, it is routed through tenant
connect and checked for having tenant capabilites for viewing tsdb data.
During this stage that request is annotated with the Tenant ID. When the
system tenant ts server recevies this query with a Tenant ID set:
- If no sources are requested as part of the query, it will aggregate
  across all sources that match the tenant ID.
- If specific sources are specified along with the tenant ID, it will
  scan the keyspace for the NodeID and then filter out results that do
not match with the TenantID.

These changes ensure that the system tenant is able to have a view of
metrics across all tenants and application tenants have access to their
own metrics. These changes are all done entirely server-side so no
client changes are needed. The client will be automatically served the
timeseries data it has access to based on its tenant capability.

Fixes: cockroachdb#96438.

Release note (ui change): Timeseries metrics in db concole will show
tenant-specific metrics. For the system tenant, these timeseries will be
aggregated across all tenants. For a secondary tenant, only metrics
belonging to that particular tenant will be shown.

98815: server, ui: add sort and limit to sql activity pages, switch sql activity pages to read only from system table r=maryliag a=xinhaoz

See individual commits.

DB-Console: https://www.loom.com/share/73248244baa54cdf80389d0eec788447
DB-Console Details pages: https://www.loom.com/share/2d74c6cf082f4dcfb5e7f690166c2baf

Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: maryliag <[email protected]>
98897: server: fix tenant auth for status server r=knz,THardy98 a=dhartunian

Previously, the authentication for gRPC endpoints that are exposed
via HTTP on the tenant was not implemented correctly. Because the HTTP
session is decoded into gRPC metadata, and that metadata was contained
in the Context object passed through to the Tenant Connector, the
username from the tenant could leak into the kv layer and be treated
as an authenticated username. If that username happened to match one
in the system tenant it would be accepted as valid.

Additinally, some endpoints were missing their authentication code.
This did not break functionality because a gRPC request without any
metadata is treated as an internal request with admin permissions.

*Warning*: If a request contains a validated username as part of gRPC
metadata and that metadata is preserved as the request is handed down
to the KV layer, it could be interpreted as a valid user on the system
tenant and cause an escalation of privileges.

This commit adds authentication to the HotRangesV2 endpoint and
SpanStats endpoints which were missing it, and contains tests that
ensure that the endpoints return errors when the user does not have
the correct permissions.

Epic: CRDB-12100

Release note: None

98958: sql,backupccl: set system table user ID columns to be NOT NULL r=rafiss,stevendanna a=andyyang890

This PR sets the user ID columns in system tables to be NOT NULL
and when applicable, updates the `RESTORE` logic to account for
the case where a backup may have been created before the user ID
column was added.

Part of cockroachdb#87079

Release note: None

Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
This patch adds a missing case for the public role in the backfilling
portion of the system.privileges user ID migration.

Release note: None
This patch adds a missing case for the empty role in the backfilling
portion of the system.database_role_settings user ID migration.

Release note: None
98854: kvserver: shard Raft scheduler r=erikgrinaker a=erikgrinaker

The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards.

By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via `COCKROACH_SCHEDULER_SHARD_SIZE`.

This results in a substantial performance improvement on high-CPU nodes:

```
name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4              7.71k ± 5%   7.93k ± 4%     ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=8              15.6k ± 3%   14.8k ± 7%     ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=32             43.4k ± 2%   45.0k ± 3%   +3.73%  (p=0.032 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws   40.5k ± 2%   61.7k ± 1%  +52.53%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce   35.6k ± 4%   44.5k ± 0%  +24.99%  (p=0.008 n=5+5)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4               21.2 ± 6%    20.6 ± 3%     ~     (p=0.397 n=5+5)
kv0/enc=false/nodes=3/cpu=8               10.5 ± 0%    11.2 ± 8%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=32              4.16 ± 1%    4.00 ± 5%     ~     (p=0.143 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    3.00 ± 0%    2.00 ± 0%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    4.70 ± 0%    4.10 ± 0%  -12.77%  (p=0.000 n=5+4)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4               61.6 ± 5%    60.8 ± 3%     ~     (p=0.762 n=5+5)
kv0/enc=false/nodes=3/cpu=8               28.3 ± 4%    30.4 ± 0%   +7.34%  (p=0.016 n=5+4)
kv0/enc=false/nodes=3/cpu=32              7.98 ± 2%    7.60 ± 0%   -4.76%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    12.3 ± 2%     6.8 ± 0%  -44.72%  (p=0.000 n=5+4)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    10.2 ± 3%     8.9 ± 0%  -12.75%  (p=0.000 n=5+4)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4               89.8 ± 7%    88.9 ± 6%     ~     (p=0.921 n=5+5)
kv0/enc=false/nodes=3/cpu=8               46.1 ± 0%    48.6 ± 5%   +5.47%  (p=0.048 n=5+5)
kv0/enc=false/nodes=3/cpu=32              11.5 ± 0%    11.0 ± 0%   -4.35%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    14.0 ± 3%    12.1 ± 0%  -13.32%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    14.3 ± 3%    13.1 ± 0%   -8.55%  (p=0.000 n=4+5)

```

The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to `raftSchedulerBatch` pre-sharding the enqueued ranges:

```
name                                                                 old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%   564ns ± 2%   +23.36%  (p=0.001 n=7+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%   563ns ± 2%   +22.14%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%   591ns ± 2%   +28.63%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   776ns ± 5%   +70.60%  (p=0.001 n=6+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%  1058ns ± 1%  +132.13%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%  8.18ms ± 1%   +14.33%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%  8.18ms ± 1%   +14.77%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  7.86ms ± 1%   +10.30%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  7.11ms ± 1%    -1.27%  (p=0.001 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  7.16ms ± 3%      ~     (p=0.721 n=8+8)
```

Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%.

Resolves cockroachdb#98811.
Touches cockroachdb#96800.

Hat tip to `@nvanbenschoten` for the initial prototype.

Epic: none
Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts.

Co-authored-by: Erik Grinaker <[email protected]>
98225: ui: show normalized CPU Usage metric on Node Map r=koorosh a=koorosh

Before, Node map (on Overview page) displayed current system and user CPU usage 
that didn't represent the same data as CPU Percent metric on Metrics page.
Now, Node Map displays the same metric to provide users consistent information.

Release note (admin ui, bug fix): show normalized CPU usage on Node Map.
Resolve: cockroachdb#87664

98985: storage: scan local keyspace in `TestMVCCHistories` r=erikgrinaker a=erikgrinaker

This patch processes the local keyspace along with the user keyspace in `TestMVCCHistories`, which is useful for MVCC stats tests of system keys (e.g. `SysBytes`). This was motivated by tracking down an observed discrepancy in `SysBytes`, but that hasn't borne fruit yet.

Touches cockroachdb#93896.

Epic: none
Release note: None

99050: ccl/sqlproxyccl/acl: fix TestParsingErrorHandling flake r=adityamaru,jaylim-crl a=pjtatlow

TestParsingErrorHandling asserts that the error count metric is updated
correctly when reading a file fails or succeeds, and the files are checked
at a regular interval. For the tests that interval is set to 100ms, and we waited
200ms to ensure the metric would be updated, but that seems to not be reliable.
This change increases the wait to 500ms which should ensure the file is
re-read before we check the value of the error metics.

Fixes cockroachdb#98839

Co-authored-by: Andrii Vorobiov <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: PJ Tatlow <[email protected]>
This code change adds a `binary_version` column to the instances table.

This is done by adding the column to the bootstrap schema for system.sql_instances,
and piggy-backing on the existing code for the V23_1_SystemRbrReadNew migration
that overwrites the live schema for this table by the bootstrap copy.

This redefinition of the meaning of the V23_1_SystemRbrReadNew is backward-incompatible
and is only possible because this commit is merged before the v23.1 branch is cut.

Release note: None
Epic: CRDB-20860
99051: admin: statement diagnostics uses correct auth helpers r=knz a=dhartunian

Previously, the statement diagnostics HTTP handler was initialized using the incorrect `ctx` value. This caused the HTTP request context to not be correctly handed down to the handler. Furthermore, the call to `userFromIncomingRPCContext` was incorrect in this scenario since it relied on a gRPC context being populated with HTTP session information, which did not exist. That code sets a `root` user when context is missing because the gRPC handlers are used for inter-node communication *and* HTTP with the DB console and external tools.

This commit attaches the request context to the diagnostics bundle handler correctly, and amends the authorization code to use `userFromHTTPAuthInfoContext` which correctly reads the session cookie info from the request (like many `api/v2` handlers do since those exist outside the gRPC infrastructure).

Resolves cockroachdb#99049

Epic: None

Release note (security update): Previously, users could gain unauthorized access to statement diagnostic bundles they did not create if they requested the bundle through an HTTP request to `/_admin/v1/stmtbundle/<id>` and correctly guessed its (non-secret) ID.  This change locks down this endpoint behind the usual SQL gating that correctly uses the SQL user in the HTTP session as identified by their cookie.

Co-authored-by: David Hartunian <[email protected]>
98906: backupccl: move openSSTs call into the restore workers r=lidorcarmel a=lidorcarmel

Currently we see openSSTs being a bottleneck during restore when using more than 4 workers.

This patch moves the openSSTs call into the worker itself, so that this work can be parallelized. This is needed for later PR which will increase the number of workers.

Also, this change simplifies the code a bit and makes it easier to implement cockroachdb#93324, because in that PR we want to produce a few partial SSTs that need to be processed serially. Before this patch it wasn't trivial to make sure that the N workers will not process those partial SSTs in the wrong order, and with this patch each worker will process a single mergedSST, and therefore can serialize the partial SSTs created from that mergedSST.

Tested by running a roachtest (4 nodes, 8 cores) with and without this change. The fixed version was faster: 80MB/s/node vs 60 but some of it is noise, we do expect a perf improvement when using more workers and other params tuned, which is the next step.

Informs: cockroachdb#98015
Epic: CRDB-20916

Release note: None

Co-authored-by: Lidor Carmel <[email protected]>
This change removes the following log spam:
```
could not run claimed job 102: no resumer is available for AUTO CONFIG RUNNER
```

Release note: None
@jbowens jbowens requested a review from a team as a code owner March 27, 2023 21:32
@jbowens jbowens requested a review from bananabrick March 27, 2023 21:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens changed the title storage: skip TestPebbleMetricEventListener release-22.1: storage: skip TestPebbleMetricEventListener Mar 27, 2023
@jbowens jbowens changed the base branch from master to release-22.1 March 27, 2023 21:34
@jbowens jbowens requested a review from a team as a code owner March 27, 2023 21:34
@jbowens jbowens requested a review from a team March 27, 2023 21:34
@jbowens jbowens requested review from a team as code owners March 27, 2023 21:34
@jbowens jbowens requested review from a team March 27, 2023 21:34
@jbowens jbowens requested review from a team as code owners March 27, 2023 21:34
@jbowens jbowens requested a review from a team March 27, 2023 21:34
@jbowens jbowens requested review from a team as code owners March 27, 2023 21:34
@jbowens jbowens requested review from mgartner, rhu713 and samiskin and removed request for a team March 27, 2023 21:34
@jbowens jbowens closed this Mar 27, 2023
@jbowens jbowens deleted the skip-test-TestPebbleMetricEventListener branch March 27, 2023 21:34
@jbowens
Copy link
Collaborator Author

jbowens commented Mar 27, 2023

Sorry for the pings; disastrous skip-test with the wrong base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.