Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtests: introduce admission-control/elastic-backup #89482

Merged
merged 1 commit into from
Oct 6, 2022

Conversation

irfansharif
Copy link
Contributor

Informs #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C with an aggressive (every 20m) full backup schedule. We've observed latency spikes during backups because of its CPU-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. In #86638 we introduced admission control mechanisms to dynamically pace such work while maintaining acceptable CPU scheduling latencies (sub millisecond p99s). This roachtest exercises that machinery. In future commits we'll add libraries to the roachtest package to automatically spit out the degree to which {CPU-scheduler,foreground} latencies are protected.

Release note: None

Informs cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU
machines running 1000-warehouse TPC-C with an aggressive (every 20m)
full backup schedule. We've observed latency spikes during backups
because of its CPU-heavy nature -- it can elevate CPU scheduling
latencies which in turn translates to an increase in foreground latency.
In cockroachdb#86638 we introduced admission control mechanisms to dynamically pace
such work while maintaining acceptable CPU scheduling latencies (sub
millisecond p99s). This roachtest exercises that machinery. In future
commits we'll add libraries to the roachtest package to automatically
spit out the degree to which {CPU-scheduler,foreground} latencies are
protected.

Release note: None
@irfansharif irfansharif requested a review from a team October 6, 2022 12:28
@irfansharif irfansharif requested a review from a team as a code owner October 6, 2022 12:28
@irfansharif irfansharif requested review from renatolabs and removed request for a team October 6, 2022 12:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

12:42:25 util.go:40: waiting for initial up-replication... (<2m0s)
12:42:25 util.go:59: still waiting for full replication (52 ranges left)
12:42:26 util.go:59: still waiting for full replication (42 ranges left)
...
12:43:08 util.go:59: still waiting for full replication (2 ranges left)
12:43:09 util.go:59: still waiting for full replication (1 ranges left)
12:43:10 util.go:55: up-replication complete
12:43:10 tpcc.go:175: test status: loading fixture (<10m0s)
12:49:42 tpcc.go:191: test status: finished tpc-c setup
12:49:42 tpcc.go:254: test worker status: running tpcc worker=0 warehouses=1000 ramp=5m0s duration=1h30m0s on {pgurl:1-3} (<1m0s)
12:49:42 admission_control_elastic_backup.go:90: test status: during: enabling admission control (<30s)
12:49:42 admission_control_elastic_backup.go:95: test status: during: creating full backup schedule to run every 20m (<1m0s)
12:49:42 admission_control_elastic_backup.go:104: test status: during: waiting for workload to finish (<1h30m0s)
14:24:50 test_runner.go:942: tearing down after success; see teardown.log

Here's a prometheus dump of an experiment run. To see it, run the contained ./prometheus-docker-run.sh and point to localhost:9090 through your grafana instance (brew install grafana; brew services start grafana + localhost:3000 if you're on macs). http://go.crdb.dev/p/backup-admission-control-grafana is the set of dashboards I've been using.

elastic-backup.zip

image

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @renatolabs)


pkg/cmd/roachtest/tests/admission_control_snapshot_overload.go line 92 at r1 (raw file):

			t.Status(fmt.Sprintf("setting up prometheus/grafana (<%s)", 2*time.Minute))
			{

Nit: Remove this change from this PR, or look at my other PR which pulls this into admission_control.go.


pkg/cmd/roachtest/tests/util.go line 121 at r1 (raw file):

		"admission.sql_kv_response.enabled",
		"admission.sql_sql_response.enabled",
		"admission.elastic_cpu.enabled",

Its not clear that we want to enable this in all tests where admission control is enabled. Calling either true or false for this method does not result in the same as not setting anything. I'm concerned that users of this method won't do the right thing.

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the speedy review!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @renatolabs)


pkg/cmd/roachtest/tests/admission_control_snapshot_overload.go line 92 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Nit: Remove this change from this PR, or look at my other PR which pulls this into admission_control.go.

If you notice I'm using a different grafana dashboard for the backup roachtest vs. the snapshot one. Quickly skimming through #89324, I think instead of a shared constructor for prometheus.Config (which itself is a constructor param for setting up prometheus), perhaps we ought to add some smarts to setupPrometheusForRoachtest or create a "default prometheus.Config" that assumes: - last-node-houses-prom/workload-generator

  • every-other-node-runs-crdb, and
  • every-other-node-should-run-node-exporter

And then tests can use that + annotate it with their custom dashboards. I'll keep this patch as is and leave the refactoring to a future commit/#89324.


pkg/cmd/roachtest/tests/util.go line 121 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Its not clear that we want to enable this in all tests where admission control is enabled. Calling either true or false for this method does not result in the same as not setting anything. I'm concerned that users of this method won't do the right thing.

The only uses of this are tests we already own or shall soon enough 😛
The current uses look ok to me.

@craig
Copy link
Contributor

craig bot commented Oct 6, 2022

Build succeeded:

@craig craig bot merged commit 516293f into cockroachdb:master Oct 6, 2022
@irfansharif irfansharif deleted the 221002.export-ac-roachtest branch October 6, 2022 21:12
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 17, 2022
Fixes cockroachdb#89600. This test makes use of `workload querybench`, which is
only available in the `workload` binary. This test started failing after
\cockroachdb#89482 where we (attempted) to replace all uses of `./workload` with
`./cockroach workload`.

Release note: None
craig bot pushed a commit that referenced this pull request Oct 17, 2022
89813: kvserver: remove legacy snapshot and diff code r=erikgrinaker a=pavelkalinnikov

This commit removes the proto and code for `RaftDataSnapshot`. This type was used for reporting replica snapshots and computing diffs, but now this functionality has been removed in favor of storage checkpoints and offline tooling.

Part of #21128
Epic: none

Release note: None

90006: roachtest: Introduce a test to overwhelm nodes r=irfansharif a=andrewbaptist

This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM.

Informs #89142.

Release note: None

90063: roachtest: de-flake admission-control/tpcc-olap r=irfansharif a=irfansharif

Fixes #89600. This test makes use of `workload querybench`, which is only available in the `workload` binary. This test started failing after \#89482 where we (attempted) to replace all uses of `./workload` with `./cockroach workload`.

Release note: None

Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants