Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: introduce admission-control/elastic-cdc #89656

Merged
merged 1 commit into from
Oct 17, 2022

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Oct 10, 2022

Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency.

Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope.

Release note: None

@irfansharif irfansharif requested a review from a team October 10, 2022 14:26
@irfansharif irfansharif requested a review from a team as a code owner October 10, 2022 14:26
@irfansharif irfansharif requested review from smg260 and removed request for a team October 10, 2022 14:26
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif removed the request for review from smg260 October 10, 2022 14:26
@irfansharif
Copy link
Contributor Author

05:02:26 util.go:55: up-replication complete
05:02:26 tpcc.go:175: test status: loading fixture (<10m0s)
05:08:53 tpcc.go:191: test status: finished tpc-c setup
05:08:53 tpcc.go:255: test worker status: running tpcc worker=0 warehouses=1000 ramp=5m0s duration=1h30m0s on {pgurl:1-3} (<1m0s)
05:08:53 admission_control_elastic_cdc.go:84: test status: during: cluster settings (<30s)
05:08:53 admission_control_elastic_cdc.go:94: test status: during: setting performance baseline (<5m0s)
05:13:53 admission_control_elastic_cdc.go:129: test status: during: waiting for workload to finish (<1h30m0s)
05:13:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 0 (<1m0s)
05:18:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 1 (<1m0s)
05:23:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 2 (<1m0s)
05:28:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 3 (<1m0s)
05:33:53 admission_control_elastic_cdc.go:113: test status: during: creating changefeed 4 (<1m0s)
06:44:02 test_runner.go:942: tearing down after success; see teardown.log

Here's a prometheus dump of an experiment run. To see it, run the contained ./prometheus-docker-run.sh and point to localhost:9090 through your grafana instance (brew install grafana; brew services start grafana + localhost:3000 if you're on macs). http://go.crdb.dev/p/changefeed-admission-control-grafana is the set of dashboards I've been using.

elastic-cdc.zip

image

image

@irfansharif irfansharif force-pushed the 221007.rangefeed-ac-test branch 2 times, most recently from a3f7ea1 to 36d5dd8 Compare October 11, 2022 00:21
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 11, 2022
Part of cockroachdb#65957.

Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri). Experimentally we observed that during
initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission
integration of parallel worker goroutines to future work.

Unlike export requests rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to only use 100ms of CPU time, to exercise
better control over scheduling latencies. To that end, we introduce the
following component used within the rangefeed catchup iterator.

    // Pacer is used in tight loops (CPU-bound) for non-premptible
    // elastic work. Callers are expected to invoke Pace() every loop
    // iteration and Close() once done. Internally this type integrates
    // with elastic CPU work queue, acquiring tokens for the CPU work
    // being done, and blocking if tokens are unavailable. This allows
    // for a form of cooperative scheduling with elastic CPU granters.
    type Pacer struct
    func (p *Pacer) Pace(ctx context.Context) error { ... }
    func (p *Pacer) Close() { ... }

Release note: None
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! minor comments.
:lgtm:

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 106 at r1 (raw file):

					time.Sleep(padDuration)

					stopFeeds(db)

Is this stopping feeds before starting the experiment below since there may be some leftover from before?


pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 117 at r1 (raw file):

							t.Status(fmt.Sprintf("during: round %d: stopping extant changefeeds (<%s)", i, stopFeedsDuration))
							stopFeeds(db)
							time.Sleep(stopFeedsDuration)

The logic here could use an overview comment explaining what it is trying to do. And explain stopFeedsDuration and padDuration.
Is the number of changefeeds > 1 so that they will all do their independent catchup scan? How does this cause the "Changefeeds running" graph to show an increasing step function?


pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 124 at r1 (raw file):

									CREATE CHANGEFEED FOR tpcc.order_line, tpcc.stock, tpcc.customer
									INTO 'null://' WITH cursor = '-%ds'
								`, int64(float64(i+1)*padDuration.Seconds()))

Are we increasing how far back we need to scan for the catchup scan?

Part of cockroachdb#89208. This test sets up a 3-node CRDB cluster on 8vCPU
machines running 1000-warehouse TPC-C, and kicks off a few changefeed
backfills concurrently. We've observed latency spikes during backfills
because of its CPU/scan-heavy nature -- it can elevate CPU scheduling
latencies which in turn translates to an increase in foreground latency.

Also in this commit: routing std{err,out} from prometheus/grafana setup
that roachtests do to the logger in scope.

Release note: None
@irfansharif irfansharif force-pushed the 221007.rangefeed-ac-test branch from 36d5dd8 to 42d08d8 Compare October 17, 2022 20:13
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @shermanCRL and @sumeerbhola)


pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 106 at r1 (raw file):

Previously, sumeerbhola wrote…

Is this stopping feeds before starting the experiment below since there may be some leftover from before?

Yes, was handy when constantly re-running against the same cluster. Added a comment


pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 117 at r1 (raw file):
Added comments.

Is the number of changefeeds > 1 so that they will all do their independent catchup scan?

Yes, and just to really swamp the cluster. Even this week @shermanCRL pointed me to this case internally where a customer was trying to run 50+ changefeeds all at once and being worried about the CPU effects. See this slack thread.

How does this cause the "Changefeeds running" graph to show an increasing step function?

The screenshot above was from a different version of the test. The most recent version/data dumps are here where we don't have these step functions: #89709 (comment).


pkg/cmd/roachtest/tests/admission_control_elastic_cdc.go line 124 at r1 (raw file):

Previously, sumeerbhola wrote…

Are we increasing how far back we need to scan for the catchup scan?

Yes, trying to scan as far back as possible (~ when the workload started). This test doesn't care about the work getting done or not (feeds are cancelled by the next iteration), just what the effect is on CPU/scheduling latencies. Added a comment.

@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 17, 2022

Build succeeded:

@craig craig bot merged commit 55aca65 into cockroachdb:master Oct 17, 2022
@irfansharif irfansharif deleted the 221007.rangefeed-ac-test branch October 17, 2022 21:28
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 24, 2022
Part of cockroachdb#65957.

Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that
during initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission
integration of parallel worker goroutines to future work (cockroachdb#90089).

Unlike export requests rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to only use 100ms of CPU time, to exercise
better control over scheduling latencies. To that end, we introduce the
following component used within the rangefeed catchup iterator.

    // Pacer is used in tight loops (CPU-bound) for non-premptible
    // elastic work. Callers are expected to invoke Pace() every loop
    // iteration and Close() once done. Internally this type integrates
    // with elastic CPU work queue, acquiring tokens for the CPU work
    // being done, and blocking if tokens are unavailable. This allows
    // for a form of cooperative scheduling with elastic CPU granters.
    type Pacer struct
    func (p *Pacer) Pace(ctx context.Context) error { ... }
    func (p *Pacer) Close() { ... }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 28, 2022
Part of cockroachdb#65957.

Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that
during initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission
integration of parallel worker goroutines to future work (cockroachdb#90089).

Unlike export requests rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to only use 100ms of CPU time, to exercise
better control over scheduling latencies. To that end, we introduce the
following component used within the rangefeed catchup iterator.

    // Pacer is used in tight loops (CPU-bound) for non-premptible
    // elastic work. Callers are expected to invoke Pace() every loop
    // iteration and Close() once done. Internally this type integrates
    // with elastic CPU work queue, acquiring tokens for the CPU work
    // being done, and blocking if tokens are unavailable. This allows
    // for a form of cooperative scheduling with elastic CPU granters.
    type Pacer struct
    func (p *Pacer) Pace(ctx context.Context) error { ... }
    func (p *Pacer) Close() { ... }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Nov 2, 2022
Part of cockroachdb#65957.

Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that
during initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission
integration of parallel worker goroutines to future work (cockroachdb#90089).

Unlike export requests rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to only use 100ms of CPU time, to exercise
better control over scheduling latencies. To that end, we introduce the
following component used within the rangefeed catchup iterator.

    // Pacer is used in tight loops (CPU-bound) for non-premptible
    // elastic work. Callers are expected to invoke Pace() every loop
    // iteration and Close() once done. Internally this type integrates
    // with elastic CPU work queue, acquiring tokens for the CPU work
    // being done, and blocking if tokens are unavailable. This allows
    // for a form of cooperative scheduling with elastic CPU granters.
    type Pacer struct
    func (p *Pacer) Pace(ctx context.Context) error { ... }
    func (p *Pacer) Close() { ... }

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Nov 3, 2022
Part of cockroachdb#65957.

Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that
during initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission
integration of parallel worker goroutines to future work (cockroachdb#90089).

Unlike export requests rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to only use 100ms of CPU time, to exercise
better control over scheduling latencies. To that end, we introduce the
following component used within the rangefeed catchup iterator.

    // Pacer is used in tight loops (CPU-bound) for non-premptible
    // elastic work. Callers are expected to invoke Pace() every loop
    // iteration and Close() once done. Internally this type integrates
    // with elastic CPU work queue, acquiring tokens for the CPU work
    // being done, and blocking if tokens are unavailable. This allows
    // for a form of cooperative scheduling with elastic CPU granters.
    type Pacer struct
    func (p *Pacer) Pace(ctx context.Context) error { ... }
    func (p *Pacer) Close() { ... }

Release note: None
craig bot pushed a commit that referenced this pull request Nov 4, 2022
89709: kv,rangefeed: integrate catchup scans with elastic cpu r=irfansharif a=irfansharif

Part of #65957. First commit is from #89656.

Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In #89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in #86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri). Experimentally we observed that during
initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something #89589 can help with. We leave admission
integration of parallel worker goroutines to future work.

Unlike export requests, rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to use at most 100ms of CPU time (it makes
for an ineffective controller otherwise), and to that end we introduce
the following component used within the rangefeed catchup iterator:

    // Pacer is used in tight loops (CPU-bound) for non-premptible
    // elastic work. Callers are expected to invoke Pace() every loop
    // iteration and Close() once done. Internally this type integrates
    // with elastic CPU work queue, acquiring tokens for the CPU work
    // being done, and blocking if tokens are unavailable. This allows
    // for a form of cooperative scheduling with elastic CPU granters.
    type Pacer struct
    func (p *Pacer) Pace(ctx context.Context) error { ... }
    func (p *Pacer) Close() { ... }

Release note: None

Co-authored-by: irfan sharif <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Nov 24, 2022
Part of cockroachdb#65957.

Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that
during initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission
integration of parallel worker goroutines to future work (cockroachdb#90089).

Unlike export requests rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to only use 100ms of CPU time, to exercise
better control over scheduling latencies. To that end, we introduce the
following component used within the rangefeed catchup iterator.

    // Pacer is used in tight loops (CPU-bound) for non-premptible
    // elastic work. Callers are expected to invoke Pace() every loop
    // iteration and Close() once done. Internally this type integrates
    // with elastic CPU work queue, acquiring tokens for the CPU work
    // being done, and blocking if tokens are unavailable. This allows
    // for a form of cooperative scheduling with elastic CPU granters.
    type Pacer struct
    func (p *Pacer) Pace(ctx context.Context) error { ... }
    func (p *Pacer) Close() { ... }

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants