roachperf: regression in tpcc on 2023/04/11 #101494

nvanbenschoten · 2023-04-13T20:55:42Z

Our performance dropped on tpccbench/nodes=3/cpu=16 across aws and gcp on 2023/04/11. We should understand why and resolve the regression.

Jira issue: CRDB-26971

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-04-13T20:55:45Z

cc @cockroachdb/test-eng

srosenberg · 2023-04-14T04:11:20Z

Below are the results of an automated bisection of kv95/enc=false/nodes=3/cpu=32 (GCE) using 4 runs, each run with the duration of 10 minutes

Average throughput,

{
  "hashResults": {
    "85e41ca8d3d": "110214",
    "a5d61d2cf2b": "94856",
    "61946783bc6": "111257",
    "b7702e1ebf1": "110730",
    "668ec8dbd7b": "95100",
    "dd237e6b8d9": "107229",
    "1654c7d2504": "107733",
    "51ace6504e2": "108278",
    "b9b8da67d55": "107492",
    "bdf3f6262e3": "95087"
  },
  "goodThreshold": "107229",
  "badThreshold": "95100"
}

Bisection log,

git bisect start
# status: waiting for both good and bad commits
# good: [85e41ca8d3d9edacf5ee3061a2591b159b9b0502] Merge #100964
git bisect good 85e41ca8d3d9edacf5ee3061a2591b159b9b0502
# status: waiting for bad commit, 1 good commit known
# bad: [a5d61d2cf2bfc11bd6bdc27dcab32467334fb123] Merge #101148
git bisect bad a5d61d2cf2bfc11bd6bdc27dcab32467334fb123
# good: [61946783bc69e4fe9dab0f7702e7a0ae550af84d] Merge #100925 #100993
git bisect good 61946783bc69e4fe9dab0f7702e7a0ae550af84d
# good: [b7702e1ebf1d41aadc4fe409f763256a1a853eaa] Merge #99719 #100921 #100976 #101128 #101129
git bisect good b7702e1ebf1d41aadc4fe409f763256a1a853eaa
# bad: [668ec8dbd7bc3148faba5293dbccfa2c7e62830a] Merge #100533 #100776 #101076 #101078
git bisect bad 668ec8dbd7bc3148faba5293dbccfa2c7e62830a
# good: [dd237e6b8d9053ad3d993bf7a593bbf3ad229075] Merge #100406 #101160 #101161
git bisect good dd237e6b8d9053ad3d993bf7a593bbf3ad229075
# good: [1654c7d2504b2ead7b03bf5ba7bc9708762ba06f] autoconfig: transactional SQL should wait on leases
git bisect good 1654c7d2504b2ead7b03bf5ba7bc9708762ba06f
# good: [51ace6504e2b122274d993e06b32ce94e109281c] roachtest: move copyfrom test suite to SQL Queries
git bisect good 51ace6504e2b122274d993e06b32ce94e109281c
# good: [b9b8da67d55451c2e322223aa3b83d5a9be20d15] opt: fix ordering-related optimizer panics
git bisect good b9b8da67d55451c2e322223aa3b83d5a9be20d15
# bad: [bdf3f6262e380b867aa8884016eae53d28d074a4] workload: jitter the teardown of connections to prevent thundering herd
git bisect bad bdf3f6262e380b867aa8884016eae53d28d074a4
# first bad commit: [bdf3f6262e380b867aa8884016eae53d28d074a4] workload: jitter the teardown of connections to prevent thundering herd

The culprit appears to be bdf3f62 and the suspect is the added mutex [1] to fix the race condition in Record. Could it be that the lock contention is effectively rate limiting the workload generator?

[1] https://github.com/cockroachdb/cockroach/blame/master/pkg/workload/histogram/histogram.go#L76

srosenberg · 2023-04-25T20:12:04Z

The mutex which was added to fix the race condition is not on a hot-path; it accounts for ~1% of CPU,

The actual culprit turned out to be another mutex which happens to be on a hot-path,

// Method returns the query execution mode of the connection pool.
func (m *MultiConnPool) Method() pgx.QueryExecMode {
	m.mu.Lock()
	defer m.mu.Unlock()
	return m.mu.method
}

Additionally, Tracer results in non-trivial allocations and accounts for ~5% of CPU,

cf. before the change,

srosenberg · 2023-04-25T20:17:15Z

Removing the mutex (in Method) and Tracer altogether, removes the bottleneck in the workload generator. The observed, average throughput in kv95 is essentially equivalent to the previous workload generator, using pgx v4.

srosenberg · 2023-04-25T20:23:36Z

Note, there are observable runtime differences between pgx v4 and v5. Below runtime plots corresponds to the above two runs; left side is pgx v4, right side is pgx v5.

We can see that v5 is using slightly more memory. Also, the number of goroutines is more stable in v4, where it's nearly a flat line; cf., v4 where the number of goroutines oscillate roughly between 360 and 390.

NOTE: Owing to the concurrency setting (3 nodes x 64), there are 192 workload goroutines. Thus, the remaining majority is spawned by pgx.

srosenberg · 2023-04-25T20:25:54Z

One major difference is reflected in the flamegraphs below. Pgx v5 is using an internal buffer in conjunction with non-blocking IO. This also results in slightly better latency (see below).

v4 with blocking IO,

v5 with non-blocking IO,

kv95 read latency,

srosenberg · 2023-04-25T20:39:01Z

With respect to the number of goroutines being unstable in v5, I did some profiling by sampling all goroutines at different times,

curl "http://34.73.104.183:33333/debug/pprof/goroutine?debug=2  >/tmp/out"
panicparse /tmp/out

The root cause is ContextWatcher [1], which seems to correspond to closed connections. I am not exactly sure why more (TCP) connections are being closed, possibly due to timeout differences, but it doesn't appear to be a cause for concern. The network metrics below don't show a large disparity in re-established connections.

[1] https://github.com/jackc/pgx/blob/master/pgconn/internal/ctxwatch/context_watcher.go#L46-L54

sync.RWMutex RLock() became a hot path for essentially a read-only value. The execution mode of a pool and query doesn't change after program initialization. Release note: None Fixes: cockroachdb#101494

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. branch-master Failures and bugs on the master branch. T-testeng TestEng Team labels Apr 13, 2023

This was referenced Apr 24, 2023

roachperf: gradual kv95 decline of 10% #101299

Closed

workload: add prometheus NewGoCollector #102172

Merged

sean- mentioned this issue Apr 25, 2023

workload: fix performance regresion from accessing query execution mode #102193

Merged

craig bot closed this as completed in 2066afb Apr 26, 2023

srosenberg mentioned this issue Apr 26, 2023

release-23.1: workload: fix performance regresion from accessing query execution mode #102391

Closed

srosenberg mentioned this issue Apr 26, 2023

release-23.1: workload: jitter the teardown of connections to prevent thundering herd #102395

Merged

nvanbenschoten mentioned this issue Jul 14, 2023

release-23.1: workload: fix performance regresion from accessing query execution mode #106870

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachperf: regression in tpcc on 2023/04/11 #101494

roachperf: regression in tpcc on 2023/04/11 #101494

nvanbenschoten commented Apr 13, 2023 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Apr 13, 2023

srosenberg commented Apr 14, 2023 •

edited

Loading

srosenberg commented Apr 25, 2023 •

edited

Loading

srosenberg commented Apr 25, 2023

srosenberg commented Apr 25, 2023 •

edited

Loading

srosenberg commented Apr 25, 2023 •

edited

Loading

srosenberg commented Apr 25, 2023

roachperf: regression in tpcc on 2023/04/11 #101494

roachperf: regression in tpcc on 2023/04/11 #101494

Comments

nvanbenschoten commented Apr 13, 2023 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Apr 13, 2023

srosenberg commented Apr 14, 2023 • edited Loading

srosenberg commented Apr 25, 2023 • edited Loading

srosenberg commented Apr 25, 2023

srosenberg commented Apr 25, 2023 • edited Loading

srosenberg commented Apr 25, 2023 • edited Loading

srosenberg commented Apr 25, 2023

nvanbenschoten commented Apr 13, 2023 •

edited by cockroach-jira-scripts

Loading

srosenberg commented Apr 14, 2023 •

edited

Loading

srosenberg commented Apr 25, 2023 •

edited

Loading

srosenberg commented Apr 25, 2023 •

edited

Loading

srosenberg commented Apr 25, 2023 •

edited

Loading