Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachperf: regression in tpcc on 2023/04/11 #101494

Closed
nvanbenschoten opened this issue Apr 13, 2023 · 7 comments · Fixed by #102193
Closed

roachperf: regression in tpcc on 2023/04/11 #101494

nvanbenschoten opened this issue Apr 13, 2023 · 7 comments · Fixed by #102193
Labels
branch-master Failures and bugs on the master branch. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-testeng TestEng Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Apr 13, 2023

Our performance dropped on tpccbench/nodes=3/cpu=16 across aws and gcp on 2023/04/11. We should understand why and resolve the regression.

Screenshot 2023-04-13 at 4 52 31 PM

Jira issue: CRDB-26971

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. branch-master Failures and bugs on the master branch. T-testeng TestEng Team labels Apr 13, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 13, 2023

cc @cockroachdb/test-eng

@srosenberg
Copy link
Member

srosenberg commented Apr 14, 2023

Below are the results of an automated bisection of kv95/enc=false/nodes=3/cpu=32 (GCE) using 4 runs, each run with the duration of 10 minutes

Average throughput,

{
  "hashResults": {
    "85e41ca8d3d": "110214",
    "a5d61d2cf2b": "94856",
    "61946783bc6": "111257",
    "b7702e1ebf1": "110730",
    "668ec8dbd7b": "95100",
    "dd237e6b8d9": "107229",
    "1654c7d2504": "107733",
    "51ace6504e2": "108278",
    "b9b8da67d55": "107492",
    "bdf3f6262e3": "95087"
  },
  "goodThreshold": "107229",
  "badThreshold": "95100"
}

Bisection log,

git bisect start
# status: waiting for both good and bad commits
# good: [85e41ca8d3d9edacf5ee3061a2591b159b9b0502] Merge #100964
git bisect good 85e41ca8d3d9edacf5ee3061a2591b159b9b0502
# status: waiting for bad commit, 1 good commit known
# bad: [a5d61d2cf2bfc11bd6bdc27dcab32467334fb123] Merge #101148
git bisect bad a5d61d2cf2bfc11bd6bdc27dcab32467334fb123
# good: [61946783bc69e4fe9dab0f7702e7a0ae550af84d] Merge #100925 #100993
git bisect good 61946783bc69e4fe9dab0f7702e7a0ae550af84d
# good: [b7702e1ebf1d41aadc4fe409f763256a1a853eaa] Merge #99719 #100921 #100976 #101128 #101129
git bisect good b7702e1ebf1d41aadc4fe409f763256a1a853eaa
# bad: [668ec8dbd7bc3148faba5293dbccfa2c7e62830a] Merge #100533 #100776 #101076 #101078
git bisect bad 668ec8dbd7bc3148faba5293dbccfa2c7e62830a
# good: [dd237e6b8d9053ad3d993bf7a593bbf3ad229075] Merge #100406 #101160 #101161
git bisect good dd237e6b8d9053ad3d993bf7a593bbf3ad229075
# good: [1654c7d2504b2ead7b03bf5ba7bc9708762ba06f] autoconfig: transactional SQL should wait on leases
git bisect good 1654c7d2504b2ead7b03bf5ba7bc9708762ba06f
# good: [51ace6504e2b122274d993e06b32ce94e109281c] roachtest: move copyfrom test suite to SQL Queries
git bisect good 51ace6504e2b122274d993e06b32ce94e109281c
# good: [b9b8da67d55451c2e322223aa3b83d5a9be20d15] opt: fix ordering-related optimizer panics
git bisect good b9b8da67d55451c2e322223aa3b83d5a9be20d15
# bad: [bdf3f6262e380b867aa8884016eae53d28d074a4] workload: jitter the teardown of connections to prevent thundering herd
git bisect bad bdf3f6262e380b867aa8884016eae53d28d074a4
# first bad commit: [bdf3f6262e380b867aa8884016eae53d28d074a4] workload: jitter the teardown of connections to prevent thundering herd

The culprit appears to be bdf3f62 and the suspect is the added mutex [1] to fix the race condition in Record. Could it be that the lock contention is effectively rate limiting the workload generator?

[1] https://github.com/cockroachdb/cockroach/blame/master/pkg/workload/histogram/histogram.go#L76

@srosenberg
Copy link
Member

srosenberg commented Apr 25, 2023

The mutex which was added to fix the race condition is not on a hot-path; it accounts for ~1% of CPU,

workload_jitters_flame

The actual culprit turned out to be another mutex which happens to be on a hot-path,

// Method returns the query execution mode of the connection pool.
func (m *MultiConnPool) Method() pgx.QueryExecMode {
	m.mu.Lock()
	defer m.mu.Unlock()
	return m.mu.method
}

Additionally, Tracer results in non-trivial allocations and accounts for ~5% of CPU,

NewMultiConnPoolAfter

cf. before the change,

NewMultiConnPoolBefore

@srosenberg
Copy link
Member

Removing the mutex (in Method) and Tracer altogether, removes the bottleneck in the workload generator. The observed, average throughput in kv95 is essentially equivalent to the previous workload generator, using pgx v4.

kv95_pgx_v4_vs_v5

@srosenberg
Copy link
Member

srosenberg commented Apr 25, 2023

Note, there are observable runtime differences between pgx v4 and v5. Below runtime plots corresponds to the above two runs; left side is pgx v4, right side is pgx v5.

memory_pgx_v4_vs_v5

goroutines_pgx_v4_vs_v5

We can see that v5 is using slightly more memory. Also, the number of goroutines is more stable in v4, where it's nearly a flat line; cf., v4 where the number of goroutines oscillate roughly between 360 and 390.

NOTE: Owing to the concurrency setting (3 nodes x 64), there are 192 workload goroutines. Thus, the remaining majority is spawned by pgx.

@srosenberg
Copy link
Member

srosenberg commented Apr 25, 2023

One major difference is reflected in the flamegraphs below. Pgx v5 is using an internal buffer in conjunction with non-blocking IO. This also results in slightly better latency (see below).

v4 with blocking IO,

execPrepared_v4

v5 with non-blocking IO,

execPrepared_v5

kv95 read latency,

kv95_read_latency_v4_vs_v5

@srosenberg
Copy link
Member

With respect to the number of goroutines being unstable in v5, I did some profiling by sampling all goroutines at different times,

curl "http://34.73.104.183:33333/debug/pprof/goroutine?debug=2  >/tmp/out"
panicparse /tmp/out

The root cause is ContextWatcher [1], which seems to correspond to closed connections. I am not exactly sure why more (TCP) connections are being closed, possibly due to timeout differences, but it doesn't appear to be a cause for concern. The network metrics below don't show a large disparity in re-established connections.

kv95_tcp_conn_v4_vs_v5

[1] https://github.com/jackc/pgx/blob/master/pgconn/internal/ctxwatch/context_watcher.go#L46-L54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-testeng TestEng Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants