-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=3/cpu=16 failed #59424
Comments
Very similar symptoms to what we're seeing in #58298. It's unclear to me what's causing the crash. The Heap profile doesn't seem to indicate any foulness and looks like 64d6d87 is doing what it's supposed to. However, I am seeing a whole lot of
|
Should no longer be a release blocker as long as the release is picking up #59431. |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@5971ecb9dd1a25c81cd6012d6be1ff922802eae5:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@3c223f5f5162103110a790743b687ef2bf952489:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@83e70ce84b740e27e721c3b73c38a4b8b515094a:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
The last two failures are likely fallout from #60765. |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@64c4aef909f4382523cd9248341ca9f4448d841a:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@bf9744bad5a416a4b06907f0f3dd42896f7342f3:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@5cfd7e5553a3072a1490d392390dddf968844215:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
I think this is a real failure that needs to be investigated. Partway through the workload we seem to hit 0QPS due to crashed nodes. Except the logs for these nodes aren't available, because the machines have gone? Could be an infra-flake, but I'm not sure why this is the only test (that I'm aware of) that's seeing a pattern like this.
|
These are exactly the symptoms we would expect to see due to the bug fixed by #60992, and the timing lines up. So I'd optimistically sit on this and wait to see if it continues reproducing before spending time investigating. |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@ec011620c7cf299fdbb898db692b36454defc4a2:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@c7e088826bc079620dfd3b5ae75d1c15cd9cd16d:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@6601d827b814d4e85a1081b03bf2562d8ac2a4ab:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@9595a158f0233e1c3d86786ec4462dd39c7beb20:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
Thanks for the find about mistakenly attaching flow stats regardless of the fact whether the execution stats are collected. I opened up a PR to fix it. @irfansharif #61532 recently was merged which changed the code quoted here. We now should be using the span created here for stats propagation: cockroach/pkg/sql/conn_executor.go Lines 1467 to 1471 in c7f9851
|
Gotcha, thanks for the heads up. I think we'll want to revert #61532. I'm assuming the original motivation was to reduce the overhead of sampling in order to not create any new spans? Well, with #61777 the statement's span will be a no-op one, so when we know we're sampling, we should go and create a new span. I think in the end we'd be doing no worse for sampled statements (relative to #61532). |
This isn't something I am planning to finish, but I wanted to see if I would immediately run into any snags when trying to get `roachprod start` to spin up a Docker container instead of running bare-metal. The motivation is twofold: - I'm generally interested in how we involve roachprod over time. Specifically, the question is whether in the long run systems testing should be done in k8s and which, if any, stepping stones there are to evolve what we currently do into that direction. - The current bare-metal setup goes [unresponsive] when CRDB goes into overdrive, and tests fail in the most opaque ways. We then incur a large tax for debugging these situations since we can't access the cluster in that state. Work that would need to be done to really finish this: - maintain ubuntu images that come with docker installed (right now needs to be manually set up) and the `ubuntu` user set up and the images we need cached - we hit the old problem of having to pass the uid/gid to the container to avoid creating files as root, I hacked around it by hard-coding them - Setting the cluster settings, etc, is all done via the `./cockroach` binary but that is no longer a thing. Ideally that should use SQL - The whole premise of uploading binaries is out of the window, we need to deal in CRDB containers exclusively, though we could conceivably use a wrapper container running the uploaded binary if we wanted to retain how roachprod/test work. [unresponsive]: cockroachdb#59424 (comment) Release note: None
This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@4b98115dfda02a9498f566958bd915c45ec7e449:
More
Artifacts: /tpccbench/nodes=3/cpu=16
See this test on roachdash |
This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None
61777: sql: only create real spans when session tracing/sampling r=irfansharif a=irfansharif This drastically reduces the memory overhead for tracing we're observing in #59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes #59424. Release note: None --- +cc @cockroachdb/kv-east. Co-authored-by: irfan sharif <[email protected]>
This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None
--- sql: only create real spans when session tracing/sampling r=irfansharif a=irfansharif This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif We arrived at the previous default rate of 10% back in #59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what #59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced #61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in #59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in #62227), but they're much too invasive to backport to 21.1. Release note: None Co-authored-by: irfan sharif <[email protected]>
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
(roachtest).tpccbench/nodes=3/cpu=16 failed on master@d86781c07065421f4a4d8bf5d988900ab07fdce5:
More
Artifacts: /tpccbench/nodes=3/cpu=16
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: