-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/mixed-headroom/n5cpu16 failed #36983
Comments
@jordanlewis mind taking a look? |
SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog
|
Hmm, the original failure is an odd one. I think it means there was an error in a statement in a transaction but the client didn't do the right thing and roll back - instead it tried to do another statement as normal. @andreimatei does that match your guess? |
SHA: https://github.com/cockroachdb/cockroach/commits/dff4132a80e62c6c5ad603ff6c608b09419d4e3e Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1264632&tab=buildLog
|
Yup The other failures are about us not having updated some infrastructure for 19.2 I think. I'll deal with it. |
SHA: https://github.com/cockroachdb/cockroach/commits/efb45869b242137e5c178b10c646c3ed025fff36 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266041&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/dab475d78289721714b9c3b5b649dbb4794fc01f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268212&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/d554884a4e474cc06213230d5ba7d757a88e9e46 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1286052&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/46b82eb637a3f6a1ae563fcad7448fa5458f8298 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290161&tab=buildLog
|
I think this is a bug in the TPCC load generator, not Cockroach, but I'm not sure. |
SHA: https://github.com/cockroachdb/cockroach/commits/a65ec243adaa11ea951f02319dc2e02463461ab2 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290143&tab=buildLog
|
I need to move the ssh debug output above the error or it'll be annoying to debug these going forward. |
@tbg do we expect this test to pass? I'm still not really sure what it does - that error means the cluster was overloaded enough for a time that a query failed. |
We expect this to pass, the headroom test intentionally runs with a much lower warehouse count than we think we support (1300+).
|
Oh, |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Previous two issues addressed by #37701. |
SHA: https://github.com/cockroachdb/cockroach/commits/1810a4eaa07b412b2d0899d25bb16a28a2746d48 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1300948&tab=buildLog
|
@ajwerner this looks like the issue you just fixed. Has that change been backported to 19.1 yet? |
This comment has been minimized.
This comment has been minimized.
The most recent failures are due to the non-nullable fields in |
38465: roachpb: Change RangeDescriptor.StickyBit and ReplicaDescriptor.Type to nullable r=jeffrey-xiao a=jeffrey-xiao On a mixed version cluster, only having #38302 does not guarantee backwards compatibility for non-nullable fields in RangeDescriptor. Suppose the replicas of a range are of mixed version and the current leaseholder of the range is on newest version. If the leaseholder splits that range and writes to the non-nullable fields in RangeDescriptor and the leaseholder transfers to a node on an older version without #38302, then all CPuts would fail on that node. We must guarantee that #38302 is on all nodes before making the fields in RangeDescriptor non-nullable. Fixes #36983. Fixes #38471. Release note: None Co-authored-by: Jeffrey Xiao <[email protected]>
This comment has been minimized.
This comment has been minimized.
38465: roachpb: Change RangeDescriptor.StickyBit and ReplicaDescriptor.Type to nullable r=jeffrey-xiao a=jeffrey-xiao On a mixed version cluster, only having #38302 does not guarantee backwards compatibility for non-nullable fields in RangeDescriptor. Suppose the replicas of a range are of mixed version and the current leaseholder of the range is on newest version. If the leaseholder splits that range and writes to the non-nullable fields in RangeDescriptor and the leaseholder transfers to a node on an older version without #38302, then all CPuts would fail on that node. We must guarantee that #38302 is on all nodes before making the fields in RangeDescriptor non-nullable. Fixes #36983. Fixes #38471. Release note: None Co-authored-by: Jeffrey Xiao <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog
|
This comment has been minimized.
This comment has been minimized.
SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog
|
This comment has been minimized.
This comment has been minimized.
38593: storage: sort InternalReplicas before comparing r=jeffrey-xiao a=jeffrey-xiao Fixes an issue where the in-memory copy of RangeDescriptor is out-of-sync with the copy in kv. Should resolve the failure in this #36983 (comment) The import step of `tpcc/mixed-headroom/n5cpu16` passes on `2811e0f` + #38465 + this PR. cc @asubiotto Co-authored-by: Jeffrey Xiao <[email protected]>
Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A for of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None
Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None
38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten Fixes #34180. Fixes #35493. Fixes #36983. Fixes #37108. Fixes #37371. Fixes #37384. Fixes #37551. Fixes #37879. Fixes #38095. Fixes #38131. Fixes #38136. Fixes #38549. Fixes #38552. Fixes #38555. Fixes #38560. Fixes #38562. Fixes #38563. Fixes #38569. Fixes #38578. Fixes #38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: ![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png) We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Co-authored-by: Nathan VanBenschoten <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/2c865eeb3e3b244468ffc509a62778bd1f46740f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1370685&tab=buildLog
|
It looks like that's because there was an 8-hour gap between detecting the crash and trying to grab any artifacts:
|
SHA: https://github.com/cockroachdb/cockroach/commits/df200cbf3f407dbf349aa601ff9036b4dff88e83
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1252822&tab=buildLog
The text was updated successfully, but these errors were encountered: