-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/headroom failed during release qualification #97141
Comments
cc @cockroachdb/test-eng |
There are a few violations right next to each other Details
|
Note that we just merged a change that parallelizes the execution of FK and UNIQUE constraint checks. There is a possibility that that change introduced a false positive (by reporting a violation that doesn't exist) although it seems unlikely. In particular, I'm a little worried about the fact that change made it so that we use the RootTxn for the main mutation query while the LeafTxns, concurrently, for read-only post-query FK checks. I believe this is the first time we're using multiple LeafTxns concurrently on the same node. |
cc @cockroachdb/replication |
That change is indeed in 51ed100, so it seems like a likely candidate. I'll kick off 10 runs to try a repro, and another before that change. |
First 10 runs passed, doing another 10. |
Got 2 failed runs on the second set of 10. Going to dig in after dinner, and kick off 20 runs without that PR. |
Ok, so in one of the failures we had four queries (on different worker threads) failing around the same time: INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info) VALUES
(3018,1,164,9,16315,164,3,57.060000,'uDNpeXv99YnTpFFIno09VeRX'), (3018,1,164,7,16476,164,5,479.900000,'FJR7pY1nRRsLWoQhCKMiiMai'),
(3018,1,164,2,20564,164,5,375.500000,'pY1nRRsLWoQhCKMiiMaivJB3'), (3018,1,164,4,45130,164,6,236.820000,'b5tj983VuDNpeXv99YnTpFFI'),
(3018,1,164,6,46028,164,6,431.160000,'FIno09VeRXedB2hFJR7pY1nR'), (3018,1,164,5,55888,164,2,183.320000,'YXyj7y5ZabrkWjb5tj983VuD'),
(3018,1,164,1,78362,164,5,386.000000,'FJR7pY1nRRsLWoQhCKMiiMai'), (3018,1,164,8,80284,164,7,7.070000,'hFJR7pY1nRRsLWoQhCKMiiMa'),
(3018,1,164,3,89690,164,3,173.640000,'R7pY1nRRsLWoQhCKMiiMaivJ');
INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info) VALUES
(3011,10,155,11,4528,155,1,4.990000,'djpo78MLwpf515JKgEE833Tk'), (3011,10,155,1,12956,155,2,163.620000,'78MLwpf515JKgEE833TkhoGv'),
(3011,10,155,3,15291,155,2,159.280000,'khoGvrlqNpF8Avrzpv6a2qvn'), (3011,10,155,2,22088,155,7,328.090000,'9pev7QzXOmlmDehlYdjpo78M'),
(3011,10,155,8,23628,155,1,82.660000,'3TkhoGvrlqNpF8Avrzpv6a2q'), (3011,10,155,5,32284,155,4,318.960000,'8MLwpf515JKgEE833TkhoGvr'),
(3011,10,155,10,35858,155,4,287.240000,'33TkhoGvrlqNpF8Avrzpv6a2'), (3011,10,155,6,39967,155,5,485.600000,'7QzXOmlmDehlYdjpo78MLwpf'),
(3011,10,155,12,53338,155,9,627.750000,'515JKgEE833TkhoGvrlqNpF8'), (3011,10,155,13,57436,155,1,62.290000,'AiFH5D9pev7QzXOmlmDehlYd'),
(3011,10,155,4,80172,155,3,186.510000,'qNpF8Avrzpv6a2qvntOzC0cs'), (3011,10,155,9,90162,155,9,244.980000,'5JKgEE833TkhoGvrlqNpF8Av'),
(3011,10,155,7,98228,155,6,468.600000,'MLwpf515JKgEE833TkhoGvrl');
INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info) VALUES
(3017,6,167,13,8284,167,6,133.200000,'4O5u71QNRaMByMkHFSpeaCg8'), (3017,6,167,7,24602,167,4,61.120000,'1ud46F10D7Km4O5u71QNRaMB'),
(3017,6,167,10,30283,167,9,706.140000,'10D7Km4O5u71QNRaMByMkHFS'), (3017,6,167,14,31228,167,5,82.450000,'snxgy9Mi8zsN6n1ud46F10D7'),
(3017,6,167,9,37381,167,4,383.720000,'i8zsN6n1ud46F10D7Km4O5u7'), (3017,6,167,12,40616,167,3,139.770000,'O5u71QNRaMByMkHFSpeaCg83'),
(3017,6,167,11,41043,167,2,53.420000,'xgy9Mi8zsN6n1ud46F10D7Km'), (3017,6,167,4,51284,167,2,55.960000,'NRaMByMkHFSpeaCg838uvUzB'),
(3017,6,167,3,61516,167,1,65.980000,'njVQU3Msnxgy9Mi8zsN6n1ud'), (3017,6,167,5,63490,167,10,12.800000,'Km4O5u71QNRaMByMkHFSpeaC'),
(3017,6,167,6,64976,167,2,97.900000,'Cg838uvUzBLjmGdvOYaCNRzu'), (3017,6,167,8,65628,167,4,89.800000,'n1ud46F10D7Km4O5u71QNRaM'),
(3017,6,167,2,72730,167,1,47.700000,'NRaMByMkHFSpeaCg838uvUzB'), (3017,6,167,1,96252,167,4,134.200000,'MkHFSpeaCg838uvUzBLjmGdv');
INSERT INTO order_line(ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info) VALUES
(3015,2,155,1,55888,155,8,733.280000,'LWoQhCKMiiMaivJB3MCD5cR8'), (3015,2,155,6,57428,155,8,373.040000,'Ino09VeRXedB2hFJR7pY1nRR'),
(3015,2,155,5,65521,155,4,230.440000,'iiMaivJB3MCD5cR8QiEfNvZT'), (3015,2,155,9,67674,155,5,277.550000,'pY1nRRsLWoQhCKMiiMaivJB3'),
(3015,2,155,4,69628,155,9,677.700000,'oQhCKMiiMaivJB3MCD5cR8Qi'), (3015,2,155,7,81947,155,2,51.460000,'RXedB2hFJR7pY1nRRsLWoQhC'),
(3015,2,155,8,89692,155,10,34.400000,'2hFJR7pY1nRRsLWoQhCKMiiM'), (3015,2,155,3,90203,155,2,92.740000,'2hFJR7pY1nRRsLWoQhCKMiiM'),
(3015,2,155,2,97354,155,9,344.700000,'Ino09VeRXedB2hFJR7pY1nRR'); The violated index is always
Notice how the indexed values are the same in all the rows for each statement. This is because this query is part of the cockroach/pkg/workload/tpcc/new_order.go Lines 399 to 424 in 1c66c95
The referenced order is created in the same transaction: cockroach/pkg/workload/tpcc/new_order.go Lines 365 to 371 in 1c66c95
There are a few possibilities here:
I'm going to see if the 20 runs without #96123 succeed. The clusters are running on grinaker-1676463254-02-n4cpu16 and grinaker-1676463254-04-n4cpu16 if anyone wants to poke, but I didn't build them with UI. Have artifacts locally. |
FWIW, I'm seeing these transactions aborts at the same time, which possibly is entirely expected and not significant. It's interesting that these are all on the same node though, there were no errors elsewhere around that time.
The logged errors were at:
|
How about this explanation: up until #96123 we always ran the FK checks serially and were using the RootTxn which is capable of transparent retries in some cases; with #96123 merged we now run the FK checks concurrently and are using the LeafTxn. If a retryable error is encountered by the LeafTxn, it's always returned to the client. Does this sound reasonable? |
Hoping to shake out something relevant for cockroachdb#97102 or cockroachdb#97141. Not intended for merge. Can maybe turn this into a metamorphic var down the road. For now running manually via `./experiment.sh`. To date, it hasn't produced anything. For the first runs, I ~immediately got a closed timestamp regression. This was puzzling, but then also disappeared, so I think my OSX clock might have been adjusting over those few minutes. Epic: none Release note: None
Another 20/20 passes on b4cfb27, I'm calling it.
I'm not really familiar with the finer points of leaf txns, so don't know how much help I can be. Assuming the leaves got There's this: cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go Lines 883 to 889 in 3e1cdb0
And also this: Lines 40 to 45 in 3e1cdb0
|
I think we have a good understanding of what is going on here. We reproduced this with tracing turned on (attaching a trace of an offending query that violates an FK check below). @nvanbenschoten (thanks for the help!) and I looked at this together. We see that this transaction was aborted -- it's detected both by the root transaction's heartbeater:
And while constructing one of the leaf transactions to perform the FK checks:
When we detect that the transaction is aborted, we try to handle this retryable error here: Lines 1268 to 1275 in 736a67e
When a transaction is aborted, the By the time we come around here for the second time, to create a leaf transaction for the second FK check, we end up creating one for the new transaction. We see that further down in the trace:
Note the ID here in the
Eventually, when the The issue here seems to be that we're trying to handle the retryable error when setting up flows, instead of bubbling it all the way conn executor. @nvanbenschoten mentioned this likely looks like a vestige of a time before we cleaned these things up. We think the following diff should help fix this issue:
I'll kick off some roachtest runs to verify. |
Ran with the diff above, and all runs passed -- this seems to be it. I'll send out a patch. |
Nice find! Do you think this change will be safe to backport (after some backing period on master)? |
I think it should be safe to backport, but like you said, let's let it bake for a bit. |
Removing the call to |
@andreimatei as of #74563, the executor will be the one taking action in response to a txn abort. The change proposed above will eliminate all calls to |
Checks out! |
Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes cockroachdb#97141 Epic: none Release note: None
Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes cockroachdb#97141 Epic: none Release note: None
Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes cockroachdb#97141 Epic: none Release note: None
Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes cockroachdb#97141 Epic: none Release note: None
Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes cockroachdb#97141 Epic: none Release note: None
Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes cockroachdb#97141 Epic: none Release note: None
96811: loqrecovery: support mixed version recovery r=erikgrinaker a=aliher1911 This commit adds mixed version support for half-online loss of quorum recovery service and cli tools. This change would allow user to use loq recovery in partially upgraded clusters by tracking version that generated data and produce recovery plans which will have identical version so that versions could be verified on all steps of recovery. General rule is you can use data from the cluster that is not newer than a binary version to avoid new information being dropped. This rule applies to planning process where planner should understand replica info and also to cockroach node that applies the plan, which should be created by equal or lower version. Additional restriction is on planner to preserve version in the plan and don't use any new features if processed info is older than the binary version. This is no different on what version gates do in cockroach. Release note: None Fixes #95344 98707: keyvisualizer: pre-aggregate ranges r=zachlite a=zachlite Previously, there was no bound on the number of ranges that could be propagated to the collectors. After collection, data was downsampled using a simple heurstic to decide if a bucket was worth keeping or if it should be aggregated with its neighbor. In this commit, I've introduced a function, `maybeAggregateBoundaries`, to prevent more than `keyvisualizer.max_buckets` from being propagated to collectors. This pre-aggregation takes the place of the post-collection downsampling. For the first stable release of the key visualizer, I am intentionally sacrificing dynamic resolution and prioritizing boundary stability instead. This trade-off means that the key visualizer will demand less network, memory, and storage resources from the cluster while operating. Additionally, this PR drops the sample retention time from 14 days to 7 days, and ensures that `keyvisualizer.max_buckets` is bounded between [1, 1024]. Resolves: #96740 Epic: None Release note: None 98713: sql,kv: bubble up retry errors when creating leaf transactions r=arulajmani a=arulajmani Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes #97141 Epic: none Release note: None 98732: cloud/gcp_test: add weird code 0/ok error to regex r=dt a=dt Still unsure why we sometimes see this instead of the other more infromative errors but in the meanime, make the test pass. Release note: none. Epic: none. Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: zachlite <[email protected]> Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: David Taylor <[email protected]>
Previously, if we detected that the transaction was aborted when trying to construct leaf transaction state, we would handle the retry error instead of bubbling it up to the caller. When a transaction is aborted, the `TransactionRetryWithProtoRefreshError` carries with it a new transaction that should be used for subsequent attempts. Handling the retry error entailed swapping out the old `TxnCoordSender` with a new one -- one that is associated with this new transaction. This is bug prone when trying to create multiple leaf transactions in parallel if the root has been aborted. We would expect the first leaf transaction to handle the error and all subsequent leaf transactions to point to the new transaction, as the `TxnCoordSender` has been swapped out. This wasn't an issue before as we never really created multiple leaf transactions in parallel. This recently change in 0f4b431, which started parallelizing FK and uniqueness checks. With this change, we could see FK or uniqueness violations when in fact the transaction needed to be retried. This patch fixes the issue described above by not handling the retry error when creating leaf transactions. Instead, we expect the ConnExecutor to retry the entire transaction and prepare it for another iteration. Fixes cockroachdb#97141 Epic: none Release note: None
master on
refs/tags/v23.1.0-alpha.2-263-g51ed100a46 (51ed100)
During release qualification for the 23.1.0-alpha.3 release, the
tpcc/headroom
roachtest failed because thetpcc
workload got a foreign key violation error:A subsequent run of that test passed, so this failure is non-deterministic. However, the workload shouldn't be seeing these types of errors.
Build: https://teamcity.cockroachdb.com/buildConfiguration/Internal_Release_Process_RoachtestReleaseQualificationV222/8702283?buildTab=overview&showRootCauses=true&expandBuildProblemsSection=true&expandBuildTestsSection=true&expandBuildChangesSection=true#%2F%25Y%25m%25d-8702283;%2Ftpcc%2Fheadroom%2Fn4cpu16%2Frun_1%2Fartifacts.zip
Jira issue: CRDB-24539
The text was updated successfully, but these errors were encountered: