-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvnemesis: TestKVNemesisMultiNode failed #101721
Comments
This is reminiscent of #93896, where we saw |
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ a4ab6b6243f509e62993b4d81855f23470d9901a:
Parameters: Same failure on other branches
|
Both of these failures are with |
cc @cockroachdb/replication |
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ 10665f9856d9c7ca51557bdf5b48fe9df796228b:
Parameters: Same failure on other branches
|
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ 1f3419e178bdba544f74d9c9e14a4682efd18028:
Parameters: Same failure on other branches
|
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ 1f3419e178bdba544f74d9c9e14a4682efd18028:
Parameters: Same failure on other branches
|
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ 9fb03ee9b365d656cc26728cc87f682a385d511b:
Parameters: Same failure on other branches
|
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ 9e0b06bf61700d5192f171c6ebaec7e8a10fa6e6:
Parameters: Same failure on other branches
|
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ 69875da00208dd792cb92aee54811e8a660773a9:
Parameters: Same failure on other branches
|
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ 69875da00208dd792cb92aee54811e8a660773a9:
Parameters: Same failure on other branches
|
My current hypothesis is that the discrepancy comes from the fact that lease extensions doesn't obtain latches and could could change lease and mvcc stats in parallel to other requests. The only requests that touches lease are likely range split and range merge which are done as a commit trigger when executing EndTxnRequest. There was a fix for merge op already improving that and I think current failure is happening during splits. Current repro of the failure shows discrepancy of 6 bytes on SystemBytes stat. This always happen when there's a lease extension happening in parallel to split. This change itself is mostly harmless as size of lease value shouldn't change, but it changes if we have a local component to the timestamp. Leases that have local component on commit and expiration timestamps are 6 bytes larger than without and that matches the size difference. Moreover failure always happens when lease extension happening close to the split changes from using local to local == 0. As an experiment, resetting local component to 0 makes failure go away. We can't use it as a fix as it is not correct and could cause problems. Current understanding of the cause: cockroach/pkg/kv/kvserver/batcheval/split_stats_helper.go Lines 148 to 176 in b34c137
h.in.AbsPreSplitBothEstimated contains stats from replica obtained by:cockroach/pkg/kv/kvserver/replica.go Lines 1302 to 1309 in b34c137
which is an in-memory most recent state of stats. We then compute(scan) LHS stats of snapshot and we end up with full stats that are inconsistent with LHS stats. We then compute RHS using full and LHS which refer different points in time that can have lease extension inbetween. That would produce skewed RHS. We then use absPostSplitLeft incockroach/pkg/kv/kvserver/batcheval/split_stats_helper.go Lines 212 to 220 in b34c137
to produce final stats delta that we return as a result of execution. This would contain delta between split time snapshot and newer version of stats from replica. |
Previouly split trigger was using in-memory copy of mvccstats from replica. This value could diverge from store snapshot used by request in special cases of lease extensions that don't hold latches. This was causing stats for the replica to diverge. This PR changes base stats to be read from request reader which provides consistent stats value with content of splitted ranges. Touches: cockroachdb#101721 Release note: None
@nvanbenschoten can you look at #101721 (comment) if it looks familiar in any way? This issue has bunch of different failures. Top level ones come are for mvcc stats, but transaction commit might be connected with changes you merged recently. |
Previouly split trigger was using in-memory copy of mvccstats from replica. This value could diverge from store snapshot used by request in special cases of lease extensions that don't hold latches. This was causing stats for the replica to diverge. This PR changes base stats to be read from request reader which provides consistent stats value with content of splitted ranges. Touches: cockroachdb#101721 Release note: None
Previouly split trigger was using in-memory copy of mvccstats from replica. This value could diverge from store snapshot used by request in special cases of lease extensions that don't hold latches. This was causing stats for the replica to diverge. This PR changes base stats to be read from request reader which provides consistent stats value with content of splitted ranges. Touches: cockroachdb#101721 Release note: None
103719: kvserver/batcheval: fix mvcc stats computaion on split r=nvanbenschoten a=aliher1911 Previouly split trigger was using in-memory copy of mvccstats from replica. This value could diverge from store snapshot used by request in special cases of lease extensions that don't hold latches. This was causing stats for the replica to diverge. This PR changes base stats to be read from request reader which provides consistent stats value with content of splitted ranges. Touches: #101721 Fixes: #102410 Release note: None Co-authored-by: Oleg Afanasyev <[email protected]>
We havn't seen many failures after stats were fixed. I managed to repro transaction push timeout with 50k reruns while testing stats:
|
I'll put issue back into backlog. |
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ e57e9742146b7fac7f7530fa6224e11121ada92f: Fatal error:
Stack:
Log preceding fatal error
Parameters: |
The last span use after finish might be the same issue as #107242. |
…iority DNM: needs more testing. Fixes cockroachdb#105330. Fixes cockroachdb#101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known too be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None
Fixes cockroachdb#105330. Fixes cockroachdb#101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known too be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None
107882: kv: don't let pusher win when pushing STAGING txn with equal priority r=miraradeva a=nvanbenschoten Fixes #105330. Fixes #101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known to be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None 107962: docs: do not generate docs for crdb_internal functions r=knz a=rafiss Epic: None Release note: None 108062: server: Add rangefeed metrics r=miretskiy a=miretskiy Add rangefeed related metrics to keep track of the number of actively running rangefeeds on the server. Epic: CRDB-26372 Release note: None 108071: stmtdiagnostics: use background context when building the bundle r=yuzefovich a=yuzefovich When the context is canceled, we still want to build the bundle as best as possible. Over in 532274b we introduced the usage of the background context in order to insert the bundle into the system tables, but we still built the bundle with the canceled context. This commit fixes that oversight - in particular, we should now get `env.sql` correctly. Informs: https://github.com/cockroachlabs/support/issues/2494. Epic: None Release note: None 108085: builtins: improve docs and refactor code for unordered_unique_id r=rafiss,rharding6373 a=andyyang890 **builtins: improve documentation for unordered_unique_rowid** This patch clarifies that `unordered_unique_rowid` generates unique IDs that are statistically likely to be unordered because it bit-reverses the insert timestamp for use as the ID's timestamp bit segment. Release note: None ---- **builtinconstants: define constants for unique int bit segments** This patch defines constants for the sizes and bitmasks for each bit segment in a unique int generated for row IDs. Release note: None ---- **builtins: refactor bit shifting logic in mapToUnorderedUniqueInt** This patch refactors the bit shifting logic in `mapToUnorderedUniqueInt` to use constants instead of magic numbers. Release note: None ---- Epic: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andy Yang <[email protected]>
kv/kvnemesis.TestKVNemesisMultiNode failed with artifacts on master @ 1cd507a7c6582fd91bb56123e62bd4fde45c4d22:
Parameters:
TAGS=bazel,gss,deadlock
Help
See also: How To Investigate a Go Test Failure (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-27104
Epic CRDB-27234
The text was updated successfully, but these errors were encountered: