Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txn: persist fair lock type in lock information and handle stale fair lock resolve #14692

Merged
merged 9 commits into from
May 29, 2023

Conversation

cfzjywxk
Copy link
Collaborator

@cfzjywxk cfzjywxk commented May 6, 2023

What is changed and how it works?

Issue Number: Ref #13298 pingcap/tidb#43540

What's Changed:

  1. Persist the lock with conflict information in the pessimistic lock type.
  2. Check the persisted transaction information first resolving the primary pessimistic lock, if the lock is stale return
    the actual transaction status and unlock the stale lock.
  3. Check the persisted transaction information first checking secondary locks, if the lock is stale return
    the actual transaction status and unlock the stale lock.

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test

Side effects

  • Performance regression
    • Consumes more CPU if there are lots of conflict pessimistic lock to resolve.

Release note

Fix the incorrect transaction status returned processing stale pessimistic lock with conflict.

@cfzjywxk cfzjywxk added sig/transaction SIG: Transaction type/bugfix This PR fixes a bug. labels May 6, 2023
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented May 6, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • tonyxuqqi
  • you06

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 6, 2023
@cfzjywxk cfzjywxk marked this pull request as ready for review May 6, 2023 02:28
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2023
@cfzjywxk
Copy link
Collaborator Author

cfzjywxk commented May 6, 2023

/test

@@ -20,13 +20,14 @@ pub enum LockType {
Put,
Delete,
Lock,
Pessimistic,
Pessimistic(bool),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to comment the field.

Ok((TxnStatus::TtlExpire, released))
};
if lock.is_pessimistic_lock() {
let (status, released) = check_txn_status_from_pessimistic_primary_lock(
Copy link
Contributor

@ekexium ekexium May 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only guarantee it is primary if verify_is_primary is true. What should we do if it's false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, In my opinion, for the newer version, the kv client should always set it to true as indicated in https://github.com/tikv/tikv/blob/master/src/storage/txn/commands/check_txn_status.rs#L57. For older versions, resolving the primary-key switching issue was not possible.

Another potential solution is to never place a rollback record on the pessimistic lock primary key. This would prevent breaking transaction status. However, if during prewrite phase, the prewrite request on the primary key is lost and remains a pessimistic lock while secondary locks are resolved as prewrite ones, it could neither be rolled back nor committed since we cannot determine its transaction status.

/cc @MyonKeminta

Copy link
Contributor

@MyonKeminta MyonKeminta May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my original idea is to check write cf to see if there is any record of this transaction (get_txn_commit_record) if check_txn_status finds the primary has a pessimistic lock, and it's marked to be locked with conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed while considering about this, that there exists some more problem. When we call check_txn_status on a primary (maybe due to encountering an optimistic lock), the primary may be locked by a stale pessimistic lock request whose primary points to another key. For this case, it seems that it can be covered by performing the logic that checking locked-with-conflict flag and checking txn status from write cf before verifying primary at line 115.

However, I'm afraid there can still be way more complicated corner cases starting from this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's much more complicated than before because we have to make sure if the current primary key and lock on the key are valid, the transaction status should be determined only if the information are valid.

Comment on lines 38 to 39
TxnStatus::RolledBack => (TxnStatus::RolledBack, true),
TxnStatus::Committed { commit_ts } => (TxnStatus::committed(commit_ts), true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TxnStatus::RolledBack => (TxnStatus::RolledBack, true),
TxnStatus::Committed { commit_ts } => (TxnStatus::committed(commit_ts), true),
t @ TxnStatus::RolledBack | t @ TxnStatus::Committed{..} => (t, true),

}
},
Err(err) => match err {
// Continue to process if there's no correspond persistent information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Continue to process if there's no correspond persistent information.
// Continue to process if there's no corresponding persistent information.

@cfzjywxk cfzjywxk force-pushed the persist_fair_lock_type branch from 8b34647 to 21244cb Compare May 6, 2023 14:46
@ti-chi-bot ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label May 10, 2023
Copy link
Contributor

@you06 you06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation LGTM, but I'm considering can we reject a pessimistic acquiring request when the recent write record's start_ts is equal to the txn's start_ts, which means the transaction is already committed. If so, the format might not be changed, I'm not clear enough with rejecting pessimistic lock way, need more thinking.

@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels May 10, 2023
@cfzjywxk
Copy link
Collaborator Author

cfzjywxk commented May 11, 2023

@you06
Checking if the transction commit status is ok theoratically, the only problem is the extra cost to get the committed key with certain "start_ts". Actually, for prewrite secondary index key even the write conflict check is skipped because of the cost and it brings us many corner cases.

@you06
Copy link
Contributor

you06 commented May 12, 2023

Another thought is can we accept the conflict pessimistic lock from resumed requests only? So that we can reject the stale pessimistic lock because it's not in TiKV's waiting queue.

@cfzjywxk
Copy link
Collaborator Author

Another thought is can we accept the conflict pessimistic lock from resumed requests only? So that we can reject the stale pessimistic lock because it's not in TiKV's waiting queue.

@you06 I dont't quite get it, do you mean we check if it is a force lock by checking its existency in the waitting queue? Could you explain more about it in details?

@you06
Copy link
Contributor

you06 commented May 12, 2023

@cfzjywxk If the a request acquired a lock successfully without awaking(from the fair lock queue), it's for_update_ts should be greater than the exist commit_ts, on the other word, if for_update_ts < commit_ts and the lock is granted, the request is awaken by AcquirePessimisticLockResumed. The stale pessimistic is called by AcquirePessimisticLock, which should not be able to acquire pessimistic locks because for_update_ts < commit_ts.

Copy link
Contributor

@MyonKeminta MyonKeminta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..., but I'm considering can we reject a pessimistic acquiring request when the recent write record's start_ts is equal to the txn's start_ts, ...

Apart from the extra performance cost, also note that if the transaction is committed, it might not always be the most recent write record since it's theoretically possible (though the probability is very low) that another transaction writes this key before the stale request arrives.

Another thought is can we accept the conflict pessimistic lock from resumed requests only? So that we can reject the stale pessimistic lock because it's not in TiKV's waiting queue.

Allowing locking with conflict no matter if it's resumed is an important idea in my opinion. When a new pessimistic lock request arrives, the previous transaction might be either ongoing or finished. My original expectation is to reduce the extra latency from statement-retrying in both these cases. But since in our high-concurrency single-row conflict scenarios the row is continuously locked, disallowing locking with conflict for non-resumed request might not cause any problem.
However a more important problem is that, when a stale pessimistic lock request arrives, it's possible that the key is locked by another transaction, which will put the stale request to waiting status. Then when it's resumed, it will still acquire the lock that we don't want.

}

const FLAG_PUT: u8 = b'P';
const FLAG_DELETE: u8 = b'D';
const FLAG_LOCK: u8 = b'L';
const FLAG_PESSIMISTIC: u8 = b'S';
const FLAG_PESSIMISTIC_WTIH_CONFLICT: u8 = b'F';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it needs carefully handling the compatibility. It shouldn't be written unless all components in the cluster were upgraded to a version that supports this new lock type. I think more simple approach to do that is to add a flag to indicate whether the lock is acquired with conflict. The flag can also be derived to the optimistic lock after prewritting, so it's possible to know if it's locked with conflict after prewriting if we need in the future. Then the only compatibility issue we need to concern about is to inform the TiFlash team to recognize this new flag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more simple approach to do that is to add a flag to indicate whether the lock is acquired with conflict

Do you mean adding the flag somewhere else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/WTIH/WITH

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more simple approach to do that is to add a flag to indicate whether the lock is acquired with conflict

Do you mean adding the flag somewhere else?

Yes. Add an extra flag instead of a new lock type. It would be much easier to handle compatibility stuff, I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MyonKeminta
It's updated, PTAL.

@cfzjywxk cfzjywxk force-pushed the persist_fair_lock_type branch from c09f3f0 to edab787 Compare May 23, 2023 12:01
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 23, 2023
@cfzjywxk cfzjywxk requested review from MyonKeminta and you06 May 23, 2023 12:10
"current_ts" => current_ts,
"resolving_pessimistic_lock" => ?resolving_pessimistic_lock,
);
let released = txn.unlock_key(primary_key, true, TimeStamp::zero());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument commit_ts is used as the conflicting_commit_ts information for pessimistic lock requests that are waiting for the lock (if any). But in this case it doesn't make sense by passing either zero or the commit ts of the record we found, as there's possibly a new write record on the key with large commit ts. Perhaps we'd better consider adding an unknown state for it to make the semantics more clear. (nothing to do with this PR though, just to mention)

Copy link
Collaborator Author

@cfzjywxk cfzjywxk May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's needed to do some refactor about this interface and the commit_ts parameter.

Comment on lines 81 to 64
return if resolving_pessimistic_lock {
let released = txn.unlock_key(primary_key, true, TimeStamp::zero());
MVCC_CHECK_TXN_STATUS_COUNTER_VEC.pessimistic_rollback.inc();
Ok((TxnStatus::PessimisticRollBack, released))
} else {
let released = rollback_lock(txn, reader, primary_key, lock, true, true)?;
MVCC_CHECK_TXN_STATUS_COUNTER_VEC.rollback.inc();
Ok((TxnStatus::TtlExpire, released))
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me some time to realize this is actually correct...

(By the way I think it's better to comment at line 76 about what kinds of cases may reach this piece of code)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are added above.

reader,
primary_key.clone(),
None,
MissingLockAction::ReturnError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to assert that check_txn_status_missing_lock called here won't perform any write operation?
It seems it won't write anything (to the txn) currently. I'm kind of afraid that if in the future we mistakenly writes the samething (the same physical key) more than once in a single MvccTxn object, the final result might be undefined

/// 'start_ts'.
///
/// 1. Validate whether the existing lock indeed corresponds to the
/// primary lock. The primary key may switch under certain circumstances. If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// primary lock. The primary key may switch under certain circumstances. If
/// primary lock. The primary key may switch under certain circumstances. If

///
/// 1. Validate whether the existing lock indeed corresponds to the
/// primary lock. The primary key may switch under certain circumstances. If
/// it's a stale lock, the transaction status should not be determined by it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// it's a stale lock, the transaction status should not be determined by it.
/// it's a stale lock, the transaction status should not be determined by it.

/// 1. Validate whether the existing lock indeed corresponds to the
/// primary lock. The primary key may switch under certain circumstances. If
/// it's a stale lock, the transaction status should not be determined by it.
/// Refer to https://github.com/tikv/tikv/issues/14636 for additional information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Refer to https://github.com/tikv/tikv/issues/14636 for additional information.
/// Refer to https://github.com/pingcap/tidb/issues/42937 for additional information.

#14636 doesn't say anything in detail. I think linking to the corresponding issue in TiDB repo is better.

)?;
// Return if the primary lock is stale or the transaction status is decided.
if (status.is_decided() || status == TxnStatus::PessimisticRollBack) && released.is_some() {
return Ok((status, released));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assert released is empty if we are not returning here? We should be careful not to write the same thing to MvccTxn twice or put the same thing to ReleasedLocks twice.

Comment on lines +1293 to +1294
// 1. Test resolve the stale pessimistic primary lock. Note the force lock
// could succeed only if there's no corresponding rollback record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a rollback record, it seems it can still succeed if there is a newer record written by another transaction. This case is better to be covered too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check https://github.com/tikv/tikv/blob/master/src/storage/txn/actions/acquire_pessimistic_lock.rs#L282-L297 seems to prevent lock acquiring when there's a corresponding rollback record.

@cfzjywxk cfzjywxk force-pushed the persist_fair_lock_type branch from be67276 to 8d69553 Compare May 25, 2023 13:13
@cfzjywxk cfzjywxk requested a review from MyonKeminta May 25, 2023 13:18
Copy link
Contributor

@MyonKeminta MyonKeminta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

// rollback record in the write CF, if so the current primary
// pessimistic lock is stale. Otherwise the primary pessimistic lock is
// regarded as valid, and the transaction status is determined by it.
let (txn_status, is_status_decided) = check_txn_status_from_storage(reader, &primary_key)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both locks and writes comes from the storage. Perhaps we can call it check_determined_txn_status and return Option<TxnStatus> which is None if the record isn't found in write cf. The similar to check_secondary_locks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are refactored, PTAL.

@@ -54,6 +54,78 @@ enum SecondaryLockStatus {
RolledBack,
}

// The returned `bool` indicates whether the rollback record should be written,
// it should be false if and only if the txn commit record is not found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be false if ...

Did you mean it should be true ?


if lock.is_pessimistic_lock() {
let released_lock = txn.unlock_key(key.clone(), true, TimeStamp::zero());
let overlapped_write = reader.get_txn_commit_record(key)?.unwrap_none(region_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_txn_commit_record might be called twice for a force-locked pessimstic lock. I think in this case it should be possible to reuse the record info already got at line 101. It doesn't seem to matter much though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored, PTAL.

@cfzjywxk cfzjywxk force-pushed the persist_fair_lock_type branch from 8d69553 to 569e864 Compare May 26, 2023 10:44
@cfzjywxk cfzjywxk requested a review from MyonKeminta May 26, 2023 10:46
Copy link
Contributor

@MyonKeminta MyonKeminta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if lock.is_pessimistic_lock() {
let released_lock = txn.unlock_key(key.clone(), true, TimeStamp::zero());
let overlapped_write_res = if lock.is_pessimistic_lock_with_conflict() {
overlapped_write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider comment about that overlapped_write is already loaded if the above lock_with_conflcit related code is executed so that we can use it directly.

cfzjywxk added 9 commits May 29, 2023 14:03
Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
Signed-off-by: cfzjywxk <[email protected]>
@cfzjywxk cfzjywxk force-pushed the persist_fair_lock_type branch from 569e864 to 54a4ecb Compare May 29, 2023 06:17
@cfzjywxk
Copy link
Collaborator Author

/merge

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented May 29, 2023

@cfzjywxk: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented May 29, 2023

This pull request has been accepted and is ready to merge.

Commit hash: 54a4ecb

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label May 29, 2023
@ti-chi-bot ti-chi-bot bot merged commit a24d9d6 into tikv:master May 29, 2023
@ti-chi-bot ti-chi-bot bot added this to the Pool milestone May 29, 2023
@cfzjywxk cfzjywxk deleted the persist_fair_lock_type branch May 29, 2023 06:32
@CalvinNeo
Copy link
Member

How will this affect TiFlash, is it safe to be ignored?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/transaction SIG: Transaction size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants