Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/concurrency: remove TODO about impossible deadlock scenario #47616

Merged

Conversation

nvanbenschoten
Copy link
Member

This commit removes a TODO that described a potential scenario in which a request-only dependency cycle could deadlock due to a simultaneous transaction abort. I realized after writing a full fix (9a9182f) that such a deadlock was not achievable in practice due to the semantics of the lockTable.

It may appear that there is a bug here in the handling of request-only
dependency cycles. If such a cycle was broken by simultaneously aborting
the transactions responsible for each of the request, there would be no
guarantee that an aborted pusher would notice that its own transaction
was aborted before it notices that its pushee's transaction was aborted.
For example, in the simplest case, imagine two requests deadlocked on
each other. If their transactions are both aborted and each push notices
the pushee is aborted first, they will both return here triumphantly and
wait for the other to exit its lock wait-queues, leading to deadlock.
Even if they eventually pushed each other again, there would be no
guarantee that the same thing wouldn't happen.

However, such a situation is not possible in practice because such a
dependency cycle is never constructed by the lockTable. The lockTable
assigns each request a monotonically increasing sequence number upon its
initial entrance to the lockTable. This sequence number is used to
straighten out dependency chains of requests such that a request only
waits on conflicting requests with lower sequence numbers than its own
sequence number. This behavior guarantees that request-only dependency
cycles are never constructed by the lockTable. Put differently, all
dependency cycles must include at least one dependency on a lock and,
therefore, one call to pushLockTxn. Unlike pushRequestTxn, pushLockTxn
actively removes the conflicting lock and removes the dependency when it
determines that its pushee transaction is aborted. This means that the
call to pushLockTxn will continue to make forward progress in the case of
a simultaneous abort of all transactions behind the members of the cycle,
preventing such a hypothesized deadlock from ever materializing.

The PR also includes a good amount of other cleanup that I was intending to land with the fix to this deadlock. For instance, it splits off a LockAcquisition type from the LockUpdate type. It also cleans up the internals of lock_table.go.

@nvanbenschoten nvanbenschoten requested a review from a team as a code owner April 17, 2020 17:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@blathers-crl
Copy link

blathers-crl bot commented Apr 17, 2020

❌ The GitHub CI (Cockroach) build has failed on ae7781b4.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/lockTableWaiterCancel3 branch from ae7781b to 561b783 Compare April 17, 2020 20:48
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 18 of 18 files at r1, 3 of 3 files at r2, 14 of 15 files at r3, 4 of 4 files at r4, 2 of 2 files at r5, 1 of 1 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/concurrency/lock_table.go, line 1122 at r4 (raw file):

	g.key = l.key
	g.mu.startWait = true
	if reservedBySelfTxn {

I like the reduction in local variables in the refactorings.


pkg/kv/kvserver/concurrency/lock_table.go, line 1054 at r5 (raw file):

		// reservation if the reservation has a lower seqNum. For reads, the
		// non-transactional and transactional behavior is equivalent and
		// handled later in this function.

the comment is now inconsistent with the change that removed sa == spanset.SpanReadWrite from the if-condition.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 454 at r6 (raw file):

	// call to pushLockTxn will continue to make forward progress in the case of
	// a simultaneous abort of all transactions behind the members of the cycle,
	// preventing such a hypothesized deadlock from ever materializing.

An example here would be helpful. And it may help clarify the definition of "request-only dependency cycle", which is quite narrow. Perhaps something like:

req(1, txn1), req(1, txn2) are both waiting on a lock held by txn3, and they respectively hold a reservation on key "a" and key "b".
req(2, txn2) queues up behind the reservation on key "a" and req(2, txn1) queues up behind the reservation on key "b". Now the dependency cycle between txn1 and txn2 only involves requests, but some of the requests here also depend on a lock. So when both txn1, txn2 are aborted, the req(1, txn1), req(1, txn2) will exit the lockTable, allowing req(2, txn1) and req(2, txn2) to get the reservation and now they no longer depend on each other.


pkg/roachpb/data.proto, line 569 at r1 (raw file):

// A LockAcquisition represents the action of a Transaction acquiring a lock
// with a specified durbility level over a Span of keys.

is this using a Span to be future-proof?

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/lockTableWaiterCancel3 branch from 561b783 to d4cbe15 Compare April 21, 2020 17:06
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)


pkg/kv/kvserver/concurrency/lock_table.go, line 1054 at r5 (raw file):

Previously, sumeerbhola wrote…

the comment is now inconsistent with the change that removed sa == spanset.SpanReadWrite from the if-condition.

Done.


pkg/kv/kvserver/concurrency/lock_table_waiter.go, line 454 at r6 (raw file):

Previously, sumeerbhola wrote…

An example here would be helpful. And it may help clarify the definition of "request-only dependency cycle", which is quite narrow. Perhaps something like:

req(1, txn1), req(1, txn2) are both waiting on a lock held by txn3, and they respectively hold a reservation on key "a" and key "b".
req(2, txn2) queues up behind the reservation on key "a" and req(2, txn1) queues up behind the reservation on key "b". Now the dependency cycle between txn1 and txn2 only involves requests, but some of the requests here also depend on a lock. So when both txn1, txn2 are aborted, the req(1, txn1), req(1, txn2) will exit the lockTable, allowing req(2, txn1) and req(2, txn2) to get the reservation and now they no longer depend on each other.

Done.


pkg/roachpb/data.proto, line 569 at r1 (raw file):

Previously, sumeerbhola wrote…

is this using a Span to be future-proof?

Yes, we're likely to eventually introduce ranged lock acquisition.

@blathers-crl
Copy link

blathers-crl bot commented Apr 21, 2020

❌ The GitHub CI (Cockroach) build has failed on d4cbe154.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@nvanbenschoten
Copy link
Member Author

All of the roachtests in CI failed to start with the error invalid version string 'd4cbe15'. I'm guessing this is related to some of the github flakiness. I'll try again.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/lockTableWaiterCancel3 branch from d4cbe15 to ebcba16 Compare April 21, 2020 17:27
@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Canceled

@nvanbenschoten
Copy link
Member Author

bors r+

@blathers-crl
Copy link

blathers-crl bot commented Apr 21, 2020

❌ The GitHub CI (Cockroach) build has failed on ebcba168.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 18 files at r7, 1 of 2 files at r11, 1 of 1 files at r12.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build failed

6f81abe split off a new LockUpdate type from the existing Intent type.
This was an improvement because, to that point, the roles of the Intent
type had been seriously overloaded, leading to complexity and confusion.
6f81abe has a good commit message explaining this in detail.

This separation moved us in the right direction, but it has become clear
over the past month that it wasn't complete. Specifically, LockUpdate
was still used for two distinct roles:
- talking about updating existing locks
- talking about acquiring new locks

Using LockUpdate for both of these roles still left some room for
ambiguity. The type contained fields that clearly only made sense for
one of the two roles. For instance, IgnoredSeqnums was clearly only
useful for the first role. More concerning, the type contained fields
that arguably could have made sense for both roles but weren't actually
used for both. This was the case for the Durability field. Originally,
we planned on being able to resolve intents at specific durabilities,
but it became clear that this was not a good idea. Because of this, the
field was only actually used for the second role and was ignored by the
IntentResolver.

This commit addresses these concerns by splitting off a LockAcquisition
type from the LockUpdate type. LockUpdate now exclusively serves the
first role while LockAcquisition serves the second. This added type
safety avoids confusion and room for bugs.

NOTE: LockUpdate is not sent across the wire, so we're able to remove a
field from it without reserving the field number.
These were unused and its not clear when they would be useful.
They were an added burden to the implementation of lockTable,
so it's better to remove them.
This field was used, but none of the uses were necessary. The field was
an added burden to the implementation of lockTable, so it's better to
remove it.
Strictly renames and added comments. Just a refactor.
This makes some of the logic easier to understand.
This commit removes a TODO that described a potential scenario in which
a request-only dependency cycle could deadlock due to a simultaneous
transaction abort. I realized after writing a full fix (<add ref>) that
such a deadlock was not achievable in practice due to the semantics of
the lockTable.

> It may appear that there is a bug here in the handling of request-only
> dependency cycles. If such a cycle was broken by simultaneously aborting
> the transactions responsible for each of the request, there would be no
> guarantee that an aborted pusher would notice that its own transaction
> was aborted before it notices that its pushee's transaction was aborted.
> For example, in the simplest case, imagine two requests deadlocked on
> each other. If their transactions are both aborted and each push notices
> the pushee is aborted first, they will both return here triumphantly and
> wait for the other to exit its lock wait-queues, leading to deadlock.
> Even if they eventually pushed each other again, there would be no
> guarantee that the same thing wouldn't happen.
>
> However, such a situation is not possible in practice because such a
> dependency cycle is never constructed by the lockTable. The lockTable
> assigns each request a monotonically increasing sequence number upon its
> initial entrance to the lockTable. This sequence number is used to
> straighten out dependency chains of requests such that a request only
> waits on conflicting requests with lower sequence numbers than its own
> sequence number. This behavior guarantees that request-only dependency
> cycles are never constructed by the lockTable. Put differently, all
> dependency cycles must include at least one dependency on a lock and,
> therefore, one call to pushLockTxn. Unlike pushRequestTxn, pushLockTxn
> actively removes the conflicting lock and removes the dependency when it
> determines that its pushee transaction is aborted. This means that the
> call to pushLockTxn will continue to make forward progress in the case of
> a simultaneous abort of all transactions behind the members of the cycle,
> preventing such a hypothesized deadlock from ever materializing.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/lockTableWaiterCancel3 branch from ebcba16 to 6ac38be Compare April 22, 2020 18:02
@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 22, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 22, 2020

Build succeeded

@craig craig bot merged commit a1296f9 into cockroachdb:master Apr 22, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/lockTableWaiterCancel3 branch April 23, 2020 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants