Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: divergent ideas about whether a txn is "read-only" between the Txn and the txnIntentCollector #28256

Closed
andreimatei opened this issue Aug 3, 2018 · 1 comment
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases.
Milestone

Comments

@andreimatei
Copy link
Contributor

The intent collector interceptor complains if it receives an EndTransaction without it having collected any intents prior. It returns an error saying that it's a "readonly txn" and the EndTxn should have been elided above it. That error is supposed to signal bugs.

return nil, roachpb.NewErrorf("cannot commit a read-only transaction")

Unfortunately, that assumption is not quite copacetic with what the layers above it do. There's currently two problems:

  1. If the txn ever encountered an error, the Txn moves to the txnError error, at which point it forgets if it ever performed any writes (i.e. if it came from state txnReadonly or txnWriting, etc..), and so it will never elide future EndTransactions (rollbacks).
  2. If the BeginTxn batch is rejected below the Txn layer, in the TxnCoordSender, then the client.Txn considers the transaction to be writing (cause it sent a BeginTxn), but the intent collector considers it read-only (cause it never saw said BeginTxn). This happens for example if the Stopper is stopped when the BeginTransaction is sent, and the TCS fails to start the heartbeat loop and rejects the batch.

The first problem goes away in #28185 because that PR brakes apart the different txn states and correctly tracks if a BeginTxn was ever sent.

The second problem is more fundamental, caused by the separate tracking of the BeginTxn done in both the TCS and the intent collector.
It's not very clear to me what to do about it. We could try to either share the "did we send a BeginTxn" state. This is a bit complicated by the fact that, in #28185, the BeginTxn tracking is not done by the TCS directly, but by the heartbeat interceptor. So we'd need to create a communication channel between two interceptors. I think what we'd want is for the txnLockGatekeeper to keep track of whether a BeginTxn is really sent to the server, and for the heartbeat interceptor to use that to dictate whether EndTransaction can be elided. Then any interceptor anywhere in the stack can retain the right to reject batches (as they tend to do already) and the intent collector can be left alone with the current assumptions - that if it sees and EndTxn it must have accumulated some intents.
Alternatively, the intent collector could get its own logic for eliding the end transaction, duplicating the existing one.
Yet alternatively, the intent collector could stop assuming anything and conservatively forward unnecessary EndTxns.

@nvanbenschoten for thoughts, if any.
Thanks @jordanlewis for seeing something and saying something.

The error can be demonstrated with

func TestXXX(t *testing.T) {
	defer leaktest.AfterTest(t)()

	s, _, db := serverutils.StartServer(t, base.TestServerArgs{})
	ctx := context.Background()
	s.Stopper().Stop(ctx)

	err := db.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
		key := roachpb.Key("a")
		return txn.Put(ctx, key, "val")
	})
	if err != nil {
		t.Fatal(err)
	}
}
@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-client Relating to the KV client and the KV interface. S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. labels Aug 3, 2018
@andreimatei andreimatei self-assigned this Aug 3, 2018
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 3, 2018
This patch moves most of the logic from the client.Txn into the
kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the
process.
The split between the client.Txn and the TxnCoordSender caused a lot of
grief historically. The main problem is that both the Txn and the TCS
each have their own copy of the roachpb.Transaction proto. They both
use their copy for different things. We attempt to keep the two protos
in sync, but we can't ensure that as there's no common locking between
the two layers.
This patch keeps the client.Txn as a mostly stateless shim, allowing
one to mock everything underneath. This is nice, as previously "mocking
KV" was a less clear proposition - does one mock all the logic in the
Txn or just the TCS? Now the TCS has all the logic and all the locking
necessary for serializing accesses to the "transaction state" - notably
the proto.
The Txn and TCS communicate through a (now expanded) client.TxnSender
interface.

Within the TCS, the biggest change is that everything that has to do
with the heartbeat loop has been moved to a new interceptor.

Fixes cockroachdb#28256

Release note: none
@andreimatei
Copy link
Contributor Author

I've done something in #28185 - the gatekeeper is now in charge of tracking whether we ever sent a BeginTxn. Wasn't that bad.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 7, 2018
This patch moves most of the logic from the client.Txn into the
kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the
process.
The split between the client.Txn and the TxnCoordSender caused a lot of
grief historically. The main problem is that both the Txn and the TCS
each have their own copy of the roachpb.Transaction proto. They both
use their copy for different things. We attempt to keep the two protos
in sync, but we can't ensure that as there's no common locking between
the two layers.
This patch keeps the client.Txn as a mostly stateless shim, allowing
one to mock everything underneath. This is nice, as previously "mocking
KV" was a less clear proposition - does one mock all the logic in the
Txn or just the TCS? Now the TCS has all the logic and all the locking
necessary for serializing accesses to the "transaction state" - notably
the proto.
The Txn and TCS communicate through a (now expanded) client.TxnSender
interface.

Within the TCS, the biggest change is that everything that has to do
with the heartbeat loop has been moved to a new interceptor.
The metrics generation has also been extracted into a new interceptor.

One behavior change introduced by this patch is that heartbeat loops are
no longer started for (what the TCS hopes will be) 1PC txns. The
motivation was concern over the price of spawning a (shortlived)
heartbeat goroutine per txn in the 1PC-heavy "kv" workload.
Another one is that the TxnCoordSender doesn't inherit the old Txn logic
for swallowing errors on rollbacks. Instead, we're relying on a recent
server change to not return errors on rollbacks when the txn record is
missing - which was the reason for said swallowing.

Fixes cockroachdb#28256

Release note: none
@tbg tbg added this to the 2.1 milestone Aug 8, 2018
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 8, 2018
This patch moves most of the logic from the client.Txn into the
kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the
process.
The split between the client.Txn and the TxnCoordSender caused a lot of
grief historically. The main problem is that both the Txn and the TCS
each have their own copy of the roachpb.Transaction proto. They both
use their copy for different things. We attempt to keep the two protos
in sync, but we can't ensure that as there's no common locking between
the two layers.
This patch keeps the client.Txn as a mostly stateless shim, allowing
one to mock everything underneath. This is nice, as previously "mocking
KV" was a less clear proposition - does one mock all the logic in the
Txn or just the TCS? Now the TCS has all the logic and all the locking
necessary for serializing accesses to the "transaction state" - notably
the proto.
The Txn and TCS communicate through a (now expanded) client.TxnSender
interface.

Within the TCS, the biggest change is that everything that has to do
with the heartbeat loop has been moved to a new interceptor.
The metrics generation has also been extracted into a new interceptor.

One behavior change introduced by this patch is that heartbeat loops are
no longer started for (what the TCS hopes will be) 1PC txns. The
motivation was concern over the price of spawning a (shortlived)
heartbeat goroutine per txn in the 1PC-heavy "kv" workload.
Another one is that the TxnCoordSender doesn't inherit the old Txn logic
for swallowing errors on rollbacks. Instead, we're relying on a recent
server change to not return errors on rollbacks when the txn record is
missing - which was the reason for said swallowing.

Fixes cockroachdb#28256

Release note: none
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 9, 2018
This patch moves most of the logic from the client.Txn into the
kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the
process.
The split between the client.Txn and the TxnCoordSender caused a lot of
grief historically. The main problem is that both the Txn and the TCS
each have their own copy of the roachpb.Transaction proto. They both
use their copy for different things. We attempt to keep the two protos
in sync, but we can't ensure that as there's no common locking between
the two layers.
This patch keeps the client.Txn as a mostly stateless shim, allowing
one to mock everything underneath. This is nice, as previously "mocking
KV" was a less clear proposition - does one mock all the logic in the
Txn or just the TCS? Now the TCS has all the logic and all the locking
necessary for serializing accesses to the "transaction state" - notably
the proto.
The Txn and TCS communicate through a (now expanded) client.TxnSender
interface.

Within the TCS, the biggest change is that everything that has to do
with the heartbeat loop has been moved to a new interceptor.
The metrics generation has also been extracted into a new interceptor.

One behavior change introduced by this patch is that heartbeat loops are
no longer started for (what the TCS hopes will be) 1PC txns. The
motivation was concern over the price of spawning a (shortlived)
heartbeat goroutine per txn in the 1PC-heavy "kv" workload.
Another one is that the TxnCoordSender doesn't inherit the old Txn logic
for swallowing errors on rollbacks. Instead, we're relying on a recent
server change to not return errors on rollbacks when the txn record is
missing - which was the reason for said swallowing.

Fixes cockroachdb#28256

Release note: none
craig bot pushed a commit that referenced this issue Aug 9, 2018
28185:  client, kv: move logic out of Txn, rewrite some of the TxnCoordSender r=andreimatei a=andreimatei

This patch moves most of the logic from the client.Txn into the
kv.TxnCoordSender and reorganizes much of the TxnCoordSender in the
process.
The split between the client.Txn and the TxnCoordSender caused a lot of
grief historically. The main problem is that both the Txn and the TCS
each have their own copy of the roachpb.Transaction proto. They both
use their copy for different things. We attempt to keep the two protos
in sync, but we can't ensure that as there's no common locking between
the two layers.
This patch keeps the client.Txn as a mostly stateless shim, allowing
one to mock everything underneath. This is nice, as previously "mocking
KV" was a less clear proposition - does one mock all the logic in the
Txn or just the TCS? Now the TCS has all the logic and all the locking
necessary for serializing accesses to the "transaction state" - notably
the proto.
The Txn and TCS communicate through a (now expanded) client.TxnSender
interface.

Within the TCS, the biggest change is that everything that has to do
with the heartbeat loop has been moved to a new interceptor.
The metrics generation has also been extracted into a new interceptor.

One behavior change introduced by this patch is that heartbeat loops are
no longer started for (what the TCS hopes will be) 1PC txns. The
motivation was concern over the price of spawning a (shortlived)
heartbeat goroutine per txn in the 1PC-heavy "kv" workload.
Another one is that the TxnCoordSender doesn't inherit the old Txn logic
for swallowing errors on rollbacks. Instead, we're relying on a recent
server change to not return errors on rollbacks when the txn record is
missing - which was the reason for said swallowing.

Fixes #28256

Release note: none

Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in #28185 Aug 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases.
Projects
None yet
Development

No branches or pull requests

2 participants