Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/contention/nodes=4 failed #53518

Closed
cockroach-teamcity opened this issue Aug 26, 2020 · 7 comments · Fixed by #54230
Closed

roachtest: kv/contention/nodes=4 failed #53518

cockroach-teamcity opened this issue Aug 26, 2020 · 7 comments · Fixed by #54230
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).kv/contention/nodes=4 failed on provisional_202008261913_v20.2.0-beta.1@eaa939ce6548a54a23970814ff00f30ad87680ac:

The test failed on branch=provisional_202008261913_v20.2.0-beta.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/kv/contention/nodes=4/run_1
	ts_util.go:116,kv.go:260,cluster.go:2591,errgroup.go:57: average txns per second 93.895596 was under target 100.000000

	cluster.go:2613,kv.go:263,test_runner.go:754: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2601
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2609
		  | main.registerKVContention.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/kv.go:263
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:754
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2657
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2571
		  | runtime.doInit
		  | 	/usr/local/go/src/runtime/proc.go:5228
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:190
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

More

Artifacts: /kv/contention/nodes=4
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-provisional_202008261913_v20.2.0-beta.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 26, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Aug 26, 2020
@tbg
Copy link
Member

tbg commented Aug 28, 2020

@nvanbenschoten can you take a look at this one on Monday as well? Seems less dramatic than the conc mgr panic but it's also on the 20.2-beta list.

@knz
Copy link
Contributor

knz commented Aug 31, 2020

FWIW the exact error is average txns per second 93.895596 was under target 100.000000

@nvanbenschoten
Copy link
Member

I was looking into this last week. There's definitely something real here. About 20% of tests seem to see 10 minute transaction stalls repeatedly, which is what we would expect if transactions end up waiting for someone else to expire. This could be an indication of an abandoned transaction that is not cleaned up eagerly. Or something else.

I also stumbled upon #53677 while looking into this, which is moderately concerning.

@nvanbenschoten
Copy link
Member

TL;DR: not a release blocker.

I think I see what's going on here. In #45568, we added the COCKROACH_TXN_HEARTBEAT_DURING_1PC testing knob to ensure that in-flight 1PC transactions never expired while contending with other txns and waiting in wait-queues. The effect of this was that a transaction could create its transaction record through a HeartbeatTxn request and then perform a 1PC transaction. Because the 1PC path doesn't check for a persistent transaction record (because one should never exist without this testing knob), this results in a pretty scary situation - we end up with a committed transaction that still has a PENDING transaction record.

What we were seeing in this test is that this situation would occasionally cause issues if an in-flight PushTxn request misses the concurrency.Manager.OnTransactionUpdated notification. When this happens, the PushTxn will enqueue in the txnWaitQueue and begin querying the already-committed transaction's record using QueryTxn requests. Because the record is present and PENDING, the request will stall until the record expires, which the test dramatically delays (COCKROACH_TXN_LIVENESS_HEARTBEAT_MULTIPLIER=600).

So we could fix this by ensuring that we delete any existing transaction record on a 1PC transaction, but this actually seems misguided. Doing so is an added cost that we only need with COCKROACH_TXN_HEARTBEAT_DURING_1PC. Instead, I plan to remove the COCKROACH_TXN_HEARTBEAT_DURING_1PC knob entirely. It was misguided and the test will still fail on any deadlocks even without this now that we increased the txn expiration from 2m to 10m in 3956bb6.

I'll also make a note that heartbeating a transaction that is attempting a 1PC batch is not only unnecessary, but is also unsafe. I don't think it could actually cause correctness problems because a 1PC txn leaves no intents, but it opens the door to all kinds of weird issues like this.

@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 10, 2020
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Sep 10, 2020

TL;DR: not a release blocker.

I take this back. As of #53132, we actually begin heartbeating before the 1PC batch in this test. This is because we acquire unreplicated locks during the initial Scan batch and so COCKROACH_TXN_HEARTBEAT_DURING_1PC isn't even applicable to the test anymore.

For the same reason we need to inform the concurrency manager of an updated txn on 1PC transactions, we'll also need to handle the case where a 1PC transaction already has a PENDING record:

// Even though the transaction is 1PC and hasn't written any intents, it may
// have acquired unreplicated locks, so inform the concurrency manager that
// it is finalized and than any unreplicated locks that it has acquired can
// be released.
res.Local.UpdatedTxns = []*roachpb.Transaction{clonedTxn}
res.Local.ResolvedLocks = make([]roachpb.LockUpdate, len(etArg.LockSpans))
for i, sp := range etArg.LockSpans {
res.Local.ResolvedLocks[i] = roachpb.LockUpdate{
Span: sp,
Txn: clonedTxn.TxnMeta,
Status: clonedTxn.Status,
IgnoredSeqNums: clonedTxn.IgnoredSeqNums,
}
}

This is an important optimization that allows us to acquire unreplicated locks and then remove them during a 1PC batch. It's the only reason we perform decently well on YCSB these days.

But as mentioned above, this isn't really safe. So what we could do is perform a read for a transaction record in evaluate1PC and delete it if necessary. Or we could bail back to the non-1PC path if one exists. But I'd like to avoid adding in a new disk read on most 1PC transactions just to protect against a rare case.

So to avoid a new disk read for 1PC transactions that have not previously acquired unreplicated locks, my plan is:

  1. stop using the COCKROACH_TXN_HEARTBEAT_DURING_1PC env var in this test, it's useless
  2. add a new TxnHeartbeating flag to EndTxnRequest that is set by the txnHeartbeater
  3. consult this flag in Replica.canAttempt1PCEvaluation. If it is set, look for a transaction record
  4. if a transaction record exists, don't attempt a 1PC evaluation
  5. as a bonus, gate the UpdatedTxns and ResolvedLocks handling referenced above on the TxnHeartbeating flag
    6. as a bonus in the future, avoid reading the transaction record when evaluating an EndTxn request with the TxnHeartbeating flag set to false. not safe for non-1PC batches

Luckily, this is all an optimization to avoid waiting out a transaction's expiration, so we can introduce it and backport it without a migration.

cc. @andreimatei in case you have any thoughts here.

@nvanbenschoten nvanbenschoten added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 10, 2020
@andreimatei
Copy link
Contributor

Not attempting 1PC evaluation if we've created the txn record sounds right to me. What's the alternative? Some bastardized 1PC evaluation that also deletes the txn record? Seems weird to me.

@tbg
Copy link
Member

tbg commented Sep 10, 2020

SGTM as well.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 11, 2020
Fixes cockroachdb#53403.
Fixes cockroachdb#53518.
Fixes cockroachdb#53772
Fixes cockroachdb#54094.

This commit disables one-phase commit evaluation for transactions with
existing PENDING transaction records. As we saw in cockroachdb#53518 (comment),
failing to do so can lead to a transaction that commits on the one-phase
commit fast-path but still has a PENDING transaction record. It doesn't
seem like this can actually cause serious correctness issues today other
than in the presence of replays because a 1PC transaction does not have
remote intents (by definition). However, it was creating the appearance
of abandoned transaction records in `kv/contention/nodes=4` and causing
that test to fail.

This commit needs a backport to release-20.2 and release-20.1. This was
not an issue before release-20.1 because before then, we would never
begin a transaction's heartbeat loop for 1PC transactions. This changed
in v20.1 because of unreplicated locks. We allow transactions that
acquire unreplicated locks to still hit the one-phase commit fast-path,
but we also need to start heartbeating once a transaction has acquired
any locks so that it doesn't get aborted by conflicting transactions.

In the vast majority of these cases, the heartbeat loop will never
actually fire (for any txn that takes less than 1s), so with this
change, we'll still be able to perform a 1PC evaluation. However, this
is adding in a disk read for those cases, which is a little
disappointing but doesn't seem easy to avoid without disabling the
heartbeat loop before issuing the 1PC batch (another alternative, happy
to discuss). The upside of this is that we now have enough information
on the server to avoid a bit of work for 1PC txns that have not
previously acquired locks (see TODO in evaluate1PC).
craig bot pushed a commit that referenced this issue Sep 14, 2020
54230: kv: disallow 1PC evaluation when heartbeating and txn record present r=nvanbenschoten a=nvanbenschoten

Fixes #53403.
Fixes #53518.
Fixes #53772
Fixes #54094.

This commit disables one-phase commit evaluation for transactions with existing PENDING transaction records. As we saw in #53518 (comment), failing to do so can lead to a transaction that commits on the one-phase commit fast-path but still has a PENDING transaction record. It doesn't seem like this can actually cause serious correctness issues today other than in the presence of replays because a 1PC transaction does not have remote intents (by definition). However, it was creating the appearance of abandoned transaction records in `kv/contention/nodes=4` and causing that test to fail.

This commit needs a backport to release-20.2 and release-20.1. This was not an issue before release-20.1 because before then, we would never begin a transaction's heartbeat loop for 1PC transactions. This changed in v20.1 because of unreplicated locks. We allow transactions that acquire unreplicated locks to still hit the one-phase commit fast-path, but we also need to start heartbeating once a transaction has acquired any locks so that it doesn't get aborted by conflicting transactions.

In the vast majority of these cases, the heartbeat loop will never actually fire (for any txn that takes less than 1s), so with this change, we'll still be able to perform a 1PC evaluation. However, this is adding in a disk read for those cases, which is a little disappointing but doesn't seem easy to avoid without disabling the heartbeat loop before issuing the 1PC batch (another alternative, happy to discuss). The upside of this is that we now have enough information on the server to avoid a bit of work for 1PC txns that have not previously acquired locks (see TODO in evaluate1PC).

54325: opt: unwrap explain.Node in ConstructScanBuffer r=RaduBerinde a=RaduBerinde

When the new explain infrastructure is in use, the plan is built
against an explain.Factory but the "inner" recursive CTE plan is built
against a normal factory. This leads to an internal error. To avoid
this, we unwrap the node in `ConstructScanBuffer`.

Note that the new explain infrastructure is used automatically for the
first instance of a query fingerprint, in order to populate the plan
in the UI.

Fixes #54324.

Release note (bug fix): fixed an internal error in some cases when
recursive CTEs are used.


54350: roachtest: disable load-based splitting in copy/bank/rows=10000000,nodes=9,txn=false r=nvanbenschoten a=nvanbenschoten

Fixes #54301.

Speculative fix for that roachtest.

54351: workload/schemachange: add enum support r=otan a=ajwerner

This commit adds support for creating enums, adding enum columns, and using
enums to insert data into rows. This found a bug so I'm pleased.

Release note: None

54352: kv: increment bytesSent, not batchSize in kvBatchSnapshotStrategy.sendBatch r=nvanbenschoten a=nvanbenschoten

Partly responsible for #54311.

This commit fixes a bug introduced in #48579 where the snapshot
batch size was increased when each batch was sent instead of the
bytesSent metric. This had two effects:
1. it undermined the memory footprint limit (256 KB) placed on
   snapshot senders by doubling the batch size on each subsequent
   batch.
2. it failed to track the snapshot data rate properly, so the log
   message introduced in #48579 always contained "0 B/s".

This needs to be backported to release-20.1 and release-20.2.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in b2ea96f Sep 14, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 18, 2020
Fixes cockroachdb#53403.
Fixes cockroachdb#53518.
Fixes cockroachdb#53772
Fixes cockroachdb#54094.

This commit disables one-phase commit evaluation for transactions with
existing PENDING transaction records. As we saw in cockroachdb#53518 (comment),
failing to do so can lead to a transaction that commits on the one-phase
commit fast-path but still has a PENDING transaction record. It doesn't
seem like this can actually cause serious correctness issues today other
than in the presence of replays because a 1PC transaction does not have
remote intents (by definition). However, it was creating the appearance
of abandoned transaction records in `kv/contention/nodes=4` and causing
that test to fail.

This commit needs a backport to release-20.2 and release-20.1. This was
not an issue before release-20.1 because before then, we would never
begin a transaction's heartbeat loop for 1PC transactions. This changed
in v20.1 because of unreplicated locks. We allow transactions that
acquire unreplicated locks to still hit the one-phase commit fast-path,
but we also need to start heartbeating once a transaction has acquired
any locks so that it doesn't get aborted by conflicting transactions.

In the vast majority of these cases, the heartbeat loop will never
actually fire (for any txn that takes less than 1s), so with this
change, we'll still be able to perform a 1PC evaluation. However, this
is adding in a disk read for those cases, which is a little
disappointing but doesn't seem easy to avoid without disabling the
heartbeat loop before issuing the 1PC batch (another alternative, happy
to discuss). The upside of this is that we now have enough information
on the server to avoid a bit of work for 1PC txns that have not
previously acquired locks (see TODO in evaluate1PC).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants