Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-introduce "kv,migration: rm code handling legacy raft truncated st… #70464

Merged

Conversation

irfansharif
Copy link
Contributor

…ate"

This reverts commit ef1dd6f. #70432
reverted #69887, as temporary stop-gap until we release the first 21.2
beta. See the discussion over on #70432 for why we want to queue up this
revert to the original revert; this should only be merged after #69826
lands.

Release note: None

@irfansharif irfansharif requested review from ajwerner and tbg September 21, 2021 01:16
@irfansharif irfansharif requested review from a team as code owners September 21, 2021 01:16
@irfansharif irfansharif requested a review from a team September 21, 2021 01:16
@irfansharif irfansharif requested a review from a team as a code owner September 21, 2021 01:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the dance and thank you! I know there was toil.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

…ate"

This reverts commit ef1dd6f. cockroachdb#70432
reverted cockroachdb#69887, as temporary stop-gap until we release the first 21.2
beta. See the discussion over on cockroachdb#70432 for why we want to queue up this
revert to the original revert; this should only be merged after cockroachdb#69826
lands.

Release note: None
@irfansharif irfansharif force-pushed the 210920.reintroduce-trunc-state branch from cfd9d5a to 4df8ac2 Compare October 25, 2021 16:00
@irfansharif
Copy link
Contributor Author

irfansharif commented Oct 25, 2021

Rebased now that #69826 landed, this should be safe to merge (I think). Can I has a rubber stamp?

@irfansharif irfansharif requested a review from ajwerner October 25, 2021 16:01
@irfansharif
Copy link
Contributor Author

Thanks Erik!

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 26, 2021

Build failed:

@irfansharif
Copy link
Contributor Author

=== RUN   TestReplicateQueueRebalance
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReplicateQueueRebalance127981417
    test_log_scope.go:80: use -show-logs to present logs inline
    replicate_queue_test.go:98: split off r44:/{Table/50-Max} [(n1,s1):1, (n4,s4):2, (n2,s2):3, (n6,s6):4, (n3,s3):5, next=6, gen=9, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r46:/{Table/51-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=16, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r47:/{Table/52-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=17, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r48:/{Table/53-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=18, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r49:/{Table/54-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=19, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r50:/{Table/55-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=20, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r51:/{Table/56-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=21, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r52:/{Table/57-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=22, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r53:/{Table/58-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=23, sticky=9223372036.854775807,2147483647]
    replicate_queue_test.go:98: split off r54:/{Table/59-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=24, sticky=9223372036.854775807,2147483647]
    panic.go:965: -- test log scope end --

ERROR: a panic has occurred!
Details cannot be printed yet because we are still unwinding.
Hopefully the test harness prints the panic below, otherwise check the test logs.

test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestReplicateQueueRebalance127981417
I211026 13:00:51.077397 1273396 2@rpc/context.go:1045  [n3] 13  dialing n6: 127.0.0.1:44371 (system)
    panic.go:965: runtime error: index out of range [5] with length 5
        goroutine 1111300 [running]:
        runtime/debug.Stack(0xc00aa66bf0, 0x46c5520, 0xc01f76f8c0)
        	/usr/local/go/src/runtime/debug/stack.go:24 +0x9f
        github.com/cockroachdb/cockroach/pkg/util/leaktest.AfterTest.func1()
        	/go/src/github.com/cockroachdb/cockroach/pkg/util/leaktest/leaktest.go:110 +0x27e
        panic(0x46c5520, 0xc01f76f8c0)
        	/usr/local/go/src/runtime/panic.go:965 +0x1b9
        github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc00edaa5a0, 0x5f4c7e8, 0xc000124010)
        	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:551 +0x2d3
        panic(0x46c5520, 0xc01f76f8c0)
        	/usr/local/go/src/runtime/panic.go:965 +0x1b9
        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestReplicateQueueRebalance.func2.1(0xc00e854400, 0x42baa5, 0xc00aa670a8)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicate_queue_test.go:110 +0x78
        github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1(0x6, 0xc00e854400, 0xc00aa670a8)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:148 +0x38
        github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range(0xc00b71bd80, 0xc00aa67138)
        	/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:352 +0x133
        github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores(0xc00b71bd50, 0xc00aa671c8, 0x0, 0x0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:147 +0x75
        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestReplicateQueueRebalance.func2(0x2, 0xc00aa67238, 0x4c0816)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicate_queue_test.go:109 +0x10d
        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestReplicateQueueRebalance.func3(0x3f31aff00000010, 0x6177fc03)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicate_queue_test.go:130 +0x53
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError.func1(0x3f31aff, 0xed909f303)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:71 +0x32
        github.com/cockroachdb/cockroach/pkg/util/retry.ForDuration(0xa7a358200, 0xc00aa673f8, 0x0, 0xc00ffc00e8)
        	/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:197 +0xeb
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithinError(0xc00aa67a68, 0xa7a358200, 0xc0056c4780, 0xc00e86a1e0)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:77 +0x87
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsWithin(0x5f75858, 0xc0056c4780, 0xc00aa67a68, 0xa7a358200)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:58 +0x50
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon(0x5f75858, 0xc0056c4780, 0xc00aa67a68)
        	/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:41 +0x65
        github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestReplicateQueueRebalance(0xc0056c4780)
        	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicate_queue_test.go:129 +0x9ab
        testing.tRunner(0xc0056c4780, 0x4fb9728)
        	/usr/local/go/src/testing/testing.go:1193 +0xef
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1238 +0x2b3
--- FAIL: TestReplicateQueueRebalance (6.89s)

Unrelated flake. Looks like this test isn't expecting non-contiguous store IDs.

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 26, 2021

Build succeeded:

@craig craig bot merged commit fa26501 into cockroachdb:master Oct 26, 2021
@irfansharif irfansharif deleted the 210920.reintroduce-trunc-state branch October 26, 2021 14:35
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 29, 2021
Fixes cockroachdb#72029.

using_applied_state_key was previously used to check whether the Range
had upgraded to begin using the RangeAppliedState key. When set to true,
replicas receiving the state (through a ReplicatedEvalResult) knew to
begin using the new key. In cockroachdb#58088 (in 21.1) we introduced a migration
to iterate through all ranges in the system and have them start using
the RangeAppliedState key. In 21.2, this field was always set to true --
21.2 nodes were already using the RangeAppliedState key, or receiving
messages from 21.1 nodes that were also using the RangeAppliedState key.
In 21.2 (and in 21.1 for that matter) we didn't need to trigger the "if
set to true in an incoming message, start using the RangeAppliedState
key" code path.

When looking to get rid of this field in 22.1 (cockroachdb#70464), we observed that
there was an unintentional read of this field in 21.2 nodes (see cockroachdb#72029 and
\cockroachdb#72222); the saga is as follows:
 - Removing this field in 22.1 meant it was set as false when received at
   21.2 nodes.
 - This should've been fine! We weren't using this field to trigger any
   upgrade code paths (like it was originally intended for).
 - It turns out that in 21.2 we were using the ReplicaState from the
   incoming snapshot to update our in-memory replica state
 - Because the proto field was being phased out, there was now a divergence
   between the on-disk state (field set to true, from earlier 21.2
   operations) and the in-memory state (field set to false, because sent
   from a version that attempted to get rid of this field).

Removing proto fields from the replica state are not possible until we stop
using the protobuf copy of the replica state when applying a snapshot
(cockroachdb#72222). Once that's done, we should be able to stop sending the replica
state as part of the snapshot in the subsequent release.

Release note: None
craig bot pushed a commit that referenced this pull request Oct 29, 2021
72132: coldata: fix BenchmarkAppend so that it could be run with multiple count r=yuzefovich a=yuzefovich

Previously, we could get an index out of bounds because we'd allocate
larger `Bytes.data` slice than `int32` offset can support.

Release note: None

72239: kvserver: resurrect using_applied_state_key in ReplicaState r=irfansharif a=irfansharif

Fixes #72029.

`using_applied_state_key` was previously used to check whether the Range
had upgraded to begin using the `RangeAppliedState` key. When set to true,
replicas receiving the state (through a `ReplicatedEvalResult`) knew to
begin using the new key. In #58088 (in 21.1) we introduced a migration
to iterate through all ranges in the system and have them start using
the `RangeAppliedState` key. In 21.2, this field was always set to true --
21.2 nodes were already using the `RangeAppliedState` key, or receiving
messages from 21.1 nodes that were also using the `RangeAppliedState` key.
In 21.2 (and in 21.1 for that matter) we didn't need to trigger the "if
set to true in an incoming message, start using the `RangeAppliedState`
key" code path.

When looking to get rid of this field in 22.1 (#70464), we observed that
there was an unintentional read of this field in 21.2 nodes (see #72029 and
\#72222); the saga is as follows:
 - Removing this field in 22.1 meant it was set as false when received at
   21.2 nodes.
 - This should've been fine! We weren't using this field to trigger any
   upgrade code paths (like it was originally intended for).
 - It turns out that in 21.2 we were using the `ReplicaState` from the
   incoming snapshot to update our in-memory replica state
 - Because the proto field was being phased out, there was now a divergence
   between the on-disk state (field set to true, from earlier 21.2
   operations) and the in-memory state (field set to false, because sent
   from a version that attempted to get rid of this field).

Removing proto fields from the replica state are not possible until we stop
using the protobuf copy of the replica state when applying a snapshot
(#72222). Once that's done, we should be able to stop sending the replica
state as part of the snapshot in the subsequent release.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants