Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: TestLearnerSnapshotFailsRollback takes 90s because of SucceedsSoon #74621

Closed
tbg opened this issue Jan 10, 2022 · 1 comment · Fixed by #75248
Closed

kvserver: TestLearnerSnapshotFailsRollback takes 90s because of SucceedsSoon #74621

tbg opened this issue Jan 10, 2022 · 1 comment · Fixed by #75248
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Jan 10, 2022

See #69519 (comment).

The test injects an error in the snapshot path but some well-meaning code in our test helpers retries that error for a total of 45s until it returns it to the test. This happens twice for a total of 90s.

Jira issue: CRDB-12176

@tbg tbg added A-kv-replication Relating to Raft, consensus, and coordination. A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Jan 10, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jan 20, 2022
We were previously using a "Message" string to indicate details
about an error. We can do so much better now and actually encode
the error. This wasn't possible when this field was first added,
but it is now, so let's use it. As always, there's a migration
concern, which means the old field stays around & is populated
as well as interpreted for one release.

I picked this up for cockroachdb#74621, for which I intend to send a follow-up.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 20, 2022
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 21, 2022
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
@tbg
Copy link
Member Author

tbg commented Mar 16, 2022

The linked PR will close this.

tbg added a commit to tbg/cockroach that referenced this issue Mar 17, 2022
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Mar 17, 2022
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Mar 30, 2022
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
@tbg tbg self-assigned this Sep 5, 2022
@exalate-issue-sync exalate-issue-sync bot assigned tbg and unassigned tbg Sep 20, 2022
tbg added a commit to tbg/cockroach that referenced this issue Sep 28, 2022
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 31, 2022
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 31, 2022
We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes cockroachdb#74621.

Release note: None
craig bot pushed a commit that referenced this issue Nov 3, 2022
75248: kvserver: use EncodedError in SnapshotResponse r=aayushshah15 a=tbg

We were previously using a "Message" string to indicate details about an
error. We can do so much better now and actually encode the error. This
wasn't possible when this field was first added, but it is now, so let's
use it. As always, there's a migration concern, which means the old
field stays around & is populated as well as interpreted for one
release.

We then use this new-found freedom to improve which errors were marked
as "failed snapshot" errors. Previously, any error coming in on a
`SnapshotResponse` were considered snapshot errors and were considered
retriable. This was causing `TestLearnerSnapshotFailsRollback` to run
for 90s, as `TestCluster`'s replication changes use a [SucceedsSoon] to
retry snapshot errors - but that test actually injects an error that it
wants to fail-fast.  Now, since snapshot error marks propagate over the
wire, we can do the marking on the *sender* of the SnapshotResponse, and
we can only mark messages that correspond to an actual failure to apply
the snapshot (as opposed to an injected error, or a hard error due to a
malformed request). The test now takes around one second, for a rare 90x
speed-up.

As a drive-by, we're also removing `errMalformedSnapshot`, which became
unused when we stopped sending the raft log in raft snaps a few releases
back, and which had managed to hide from the `unused` lint.

[SucceedsSoon]: https://github.com/cockroachdb/cockroach/blob/37175f77bf374d1bcb76bc39a65149788be06134/pkg/testutils/testcluster/testcluster.go#L628-L631

Fixes #74621.
Fixes #87337.

Release note: None


90377: roachtest: configure ignore and block lists for 23.1 in roachtests r=rhu713 a=rhu713

Configure ignore and block lists for 23.1 in roachtests.

Release note: None
Epic: RE-253

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Rui Hu <[email protected]>
@craig craig bot closed this as completed in 4490ef1 Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-testing Testing tools and infrastructure C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants