Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize testRerouteRecovery throttle testing #100788

Conversation

DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Oct 12, 2023

Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.

Closes #99941

Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.
@DiannaHohensee DiannaHohensee added >test Issues or PRs that are addressing/adding tests :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Oct 12, 2023
@DiannaHohensee DiannaHohensee self-assigned this Oct 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DiannaHohensee DiannaHohensee added >test-failure Triaged test failures from CI and removed >test Issues or PRs that are addressing/adding tests labels Oct 12, 2023
@DiannaHohensee
Copy link
Contributor Author

Is the >test-failure label used for test fixes?

I've run the old and new tests a few times each at 100 iterations. I did a little tidying in the test file, too: let me know if it should be done separately.

I didn't modify the recovery path, though I did shove it into a function helper. I figure it's the same recovery functionality that's being tested.

@DaveCTurner
Copy link
Contributor

>test is the label to use for changes that fix (or add) tests.

The cleanups/renamings/comment additions look good but a separate PR would make it a little easier to review, and also to eliminate them as a source of problems if we ever have to git bisect back to this change. Not a huge deal tho, I can see what's going on.

@DiannaHohensee DiannaHohensee added >test Issues or PRs that are addressing/adding tests and removed >test-failure Triaged test failures from CI labels Oct 13, 2023
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New tests look good; I left only a few superficial comments. I have a hunch that splitting out the renamings and other tidyups will make this a lot more readable - there's too much similarity between the old and new tests, and too many differences between the old version of the old test and its new version here, for the diff rendering to work nicely.

Comment on lines 472 to 475
IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeA);
assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsSource(), equalTo(1));
indicesService = internalCluster().getInstance(IndicesService.class, nodeB);
assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsTarget(), equalTo(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this wait looks suspicious to me. It's not throttling-related, it's waiting for both nodes to start handling the recovery and that looks important for the next few lines of the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I added this back to the testRerouteRecovery test. Originally I thought you meant in the new tests, because of the diff -- this is in the new tests.

I basically undid what the original patch to add throttling added to this test. I don't have strong opinions, though.

Comment on lines 503 to 509
assertThat("node A should have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(1));
assertThat("node A should not have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(0));
nodeAThrottling = recoveryStats.throttleTime().millis();
}
if (nodeStats.getNode().getName().equals(nodeB)) {
assertThat("node B should not have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(0));
assertThat("node B should have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the check that A has 1 source and 0 targets and B has 0 sources and 1 target. We can drop the bit about the throttling here tho, and probably the assertBusy().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 564 to 565
assertThat(recoveryStats.currentAsSource(), equalTo(0));
assertThat(recoveryStats.currentAsTarget(), equalTo(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise with this bit, I think we should still be waiting for the recoveries to reach zero according to both nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 305 to 306
long nodeAThrottling = Long.MAX_VALUE;
long nodeBThrottling = Long.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are unused here it seems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

logger.info(
"--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a "
+ "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.",
shardSize.getMb(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest just shardSize here, its toString() will yield something readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is pretty nifty. Sure.

"--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a "
+ "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.",
shardSize.getMb(),
chunkSize / 1024.0 / 1024.0 /* converting bytes to megabytes */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, ByteSizeValue.ofBytes(chunkSize) will do something sensible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated per review comments here and here. I also pulled out the name changes for a follow up patch, but I think more importantly I moved the new tests below the old one so that the diff is much clearer.

Comment on lines 305 to 306
long nodeAThrottling = Long.MAX_VALUE;
long nodeBThrottling = Long.MAX_VALUE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

Comment on lines 472 to 475
IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeA);
assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsSource(), equalTo(1));
indicesService = internalCluster().getInstance(IndicesService.class, nodeB);
assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsTarget(), equalTo(1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I added this back to the testRerouteRecovery test. Originally I thought you meant in the new tests, because of the diff -- this is in the new tests.

I basically undid what the original patch to add throttling added to this test. I don't have strong opinions, though.

Comment on lines 503 to 509
assertThat("node A should have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(1));
assertThat("node A should not have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(0));
nodeAThrottling = recoveryStats.throttleTime().millis();
}
if (nodeStats.getNode().getName().equals(nodeB)) {
assertThat("node B should not have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(0));
assertThat("node B should have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logger.info(
"--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a "
+ "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.",
shardSize.getMb(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is pretty nifty. Sure.

"--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a "
+ "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.",
shardSize.getMb(),
chunkSize / 1024.0 / 1024.0 /* converting bytes to megabytes */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 564 to 565
assertThat(recoveryStats.currentAsSource(), equalTo(0));
assertThat(recoveryStats.currentAsTarget(), equalTo(0));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Since it's only test changes (and fixes a known test bug) I think it would be good to backport to 8.11, 8.10 and 7.17 too.

…recovery/IndexRecoveryIT.java

Co-authored-by: David Turner <[email protected]>
@DiannaHohensee DiannaHohensee added v8.11.1 v7.17.15 v8.10.5 auto-backport Automatically create backport pull requests when merged labels Oct 13, 2023
@DiannaHohensee DiannaHohensee merged commit 323d936 into elastic:main Oct 13, 2023
DiannaHohensee added a commit to DiannaHohensee/elasticsearch that referenced this pull request Oct 13, 2023
Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.11
7.17 Commit could not be cherrypicked due to conflicts
8.10

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 100788

DiannaHohensee added a commit to DiannaHohensee/elasticsearch that referenced this pull request Oct 13, 2023
Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.
DiannaHohensee added a commit that referenced this pull request Oct 16, 2023
Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.

(cherry picked from commit 323d936)q
DiannaHohensee added a commit that referenced this pull request Oct 16, 2023
Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.

(cherry picked from commit 323d936)
DiannaHohensee added a commit that referenced this pull request Oct 16, 2023
)

Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.

(cherry picked from commit 323d936)
mfussenegger added a commit to crate/crate that referenced this pull request Oct 16, 2023
The test is currently flaky and it was upstream as well:

- elastic/elasticsearch#99941

In ES they split out the throttle checks into dedicated tests:

- elastic/elasticsearch#100788

We can't copy them due to the License, but we can at least remove the
broken checks to make the test non-flaky.
mfussenegger added a commit to crate/crate that referenced this pull request Oct 16, 2023
The test is currently flaky and it was upstream as well:

- elastic/elasticsearch#99941

In ES they split out the throttle checks into dedicated tests:

- elastic/elasticsearch#100788

We can't copy them due to the License, but we can at least remove the
broken checks to make the test non-flaky.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v7.17.15 v8.10.5 v8.11.1 v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndexRecoveryIT testRerouteRecovery failing
3 participants