-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stabilize testRerouteRecovery throttle testing #100788
Stabilize testRerouteRecovery throttle testing #100788
Conversation
Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling.
Pinging @elastic/es-distributed (Team:Distributed) |
Is the I've run the old and new tests a few times each at 100 iterations. I did a little tidying in the test file, too: let me know if it should be done separately. I didn't modify the recovery path, though I did shove it into a function helper. I figure it's the same recovery functionality that's being tested. |
The cleanups/renamings/comment additions look good but a separate PR would make it a little easier to review, and also to eliminate them as a source of problems if we ever have to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New tests look good; I left only a few superficial comments. I have a hunch that splitting out the renamings and other tidyups will make this a lot more readable - there's too much similarity between the old and new tests, and too many differences between the old version of the old test and its new version here, for the diff rendering to work nicely.
IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeA); | ||
assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsSource(), equalTo(1)); | ||
indicesService = internalCluster().getInstance(IndicesService.class, nodeB); | ||
assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsTarget(), equalTo(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this wait looks suspicious to me. It's not throttling-related, it's waiting for both nodes to start handling the recovery and that looks important for the next few lines of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I added this back to the testRerouteRecovery test. Originally I thought you meant in the new tests, because of the diff -- this is in the new tests.
I basically undid what the original patch to add throttling added to this test. I don't have strong opinions, though.
assertThat("node A should have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(1)); | ||
assertThat("node A should not have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(0)); | ||
nodeAThrottling = recoveryStats.throttleTime().millis(); | ||
} | ||
if (nodeStats.getNode().getName().equals(nodeB)) { | ||
assertThat("node B should not have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(0)); | ||
assertThat("node B should have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the check that A has 1 source and 0 targets and B has 0 sources and 1 target. We can drop the bit about the throttling here tho, and probably the assertBusy()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
assertThat(recoveryStats.currentAsSource(), equalTo(0)); | ||
assertThat(recoveryStats.currentAsTarget(), equalTo(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise with this bit, I think we should still be waiting for the recoveries to reach zero according to both nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
long nodeAThrottling = Long.MAX_VALUE; | ||
long nodeBThrottling = Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are unused here it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed
logger.info( | ||
"--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a " | ||
+ "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.", | ||
shardSize.getMb(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest just shardSize
here, its toString()
will yield something readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that is pretty nifty. Sure.
"--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a " | ||
+ "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.", | ||
shardSize.getMb(), | ||
chunkSize / 1024.0 / 1024.0 /* converting bytes to megabytes */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, ByteSizeValue.ofBytes(chunkSize)
will do something sensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long nodeAThrottling = Long.MAX_VALUE; | ||
long nodeBThrottling = Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed
IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeA); | ||
assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsSource(), equalTo(1)); | ||
indicesService = internalCluster().getInstance(IndicesService.class, nodeB); | ||
assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsTarget(), equalTo(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I added this back to the testRerouteRecovery test. Originally I thought you meant in the new tests, because of the diff -- this is in the new tests.
I basically undid what the original patch to add throttling added to this test. I don't have strong opinions, though.
assertThat("node A should have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(1)); | ||
assertThat("node A should not have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(0)); | ||
nodeAThrottling = recoveryStats.throttleTime().millis(); | ||
} | ||
if (nodeStats.getNode().getName().equals(nodeB)) { | ||
assertThat("node B should not have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(0)); | ||
assertThat("node B should have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
logger.info( | ||
"--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a " | ||
+ "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.", | ||
shardSize.getMb(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that is pretty nifty. Sure.
"--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a " | ||
+ "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.", | ||
shardSize.getMb(), | ||
chunkSize / 1024.0 / 1024.0 /* converting bytes to megabytes */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
assertThat(recoveryStats.currentAsSource(), equalTo(0)); | ||
assertThat(recoveryStats.currentAsTarget(), equalTo(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Since it's only test changes (and fixes a known test bug) I think it would be good to backport to 8.11, 8.10 and 7.17 too.
server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java
Outdated
Show resolved
Hide resolved
…recovery/IndexRecoveryIT.java Co-authored-by: David Turner <[email protected]>
Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling.
💔 Backport failed
You can use sqren/backport to manually backport by running |
Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling.
Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling. (cherry picked from commit 323d936)q
Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling. (cherry picked from commit 323d936)
) Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling. (cherry picked from commit 323d936)
The test is currently flaky and it was upstream as well: - elastic/elasticsearch#99941 In ES they split out the throttle checks into dedicated tests: - elastic/elasticsearch#100788 We can't copy them due to the License, but we can at least remove the broken checks to make the test non-flaky.
The test is currently flaky and it was upstream as well: - elastic/elasticsearch#99941 In ES they split out the throttle checks into dedicated tests: - elastic/elasticsearch#100788 We can't copy them due to the License, but we can at least remove the broken checks to make the test non-flaky.
Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.
Closes #99941