Stabilize testRerouteRecovery throttle testing #100788

DiannaHohensee · 2023-10-12T23:40:26Z

Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.

Closes #99941

Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling.

elasticsearchmachine · 2023-10-12T23:40:50Z

Pinging @elastic/es-distributed (Team:Distributed)

DiannaHohensee · 2023-10-13T13:20:05Z

Is the >test-failure label used for test fixes?

I've run the old and new tests a few times each at 100 iterations. I did a little tidying in the test file, too: let me know if it should be done separately.

I didn't modify the recovery path, though I did shove it into a function helper. I figure it's the same recovery functionality that's being tested.

DaveCTurner · 2023-10-13T13:37:13Z

>test is the label to use for changes that fix (or add) tests.

The cleanups/renamings/comment additions look good but a separate PR would make it a little easier to review, and also to eliminate them as a source of problems if we ever have to git bisect back to this change. Not a huge deal tho, I can see what's going on.

DaveCTurner

New tests look good; I left only a few superficial comments. I have a hunch that splitting out the renamings and other tidyups will make this a lot more readable - there's too much similarity between the old and new tests, and too many differences between the old version of the old test and its new version here, for the diff rendering to work nicely.

DaveCTurner · 2023-10-13T13:39:15Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

-            IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeA);
-            assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsSource(), equalTo(1));
-            indicesService = internalCluster().getInstance(IndicesService.class, nodeB);
-            assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsTarget(), equalTo(1));


Removing this wait looks suspicious to me. It's not throttling-related, it's waiting for both nodes to start handling the recovery and that looks important for the next few lines of the test.

Hmm. I added this back to the testRerouteRecovery test. Originally I thought you meant in the new tests, because of the diff -- this is in the new tests.

I basically undid what the original patch to add throttling added to this test. I don't have strong opinions, though.

DaveCTurner · 2023-10-13T13:41:01Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

-                assertThat("node A should have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(1));
-                assertThat("node A should not have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(0));
-                nodeAThrottling = recoveryStats.throttleTime().millis();
-            }
-            if (nodeStats.getNode().getName().equals(nodeB)) {
-                assertThat("node B should not have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(0));
-                assertThat("node B should have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(1));


I think we should keep the check that A has 1 source and 0 targets and B has 0 sources and 1 target. We can drop the bit about the throttling here tho, and probably the assertBusy().

DaveCTurner · 2023-10-13T13:42:13Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

-            assertThat(recoveryStats.currentAsSource(), equalTo(0));
-            assertThat(recoveryStats.currentAsTarget(), equalTo(0));


Likewise with this bit, I think we should still be waiting for the recoveries to reach zero according to both nodes.

DaveCTurner · 2023-10-13T13:45:09Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        long nodeAThrottling = Long.MAX_VALUE;
+        long nodeBThrottling = Long.MAX_VALUE;


These are unused here it seems.

Thanks, fixed

DaveCTurner · 2023-10-13T13:47:09Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        logger.info(
+            "--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a "
+                + "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.",
+            shardSize.getMb(),


Suggest just shardSize here, its toString() will yield something readable.

Oh, that is pretty nifty. Sure.

DaveCTurner · 2023-10-13T13:47:31Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+            "--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a "
+                + "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.",
+            shardSize.getMb(),
+            chunkSize / 1024.0 / 1024.0 /* converting bytes to megabytes */


Likewise, ByteSizeValue.ofBytes(chunkSize) will do something sensible.

…patch

DiannaHohensee

I've updated per review comments here and here. I also pulled out the name changes for a follow up patch, but I think more importantly I moved the new tests below the old one so that the diff is much clearer.

DiannaHohensee · 2023-10-13T14:02:36Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        long nodeAThrottling = Long.MAX_VALUE;
+        long nodeBThrottling = Long.MAX_VALUE;


Thanks, fixed

DiannaHohensee · 2023-10-13T14:12:10Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

-            IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeA);
-            assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsSource(), equalTo(1));
-            indicesService = internalCluster().getInstance(IndicesService.class, nodeB);
-            assertThat(indicesService.indexServiceSafe(index).getShard(0).recoveryStats().currentAsTarget(), equalTo(1));


Hmm. I added this back to the testRerouteRecovery test. Originally I thought you meant in the new tests, because of the diff -- this is in the new tests.

I basically undid what the original patch to add throttling added to this test. I don't have strong opinions, though.

DiannaHohensee · 2023-10-13T14:13:36Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

-                assertThat("node A should have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(1));
-                assertThat("node A should not have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(0));
-                nodeAThrottling = recoveryStats.throttleTime().millis();
-            }
-            if (nodeStats.getNode().getName().equals(nodeB)) {
-                assertThat("node B should not have ongoing recovery as source", recoveryStats.currentAsSource(), equalTo(0));
-                assertThat("node B should have ongoing recovery as target", recoveryStats.currentAsTarget(), equalTo(1));


DiannaHohensee · 2023-10-13T14:16:08Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        logger.info(
+            "--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a "
+                + "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.",
+            shardSize.getMb(),


Oh, that is pretty nifty. Sure.

DiannaHohensee · 2023-10-13T14:16:11Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+            "--> restarting node A with recovery throttling settings. Index shard size (MB) is `{}`. Throttling down to a "
+                + "chunk of size `{}` (MB) per second. This will slow recovery of the shard to 10 seconds.",
+            shardSize.getMb(),
+            chunkSize / 1024.0 / 1024.0 /* converting bytes to megabytes */


DiannaHohensee · 2023-10-13T15:09:36Z

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

-            assertThat(recoveryStats.currentAsSource(), equalTo(0));
-            assertThat(recoveryStats.currentAsTarget(), equalTo(0));


DaveCTurner

LGTM.

Since it's only test changes (and fixes a known test bug) I think it would be good to backport to 8.11, 8.10 and 7.17 too.

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

…recovery/IndexRecoveryIT.java Co-authored-by: David Turner <[email protected]>

Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling.

elasticsearchmachine · 2023-10-13T19:47:34Z

💔 Backport failed

Status	Branch	Result
✅	8.11
❌	7.17	Commit could not be cherrypicked due to conflicts
✅	8.10

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 100788

Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling.

Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling. (cherry picked from commit 323d936)q

Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling. (cherry picked from commit 323d936)

) Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling. (cherry picked from commit 323d936)

The test is currently flaky and it was upstream as well: - elastic/elasticsearch#99941 In ES they split out the throttle checks into dedicated tests: - elastic/elasticsearch#100788 We can't copy them due to the License, but we can at least remove the broken checks to make the test non-flaky.

DiannaHohensee self-assigned this Oct 12, 2023

elasticsearchmachine added the v8.12.0 label Oct 12, 2023

DiannaHohensee added >test-failure Triaged test failures from CI and removed >test Issues or PRs that are addressing/adding tests labels Oct 12, 2023

elasticsearchmachine added the blocker label Oct 12, 2023

DiannaHohensee mentioned this pull request Oct 12, 2023

[CI] IndexRecoveryIT testRerouteRecovery failing #99941

Closed

DiannaHohensee requested a review from DaveCTurner October 13, 2023 13:14

DiannaHohensee added >test Issues or PRs that are addressing/adding tests and removed >test-failure Triaged test failures from CI labels Oct 13, 2023

DaveCTurner reviewed Oct 13, 2023

View reviewed changes

DiannaHohensee added 2 commits October 13, 2023 10:18

fixes after review

c1d282d

improve git diff, and delay general file improvements to a follow up …

b4717f6

…patch

DaveCTurner removed the blocker label Oct 13, 2023

return some testRerouteRecovery checks without the throttling bits

6768555

DiannaHohensee commented Oct 13, 2023

View reviewed changes

DiannaHohensee requested a review from DaveCTurner October 13, 2023 15:14

DaveCTurner approved these changes Oct 13, 2023

View reviewed changes

server/src/internalClusterTest/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java Outdated Show resolved Hide resolved

Update server/src/internalClusterTest/java/org/elasticsearch/indices/…

c1e2f86

…recovery/IndexRecoveryIT.java Co-authored-by: David Turner <[email protected]>

DiannaHohensee added v8.11.1 v7.17.15 v8.10.5 auto-backport Automatically create backport pull requests when merged labels Oct 13, 2023

fix startShardRecovery

9d49556

DiannaHohensee merged commit 323d936 into elastic:main Oct 13, 2023

DiannaHohensee mentioned this pull request Oct 13, 2023

[8.11] Stabilize testRerouteRecovery throttle testing (#100788) #100853

Merged

DiannaHohensee mentioned this pull request Oct 13, 2023

[8.10] Stabilize testRerouteRecovery throttle testing (#100788) #100854

Merged

elasticsearchmachine added the backport pending label Oct 13, 2023

mfussenegger mentioned this pull request Oct 16, 2023

Remove throttle assertions from testRerouteRecovery crate/crate#14892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize testRerouteRecovery throttle testing #100788

Stabilize testRerouteRecovery throttle testing #100788

DiannaHohensee commented Oct 12, 2023 •

edited

Loading

elasticsearchmachine commented Oct 12, 2023

DiannaHohensee commented Oct 13, 2023

DaveCTurner commented Oct 13, 2023

DaveCTurner left a comment

DaveCTurner Oct 13, 2023

DiannaHohensee Oct 13, 2023

DaveCTurner Oct 13, 2023

DiannaHohensee Oct 13, 2023

DaveCTurner Oct 13, 2023

DiannaHohensee Oct 13, 2023

DaveCTurner Oct 13, 2023

DiannaHohensee Oct 13, 2023

DaveCTurner Oct 13, 2023

DiannaHohensee Oct 13, 2023

DaveCTurner Oct 13, 2023

DiannaHohensee Oct 13, 2023

DiannaHohensee left a comment

DiannaHohensee Oct 13, 2023

DiannaHohensee Oct 13, 2023

DiannaHohensee Oct 13, 2023

DiannaHohensee Oct 13, 2023

DiannaHohensee Oct 13, 2023

DiannaHohensee Oct 13, 2023

DaveCTurner left a comment

elasticsearchmachine commented Oct 13, 2023

		assertThat(recoveryStats.currentAsSource(), equalTo(0));
		assertThat(recoveryStats.currentAsTarget(), equalTo(0));

		long nodeAThrottling = Long.MAX_VALUE;
		long nodeBThrottling = Long.MAX_VALUE;

Stabilize testRerouteRecovery throttle testing #100788

Stabilize testRerouteRecovery throttle testing #100788

Conversation

DiannaHohensee commented Oct 12, 2023 • edited Loading

elasticsearchmachine commented Oct 12, 2023

DiannaHohensee commented Oct 13, 2023

DaveCTurner commented Oct 13, 2023

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DiannaHohensee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 13, 2023

💔 Backport failed

DiannaHohensee commented Oct 12, 2023 •

edited

Loading