[CI] DocsClientYamlTestSuiteIT test {yaml=reference/watcher/example-watches/example-watch-clusterstatus/line_137} failing - (#115809) #117354

lukewhiting · 2024-11-22T15:36:04Z

Sometimes a watch can run during a YAML test, creating some history indices. During test cleanup these get deleted but can throw a warning if these systems indices exist and have been deleted.

This patch introduces code to swallow (but log) system index access warnings during YAML test cluster cleanup.

Fixes #115809
Fixes #116884

elasticsearchmachine · 2024-11-22T15:36:30Z

Pinging @elastic/es-data-management (Team:Data Management)

nielsbauman · 2024-11-22T15:39:36Z

Don't forget to remove the entries from muted-tests.yml.

nielsbauman · 2024-11-22T15:41:33Z

test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java

@@ -1131,6 +1135,19 @@ protected static void wipeAllIndices(boolean preserveSecurityIndices) throws IOE
            }
            final Request deleteRequest = new Request("DELETE", Strings.collectionToCommaDelimitedString(indexPatterns));
            deleteRequest.addParameter("expand_wildcards", "open,closed,hidden");
+
+            // If system index warning, ignore but log
+            deleteRequest.setOptions(RequestOptions.DEFAULT.toBuilder().setWarningsHandler(warnings -> {


I know we talked about this yesterday in the team meeting, but I don't remember if we talked about excluding the triggered-watches data stream from this DELETE request. Is that not an option?

I don't remember discussing that but I was a bit hesitant to do that as I think there's logic that reads that index and may affect the execution / retrieval of watches if it's left over between tests.

Perhaps @masseyke can weigh in here if he thinks it's safe to keep between tests?

Did a bit more digging and yes .triggered-watches hanging around between tests does affect watch execution (see

elasticsearch/x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/TriggeredWatchStore.java

Lines 122 to 130 in 8efd08b

/**

* Checks if any of the loaded watches has been put into the triggered watches index for immediate execution

*

* Note: This is executing a blocking call over the network, thus a potential source of problems

*

* @param watches The list of watches that will be loaded here

* @param clusterState The current cluster state

* @return A list of triggered watches that have been started to execute somewhere else but not finished

*/

) so I think it's best we tidy it up.

@nielsbauman Are you happy with that conclusion?

lukewhiting · 2024-11-22T15:45:28Z

Don't forget to remove the entries from muted-tests.yml.

🤦🏻‍♂️ Yep totally forgot...

…atcher-docs-test-fail

nielsbauman

Thanks for investigating, @lukewhiting! This LGTM, but I think it makes sense if @masseyke also has a quick look anyway - since this is concerning watcher.

masseyke

I think it seems reasonable. It's worth pinging @smalyshev since we're kind of expanding the fix he just did in #117301.

smalyshev · 2024-11-26T03:00:24Z

test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java

+            // and: https://github.com/elastic/elasticsearch/issues/115809
+            deleteRequest.setOptions(RequestOptions.DEFAULT.toBuilder().setWarningsHandler(warnings -> {
+                for (String warning : warnings) {
+                    if (warning.startsWith("this request accesses system indices:")) {


I wonder if shutting down warnings about all system indexes is going a bit too far here. Maybe having an list of allowed ones would be better? Also, maybe there's a way to clean up that before we get here? For async unfortunately there isn't since it can be created by multiple APIs which don't have cleanup functions right now, but not sure what the status with watches.

Thanks for taking a look @smalyshev! To follow up your points:

I wonder if shutting down warnings about all system indexes is going a bit too far here

I did wonder that too but I also was struggling to think when failing the test due to a warning on cleanup would be helpful information... The request succeeded and the index is deleted, the warning is a known side effect of what we are doing here so this just feels like noise...

Maybe having an list of allowed ones would be better?

That's possible but difficult in our case. .triggered-watches is a data stream (I think? Or at least ILM managed) so the actual index name in the error will change over time. We could do some pattern matching on the warning text but that starts to feel quite fragile...

maybe there's a way to clean up that before we get here? For async unfortunately there isn't

Alas this is one of those async cases... Watches can run in the background between tests and cause this index to appear and need cleaning up at any time.

I'm happy to add some more complex pattern matching to this but I would be interested to know the problem it would be solving (Not saying there isn't one, More likely I'm just too new here to spot it yet 😅).

Reply by @smalyshev via DM:

19:29 Stas Malyshev Two notes there: 1. I'd like to keep the warnings handler code separate if possible, code with multiple levels of lambdas nested into each other is kinda hard to read, imho 2. Even if .triggered-watches is a data stream, we still could scan the list and use startsWith to check for specific warnings, not? If it's too hard though then I think it's ok to ignore all system indices there probably

I have moved the warning handler to it's own method to stop the nested lambdas.

The problem with checking which index threw the warning is that it's part way through the warning message so there would need to be code to extract it, then do the starts with.

I feel that with the fragility of such string parsing code (Especially when we have no guarantees on the future consistency of that warning message) and the probability of other indices cropping up in the future that need ignoring, IT's best to just deal with this once and be done with it.

That said, if anyone can come up with a downside to muting all warnings during this cleanup, I'm happy to do a u-turn and get that ignore list function in place.

…atcher-docs-test-fail

smalyshev · 2024-12-03T19:34:56Z

The tag v6.8.18 looks weird though - did you mean 8.18?

elasticsearchmachine · 2024-12-04T10:56:22Z

💔 Backport failed

Status	Branch	Result
❌	8.x	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 117354

…atches/example-watch-clusterstatus/line_137} failing - (elastic#115809) (elastic#117354) * Ignore system index access errors in YAML test index cleanup method * Remove test mute * Swap the logic back as it was right the first time * Resolve conflict with latest merge * Move warning handler into it's own method to reduce nesting (cherry picked from commit cda2fe6)

…atches/example-watch-clusterstatus/line_137} failing - (#115809) (#117354) (#117972) * Ignore system index access errors in YAML test index cleanup method * Remove test mute * Swap the logic back as it was right the first time * Resolve conflict with latest merge * Move warning handler into it's own method to reduce nesting (cherry picked from commit cda2fe6)

Ignore system index access errors in YAML test index cleanup method

7744b23

lukewhiting added >test-failure Triaged test failures from CI :Data Management/Watcher v6.8.18 v9.0.0 labels Nov 22, 2024

elasticsearchmachine added Team:Data Management Meta label for data/management team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Nov 22, 2024

lukewhiting added >test Issues or PRs that are addressing/adding tests and removed >test-failure Triaged test failures from CI needs:risk Requires assignment of a risk label (low, medium, blocker) labels Nov 22, 2024

Remove test mute

dfcc052

nielsbauman reviewed Nov 22, 2024

View reviewed changes

lukewhiting added 4 commits November 22, 2024 15:57

Swap the logic back as it was right the first time

2987125

Merge branch 'main' of github.com:elastic/elasticsearch into 115809-w…

0bb61bc

…atcher-docs-test-fail

Resolve conflict with latest merge

178c694

Merge branch 'main' into 115809-watcher-docs-test-fail

ea443d8

nielsbauman approved these changes Nov 25, 2024

View reviewed changes

masseyke approved these changes Nov 25, 2024

View reviewed changes

smalyshev reviewed Nov 26, 2024

View reviewed changes

lukewhiting added 2 commits December 3, 2024 11:07

Move warning handler into it's own method to reduce nesting

d36e205

Merge branch 'main' of github.com:elastic/elasticsearch into 115809-w…

279ff54

…atcher-docs-test-fail

smalyshev approved these changes Dec 3, 2024

View reviewed changes

lukewhiting added v8.18.0 and removed v6.8.18 labels Dec 4, 2024

Merge branch 'main' into 115809-watcher-docs-test-fail

9634e9f

lukewhiting enabled auto-merge (squash) December 4, 2024 09:46

lukewhiting added the auto-backport Automatically create backport pull requests when merged label Dec 4, 2024

lukewhiting merged commit cda2fe6 into elastic:main Dec 4, 2024
16 checks passed

elasticsearchmachine added the backport pending label Dec 4, 2024

lukewhiting mentioned this pull request Dec 4, 2024

[8.x] [CI] DocsClientYamlTestSuiteIT test {yaml=reference/watcher/example-w… #117972

Merged

lukewhiting deleted the 115809-watcher-docs-test-fail branch December 4, 2024 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/watcher/example-watches/example-watch-clusterstatus/line_137} failing - (#115809) #117354

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/watcher/example-watches/example-watch-clusterstatus/line_137} failing - (#115809) #117354

lukewhiting commented Nov 22, 2024 •

edited

Loading

elasticsearchmachine commented Nov 22, 2024

nielsbauman commented Nov 22, 2024

nielsbauman Nov 22, 2024

lukewhiting Nov 22, 2024

lukewhiting Nov 25, 2024 •

edited

Loading

lukewhiting commented Nov 22, 2024

nielsbauman left a comment

masseyke left a comment

smalyshev Nov 26, 2024

lukewhiting Nov 26, 2024 •

edited

Loading

lukewhiting Dec 3, 2024 •

edited

Loading

lukewhiting Dec 3, 2024

smalyshev commented Dec 3, 2024

elasticsearchmachine commented Dec 4, 2024

	/**
	* Checks if any of the loaded watches has been put into the triggered watches index for immediate execution
	*
	* Note: This is executing a blocking call over the network, thus a potential source of problems
	*
	* @param watches The list of watches that will be loaded here
	* @param clusterState The current cluster state
	* @return A list of triggered watches that have been started to execute somewhere else but not finished
	*/

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/watcher/example-watches/example-watch-clusterstatus/line_137} failing - (#115809) #117354

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/watcher/example-watches/example-watch-clusterstatus/line_137} failing - (#115809) #117354

Conversation

lukewhiting commented Nov 22, 2024 • edited Loading

elasticsearchmachine commented Nov 22, 2024

nielsbauman commented Nov 22, 2024

nielsbauman Nov 22, 2024

Choose a reason for hiding this comment

lukewhiting Nov 22, 2024

Choose a reason for hiding this comment

lukewhiting Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

lukewhiting commented Nov 22, 2024

nielsbauman left a comment

Choose a reason for hiding this comment

masseyke left a comment

Choose a reason for hiding this comment

smalyshev Nov 26, 2024

Choose a reason for hiding this comment

lukewhiting Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

lukewhiting Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

lukewhiting Dec 3, 2024

Choose a reason for hiding this comment

smalyshev commented Dec 3, 2024

elasticsearchmachine commented Dec 4, 2024

💔 Backport failed

lukewhiting commented Nov 22, 2024 •

edited

Loading

lukewhiting Nov 25, 2024 •

edited

Loading

lukewhiting Nov 26, 2024 •

edited

Loading

lukewhiting Dec 3, 2024 •

edited

Loading