Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DocsClientYamlTestSuiteIT test {yaml=reference/watcher/example-watches/example-watch-clusterstatus/line_137} failing - (#115809) #117354

Merged
merged 9 commits into from
Dec 4, 2024

Conversation

lukewhiting
Copy link
Contributor

@lukewhiting lukewhiting commented Nov 22, 2024

Sometimes a watch can run during a YAML test, creating some history indices. During test cleanup these get deleted but can throw a warning if these systems indices exist and have been deleted.

This patch introduces code to swallow (but log) system index access warnings during YAML test cluster cleanup.

Fixes #115809
Fixes #116884

@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team needs:risk Requires assignment of a risk label (low, medium, blocker) labels Nov 22, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@lukewhiting lukewhiting added >test Issues or PRs that are addressing/adding tests and removed >test-failure Triaged test failures from CI needs:risk Requires assignment of a risk label (low, medium, blocker) labels Nov 22, 2024
@nielsbauman
Copy link
Contributor

Don't forget to remove the entries from muted-tests.yml.

@@ -1131,6 +1135,19 @@ protected static void wipeAllIndices(boolean preserveSecurityIndices) throws IOE
}
final Request deleteRequest = new Request("DELETE", Strings.collectionToCommaDelimitedString(indexPatterns));
deleteRequest.addParameter("expand_wildcards", "open,closed,hidden");

// If system index warning, ignore but log
deleteRequest.setOptions(RequestOptions.DEFAULT.toBuilder().setWarningsHandler(warnings -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we talked about this yesterday in the team meeting, but I don't remember if we talked about excluding the triggered-watches data stream from this DELETE request. Is that not an option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember discussing that but I was a bit hesitant to do that as I think there's logic that reads that index and may affect the execution / retrieval of watches if it's left over between tests.

Perhaps @masseyke can weigh in here if he thinks it's safe to keep between tests?

Copy link
Contributor Author

@lukewhiting lukewhiting Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a bit more digging and yes .triggered-watches hanging around between tests does affect watch execution (see

/**
* Checks if any of the loaded watches has been put into the triggered watches index for immediate execution
*
* Note: This is executing a blocking call over the network, thus a potential source of problems
*
* @param watches The list of watches that will be loaded here
* @param clusterState The current cluster state
* @return A list of triggered watches that have been started to execute somewhere else but not finished
*/
) so I think it's best we tidy it up.

@nielsbauman Are you happy with that conclusion?

@lukewhiting
Copy link
Contributor Author

Don't forget to remove the entries from muted-tests.yml.

🤦🏻‍♂️ Yep totally forgot...

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating, @lukewhiting! This LGTM, but I think it makes sense if @masseyke also has a quick look anyway - since this is concerning watcher.

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it seems reasonable. It's worth pinging @smalyshev since we're kind of expanding the fix he just did in #117301.

// and: https://github.com/elastic/elasticsearch/issues/115809
deleteRequest.setOptions(RequestOptions.DEFAULT.toBuilder().setWarningsHandler(warnings -> {
for (String warning : warnings) {
if (warning.startsWith("this request accesses system indices:")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if shutting down warnings about all system indexes is going a bit too far here. Maybe having an list of allowed ones would be better? Also, maybe there's a way to clean up that before we get here? For async unfortunately there isn't since it can be created by multiple APIs which don't have cleanup functions right now, but not sure what the status with watches.

Copy link
Contributor Author

@lukewhiting lukewhiting Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @smalyshev! To follow up your points:

I wonder if shutting down warnings about all system indexes is going a bit too far here

I did wonder that too but I also was struggling to think when failing the test due to a warning on cleanup would be helpful information... The request succeeded and the index is deleted, the warning is a known side effect of what we are doing here so this just feels like noise...

Maybe having an list of allowed ones would be better?

That's possible but difficult in our case. .triggered-watches is a data stream (I think? Or at least ILM managed) so the actual index name in the error will change over time. We could do some pattern matching on the warning text but that starts to feel quite fragile...

maybe there's a way to clean up that before we get here? For async unfortunately there isn't

Alas this is one of those async cases... Watches can run in the background between tests and cause this index to appear and need cleaning up at any time.

I'm happy to add some more complex pattern matching to this but I would be interested to know the problem it would be solving (Not saying there isn't one, More likely I'm just too new here to spot it yet 😅).

Copy link
Contributor Author

@lukewhiting lukewhiting Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reply by @smalyshev via DM:

19:29
Stas Malyshev

Two notes there: 1. I'd like to keep the warnings handler code separate if possible, code with multiple levels of lambdas nested into each other is kinda hard to read, imho

2. Even if .triggered-watches is a data stream, we still could scan the list and use startsWith to check for specific warnings, not? If it's too hard though then I think it's ok to ignore all system indices there probably

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the warning handler to it's own method to stop the nested lambdas.

The problem with checking which index threw the warning is that it's part way through the warning message so there would need to be code to extract it, then do the starts with.

I feel that with the fragility of such string parsing code (Especially when we have no guarantees on the future consistency of that warning message) and the probability of other indices cropping up in the future that need ignoring, IT's best to just deal with this once and be done with it.

That said, if anyone can come up with a downside to muting all warnings during this cleanup, I'm happy to do a u-turn and get that ignore list function in place.

@smalyshev
Copy link
Contributor

The tag v6.8.18 looks weird though - did you mean 8.18?

@lukewhiting lukewhiting enabled auto-merge (squash) December 4, 2024 09:46
@lukewhiting lukewhiting added the auto-backport Automatically create backport pull requests when merged label Dec 4, 2024
@lukewhiting lukewhiting merged commit cda2fe6 into elastic:main Dec 4, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.x Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 117354

lukewhiting added a commit to lukewhiting/elasticsearch that referenced this pull request Dec 4, 2024
…atches/example-watch-clusterstatus/line_137} failing - (elastic#115809) (elastic#117354)

* Ignore system index access errors in YAML test index cleanup method

* Remove test mute

* Swap the logic back as it was right the first time

* Resolve conflict with latest merge

* Move warning handler into it's own method to reduce nesting

(cherry picked from commit cda2fe6)
lukewhiting added a commit to lukewhiting/elasticsearch that referenced this pull request Dec 4, 2024
…atches/example-watch-clusterstatus/line_137} failing - (elastic#115809) (elastic#117354)

* Ignore system index access errors in YAML test index cleanup method

* Remove test mute

* Swap the logic back as it was right the first time

* Resolve conflict with latest merge

* Move warning handler into it's own method to reduce nesting

(cherry picked from commit cda2fe6)
elasticsearchmachine pushed a commit that referenced this pull request Dec 4, 2024
…atches/example-watch-clusterstatus/line_137} failing - (#115809) (#117354) (#117972)

* Ignore system index access errors in YAML test index cleanup method

* Remove test mute

* Swap the logic back as it was right the first time

* Resolve conflict with latest merge

* Move warning handler into it's own method to reduce nesting

(cherry picked from commit cda2fe6)
@lukewhiting lukewhiting deleted the 115809-watcher-docs-test-fail branch December 4, 2024 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending :Data Management/Watcher Team:Data Management Meta label for data/management team >test Issues or PRs that are addressing/adding tests v8.18.0 v9.0.0
Projects
None yet
5 participants