Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

org.elasticsearch.xpack.transform.integration.TestFeatureResetIT #100596

Closed
benwtrent opened this issue Oct 10, 2023 · 17 comments · Fixed by #104870
Closed

org.elasticsearch.xpack.transform.integration.TestFeatureResetIT #100596

benwtrent opened this issue Oct 10, 2023 · 17 comments · Fixed by #104870
Labels
low-risk An open issue or test failure that is a low risk to future releases :ml/Transform Transform Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@benwtrent
Copy link
Member

CI Link

https://gradle-enterprise.elastic.co/s/wmkqnkogroefy

Repro line

./gradlew ':x-pack:plugin:transform:qa:multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.transform.integration.TestFeatureResetIT.testTransformFeatureReset" -Dtests.seed=A5D75D753CF49FA6 -Dtests.locale=cs -Dtests.timezone=Asia/Famagusta -Druntime.java=21

Does it reproduce?

No

Applicable branches

main

Failure history

No response

Failure excerpt

It does seem that a transforms index is being created again after the feature reset. Here are the cluster logs from this failure:
transform.zip

org.elasticsearch.xpack.transform.integration.TestFeatureResetIT > testTransformFeatureReset FAILED |  
-- | --
  | java.lang.AssertionError: |  
  | Expected: is a map with size <0> |  
  | but: map size was <1> |  
  | at __randomizedtesting.SeedInfo.seed([A5D75D753CF49FA6:B8C2ACD965E57A12]:0) |  
  | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  | at org.junit.Assert.assertThat(Assert.java:956) |  
  | at org.junit.Assert.assertThat(Assert.java:923) |  
  | at org.elasticsearch.xpack.transform.integration.TestFeatureResetIT.testTransformFeatureReset(TestFeatureResetIT.java:117)
@benwtrent benwtrent added >test-failure Triaged test failures from CI :ml/Transform Transform low-risk An open issue or test failure that is a low risk to future releases labels Oct 10, 2023
@benwtrent
Copy link
Member Author

marked as low-risk as feature-reset is a test focused thing.

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@przemekwitek
Copy link
Contributor

This is a likely consequence of merging #100203
Hopefully with #100287 I'll be able to debug this test failure.

@droberts195
Copy link
Contributor

droberts195 commented Oct 10, 2023

@przemekwitek I am guessing this is a notification hitting the notifications index after the test completes.

Could there be a /* missing from here?

The ML equivalent has a /* at the end:

Or, if it's not that then it must be that we're auditing a notification message after all the transform tasks have finished.

@przemekwitek
Copy link
Contributor

I am guessing this is a notification hitting the notifications index after the test completes.

Yes, I was thinking of a notification too. Unfortunately, the test assertion didn't tell us which transform-* index existed where it should not. I'll fix this.

Could there be a /* missing from here?

Possibly yes. TransformField.TASK_NAME resolves to "data_frame/transforms".
When I query tasks (on my Cloud cluster) using actions=data_frame/transforms it returns nothing:

k8s-gitops-control-plane [main]$ curl -k -u $USER:$PASSWORD "$ES_URL/_tasks?actions=data_frame/transforms" | jq  
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    12  100    12    0     0     20      0 --:--:-- --:--:-- --:--:--    20
{
  "nodes": {}
}

while when I query using wildcard actions=data_frame/transforms*, it returns all the transform persistent tasks.

So it may be just a "broken reset" issue, not a "broken force stop" issue.
Fix is on the way.

@przemekwitek
Copy link
Contributor

Yes, I was thinking of a notification too. Unfortunately, the test assertion didn't tell us which transform-* index existed where it should not. I'll fix this.

So I now have confirmation that it is indeed .transform-notifications-000002 index being created:

org.elasticsearch.xpack.transform.integration.TestFeatureResetIT > testTransformFeatureReset FAILED
    java.lang.AssertionError: Indices were: {.transform-notifications-000002={settings={index={routing={allocation={include={_tier_preference=data_content}}}, hidden=true, number_of_shards=1, auto_expand_replicas=0-1, provided_name=.transform-notifications-000002, creation_date=1697002490163, number_of_replicas=1, uuid=6t9YgmmmQ_2xvGuhTaLWLA, version={created=8500003}}}, mappings={_meta={managed_index_mappings_version=1, version=8.11.0}, dynamic=false, properties={level={type=keyword}, transform_id={type=keyword}, node_name={type=keyword}, message={type=text, fields={raw={ignore_above=1024, type=keyword}}}, timestamp={type=date}}}, aliases={.transform-notifications-read={is_hidden=true}}}}
    Expected: is a map with size <0>
         but: map size was <1>
        at __randomizedtesting.SeedInfo.seed([61A84AAC7843C044:7CBDBB00215225F0]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.xpack.transform.integration.TestFeatureResetIT.testTransformFeatureReset(TestFeatureResetIT.java:118)

Surprisingly, it didn't have any documents (audit messages) in it. Otherwise, the previous assertion (that searches for index contents) would fail while it didn't.
So I'm looking for a place where an empty notifications index is created.

@droberts195
Copy link
Contributor

droberts195 commented Oct 11, 2023

So I'm looking for a place where an empty notifications index is created.

It could be that, but if you cannot find such a place then it could be that the test just did its check at the instant where the index existed but the document wasn't searchable. Remember that documents aren't searchable after indexing for a second or two.

@przemekwitek
Copy link
Contributor

przemekwitek commented Oct 11, 2023

The answer to the question "why it didn't have any documents" is below. In short, the .transform-notifications-000002 got created but the shards were not yet initialized so the search succeeded.

[2023-10-11T05:34:50,057][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.transform-notifications-000002/vI8nYPvASRCW2A-4mMnyIw] deleting index
[2023-10-11T05:34:50,058][INFO ][o.e.c.m.MetadataDeleteIndexService] [javaRestTest-0] [.transform-internal-007/Y8tiTWeTSMygYzE_7-pOAg] deleting index
[2023-10-11T05:34:50,169][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-0] [.transform-notifications-000002] creating index, cause [auto(bulk api)], templates [.transform-notifications-000002], shards [1]/[1]
[2023-10-11T05:34:50,215][INFO ][o.e.i.m.MapperService    ] [javaRestTest-0] [.transform-notifications-000002] reloading search analyzers
[2023-10-11T05:34:50,215][WARN ][r.suppressed             ] [javaRestTest-0] path: .transform-*/_search, params: {index=.transform-*} Failed to execute phase [query], all shards failed; shardFailures {[_na_][.transform-notifications-000002][0]: org.elasticsearch.action.NoShardAvailableActionException 

I think for debugging, I'll wait for these shards to show up just to see what documents (audit messages) get written to this index.

Nevertheless, I'm almost sure that it is indexer not fully shutdown and writing something when the pTask is already gone. Maybe we'll need to accept that with force == true, the wait_for_completion sometimes doesn't wait for the real completion.

@benwtrent
Copy link
Member Author

failed again: https://gradle-enterprise.elastic.co/s/cjccpnkbkle5w

y'all good with me muting this thing?

benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Oct 11, 2023
@benwtrent
Copy link
Member Author

Muting it: #100729

elasticsearchmachine pushed a commit that referenced this issue Oct 11, 2023
pxsalehi pushed a commit to pxsalehi/elasticsearch that referenced this issue Oct 24, 2023
elasticsearchmachine pushed a commit that referenced this issue Oct 24, 2023
muting test for gh:
#100596

Co-authored-by: Benjamin Trent <[email protected]>
@ldematte
Copy link
Contributor

@cbuescher
Copy link
Member

Doesn't seem to be muted on 8.10 yet, I'm doing this now since I got this today.

@ldematte
Copy link
Contributor

ldematte commented Nov 2, 2023

Thank you for doing that @cbuescher, I was waiting for a "yes" and then forgot.

@przemekwitek
Copy link
Contributor

przemekwitek commented Jan 26, 2024

Thanks to the functionality we have in our tests' cleanUp method, the contents of notifications index are logged when the test is being cleaned up.
In one of these failing tests I have observed:

  1> [2023-10-11T16:15:25,664][INFO ][o.e.x.t.i.TestFeatureResetIT] [testTransformFeatureReset] writing audit messages to the log
  1> [2023-10-11T16:15:25,705][INFO ][o.e.x.t.i.TestFeatureResetIT] [testTransformFeatureReset] Transform audit: [2023-10-11T19:15:25.525Z] [continuous-transform-feature-reset] [Received abort request, stopping transform.] [javaRestTest-2]

which comes from TransformIndexer.onAbort method.
This is hint no. 1.

The hint no. 2 is the fact that we only log audit messages when not in reset mode.
While the _reset request is being processed, we are in reset mode so no audit messages should be logged. I added a log (#104763) line whenever TransformAuditor notices change of isResetMode bit in cluster state.
I also unmuted the test to see the exact timing of the failure related to the change in the isResetMode bit value.

@przemekwitek
Copy link
Contributor

Thanks for reporting @joegallo .
Was it a PR CI workflow? I cannot see this failure in #es-test-triage slack channel.

@joegallo
Copy link
Contributor

Yeah, it was a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
low-risk An open issue or test failure that is a low risk to future releases :ml/Transform Transform Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants