Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Use feature reset API in ML REST test cleanup #71552

Merged

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Apr 12, 2021

Now that we have a feature reset API, we should use
this for cleaning up in between tests instead of running
lots of bespoke cleanup code.

During testing of this change we found we need to
delete custom cluster state as part of the reset process,
so this PR also implements that.

Additionally we no longer assign persistent tasks
during feature reset.

Now that we have a feature reset API, we should use
this for cleaning up in between tests instead of running
lots of bespoke cleanup code.
@droberts195 droberts195 added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.0.0 v7.13.0 labels Apr 12, 2021
@droberts195
Copy link
Contributor Author

As of the first commit on this PR this doesn't actually work. Several tests fail with:

cannot assign model_alias [my-regression] to model_id [a-unused-regression-model1] as model_alias already refers to [a-unused-regression-model1]. Set parameter [reassign] to [true] if model_alias should be reassigned."

It reveals that the reset is not removing ML trained model aliases.

/cc @pheyos

@droberts195
Copy link
Contributor Author

We should be able to use this API to clean up between Java integration tests as well as REST integration tests. However, I am just doing the REST tests in this PR because I suspect there will be unforeseen side effects that cause CI noise so it's better to move gradually to minimise the number of side effects that need fixing at a particular point in time.

Comment on lines 1403 to 1406
final ClusterState.Builder builder = ClusterState.builder(currentState);
builder.metadata(Metadata.builder(currentState.getMetadata())
.putCustom(ModelAliasMetadata.NAME, ModelAliasMetadata.EMPTY).build());
return builder.build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, could we simply delete the entry for ModelAliasMetadata.NAME? I am thinking of the scenario where a user hasn't ever created a model alias (and consequently, the metadata hasn't even been added to the cluster state yet). This reset call will actually be the first thing to create the metadata entry.

Needed because the feature reset API causes creation
of transform custom cluster state.
This represents quite a major change of approach.

The feature reset API is NOT a master node action.  Therefore
any cluster state updates that are done as part of it need to
be separate master node actions.  However, it's wasteful to do
lots of cluster state updates during a reset because we want to
end up with each successfully reset feature's cluster state
being non-existent.

Therefore, this change reuses the master node action that unsets
reset mode after a successful reset to wipe the relevant section(s)
of custom cluster state.
@droberts195 droberts195 marked this pull request as ready for review April 12, 2021 16:33
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Apr 12, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

Just wanted to call out that there's been a major change of approach in the later commits.

The feature reset API is NOT a master node action. Therefore any cluster state updates that are done as part of it need to
be separate master node actions. This meant that the approach in the second commit didn't work reliably in a multi-node cluster. I could have introduced another internal master node action to reset cluster states. However, it's wasteful to do lots of cluster state updates during a reset because we want to end up with each successfully reset feature's cluster state being non-existent, and where do you stop with the granularity.

Therefore, the latest approach is to adapt the "set reset mode" actions so that when unsetting reset mode due to a successful reset, instead of resetting the flag the relevent section of cluster state is removed.

The question is, as a result of this change, should we take the opportunity to rename that action from "set reset mode" to something that conveys that it's also responsible for final wiping of the custom cluster state? We can do this before 7.13 feature freeze, but after that we'll have to stick with the action name.

@benwtrent benwtrent self-requested a review April 12, 2021 17:26
The transform portion of the feature reset API calls other
transform endpoints, and will do so even if transforms has
never ever been used.  We don't want it to warn about lack
of transform nodes in this case.

(This problem shows up in ML test clusters that use the
feature reset API for cleanup but don't contain any transform
nodes.)

private static final ParseField ENABLED = new ParseField("enabled");
private static final ParseField RESET_SUCCESSFUL = new ParseField("reset_successful");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be "delete metadata" or something. Reset successful doesn't make sense in this context, at least not to me.

Especially since you could have reset_successful be true and have enabled to the `true.

@droberts195
Copy link
Contributor Author

Current test failures are all caused by:

  1> [2021-04-13T01:17:54,433][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled] before test
  1> [2021-04-13T01:18:57,287][WARN ][o.e.c.RestClient         ] [test] request [DELETE http://127.0.0.1:41000/*,-.ds-ilm-history-*?expand_wildcards=open%2Cclosed%2Chidden] returned 1 warnings: [299 Elasticsearch-8.0.0-SNAPSHOT-85b239713eb39d02cf1dc13585e4db4e53a72475 "this request accesses system indices: [.ml-config], but in a future major version, direct access to system indices will be prevented by default"]
  1> [2021-04-13T01:18:57,476][INFO ][o.e.x.t.r.XPackRestIT    ] [test] There are still tasks running after this test that might break subsequent tests [cluster:admin/features/reset, cluster:admin/xpack/ml/job/close, xpack/ml/job[c]].
  1> [2021-04-13T01:18:57,477][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled] after test

Since we reset custom cluster state by completely removing it
now at the end of a successful reset.
@droberts195
Copy link
Contributor Author

Jenkins run elasticsearch-ci/2

@droberts195
Copy link
Contributor Author

@elasticmachine update branch

@droberts195 droberts195 merged commit c436458 into elastic:master Apr 13, 2021
@droberts195 droberts195 deleted the use_feature_reset_in_ml_cleanup branch April 13, 2021 15:05
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Apr 15, 2021
Now that we have a feature reset API, we should use
this for cleaning up in between tests instead of running
lots of bespoke cleanup code.

During testing of this change we found we need to
delete custom cluster state as part of the reset process,
so this PR also implements that.

Additionally we no longer assign persistent tasks
during feature reset.

Backport of elastic#71552
droberts195 added a commit that referenced this pull request Apr 15, 2021
Now that we have a feature reset API, we should use
this for cleaning up in between tests instead of running
lots of bespoke cleanup code.

During testing of this change we found we need to
delete custom cluster state as part of the reset process,
so this PR also implements that.

Additionally we no longer assign persistent tasks
during feature reset.

Backport of #71552
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants