[ML] complete machine learning plugin feature state clean up integration #71011

benwtrent · 2021-03-29T19:26:36Z

This completes the machine learning feature state cleanup integration.

This commit handles waiting for machine learning tasks to complete and adds a new
field to the ML Metadata cluster state to indicate when a reset is in progress for machine
learning.

relates: #70008

elasticmachine · 2021-03-29T19:26:39Z

Pinging @elastic/ml-core (Team:ML)

davidkyle · 2021-03-30T09:20:13Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java


        Map<String, Boolean> results = new ConcurrentHashMap<>();

+        ActionListener<ListTasksResponse> afterWaitingForTasks = ActionListener.wrap(
+            listTasksResponse -> {
+                listTasksResponse.rethrowFailures("Waiting for indexing requests for .ml-* indices");


If this throws unsetResetModeListener will not be called is that intentional? All other paths call unsetResetModeListener

elasticsearch/server/src/main/java/org/elasticsearch/action/ActionListener.java

Lines 128 to 150 in 318ae89

static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse,

Consumer<Exception> onFailure) {

return new ActionListener<Response>() {

@Override

public void onResponse(Response response) {

try {

onResponse.accept(response);

} catch (Exception e) {

onFailure(e);

}

}

@Override

public void onFailure(Exception e) {

onFailure.accept(e);

}

@Override

public String toString() {

return "WrappedActionListener{" + onResponse + "}{" + onFailure + "}";

}

};

}

Is the definition of the action listener created. The contents are all wrapped in a try catch

davidkyle · 2021-03-30T09:31:51Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

+                    .setWaitForCompletion(true)
+                    .execute(ActionListener.wrap(
+                        listMlTasks -> {
+                            listMlTasks.rethrowFailures("Waiting for machine learning tasks");


Is the idea to leave the plugin in reset mode after a timeout or something so it can be tried again?

No, the idea is reset mode to be unset.

See

elasticsearch/server/src/main/java/org/elasticsearch/action/ActionListener.java

Lines 128 to 150 in 318ae89

static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse,

Consumer<Exception> onFailure) {

return new ActionListener<Response>() {

@Override

public void onResponse(Response response) {

try {

onResponse.accept(response);

} catch (Exception e) {

onFailure(e);

}

}

@Override

public void onFailure(Exception e) {

onFailure.accept(e);

}

@Override

public String toString() {

return "WrappedActionListener{" + onResponse + "}{" + onFailure + "}";

}

};

}

davidkyle · 2021-03-30T09:34:16Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MlDailyMaintenanceService.java

@@ -151,6 +151,10 @@ private void triggerTasks() {
                LOGGER.warn("skipping scheduled [ML] maintenance tasks because upgrade mode is enabled");
                return;
            }
+            if (MlMetadata.getMlMetadata(clusterService.state()).isResetMode()) {
+                LOGGER.warn("skipping scheduled [ML] maintenance tasks because reset mode is enabled");


++ good catch

This means we have to be careful not to leave reset mode enabled and should try to ensure cleanUpFeature does not leave it set.

davidkyle · 2021-03-30T09:39:07Z

...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportSetResetModeAction.java

+
+                @Override
+                protected AcknowledgedResponse newResponse(boolean acknowledged) {
+                    logger.trace("Cluster update response built: " + acknowledged);


Suggested change

logger.trace("Cluster update response built: " + acknowledged);

logger.trace(() -> {"Cluster update response built: " + acknowledged});

Or alternatively:

Suggested change

logger.trace("Cluster update response built: " + acknowledged);

logger.trace("Cluster update response built: {}", acknowledged);

droberts195

Looks good. I just left some comments on a few details.

droberts195 · 2021-03-30T10:53:59Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/action/SetResetModeAction.java

+public class SetResetModeAction extends ActionType<AcknowledgedResponse> {
+
+    public static final SetResetModeAction INSTANCE = new SetResetModeAction();
+    public static final String NAME = "cluster:admin/xpack/ml/reset_mode";


I think we should use internal instead of admin here, to make absolutely clear that this is not intended to be called from outside the cluster. This is what our other actions that would be dangerous to call from outside the cluster have, e.g.:

elasticsearch/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/action/KillProcessAction.java

Line 21 in a92a647

public static final String NAME = "cluster:internal/xpack/ml/job/kill/process";

droberts195 · 2021-03-30T10:59:10Z

...e-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/TestFeatureResetIT.java

+            )
+        );
+        client().execute(DeletePipelineAction.INSTANCE, new DeletePipelineRequest("feature_reset_failure_inference_pipeline")).actionGet();
+    }


Building on what Dave K said, all the tests in this class could have a final assertion that the reset flag is false in the ML custom cluster state. You could have a helper method that gets the cluster state using ClusterStateAction and returns the value of the ML reset flag. Then every test can assert that the return value of that method is false.

droberts195 · 2021-03-30T11:01:58Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

        ClusterService clusterService,
        Client client,
-        ActionListener<ResetFeatureStateResponse.ResetFeatureStateStatus> listener) {
+        ActionListener<ResetFeatureStateResponse.ResetFeatureStateStatus> finalListener) {
+        logger.info("Starting machine learning cleanup");


Suggested change

logger.info("Starting machine learning cleanup");

logger.info("Starting machine learning feature reset");

(Just to make it crystal clear to support that this was triggered by a user calling the reset API.

droberts195 · 2021-03-30T11:10:07Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

+                if (numberInferenceProcessors > 0) {
+                    unsetResetModeListener.onFailure(
+                        new RuntimeException(
+                            "Unable to reset component as there are ingest pipelines still referencing trained machine learning models"


Does this mean we're still going to need custom cleanup in between integration tests? If so it devalues this API a lot.

I would say that if we blow away ML entirely we blow away the ingest pipelines that are referencing ML. To paraphrase Tony Blair: tough on ML, tough on the users of ML.

@droberts195 that feels very rough to me.

All other ML interactions are sort of their own thing. If we go and delete ingest pipelines, that could blow up a users cluster. Maybe they don't honestly know if a pipeline is in use.

I think safety here is better than causing a user to potentially lose data.

Also, we already have this custom cleanup code in the tests that use pipelines (we have this outside of the normal clean up code).

Also, we already have this custom cleanup code in the tests that use pipelines (we have this outside of the normal clean up code).

OK, in that case this PR can stay as-is and we can decide what to do about ingest pipelines separately.

Once this PR is merged we should be able to dramatically simplify our test cleanup and see if anything breaks as a result (which would reveal more things our reset needs to do).

droberts195 · 2021-03-30T11:13:35Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

            } else {
                final List<String> failedComponents = results.entrySet().stream()
                    .filter(result -> result.getValue() == false)
                    .map(Map.Entry::getKey)
                    .collect(Collectors.toList());
-                listener.onFailure(new RuntimeException("Some components failed to reset: " + failedComponents));
+                unsetResetModeListener.onFailure(new RuntimeException("Some components failed to reset: " + failedComponents));


Suggested change

unsetResetModeListener.onFailure(new RuntimeException("Some components failed to reset: " + failedComponents));

unsetResetModeListener.onFailure(new RuntimeException("Some machine learning components failed to reset: " + failedComponents));

Since the user could be resetting multiple features simultaneously, I think all the error responses should make clear they relate to ML.

droberts195 · 2021-03-30T11:14:35Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

+            failure -> client.execute(SetResetModeAction.INSTANCE, SetResetModeAction.Request.disabled(), ActionListener.wrap(
+                resetSuccess -> finalListener.onFailure(failure),
+                resetFailure -> {
+                    logger.warn("failed to disable reset mode after state clean up failure", resetFailure);


Suggested change

logger.warn("failed to disable reset mode after state clean up failure", resetFailure);

logger.error("failed to disable reset mode after machine learning reset failure", resetFailure);

droberts195 · 2021-03-30T11:15:26Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

+            success -> client.execute(SetResetModeAction.INSTANCE, SetResetModeAction.Request.disabled(), ActionListener.wrap(
+                resetSuccess -> finalListener.onResponse(success),
+                resetFailure -> {
+                    logger.warn("failed to disable reset mode after state clean up success", resetFailure);


Suggested change

logger.warn("failed to disable reset mode after state clean up success", resetFailure);

logger.error("failed to disable reset mode after otherwise successful machine learning reset", resetFailure);

droberts195 · 2021-03-30T11:16:48Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

+                resetSuccess -> finalListener.onResponse(success),
+                resetFailure -> {
+                    logger.warn("failed to disable reset mode after state clean up success", resetFailure);
+                    finalListener.onResponse(success);


I think this should count as a failure, as it leaves ML in a sub-optimal state - there will be no notifications and no nightly cleanup. So we'd want the user to try the reset again to have another go at removing the reset-in-progress flag.

droberts195 · 2021-03-30T11:18:43Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MlDailyMaintenanceService.java

@@ -151,6 +151,10 @@ private void triggerTasks() {
                LOGGER.warn("skipping scheduled [ML] maintenance tasks because upgrade mode is enabled");
                return;
            }
+            if (MlMetadata.getMlMetadata(clusterService.state()).isResetMode()) {
+                LOGGER.warn("skipping scheduled [ML] maintenance tasks because reset mode is enabled");


Suggested change

LOGGER.warn("skipping scheduled [ML] maintenance tasks because reset mode is enabled");

LOGGER.warn("skipping scheduled [ML] maintenance tasks because machine learning feature reset is in progress");

(Because "reset mode" won't be a documented thing, but the "feature reset API" will be.)

droberts195 · 2021-03-30T11:19:25Z

x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/MlMetadataTests.java

+        case 2:
+            metadataBuilder.isUpgradeMode(isUpgrade == false);
+            break;
+        case 3:
+            metadataBuilder.isResetMode(isReset == false);
+            break;


droberts195

LGTM

davidkyle

LGTM

…ion (elastic#71011) This completes the machine learning feature state cleanup integration. This commit handles waiting for machine learning tasks to complete and adds a new field to the ML Metadata cluster state to indicate when a reset is in progress for machine learning. relates: elastic#70008

…tegration (#71011) (#71071) * [ML] complete machine learning plugin feature state clean up integration (#71011) This completes the machine learning feature state cleanup integration. This commit handles waiting for machine learning tasks to complete and adds a new field to the ML Metadata cluster state to indicate when a reset is in progress for machine learning. relates: #70008 * [ML] fixing feature reset integration tests (#71081) previously created pipelines referencing ML models were not being appropriately deleted in upstream tests. This commit ensures that machine learning removes relevant pipelines from cluster state after tests complete closes #71072

[ML] complete ml plugin feature state clean up integration

9b7670d

benwtrent added >non-issue :ml Machine learning v8.0.0 v7.13.0 labels Mar 29, 2021

elasticmachine added the Team:ML Meta label for the ML team label Mar 29, 2021

davidkyle reviewed Mar 30, 2021

View reviewed changes

addressing pr comments and fixing operator action test

212baec

droberts195 reviewed Mar 30, 2021

View reviewed changes

addressing PR comments

078e41e

benwtrent requested review from davidkyle and droberts195 March 30, 2021 14:11

droberts195 approved these changes Mar 30, 2021

View reviewed changes

davidkyle approved these changes Mar 30, 2021

View reviewed changes

fixing test

7dd2ba9

benwtrent merged commit a24b832 into elastic:master Mar 30, 2021

benwtrent deleted the feature/ml-wait-for-tasks-on-system-feature-cleanup branch March 30, 2021 17:11

benwtrent mentioned this pull request Mar 30, 2021

[7.x] [ML] complete machine learning plugin feature state clean up integration (#71011) #71071

Merged

benwtrent mentioned this pull request Apr 7, 2021

Ensure Reset Features API correctly resets ML configuration #69581

Closed

benwtrent mentioned this pull request Apr 19, 2021

[ML] Functionality to wait for in-flight audit indexing to complete #70008

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] complete machine learning plugin feature state clean up integration #71011

[ML] complete machine learning plugin feature state clean up integration #71011

benwtrent commented Mar 29, 2021

elasticmachine commented Mar 29, 2021

davidkyle Mar 30, 2021

benwtrent Mar 30, 2021

davidkyle Mar 30, 2021

benwtrent Mar 30, 2021

davidkyle Mar 30, 2021

davidkyle Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 left a comment

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

benwtrent Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 Mar 30, 2021

droberts195 left a comment

davidkyle left a comment

	static <Response> ActionListener<Response> wrap(CheckedConsumer<Response, ? extends Exception> onResponse,
	Consumer<Exception> onFailure) {
	return new ActionListener<Response>() {
	@Override
	public void onResponse(Response response) {
	try {
	onResponse.accept(response);
	} catch (Exception e) {
	onFailure(e);
	}
	}

	@Override
	public void onFailure(Exception e) {
	onFailure.accept(e);
	}

	@Override
	public String toString() {
	return "WrappedActionListener{" + onResponse + "}{" + onFailure + "}";
	}
	};
	}

	logger.trace("Cluster update response built: " + acknowledged);
	logger.trace(() -> {"Cluster update response built: " + acknowledged});

	logger.info("Starting machine learning cleanup");
	logger.info("Starting machine learning feature reset");

	unsetResetModeListener.onFailure(new RuntimeException("Some components failed to reset: " + failedComponents));
	unsetResetModeListener.onFailure(new RuntimeException("Some machine learning components failed to reset: " + failedComponents));

	logger.warn("failed to disable reset mode after state clean up failure", resetFailure);
	logger.error("failed to disable reset mode after machine learning reset failure", resetFailure);

	LOGGER.warn("skipping scheduled [ML] maintenance tasks because reset mode is enabled");
	LOGGER.warn("skipping scheduled [ML] maintenance tasks because machine learning feature reset is in progress");

[ML] complete machine learning plugin feature state clean up integration #71011

[ML] complete machine learning plugin feature state clean up integration #71011

Conversation

benwtrent commented Mar 29, 2021

elasticmachine commented Mar 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment