Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Adds basic notifications for trained model deployments #88214

Conversation

dimitris-athanasiou
Copy link
Contributor

For specific models:

  • deployment started
  • deployment stopped

System notifications when rebalance occurrs with reasons:

  • model deployment started
  • model deployment stopped
  • nodes changed

For specific models:
  - deployment started
  - deployment stopped

System notifications when rebalance occurrs with reasons:
  - model deployment started
  - model deployment stopped
  - nodes changed
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 30, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent benwtrent self-requested a review June 30, 2022 15:57
@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Will you add system audit messages for node changes?

Sorry I missed that the message was already there


assertNotificationsContain(modelId1, "Started deployment", "Stopped deployment");
assertNotificationsContain(modelId2, "Started deployment", "Stopped deployment");
assertSystemNotificationsContain("Rebalanced trained model allocations because [model deployment started]");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there also be a Rebalanced trained model allocations because [model deployment STOPPED] message can you test for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the remaining model is satisfied when the other gets stopped. We only proceed with the rebalance if there are unsatisfied models after a deployment is stopped.

@@ -414,7 +415,12 @@ public ClusterState execute(ClusterState currentState) {

if (areClusterStatesCompatibleForRebalance(clusterState, currentState)) {
isUpdated = true;
return update(currentState, rebalancedMetadata);
ClusterState updatedState = update(currentState, rebalancedMetadata);
if (TrainedModelAssignmentMetadata.fromState(currentState)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the update() method return a Tuple<Boolean, ClusterState> where the boolean indicates a change. It saves extracting the metadata from cluster state again and the equals() check

ClusterState updatedState = update(currentState, rebalancedMetadata);
if (TrainedModelAssignmentMetadata.fromState(currentState)
.equals(TrainedModelAssignmentMetadata.fromState(updatedState)) == false) {
systemAuditor.info(Messages.getMessage(Messages.INFERENCE_DEPLOYMENT_REBALANCED, reason));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to be safe and not do the auditing inside the ClusterStateUpdateTask::execute when it could be done in clusterStateProcessed using the ML utility thread pool.

The auditor can potentially do a lot of work if it is the first time a message is audited

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have pushed a commit that addresses this.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@dimitris-athanasiou dimitris-athanasiou merged commit 7e9a6fe into elastic:master Jul 5, 2022
@dimitris-athanasiou dimitris-athanasiou deleted the audit-messages-for-trained-model-deployments branch July 5, 2022 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >non-issue Team:ML Meta label for the ML team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants