Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Changes to calendars or filters do not update autodetect if handled on non-master node #31803

Closed
dimitris-athanasiou opened this issue Jul 4, 2018 · 1 comment · Fixed by #31804
Assignees
Labels
>bug :ml Machine learning

Comments

@dimitris-athanasiou
Copy link
Contributor

In a multi-node cluster, any changes to calendars or filters used by a running job will not update the autodetect process if the corresponding actions are run on a non-master node.

The reason for this is because those updates are submitted to the UpdateJobProcessNotifier which is only running on the master node. So, even though it exists on non-master nodes, it's not running and the updates fall through.

@dimitris-athanasiou dimitris-athanasiou added >bug :ml Machine learning labels Jul 4, 2018
@dimitris-athanasiou dimitris-athanasiou self-assigned this Jul 4, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Jul 5, 2018
Job updates or changes to calendars or filters may
result into updating the job process if it has been
running. To preserve the order of updates, process
updates are queued through the UpdateJobProcessNotifier
which is only running on the master node. All actions
performing such updates must run on the master node.

However, the CRUD actions for calendars and filters
are not master node actions. They have been submitting
the updates to the UpdateJobProcessNotifier even though
it might have not been running (given the action was
run on a non-master node). When that happens, the update
never reaches the process.

This commit fixes this problem by ensuring the notifier
runs on all nodes and by ensuring the process update action
gets the resources again before updating the process
(instead of having those resources passed in the request).

This ensures that even if the order of the updates
gets messed up, the latest update will read the latest
state of those resource and the process will get back
in sync.

This leaves us with 2 types of updates:

  1. updates to the job config should happen on the master
  node. This is because we cannot refetch the entire job
  and update it. We need to know the parts that have been changed.

  2. updates to resources the job uses. Those can be handled
  on non-master nodes but they should be re-fetched by the
  update process action.

Closes elastic#31803
dimitris-athanasiou added a commit that referenced this issue Jul 5, 2018
Job updates or changes to calendars or filters may
result into updating the job process if it has been
running. To preserve the order of updates, process
updates are queued through the UpdateJobProcessNotifier
which is only running on the master node. All actions
performing such updates must run on the master node.

However, the CRUD actions for calendars and filters
are not master node actions. They have been submitting
the updates to the UpdateJobProcessNotifier even though
it might have not been running (given the action was
run on a non-master node). When that happens, the update
never reaches the process.

This commit fixes this problem by ensuring the notifier
runs on all nodes and by ensuring the process update action
gets the resources again before updating the process
(instead of having those resources passed in the request).

This ensures that even if the order of the updates
gets messed up, the latest update will read the latest
state of those resource and the process will get back
in sync.

This leaves us with 2 types of updates:

  1. updates to the job config should happen on the master
  node. This is because we cannot refetch the entire job
  and update it. We need to know the parts that have been changed.

  2. updates to resources the job uses. Those can be handled
  on non-master nodes but they should be re-fetched by the
  update process action.

Closes #31803
dimitris-athanasiou added a commit that referenced this issue Jul 5, 2018
Job updates or changes to calendars or filters may
result into updating the job process if it has been
running. To preserve the order of updates, process
updates are queued through the UpdateJobProcessNotifier
which is only running on the master node. All actions
performing such updates must run on the master node.

However, the CRUD actions for calendars and filters
are not master node actions. They have been submitting
the updates to the UpdateJobProcessNotifier even though
it might have not been running (given the action was
run on a non-master node). When that happens, the update
never reaches the process.

This commit fixes this problem by ensuring the notifier
runs on all nodes and by ensuring the process update action
gets the resources again before updating the process
(instead of having those resources passed in the request).

This ensures that even if the order of the updates
gets messed up, the latest update will read the latest
state of those resource and the process will get back
in sync.

This leaves us with 2 types of updates:

  1. updates to the job config should happen on the master
  node. This is because we cannot refetch the entire job
  and update it. We need to know the parts that have been changed.

  2. updates to resources the job uses. Those can be handled
  on non-master nodes but they should be re-fetched by the
  update process action.

Closes #31803
dimitris-athanasiou added a commit that referenced this issue Jul 5, 2018
Job updates or changes to calendars or filters may
result into updating the job process if it has been
running. To preserve the order of updates, process
updates are queued through the UpdateJobProcessNotifier
which is only running on the master node. All actions
performing such updates must run on the master node.

However, the CRUD actions for calendars and filters
are not master node actions. They have been submitting
the updates to the UpdateJobProcessNotifier even though
it might have not been running (given the action was
run on a non-master node). When that happens, the update
never reaches the process.

This commit fixes this problem by ensuring the notifier
runs on all nodes and by ensuring the process update action
gets the resources again before updating the process
(instead of having those resources passed in the request).

This ensures that even if the order of the updates
gets messed up, the latest update will read the latest
state of those resource and the process will get back
in sync.

This leaves us with 2 types of updates:

  1. updates to the job config should happen on the master
  node. This is because we cannot refetch the entire job
  and update it. We need to know the parts that have been changed.

  2. updates to resources the job uses. Those can be handled
  on non-master nodes but they should be re-fetched by the
  update process action.

Closes #31803
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants