Error upgrading from (<= 6.5) to 6.7.x #41090

benwtrent · 2019-04-10T19:32:28Z

While upgrading from version 6.5.x (or a previous version) to 6.7.x, there is a bug that puts a null value into the cluster metadata state in certain situations.

If there exists any ML Anomaly jobs or Datafeeds and there are NO persistent tasks in the cluster state, then the ml config migration process puts a null value into the the new cluster state and attempts to update the state. This, of course, fails as the Diff check sees the null value and throws an NPE.

This issue can be worked around in any of the following ways:

Delete Jobs and datafeeds
Set the cluster setting xpack.ml.enable_config_migration to false. It is a dynamic setting.
Open any ML job so that a persistent task gets created.

before setting xpack.ml.enable_config_migration: true again, the jobs either need to be deleted, or open one of the jobs. That will then allow the migration to take place.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-04-10T19:32:30Z

Pinging @elastic/ml-core

droberts195 · 2019-04-10T19:40:53Z

NO persistent tasks in the cluster state

... and never ever have been I think. If you have ever created a persistent task but just don’t have any right now then the corresponding custom metadata will exist, just storing the highest ever allocation ID, and then the collection that’s null when the problem occurs will instead be an empty collection.

droberts195 · 2019-04-11T08:13:04Z

Although this cannot affect rolling upgrades to 7.x (because to do such an upgrade you have to first upgrade to 6.7), I don't see why it couldn't affect full cluster restart upgrades from 6.0-6.5 direct to 7.x. Maybe I am missing something though.

Since the fix is so tiny I think it would be best to put it into master, 7.x and 7.0 as well as 6.7. It cannot hurt even if it's unnecessary.

One more point of note: if this is theoretically a problem for full cluster restart upgrades from 6.0-6.5 direct to 7.x then at least it will not affect Cloud, because I believe Cloud is forcing all upgrades to 7.x to go via 6.7.

benwtrent · 2019-04-11T13:12:54Z

@droberts195 for sure. I am going to open PRs for adding the fix to 7.0, 7.X and master. I wanted to get 6.7 opened first as it was where it was discovered first.

Today if an exception is thrown when serializing a cluster state during publication then the master enters a poisoned state where it cannot publish any more cluster states, but nor does it stand down as master, yielding repeated exceptions of the following form: ``` failed to commit cluster state version [12345] org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publishing failed at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1045) ~[elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.0.0.jar:7.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144] Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: cannot start publishing next value before accepting previous one at org.elasticsearch.cluster.coordination.CoordinationState.handleClientValue(CoordinationState.java:280) ~[elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1030) ~[elasticsearch-7.0.0.jar:7.0.0] ... 11 more ``` This is because it already created the publication request using `CoordinationState#handleClientValue()` but then it fails before accepting it. This commit addresses this by performing the serialization before calling `handleClientValue()`. Relates elastic#41090, which was the source of such a serialization exception.

Today you can add a null `Custom` to the cluster state or its metadata, but attempting to publish such a cluster state will fail. Unfortunately, the publication-time failure gives very little information about the source of the problem. This change causes the failure to manifest earlier and adds information about which `Custom` was null in order to simplify the investigation. Relates elastic#41090.

Today you can add a null `Custom` to the cluster state or its metadata, but attempting to publish such a cluster state will fail. Unfortunately, the publication-time failure gives very little information about the source of the problem. This change causes the failure to manifest earlier and adds information about which `Custom` was null in order to simplify the investigation. Relates #41090.

Today if an exception is thrown when serializing a cluster state during publication then the master enters a poisoned state where it cannot publish any more cluster states, but nor does it stand down as master, yielding repeated exceptions of the following form: ``` failed to commit cluster state version [12345] org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publishing failed at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1045) ~[elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.0.0.jar:7.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144] Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: cannot start publishing next value before accepting previous one at org.elasticsearch.cluster.coordination.CoordinationState.handleClientValue(CoordinationState.java:280) ~[elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1030) ~[elasticsearch-7.0.0.jar:7.0.0] ... 11 more ``` This is because it already created the publication request using `CoordinationState#handleClientValue()` but then it fails before accepting it. This commit addresses this by performing the serialization before calling `handleClientValue()`. Relates #41090, which was the source of such a serialization exception.

Today you can add a null `Custom` to the cluster state or its metadata, but attempting to publish such a cluster state will fail. Unfortunately, the publication-time failure gives very little information about the source of the problem. This change causes the failure to manifest earlier and adds information about which `Custom` was null in order to simplify the investigation. Relates elastic#41090.

Today if an exception is thrown when serializing a cluster state during publication then the master enters a poisoned state where it cannot publish any more cluster states, but nor does it stand down as master, yielding repeated exceptions of the following form: ``` failed to commit cluster state version [12345] org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publishing failed at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1045) ~[elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:252) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:238) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.0.0.jar:7.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144] Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: cannot start publishing next value before accepting previous one at org.elasticsearch.cluster.coordination.CoordinationState.handleClientValue(CoordinationState.java:280) ~[elasticsearch-7.0.0.jar:7.0.0] at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1030) ~[elasticsearch-7.0.0.jar:7.0.0] ... 11 more ``` This is because it already created the publication request using `CoordinationState#handleClientValue()` but then it fails before accepting it. This commit addresses this by performing the serialization before calling `handleClientValue()`. Relates elastic#41090, which was the source of such a serialization exception.

benwtrent added >bug :ml Machine learning v6.7.2 labels Apr 10, 2019

benwtrent mentioned this issue Apr 10, 2019

[ML] checking if p-tasks metadata is null before updating state #41091

Merged

This was referenced Apr 11, 2019

[ML] checking if p-tasks metadata is null before updating state (#41091) #41122

Merged

[ML] checking if p-tasks metadata is null before updating state (#41091) #41123

Merged

[ML] checking if p-tasks metadata is null before updating state (#41091) #41124

Merged

benwtrent closed this as completed in #41124 Apr 11, 2019

DaveCTurner mentioned this issue May 3, 2019

Handle serialization exceptions during publication #41781

Merged

DaveCTurner mentioned this issue May 3, 2019

Reject null customs at build time #41782

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error upgrading from (<= 6.5) to 6.7.x #41090

Error upgrading from (<= 6.5) to 6.7.x #41090

benwtrent commented Apr 10, 2019

elasticmachine commented Apr 10, 2019

droberts195 commented Apr 10, 2019 •

edited

Loading

droberts195 commented Apr 11, 2019

benwtrent commented Apr 11, 2019

Error upgrading from (<= 6.5) to 6.7.x #41090

Error upgrading from (<= 6.5) to 6.7.x #41090

Comments

benwtrent commented Apr 10, 2019

elasticmachine commented Apr 10, 2019

droberts195 commented Apr 10, 2019 • edited Loading

droberts195 commented Apr 11, 2019

benwtrent commented Apr 11, 2019

droberts195 commented Apr 10, 2019 •

edited

Loading