Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731

jasontedor · 2018-05-18T15:13:21Z

This is a meta-issue to track the work needed to enable smooth upgrades from the default distribution prior to 6.3.0 to the default distribution post 6.3.0. The sub-tasks are:

add detection when we are communicating with a >= 6.3 transport client so that we can avoid sending it pieces of the cluster state that it would not be able to deserialize (the problem here is when an OSS transport client connects to a 6.3.0 default cluster) and avoid sending pieces of the cluster state that clients might not be able to understand Introduce client feature tracking #31020 @jasontedor
same thing for < 6.3 transport clients @ywelsch Fix interoperability with < 6.3 transport clients #30971
add a node attribute that X-Pack injects into the node attributes so that we can detect when we are connected to a node that understands X-Pack metadata in the cluster state @ywelsch Only allow x-pack metadata if all nodes are ready #30743 Allow rollup job creation only if cluster is x-pack ready #30963
reject cluster state updates that contain X-Pack metadata when the cluster is not prepared for it (i.e., if there is not already X-Pack metadata in the cluster state and there are nodes in the cluster state than can not understand X-Pack metadata based on the above attribute) @ywelsch Only allow x-pack metadata if all nodes are ready #30743
fix watcher template that uses custom index setting as an OSS master in a mixed cluster will not be able to add this template and repeatedly fail the PUT template request initiated by the x-pack node, spamming the logs @ywelsch Move watcher-history version setting to _meta field #30832
ensure that all features (e.g., ML today) that rely on custom cluster state metadata can properly handle that metadata not being present (e.g., during a rolling upgrade scenario) @droberts195 [ML] Don't install empty ML metadata on startup #30751
add a rolling upgrade test for 3 nodes (we have one for 2 nodes today) to trigger this scenario during testing @nik9000 QA: Switch rolling upgrade to 3 nodes #30728
fix PersistentTaskParams so that it knows about the versions etc. Assume for example a mixed 6.2 / 6.3 x-pack cluster. If you start a rollup task, this will be put as persistent task into the cluster state. A 6.2 x-pack node cannot deserialize this task. @bleskes Make Persistent Tasks implementations version and feature aware #31045
add more tests (e.g. that rollups cannot be created in a mixed OSS/X-Pack cluster. Mixed-cluster X-pack tests) @nik9000 QA: Check rollup job creation safety #31036

elasticmachine · 2018-05-18T15:13:25Z

Pinging @elastic/es-core-infra

elasticmachine · 2018-05-18T15:13:29Z

Pinging @elastic/es-distributed

elasticmachine · 2018-05-18T15:13:33Z

Pinging @elastic/ml-core

This change is to support rolling upgrade from a pre-6.3 default distribution (i.e. without X-Pack) to a 6.3+ default distribution (i.e. with X-Pack). The ML metadata is no longer eagerly added to the cluster state as soon as the master node has X-Pack available. Instead, it is added when the first ML job is created. As a result all methods that get the ML metadata need to be able to handle the situation where there is no ML metadata in the current cluster state. They do this by behaving as though an empty ML metadata was present. This logic is encapsulated by always asking for the current ML metadata using a static method on the MlMetadata class. Relates elastic#30731

bleskes · 2018-05-21T11:32:32Z

Thank for putting this list up. I think we should also deal with the TokenMetaData injected here when called from here.

This change is to support rolling upgrade from a pre-6.3 default distribution (i.e. without X-Pack) to a 6.3+ default distribution (i.e. with X-Pack). The ML metadata is no longer eagerly added to the cluster state as soon as the master node has X-Pack available. Instead, it is added when the first ML job is created. As a result all methods that get the ML metadata need to be able to handle the situation where there is no ML metadata in the current cluster state. They do this by behaving as though an empty ML metadata was present. This logic is encapsulated by always asking for the current ML metadata using a static method on the MlMetadata class. Relates #30731

bleskes · 2018-05-21T14:16:15Z

I think we should also deal with the TokenMetaData injected here when called from here.

It looks like @ywelsch took care of it in https://github.com/elastic/elasticsearch/pull/30743/files

This change is to support rolling upgrade from a pre-6.3 default distribution (i.e. without X-Pack) to a 6.3+ default distribution (i.e. with X-Pack). The ML metadata is no longer eagerly added to the cluster state as soon as the master node has X-Pack available. Instead, it is added when the first ML job is created. As a result all methods that get the ML metadata need to be able to handle the situation where there is no ML metadata in the current cluster state. They do this by behaving as though an empty ML metadata was present. This logic is encapsulated by always asking for the current ML metadata using a static method on the MlMetadata class. Relates #30731

nik9000 · 2018-05-23T17:41:19Z

Now that #30743 is merged I wanted to test this. The 6.3 branch works perfectly for me. The 6.x branch is failing though. That probably isn't a 6.3 release blocker but it is weird. The failure comes during the 5.6.10 upgrade to 6.x. The failure is:

[2018-05-23T12:59:33,101][ERROR][o.e.x.w.s.WatcherIndexTemplateRegistry] [node-0] Error adding watcher template [.watch-history-8]
org.elasticsearch.transport.RemoteTransportException: [node-2][127.0.0.1:41731][indices:admin/template/put]
Caused by: java.lang.IllegalArgumentException: unknown setting [index.xpack.watcher.template.version] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:293) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:256) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:246) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.template.put.TransportPutIndexTemplateAction.masterOperation(TransportPutIndexTemplateAction.java:80) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.template.put.TransportPutIndexTemplateAction.masterOperation(TransportPutIndexTemplateAction.java:42) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]

The test that actually fails is " org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Test old multi type stuff}" but it only fails because one of its actions times out because the cluster is busy trying the thing above over and over again.

ywelsch · 2018-05-23T18:41:28Z

@nik9000 This is a real problem (the x-pack node tries to add a template with x-pack only settings, and the OSS master rejects it). I'm not sure why it's not triggered by the test on the 6.3 branch as the same templating behavior exists there (for Watcher, Security etc.) as well. I consider this a blocker for 6.3 and will work on a solution tomorrow.

nik9000 · 2018-05-23T18:45:09Z

I consider this a blocker for 6.3 and will work on a solution tomorrow.

❤️

It might not come up in 6.3 because I got lucky the couple of times I ran it. Maybe the 6.3 node won the master election.

It might still be a problem in the 6.3 branch but not bad enough to slow the tests down enough to fail. I'll see if I can write a test that outright fails without this. I didn't want to have to use watcher in these tests because it has so much state, but I suspect I have no choice here.

jasontedor · 2018-05-23T18:52:26Z

I consider this a blocker for 6.3 and will work on a solution tomorrow.

+1, this is a blocker.

ywelsch · 2018-05-24T07:49:51Z

The issue with Watcher is that it uses a custom setting in its template. I've gone through the other XPack services to check if they present the same issue:

SecurityIndexManager adds templates (security-index-template.json) but no custom settings in template
IndexAuditTrail adds templates (security_audit_log.json) but no custom settings in template
LocalExporter adds templates (monitoring-*.json) but no custom settings. It also adds an ingest pipeline, but again no customs.
WatcherIndexTemplateRegistry adds templates (triggered-watches.json, watch-history.json, watches.json). Only watch-history uses a custom setting (xpack.watcher.template.version).

I'll explore getting rid of the xpack.watcher.template.version setting, and using the same approach as has been used for the other templates (e.g. security-index-template or security_audit_log.json) where there's a custom _meta field in the mapping.

ywelsch · 2018-05-24T08:36:21Z

I've opened #30832 for the watcher issue.

ywelsch · 2018-05-24T09:07:56Z

It might not come up in 6.3 because I got lucky the couple of times I ran it. Maybe the 6.3 node won the master election.

@nik9000 I've run the mixed-cluster tests a few times on 6.3, and I've seen this exception spamming the logs. The tests not failing on 6.3 are more of an indication that we need to add more tests.

nik9000 · 2018-05-24T12:19:43Z

@nik9000 I've run the mixed-cluster tests a few times on 6.3, and I've seen this exception spamming the logs. The tests not failing on 6.3 are more of an indication that we need to add more tests.

I figured as much. Earlier I'd said:

It might still be a problem in the 6.3 branch but not bad enough to slow the tests down enough to fail. I'll see if I can write a test that outright fails without this.

and that is still my plan. I got distracted by other things and didn't end up writing the test.

Adds a test that we create the appropriate x-pack templates during the rolling restart from the pre-6.2 OSS-zip distribution to the new zip distribution that contains xpack. This is one way to answer the question "does xpack acting sanely during the rolling upgrade and after it?" It isn't as good as full exercising xpack but it is fairly simple and would have caught elastic#30832. Relates to elastic#30731

nik9000 · 2018-05-24T21:08:45Z

So in my grand tradition of finding things, I believe the following is flaky:

git checkout 6.x
while ./gradlew -p qa/rolling-upgrade/ check -Dtests.distribution=zip; do echo ok; done

On my desktop about half of those runs fail with:

[2018-05-24T16:51:55,044][INFO ][o.e.c.s.MasterService    ] [node-0] zen-disco-elected-as-master ([2] nodes joined)[, ], reason: new_master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127
.0.0.1:34129}{ml.machine_memory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}
[2018-05-24T16:51:55,060][INFO ][o.e.c.s.ClusterApplierService] [node-0] new_master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_memory=33651564544, xpack.insta
lled=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_m
emory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true} committed version [751] source [zen-disco-elected-as-master ([2] nodes joined)[, ]]])
[2018-05-24T16:51:55,085][WARN ][o.e.d.z.ZenDiscovery     ] [node-0] zen-disco-failed-to-publish, current nodes: nodes: 
   {node-0}{GsF5lxmmSQiemBzLM6Csbw}{lu_kNBYlROqRavg-7XxHSg}{127.0.0.1}{127.0.0.1:40615}{testattr=test, ml.machine_memory=33651564544, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {node-0}{EI796DscQYWC8OejKoXa5Q}{ikSXu_0GTiaygCs4aLzfug}{127.0.0.1}{127.0.0.1:34129}{ml.machine_memory=33651564544, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, local, master
   {node-2}{ZOcWRGW1QNaBzGNXwigiBw}{xiHQaAk_ToS9Y52SnpY4Ww}{127.0.0.1}{127.0.0.1:42023}{testattr=test, gen=old, ml.machine_memory=33651564544, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}

[2018-05-24T16:51:55,085][WARN ][o.e.c.s.MasterService    ] [node-0] failing [maybe generate license for cluster]: failed to commit cluster state version [752]
org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: failed to get enough masters to ack sent cluster state. [1] left
        at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendingController.waitForCommit(PublishClusterStateAction.java:525) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.innerPublish(PublishClusterStateAction.java:196) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.publish(PublishClusterStateAction.java:161) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:336) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:225) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:132) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]
[2018-05-24T16:51:55,086][ERROR][o.e.l.StartupSelfGeneratedLicenseTask] [node-0] unexpected failure during [maybe generate license for cluster]
org.elasticsearch.discovery.Discovery$FailedToCommitClusterStateException: failed to get enough masters to ack sent cluster state. [1] left
        at org.elasticsearch.discovery.zen.PublishClusterStateAction$SendingController.waitForCommit(PublishClusterStateAction.java:525) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.innerPublish(PublishClusterStateAction.java:196) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.PublishClusterStateAction.publish(PublishClusterStateAction.java:161) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery.java:336) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:225) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:132) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:844) [?:?]

Which looks pretty incriminating. There are totally three 3 available, master eligible nodes. It just logged about them.

But this is 6.x testing the upgrade from 6.3. I've seen no trouble going from 5.6 to 6.x or 6.2 going to 6.x. Go figure.

nik9000 · 2018-05-24T21:11:07Z

The other nodes see NotMasterException, but I think they are deserializing that from the master. Which doesn't think it is a master.

ywelsch · 2018-05-25T10:38:31Z

@nik9000 With my fix in #30859, I have this consistently passing now

ywelsch · 2018-05-25T11:19:12Z

The fix only helps in case of a rolling restart, so the mixed-cluster tests still fail with this fix on 6.x. I will have to discuss with @elastic/es-security what can be done about the mixed-cluster situation.

Adds a test that we create the appropriate x-pack templates during the rolling restart from the pre-6.2 OSS-zip distribution to the new zip distribution that contains xpack. This is one way to answer the question "does xpack acting sanely during the rolling upgrade and after it?" It isn't as good as full exercising xpack but it is fairly simple and would have caught #30832. Relates to #30731

ywelsch · 2018-05-30T16:23:44Z

Reminder to self: We also need to fix PersistentTaskParams so that it knows about the versions etc. Assume for example a mixed 6.2 / 6.3 x-pack cluster. If you start a rollup task, this will be put as persistent task into the cluster state. A 6.2 x-pack node cannot deserialize this task.

With the default distribution changing in 6.3, clusters might now contain custom metadata that a pure OSS transport client cannot deserialize. As this can break transport clients when accessing the cluster state or reroute APIs, we've decided to exclude any custom metadata that the transport client might not be able to deserialize. This will ensure compatibility between a < 6.3 transport client and a 6.3 default distribution cluster. Note that this PR only covers interoperability with older clients, another follow-up PR will cover full interoperability for >= 6.3 transport clients where we will make it possible again to get the custom metadata from the cluster state. Relates to #30731

nik9000 · 2018-06-01T14:51:29Z

add more tests (e.g. that rollups cannot be created in a mixed OSS/X-Pack cluster. Mixed-cluster X-pack tests)

I'll take this.

With #31020 we introduced the ability for transport clients to indicate what features they support in order to make sure we don't serialize object to them they don't support. This PR adapts the serialization logic of persistent tasks to be aware of those features and not serialize tasks that aren't supported. Also, a version check is added for the future where we may add new tasks implementations and need to be able to indicate they shouldn't be serialized both to nodes and clients. As the implementation relies on the interface of `PersistentTaskParams`, these are no longer optional. That's acceptable as all current implementation have them and we plan to make `PersistentTaskParams` more central in the future. Relates to #30731

danielmitterdorfer · 2018-06-04T11:50:45Z

We have another test failure that I believe belongs here as well (there are several tests failing but they appear to do so for the same reason).

The first one of them has the reproduction line:

./gradlew :qa:rolling-upgrade:v6.3.0-SNAPSHOT#twoThirdsUpgradedTestRunner -Dtests.seed=F070E6B561B268B2 -Dtests.class=org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT -Dtests.method="test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade}" -Dtests.security.manager=true -Dtests.locale=nl-BE -Dtests.timezone=America/Cuiaba -Dtests.distribution=zip -Dtests.rest.suite=mixed_cluster

Click arrow for failure details

07:23:15 ERROR   67.1s | UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade} <<< FAILURES!
07:23:15    > Throwable #1: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40576], URI [/], status line [HTTP/1.1 503 Service Unavailable]
07:23:15    > {
07:23:15    >   "name" : "node-0",
07:23:15    >   "cluster_name" : "rolling-upgrade",
07:23:15    >   "cluster_uuid" : "OhN80TdsRXmqjvybzQA48A",
07:23:15    >   "version" : {
07:23:15    >     "number" : "6.4.0",
07:23:15    >     "build_flavor" : "default",
07:23:15    >     "build_type" : "zip",
07:23:15    >     "build_hash" : "1eede11",
07:23:15    >     "build_date" : "2018-06-04T06:30:39.454194Z",
07:23:15    >     "build_snapshot" : true,
07:23:15    >     "lucene_version" : "7.4.0",
07:23:15    >     "minimum_wire_compatibility_version" : "5.6.0",
07:23:15    >     "minimum_index_compatibility_version" : "5.0.0"
07:23:15    >   },
07:23:15    >   "tagline" : "You Know, for Search"
07:23:15    > }
07:23:15    > 	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:821)
07:23:15    > 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:182)
07:23:15    > 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:227)
07:23:15    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.readVersionsFromInfo(ESClientYamlSuiteTestCase.java:282)
07:23:15    > 	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.initAndResetContext(ESClientYamlSuiteTestCase.java:106)
07:23:15    > 	at java.lang.Thread.run(Thread.java:748)
07:23:15    > Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40576], URI [/], status line [HTTP/1.1 503 Service Unavailable]
07:23:15    > {
07:23:15    >   "name" : "node-0",
07:23:15    >   "cluster_name" : "rolling-upgrade",
07:23:15    >   "cluster_uuid" : "OhN80TdsRXmqjvybzQA48A",
07:23:15    >   "version" : {
07:23:15    >     "number" : "6.4.0",
07:23:15    >     "build_flavor" : "default",
07:23:15    >     "build_type" : "zip",
07:23:15    >     "build_hash" : "1eede11",
07:23:15    >     "build_date" : "2018-06-04T06:30:39.454194Z",
07:23:15    >     "build_snapshot" : true,
07:23:15    >     "lucene_version" : "7.4.0",
07:23:15    >     "minimum_wire_compatibility_version" : "5.6.0",
07:23:15    >     "minimum_index_compatibility_version" : "5.0.0"
07:23:15    >   },
07:23:15    >   "tagline" : "You Know, for Search"
07:23:15    > }
07:23:15    > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:495)
07:23:15    > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:484)
07:23:15    > 	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
07:23:15    > 	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
07:23:15    > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
07:23:15    > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
07:23:15    > 	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
07:23:15    > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
07:23:15    > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
07:23:15    > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
07:23:15    > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
07:23:15    > 	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
07:23:15    > 	... 1 moreThrowable #2: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=15ms, time_in_queue_millis=15, source=zen-disco-elected-as-master ([2] nodes joined), executing=true, priority=URGENT, insert_order=185}
07:23:15    > {time_in_queue=1ms, time_in_queue_millis=1, source=install-token-metadata, executing=false, priority=URGENT, insert_order=186}
07:23:15    > {time_in_queue=0s, time_in_queue_millis=0, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=187}
07:23:15    > 	at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:767)
07:23:15    > 	at org.elasticsearch.test.rest.ESRestTestCase.waitForClusterStateUpdatesToFinish(ESRestTestCase.java:338)
07:23:15    > 	at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:151)
07:23:15    > 	at java.lang.Thread.run(Thread.java:748)
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=86ms, time_in_queue_millis=86, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}
07:23:15    > {time_in_queue=71ms, time_in_queue_millis=71, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=4ms, time_in_queue_millis=4, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=96ms, time_in_queue_millis=96, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}
07:23:15    > {time_in_queue=81ms, time_in_queue_millis=81, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=14ms, time_in_queue_millis=14, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=102ms, time_in_queue_millis=102, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}
07:23:15    > {time_in_queue=86ms, time_in_queue_millis=86, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=20ms, time_in_queue_millis=20, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=109ms, time_in_queue_millis=109, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42}
07:23:15    > {time_in_queue=94ms, time_in_queue_millis=94, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=27ms, time_in_queue_millis=27, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more
07:23:15    > 	Suppressed: java.lang.AssertionError: there are still running tasks:
07:23:15    > {time_in_queue=106ms, time_in_queue_millis=106, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46}
07:23:15    > {time_in_queue=40ms, time_in_queue_millis=40, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47}
07:23:15    > {time_in_queue=0s, time_in_queue_millis=0, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=49}
07:23:15    > 		at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347)
07:23:15    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755)
07:23:15    > 		... 37 more

Full cluster logs are available in rolling-upgrade-cluster-logs.zip

With #31020 we introduced the ability for transport clients to indicate what features they support in order to make sure we don't serialize object to them they don't support. This PR adapts the serialization logic of persistent tasks to be aware of those features and not serialize tasks that aren't supported. Also, a version check is added for the future where we may add new tasks implementations and need to be able to indicate they shouldn't be serialized both to nodes and clients. As the implementation relies on the interface of `PersistentTaskParams`, these are no longer optional. That's acceptable as all current implementation have them and we plan to make `PersistentTaskParams` more central in the future. Relates to #30731

nik9000 · 2018-06-04T15:41:30Z

I had a look at @danielmitterdorfer's failure. A few things:

This is an xpack to xpack upgrade from 6.3 to 6.x
I read the logs to say that:
a. Everything goes fine until the we upgrade the second node.
b. Once the second upgraded node comes online the first upgraded node is elected the master.
c. It starts doing housekeeping like upgrading templates.
d. It can't sync the cluster state

I'd expect there to be a failure somewhere in the log describing how the cluster state sync failed. But I can't find one. All of the exceptions have to do with the restarts and the cluster not having a valid master after the incident.

ywelsch · 2018-06-05T09:44:38Z

@nik9000 @danielmitterdorfer this will be fixed by #30859. It's not blocking the 6.3 release, but the 6.4 release.

Allows rolling restart from 6.3 to 6.4. Relates to #30731 and #30251

nik9000 · 2018-06-05T18:49:00Z

I've opened #31112 to make the x-pack upgrade tests (all three of them) use three nodes. It isn't perfect but it is about as complex as I'd like to get and still backport to 6.3.

nik9000 · 2018-06-07T14:49:34Z

So I merged #31112 to master and 6.x yesterday but that caused all kinds of problems. I'm trying to un-break them now. I'll merge to 6.3 once everything is calmer in the branches I've already merged to.

nik9000 · 2018-06-08T18:05:00Z

I've finished backporting #31112 to 6.3. We can see how it does over the weekend.

#31195 is still open to enable one of the tests after the backport but it is a upgrade from 5.6.10 to 6.3 test so I think we're fairly ok. It is almost certainly a test bug.

nik9000 · 2018-06-11T15:11:40Z

The weekend went well as far as the backwards compatibility builds goes! I'm happy to say that the upgrades looked great. I think I'm done here.

jasontedor · 2018-06-12T00:47:39Z

Thank you all that contributed to the effort here, this was a great effort all around.

Closed by the hard work of a lot of people

ywelsch mentioned this issue May 19, 2018

Only allow x-pack metadata if all nodes are ready #30743

Merged

droberts195 mentioned this issue May 21, 2018

[ML] Don't install empty ML metadata on startup #30751

Merged

ywelsch mentioned this issue May 24, 2018

Move watcher-history version setting to _meta field #30832

Merged

nik9000 mentioned this issue May 24, 2018

QA: Test template creation during rolling restart #30850

Merged

jasontedor mentioned this issue May 25, 2018

Do not return metadata customs by default #30857

Closed

This was referenced May 29, 2018

Use dedicated ML APIs in tests #30941

Merged

Remove metadata customs that can break serialization #30945

Closed

ywelsch mentioned this issue May 30, 2018

Fix interoperability with < 6.3 transport clients #30971

Merged

jasontedor mentioned this issue Jun 1, 2018

Introduce client feature tracking #31020

Merged

jasontedor closed this as completed in #31020 Jun 1, 2018

ywelsch reopened this Jun 1, 2018

bleskes mentioned this issue Jun 2, 2018

Make Persistent Tasks implementations version and feature aware #31045

Merged

ywelsch added a commit that referenced this issue Jun 5, 2018

Only auto-update license signature if all nodes ready (#30859)

3b98c26

Allows rolling restart from 6.3 to 6.4. Relates to #30731 and #30251

ywelsch added a commit that referenced this issue Jun 5, 2018

Only auto-update license signature if all nodes ready (#30859)

97b3b2d

Allows rolling restart from 6.3 to 6.4. Relates to #30731 and #30251

jasontedor closed this as completed Jun 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731

Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731

jasontedor commented May 18, 2018 •

edited by nik9000

Loading

elasticmachine commented May 18, 2018

elasticmachine commented May 18, 2018

elasticmachine commented May 18, 2018

bleskes commented May 21, 2018

bleskes commented May 21, 2018

nik9000 commented May 23, 2018

ywelsch commented May 23, 2018

nik9000 commented May 23, 2018

jasontedor commented May 23, 2018

ywelsch commented May 24, 2018

ywelsch commented May 24, 2018

ywelsch commented May 24, 2018

nik9000 commented May 24, 2018

nik9000 commented May 24, 2018

nik9000 commented May 24, 2018

ywelsch commented May 25, 2018

ywelsch commented May 25, 2018

ywelsch commented May 30, 2018

nik9000 commented Jun 1, 2018

danielmitterdorfer commented Jun 4, 2018

nik9000 commented Jun 4, 2018

ywelsch commented Jun 5, 2018

nik9000 commented Jun 5, 2018

nik9000 commented Jun 7, 2018 •

edited

Loading

nik9000 commented Jun 8, 2018

nik9000 commented Jun 11, 2018

jasontedor commented Jun 12, 2018

Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731

Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731

Comments

jasontedor commented May 18, 2018 • edited by nik9000 Loading

elasticmachine commented May 18, 2018

elasticmachine commented May 18, 2018

elasticmachine commented May 18, 2018

bleskes commented May 21, 2018

bleskes commented May 21, 2018

nik9000 commented May 23, 2018

ywelsch commented May 23, 2018

nik9000 commented May 23, 2018

jasontedor commented May 23, 2018

ywelsch commented May 24, 2018

ywelsch commented May 24, 2018

ywelsch commented May 24, 2018

nik9000 commented May 24, 2018

nik9000 commented May 24, 2018

nik9000 commented May 24, 2018

ywelsch commented May 25, 2018

ywelsch commented May 25, 2018

ywelsch commented May 30, 2018

nik9000 commented Jun 1, 2018

danielmitterdorfer commented Jun 4, 2018

nik9000 commented Jun 4, 2018

ywelsch commented Jun 5, 2018

nik9000 commented Jun 5, 2018

nik9000 commented Jun 7, 2018 • edited Loading

nik9000 commented Jun 8, 2018

nik9000 commented Jun 11, 2018

jasontedor commented Jun 12, 2018

jasontedor commented May 18, 2018 •

edited by nik9000

Loading

nik9000 commented Jun 7, 2018 •

edited

Loading