-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable rolling upgrades from default distribution prior to 6.3.0 to default distribution post 6.3.0 #30731
Comments
Pinging @elastic/es-core-infra |
Pinging @elastic/es-distributed |
Pinging @elastic/ml-core |
This change is to support rolling upgrade from a pre-6.3 default distribution (i.e. without X-Pack) to a 6.3+ default distribution (i.e. with X-Pack). The ML metadata is no longer eagerly added to the cluster state as soon as the master node has X-Pack available. Instead, it is added when the first ML job is created. As a result all methods that get the ML metadata need to be able to handle the situation where there is no ML metadata in the current cluster state. They do this by behaving as though an empty ML metadata was present. This logic is encapsulated by always asking for the current ML metadata using a static method on the MlMetadata class. Relates elastic#30731
Thank for putting this list up. I think we should also deal with the TokenMetaData injected here when called from here. |
This change is to support rolling upgrade from a pre-6.3 default distribution (i.e. without X-Pack) to a 6.3+ default distribution (i.e. with X-Pack). The ML metadata is no longer eagerly added to the cluster state as soon as the master node has X-Pack available. Instead, it is added when the first ML job is created. As a result all methods that get the ML metadata need to be able to handle the situation where there is no ML metadata in the current cluster state. They do this by behaving as though an empty ML metadata was present. This logic is encapsulated by always asking for the current ML metadata using a static method on the MlMetadata class. Relates #30731
It looks like @ywelsch took care of it in https://github.com/elastic/elasticsearch/pull/30743/files |
This change is to support rolling upgrade from a pre-6.3 default distribution (i.e. without X-Pack) to a 6.3+ default distribution (i.e. with X-Pack). The ML metadata is no longer eagerly added to the cluster state as soon as the master node has X-Pack available. Instead, it is added when the first ML job is created. As a result all methods that get the ML metadata need to be able to handle the situation where there is no ML metadata in the current cluster state. They do this by behaving as though an empty ML metadata was present. This logic is encapsulated by always asking for the current ML metadata using a static method on the MlMetadata class. Relates #30731
This change is to support rolling upgrade from a pre-6.3 default distribution (i.e. without X-Pack) to a 6.3+ default distribution (i.e. with X-Pack). The ML metadata is no longer eagerly added to the cluster state as soon as the master node has X-Pack available. Instead, it is added when the first ML job is created. As a result all methods that get the ML metadata need to be able to handle the situation where there is no ML metadata in the current cluster state. They do this by behaving as though an empty ML metadata was present. This logic is encapsulated by always asking for the current ML metadata using a static method on the MlMetadata class. Relates #30731
Now that #30743 is merged I wanted to test this. The 6.3 branch works perfectly for me. The 6.x branch is failing though. That probably isn't a 6.3 release blocker but it is weird. The failure comes during the 5.6.10 upgrade to 6.x. The failure is:
The test that actually fails is " org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Test old multi type stuff}" but it only fails because one of its actions times out because the cluster is busy trying the thing above over and over again. |
@nik9000 This is a real problem (the x-pack node tries to add a template with x-pack only settings, and the OSS master rejects it). I'm not sure why it's not triggered by the test on the 6.3 branch as the same templating behavior exists there (for Watcher, Security etc.) as well. I consider this a blocker for 6.3 and will work on a solution tomorrow. |
❤️ It might not come up in 6.3 because I got lucky the couple of times I ran it. Maybe the 6.3 node won the master election. It might still be a problem in the 6.3 branch but not bad enough to slow the tests down enough to fail. I'll see if I can write a test that outright fails without this. I didn't want to have to use watcher in these tests because it has so much state, but I suspect I have no choice here. |
+1, this is a blocker. |
The issue with Watcher is that it uses a custom setting in its template. I've gone through the other XPack services to check if they present the same issue:
I'll explore getting rid of the |
I've opened #30832 for the watcher issue. |
@nik9000 I've run the mixed-cluster tests a few times on 6.3, and I've seen this exception spamming the logs. The tests not failing on 6.3 are more of an indication that we need to add more tests. |
I figured as much. Earlier I'd said:
and that is still my plan. I got distracted by other things and didn't end up writing the test. |
Adds a test that we create the appropriate x-pack templates during the rolling restart from the pre-6.2 OSS-zip distribution to the new zip distribution that contains xpack. This is one way to answer the question "does xpack acting sanely during the rolling upgrade and after it?" It isn't as good as full exercising xpack but it is fairly simple and would have caught elastic#30832. Relates to elastic#30731
So in my grand tradition of finding things, I believe the following is flaky:
On my desktop about half of those runs fail with:
Which looks pretty incriminating. There are totally three 3 available, master eligible nodes. It just logged about them. But this is 6.x testing the upgrade from 6.3. I've seen no trouble going from 5.6 to 6.x or 6.2 going to 6.x. Go figure. |
The other nodes see |
The fix only helps in case of a rolling restart, so the mixed-cluster tests still fail with this fix on 6.x. I will have to discuss with @elastic/es-security what can be done about the mixed-cluster situation. |
Adds a test that we create the appropriate x-pack templates during the rolling restart from the pre-6.2 OSS-zip distribution to the new zip distribution that contains xpack. This is one way to answer the question "does xpack acting sanely during the rolling upgrade and after it?" It isn't as good as full exercising xpack but it is fairly simple and would have caught #30832. Relates to #30731
Adds a test that we create the appropriate x-pack templates during the rolling restart from the pre-6.2 OSS-zip distribution to the new zip distribution that contains xpack. This is one way to answer the question "does xpack acting sanely during the rolling upgrade and after it?" It isn't as good as full exercising xpack but it is fairly simple and would have caught #30832. Relates to #30731
Reminder to self: We also need to fix |
With the default distribution changing in 6.3, clusters might now contain custom metadata that a pure OSS transport client cannot deserialize. As this can break transport clients when accessing the cluster state or reroute APIs, we've decided to exclude any custom metadata that the transport client might not be able to deserialize. This will ensure compatibility between a < 6.3 transport client and a 6.3 default distribution cluster. Note that this PR only covers interoperability with older clients, another follow-up PR will cover full interoperability for >= 6.3 transport clients where we will make it possible again to get the custom metadata from the cluster state. Relates to #30731
With the default distribution changing in 6.3, clusters might now contain custom metadata that a pure OSS transport client cannot deserialize. As this can break transport clients when accessing the cluster state or reroute APIs, we've decided to exclude any custom metadata that the transport client might not be able to deserialize. This will ensure compatibility between a < 6.3 transport client and a 6.3 default distribution cluster. Note that this PR only covers interoperability with older clients, another follow-up PR will cover full interoperability for >= 6.3 transport clients where we will make it possible again to get the custom metadata from the cluster state. Relates to #30731
With the default distribution changing in 6.3, clusters might now contain custom metadata that a pure OSS transport client cannot deserialize. As this can break transport clients when accessing the cluster state or reroute APIs, we've decided to exclude any custom metadata that the transport client might not be able to deserialize. This will ensure compatibility between a < 6.3 transport client and a 6.3 default distribution cluster. Note that this PR only covers interoperability with older clients, another follow-up PR will cover full interoperability for >= 6.3 transport clients where we will make it possible again to get the custom metadata from the cluster state. Relates to #30731
I'll take this. |
With #31020 we introduced the ability for transport clients to indicate what features they support in order to make sure we don't serialize object to them they don't support. This PR adapts the serialization logic of persistent tasks to be aware of those features and not serialize tasks that aren't supported. Also, a version check is added for the future where we may add new tasks implementations and need to be able to indicate they shouldn't be serialized both to nodes and clients. As the implementation relies on the interface of `PersistentTaskParams`, these are no longer optional. That's acceptable as all current implementation have them and we plan to make `PersistentTaskParams` more central in the future. Relates to #30731
We have another test failure that I believe belongs here as well (there are several tests failing but they appear to do so for the same reason). The first one of them has the reproduction line:
Click arrow for failure details07:23:15 ERROR 67.1s | UpgradeClusterClientYamlTestSuiteIT.test {p0=mixed_cluster/10_basic/Verify custom cluster metadata still exists during upgrade} <<< FAILURES! 07:23:15 > Throwable #1: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40576], URI [/], status line [HTTP/1.1 503 Service Unavailable] 07:23:15 > { 07:23:15 > "name" : "node-0", 07:23:15 > "cluster_name" : "rolling-upgrade", 07:23:15 > "cluster_uuid" : "OhN80TdsRXmqjvybzQA48A", 07:23:15 > "version" : { 07:23:15 > "number" : "6.4.0", 07:23:15 > "build_flavor" : "default", 07:23:15 > "build_type" : "zip", 07:23:15 > "build_hash" : "1eede11", 07:23:15 > "build_date" : "2018-06-04T06:30:39.454194Z", 07:23:15 > "build_snapshot" : true, 07:23:15 > "lucene_version" : "7.4.0", 07:23:15 > "minimum_wire_compatibility_version" : "5.6.0", 07:23:15 > "minimum_index_compatibility_version" : "5.0.0" 07:23:15 > }, 07:23:15 > "tagline" : "You Know, for Search" 07:23:15 > } 07:23:15 > at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:821) 07:23:15 > at org.elasticsearch.client.RestClient.performRequest(RestClient.java:182) 07:23:15 > at org.elasticsearch.client.RestClient.performRequest(RestClient.java:227) 07:23:15 > at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.readVersionsFromInfo(ESClientYamlSuiteTestCase.java:282) 07:23:15 > at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.initAndResetContext(ESClientYamlSuiteTestCase.java:106) 07:23:15 > at java.lang.Thread.run(Thread.java:748) 07:23:15 > Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:40576], URI [/], status line [HTTP/1.1 503 Service Unavailable] 07:23:15 > { 07:23:15 > "name" : "node-0", 07:23:15 > "cluster_name" : "rolling-upgrade", 07:23:15 > "cluster_uuid" : "OhN80TdsRXmqjvybzQA48A", 07:23:15 > "version" : { 07:23:15 > "number" : "6.4.0", 07:23:15 > "build_flavor" : "default", 07:23:15 > "build_type" : "zip", 07:23:15 > "build_hash" : "1eede11", 07:23:15 > "build_date" : "2018-06-04T06:30:39.454194Z", 07:23:15 > "build_snapshot" : true, 07:23:15 > "lucene_version" : "7.4.0", 07:23:15 > "minimum_wire_compatibility_version" : "5.6.0", 07:23:15 > "minimum_index_compatibility_version" : "5.0.0" 07:23:15 > }, 07:23:15 > "tagline" : "You Know, for Search" 07:23:15 > } 07:23:15 > at org.elasticsearch.client.RestClient$1.completed(RestClient.java:495) 07:23:15 > at org.elasticsearch.client.RestClient$1.completed(RestClient.java:484) 07:23:15 > at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119) 07:23:15 > at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177) 07:23:15 > at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436) 07:23:15 > at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326) 07:23:15 > at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) 07:23:15 > at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) 07:23:15 > at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) 07:23:15 > at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114) 07:23:15 > at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) 07:23:15 > at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) 07:23:15 > at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) 07:23:15 > at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) 07:23:15 > at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) 07:23:15 > at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) 07:23:15 > ... 1 moreThrowable #2: java.lang.AssertionError: there are still running tasks: 07:23:15 > {time_in_queue=15ms, time_in_queue_millis=15, source=zen-disco-elected-as-master ([2] nodes joined), executing=true, priority=URGENT, insert_order=185} 07:23:15 > {time_in_queue=1ms, time_in_queue_millis=1, source=install-token-metadata, executing=false, priority=URGENT, insert_order=186} 07:23:15 > {time_in_queue=0s, time_in_queue_millis=0, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=187} 07:23:15 > at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347) 07:23:15 > at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:767) 07:23:15 > at org.elasticsearch.test.rest.ESRestTestCase.waitForClusterStateUpdatesToFinish(ESRestTestCase.java:338) 07:23:15 > at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:151) 07:23:15 > at java.lang.Thread.run(Thread.java:748) 07:23:15 > Suppressed: java.lang.AssertionError: there are still running tasks: 07:23:15 > {time_in_queue=86ms, time_in_queue_millis=86, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42} 07:23:15 > {time_in_queue=71ms, time_in_queue_millis=71, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46} 07:23:15 > {time_in_queue=4ms, time_in_queue_millis=4, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47} 07:23:15 > at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347) 07:23:15 > at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755) 07:23:15 > ... 37 more 07:23:15 > Suppressed: java.lang.AssertionError: there are still running tasks: 07:23:15 > {time_in_queue=96ms, time_in_queue_millis=96, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42} 07:23:15 > {time_in_queue=81ms, time_in_queue_millis=81, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46} 07:23:15 > {time_in_queue=14ms, time_in_queue_millis=14, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47} 07:23:15 > at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347) 07:23:15 > at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755) 07:23:15 > ... 37 more 07:23:15 > Suppressed: java.lang.AssertionError: there are still running tasks: 07:23:15 > {time_in_queue=102ms, time_in_queue_millis=102, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42} 07:23:15 > {time_in_queue=86ms, time_in_queue_millis=86, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46} 07:23:15 > {time_in_queue=20ms, time_in_queue_millis=20, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47} 07:23:15 > at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347) 07:23:15 > at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755) 07:23:15 > ... 37 more 07:23:15 > Suppressed: java.lang.AssertionError: there are still running tasks: 07:23:15 > {time_in_queue=109ms, time_in_queue_millis=109, source=cluster_reroute(async_shard_fetch), executing=true, priority=HIGH, insert_order=42} 07:23:15 > {time_in_queue=94ms, time_in_queue_millis=94, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46} 07:23:15 > {time_in_queue=27ms, time_in_queue_millis=27, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47} 07:23:15 > at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347) 07:23:15 > at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755) 07:23:15 > ... 37 more 07:23:15 > Suppressed: java.lang.AssertionError: there are still running tasks: 07:23:15 > {time_in_queue=106ms, time_in_queue_millis=106, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=46} 07:23:15 > {time_in_queue=40ms, time_in_queue_millis=40, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=47} 07:23:15 > {time_in_queue=0s, time_in_queue_millis=0, source=maybe generate license for cluster, executing=false, priority=NORMAL, insert_order=49} 07:23:15 > at org.elasticsearch.test.rest.ESRestTestCase.lambda$waitForClusterStateUpdatesToFinish$0(ESRestTestCase.java:347) 07:23:15 > at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:755) 07:23:15 > ... 37 more Full cluster logs are available in rolling-upgrade-cluster-logs.zip |
With #31020 we introduced the ability for transport clients to indicate what features they support in order to make sure we don't serialize object to them they don't support. This PR adapts the serialization logic of persistent tasks to be aware of those features and not serialize tasks that aren't supported. Also, a version check is added for the future where we may add new tasks implementations and need to be able to indicate they shouldn't be serialized both to nodes and clients. As the implementation relies on the interface of `PersistentTaskParams`, these are no longer optional. That's acceptable as all current implementation have them and we plan to make `PersistentTaskParams` more central in the future. Relates to #30731
With #31020 we introduced the ability for transport clients to indicate what features they support in order to make sure we don't serialize object to them they don't support. This PR adapts the serialization logic of persistent tasks to be aware of those features and not serialize tasks that aren't supported. Also, a version check is added for the future where we may add new tasks implementations and need to be able to indicate they shouldn't be serialized both to nodes and clients. As the implementation relies on the interface of `PersistentTaskParams`, these are no longer optional. That's acceptable as all current implementation have them and we plan to make `PersistentTaskParams` more central in the future. Relates to #30731
I had a look at @danielmitterdorfer's failure. A few things:
I'd expect there to be a failure somewhere in the log describing how the cluster state sync failed. But I can't find one. All of the exceptions have to do with the restarts and the cluster not having a valid master after the incident. |
@nik9000 @danielmitterdorfer this will be fixed by #30859. It's not blocking the 6.3 release, but the 6.4 release. |
I've opened #31112 to make the x-pack upgrade tests (all three of them) use three nodes. It isn't perfect but it is about as complex as I'd like to get and still backport to 6.3. |
So I merged #31112 to master and 6.x yesterday but that caused all kinds of problems. I'm trying to un-break them now. I'll merge to 6.3 once everything is calmer in the branches I've already merged to. |
The weekend went well as far as the backwards compatibility builds goes! I'm happy to say that the upgrades looked great. I think I'm done here. |
Thank you all that contributed to the effort here, this was a great effort all around. Closed by the hard work of a lot of people |
This is a meta-issue to track the work needed to enable smooth upgrades from the default distribution prior to 6.3.0 to the default distribution post 6.3.0. The sub-tasks are:
The text was updated successfully, but these errors were encountered: