Features not matching version after an upgrade to 8.13+ #109254

thbkrkr · 2024-05-31T15:55:13Z

Elasticsearch Version

8.13.x

Java Version: bundled
OS Version: N/A (different k8s versions)

Problem Description

ECK operator 2.12.1 fails upgrading Elasticsearch 8.13+ as it is stalled on the following error when calling the desired nodes API:

400 Bad Request: {
    Reason: [node_version] field is required and must have a valid value 
    Type: x_content_parse_exception
    ...
}

Steps to Reproduce

The steps should be (not tested yet):

Deploy ECK 2.12.1+
Deploy Elasticsearch 8.11.x?
Upgrade to Elasticsearch 8.13+

Notes

Note: this is from ECK 2.12.1 that the operator stops to use the deprecated node_version field if the cluster is 8.13+:

Move desired nodes version gate in the client cloud-on-k8s#7663

Two occurences of this issue have been reported for these versions upgrades:

8.11.x -> 8.13.3
8.11.4 -> 8.13.2

Each time, the users confirmed that:

some nodes did not have desired_node.version_deprecated feature
a restart fixes the issue

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-05-31T16:13:14Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

thecoop · 2024-06-13T10:59:56Z

@thbkrkr To help debug this, it'll be really helpful to have the order that nodes were upgraded, the next time this occurs, as well as logs from all the upgraded nodes.

thecoop · 2024-06-13T11:12:41Z

We already have some accounting to make sure features are set properly when a master node is upgraded, but it looks like this is going awry

thecoop · 2024-06-13T13:03:30Z

This does not reproduce readily, I suspect it very much depends on the order nodes are restarted. Continuing to investigate.

mikeprince3 · 2024-06-29T12:44:45Z

I encountered this issue today while migrating a cluster into k8s using ECK and version 8.14.1 of Elasticsearch. If you can provide an idea of what logs you would like, I can get them for you. For context, this is a 15 node cluster running on GKE. The three dedicated master nodes did have the desired_node.version_deprecated feature but none of the other nodes did. Killing the pods one by one did resolve the problem as indicated in a comment above.

In the screenshot below, lXbHD9JaRlSP51wL2D442Q is a node with a data role (no master) and tTBB9pOuRN6Q1LpVUJvujQ was a dedicated master node.

Note: this was an empty cluster (v8.14.1) and then I restored a snapshot to it from a v7.17.22 cluster. ECK v2.13.0

thecoop · 2024-07-01T08:24:09Z

I just need all the logs of the nodes from when the upgrade started to when it completed

thecoop · 2024-07-01T15:59:26Z

@mikeprince3 Just spotted you're not an Elastician, so don't have access to our infrastructure. If you can provide a tarball of the log files somewhere that would be very helpful; if you don't have anywhere to upload them to please let me know and we can sort something out.

mikeprince3 · 2024-07-01T20:31:28Z

@thecoop Everything logs into GCP but I'll see what I can do to share or export that. I haven't tried attaching to the pods to pull the logs direct either but will give that a shot too. I assume you're looking for logs from both ECK and the ES nodes?

thecoop · 2024-07-02T08:02:45Z

Just the ES nodes will do - this is a bug with elasticsearch

mikeprince3 · 2024-07-02T13:20:38Z

@thecoop I pinged you on linkedin. I can send you a link to the logs through a private message there. I'm open to other alternatives if you prefer.

mikeprince3 · 2024-07-02T15:50:52Z

@thecoop uploaded as requested. There are 15 nodes in the cluster so I only included logs from the two node ids in my screenshot above. The tTBB logs are for one of the masters and the lXbH ones are for one of the non-master-eligible data nodes that was missing the features. The logs are exports from GCP in both csv and json format. Let me know if you need logs from some of the other nodes.

thecoop · 2024-07-03T13:28:36Z

@mikeprince3 Thanks for the logs. This bug is around the exact order in which nodes are restarted and get elected to master, so could you send the logs for the other two master nodes, and maybe one node that was unaffected by this bug for comparison?

mikeprince3 · 2024-07-03T14:42:36Z

@thecoop No problem. I was able to combine the logs across the three masters and export them as a single file. Hopefully you'll be able to view the logs as they happened instead of jumping between log files (plus this was way easier for me). I was also able to pull the first 10k log records for the entire cluster during the initial startup so maybe that context can help as well.

Regarding the unaffected nodes, the three masters were the only ones that seemed to have any entiries in their features array. I don't have a screenshot of it, but the rest of the nodes were all blank the best I can recall.

Note: I don't think this part is relevant but wanted to explain what you will see in the logs. Our cluster has 6 nodes called elasticsearch-es-data and another 6 called elasticsearch-es-utility. The only difference between those two sets of nodes are that the utility nodes have the attribute node.attr.purpose: utility that allows us to allocate certain indices to those nodes.

elasticsearchmachine · 2024-07-04T21:48:52Z

Pinging @elastic/es-distributed (Team:Distributed)

thecoop · 2024-07-04T21:53:58Z

Looks like this bug happens when there are non-master-eligible nodes in a cluster when a cluster is upgraded. The non-master-eligible nodes are not going through the same codepath in NodeJoinExecutor the upgraded master as the master-eligible nodes, leading to features not being upgraded as expected. Restarting the relevant nodes afterwards lead to the correct features getting upgraded.

thecoop · 2024-07-16T14:40:48Z

Thanks for the logs @mikeprince3, we've merged in a fix that will be in 8.15

thbkrkr added >bug needs:triage Requires assignment of a team area label labels May 31, 2024

rjernst added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels May 31, 2024

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label May 31, 2024

thbkrkr mentioned this issue May 31, 2024

Reconciler error on upgrade from 8.13.3 to 8.13.4 elastic/cloud-on-k8s#7861

Closed

thecoop added the medium-risk An open issue or test failure that is a medium risk to future releases label Jun 13, 2024

thecoop self-assigned this Jun 13, 2024

elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Jul 4, 2024

thecoop mentioned this issue Jul 5, 2024

Add known-issues for all affected releases for the feature upgrade issue #110523

Merged

thecoop added the v8.13.0 label Jul 5, 2024

thecoop mentioned this issue Jul 9, 2024

Add a transport action to get the features of a node #110645

Closed

thecoop mentioned this issue Jul 10, 2024

Add a cluster listener to fix missing cluster features after upgrade #110710

Merged

thecoop closed this as completed in #110710 Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features not matching version after an upgrade to 8.13+ #109254

Features not matching version after an upgrade to 8.13+ #109254

thbkrkr commented May 31, 2024 •

edited

Loading

elasticsearchmachine commented May 31, 2024

thecoop commented Jun 13, 2024

thecoop commented Jun 13, 2024

thecoop commented Jun 13, 2024

mikeprince3 commented Jun 29, 2024 •

edited

Loading

thecoop commented Jul 1, 2024

thecoop commented Jul 1, 2024

mikeprince3 commented Jul 1, 2024

thecoop commented Jul 2, 2024

mikeprince3 commented Jul 2, 2024

mikeprince3 commented Jul 2, 2024

thecoop commented Jul 3, 2024

mikeprince3 commented Jul 3, 2024

elasticsearchmachine commented Jul 4, 2024

thecoop commented Jul 4, 2024

thecoop commented Jul 16, 2024

Features not matching version after an upgrade to 8.13+ #109254

Features not matching version after an upgrade to 8.13+ #109254

Comments

thbkrkr commented May 31, 2024 • edited Loading

Elasticsearch Version

Problem Description

Steps to Reproduce

Notes

elasticsearchmachine commented May 31, 2024

thecoop commented Jun 13, 2024

thecoop commented Jun 13, 2024

thecoop commented Jun 13, 2024

mikeprince3 commented Jun 29, 2024 • edited Loading

thecoop commented Jul 1, 2024

thecoop commented Jul 1, 2024

mikeprince3 commented Jul 1, 2024

thecoop commented Jul 2, 2024

mikeprince3 commented Jul 2, 2024

mikeprince3 commented Jul 2, 2024

thecoop commented Jul 3, 2024

mikeprince3 commented Jul 3, 2024

elasticsearchmachine commented Jul 4, 2024

thecoop commented Jul 4, 2024

thecoop commented Jul 16, 2024

thbkrkr commented May 31, 2024 •

edited

Loading

mikeprince3 commented Jun 29, 2024 •

edited

Loading