Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features not matching version after an upgrade to 8.13+ #109254

Closed
thbkrkr opened this issue May 31, 2024 · 16 comments · Fixed by #110710
Closed

Features not matching version after an upgrade to 8.13+ #109254

thbkrkr opened this issue May 31, 2024 · 16 comments · Fixed by #110710
Assignees
Labels
>bug :Core/Infra/Core Core issues without another label :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team Team:Distributed Meta label for distributed team (obsolete) v8.13.0

Comments

@thbkrkr
Copy link
Contributor

thbkrkr commented May 31, 2024

Elasticsearch Version

8.13.x

  • Java Version: bundled

  • OS Version: N/A (different k8s versions)

Problem Description

ECK operator 2.12.1 fails upgrading Elasticsearch 8.13+ as it is stalled on the following error when calling the desired nodes API:

400 Bad Request: {
    Reason: [node_version] field is required and must have a valid value 
    Type: x_content_parse_exception
    ...
}

Steps to Reproduce

The steps should be (not tested yet):

  • Deploy ECK 2.12.1+
  • Deploy Elasticsearch 8.11.x?
  • Upgrade to Elasticsearch 8.13+

Notes

Note: this is from ECK 2.12.1 that the operator stops to use the deprecated node_version field if the cluster is 8.13+:

Two occurences of this issue have been reported for these versions upgrades:

  • 8.11.x -> 8.13.3
  • 8.11.4 -> 8.13.2

Each time, the users confirmed that:

  • some nodes did not have desired_node.version_deprecated feature
  • a restart fixes the issue
@thbkrkr thbkrkr added >bug needs:triage Requires assignment of a team area label labels May 31, 2024
@rjernst rjernst added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels May 31, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@thecoop
Copy link
Member

thecoop commented Jun 13, 2024

@thbkrkr To help debug this, it'll be really helpful to have the order that nodes were upgraded, the next time this occurs, as well as logs from all the upgraded nodes.

@thecoop thecoop added the medium-risk An open issue or test failure that is a medium risk to future releases label Jun 13, 2024
@thecoop
Copy link
Member

thecoop commented Jun 13, 2024

We already have some accounting to make sure features are set properly when a master node is upgraded, but it looks like this is going awry

@thecoop thecoop self-assigned this Jun 13, 2024
@thecoop
Copy link
Member

thecoop commented Jun 13, 2024

This does not reproduce readily, I suspect it very much depends on the order nodes are restarted. Continuing to investigate.

@mikeprince3
Copy link

mikeprince3 commented Jun 29, 2024

I encountered this issue today while migrating a cluster into k8s using ECK and version 8.14.1 of Elasticsearch. If you can provide an idea of what logs you would like, I can get them for you. For context, this is a 15 node cluster running on GKE. The three dedicated master nodes did have the desired_node.version_deprecated feature but none of the other nodes did. Killing the pods one by one did resolve the problem as indicated in a comment above.

In the screenshot below, lXbHD9JaRlSP51wL2D442Q is a node with a data role (no master) and tTBB9pOuRN6Q1LpVUJvujQ was a dedicated master node.

Note: this was an empty cluster (v8.14.1) and then I restored a snapshot to it from a v7.17.22 cluster. ECK v2.13.0

image

@thecoop
Copy link
Member

thecoop commented Jul 1, 2024

I just need all the logs of the nodes from when the upgrade started to when it completed

@thecoop
Copy link
Member

thecoop commented Jul 1, 2024

@mikeprince3 Just spotted you're not an Elastician, so don't have access to our infrastructure. If you can provide a tarball of the log files somewhere that would be very helpful; if you don't have anywhere to upload them to please let me know and we can sort something out.

@mikeprince3
Copy link

@thecoop Everything logs into GCP but I'll see what I can do to share or export that. I haven't tried attaching to the pods to pull the logs direct either but will give that a shot too. I assume you're looking for logs from both ECK and the ES nodes?

@thecoop
Copy link
Member

thecoop commented Jul 2, 2024

Just the ES nodes will do - this is a bug with elasticsearch

@mikeprince3
Copy link

@thecoop I pinged you on linkedin. I can send you a link to the logs through a private message there. I'm open to other alternatives if you prefer.

@mikeprince3
Copy link

@thecoop uploaded as requested. There are 15 nodes in the cluster so I only included logs from the two node ids in my screenshot above. The tTBB logs are for one of the masters and the lXbH ones are for one of the non-master-eligible data nodes that was missing the features. The logs are exports from GCP in both csv and json format. Let me know if you need logs from some of the other nodes.

@thecoop
Copy link
Member

thecoop commented Jul 3, 2024

@mikeprince3 Thanks for the logs. This bug is around the exact order in which nodes are restarted and get elected to master, so could you send the logs for the other two master nodes, and maybe one node that was unaffected by this bug for comparison?

@mikeprince3
Copy link

@thecoop No problem. I was able to combine the logs across the three masters and export them as a single file. Hopefully you'll be able to view the logs as they happened instead of jumping between log files (plus this was way easier for me). I was also able to pull the first 10k log records for the entire cluster during the initial startup so maybe that context can help as well.

Regarding the unaffected nodes, the three masters were the only ones that seemed to have any entiries in their features array. I don't have a screenshot of it, but the rest of the nodes were all blank the best I can recall.

Note: I don't think this part is relevant but wanted to explain what you will see in the logs. Our cluster has 6 nodes called elasticsearch-es-data and another 6 called elasticsearch-es-utility. The only difference between those two sets of nodes are that the utility nodes have the attribute node.attr.purpose: utility that allows us to allocate certain indices to those nodes.

@thecoop thecoop added low-risk An open issue or test failure that is a low risk to future releases :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Jul 4, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Jul 4, 2024
@thecoop
Copy link
Member

thecoop commented Jul 4, 2024

Looks like this bug happens when there are non-master-eligible nodes in a cluster when a cluster is upgraded. The non-master-eligible nodes are not going through the same codepath in NodeJoinExecutor the upgraded master as the master-eligible nodes, leading to features not being upgraded as expected. Restarting the relevant nodes afterwards lead to the correct features getting upgraded.

@thecoop
Copy link
Member

thecoop commented Jul 16, 2024

Thanks for the logs @mikeprince3, we've merged in a fix that will be in 8.15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. low-risk An open issue or test failure that is a low risk to future releases Team:Core/Infra Meta label for core/infra team Team:Distributed Meta label for distributed team (obsolete) v8.13.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants