-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed to submit plan for evaluation: ... no such key \"<snip>\" in keyring error after moving cluster to 1.4.1 #14981
Comments
Hi @bfqrst! Thanks for opening this issue. The keyring being referred to here is new in 1.4.x and supports the new Workload Identity feature. When a new server joins the cluster, it streams the raft snapshot from the old cluster. It also starts up keyring replication from the old cluster. The keyring replication loop on the server reads key metadata from raft, sees a key it doesn't have in its local keyring, and then sends a RPC to the leader to get that key (falling back to polling all the other peers for the key if the leader doesn't have it so that we can get the key even if there was a leader election immediately following a new key). What seems to be happening in your case is that the new servers aren't replicating the keyring, which means the leader can't sign the workload identity for the replacement allocations. Do you have any server logs containing the word We've had a similar report from an internal user in their staging environment as well. Their workflow seems somewhat similar to what you're reporting here, so I want to double-check that:
Also, it might help if we could get a stack trace from the servers. you can trigger this via SIGQUIT, which will dump it to stderr. If it's really long, you can email it to [email protected] with a subject line pointing to this issue and I'll see it. |
Thanks for looking into it @tgross! So
After that it's basically failed to submit plan for evaluation until infinity.
Nope, that would be one at a time.
Hmm, I don't think it's actively ensured, but between server instance switches I think there's a healthy 5 minutes on a cluster with very few jobs... |
The total lack of |
Okay @tgross I did
|
From that goroutine dump I see a couple of important bits of information:
Next step for me is to go dig into that rate limiting library and make sure we're not misusing it in some way. I'll report back here later today. |
I've got a draft PR open with one fix I see we need, but I don't think that's at all the cause here: #14987. Investigation continues! |
Just wanted to give a summary of where this is at. #14987 fixes some real bugs I've found while investigating, but I'm not 100% sure that it's the fix we need until I can reproduce and so far I've been unable to. Not having that I'm going to pick this up again tomorrow and consult with some of my colleagues again and see what we can come up with. |
Appreciate the update and the investigation! For the time being, as I have a non working cluster, what's the best course of action? Since I thankfully don't use any of the new 1.4 features, I'm inclined to roll everything back to 1.3.6... That should work, right? |
|
Once any new raft log type has been written, you can't downgrade. That include the keyring metadata in this case. 😦
Yes, that will force a rotation and re-replication of the key.
Have you seen that happen without adding new servers @doppelc? Because if so that's a useful new data point for my investigation. |
Yeah I had to redo the cluster in question, @doppelc response came after I started cleaning up. It's kinda funny from a reproducibility standpoint really. It literally hit me in each of my three stages when I went up to 1.4.x. This time around is the fourth occurrence. At first I thought that it was the bug that was in 1.4.0 and brushed it off... For now, good to know that I'm able to help myself to some extent with the |
Yes, it has happened in both scenarios. |
Based on some feedback from the internal user, I've got a clue that the underlying issue here is actually related to a bug in garbage collection, which is why I wasn't able to reproduce it with fresh test clusters. I'm working on verifying that and I'm hoping to have a fix for that later this afternoon. |
GC behavior fixes are #15009. I still haven't been able to reproduce the reported bug exactly, but this at least explains where the extraneous keys were coming from: not from leader election but from misbehaving GC. That gives me a lot more confidence in the fix in #14987 even if I don't have an exact reproduction yet. |
While I think the GC is triggered by default every 48 hours or so, I have the habit of manually firing |
I have the same issue. After upgrading from 1.3.6 to 1.4.1 it happened right after the upgrade on two separate clusters. It's also happening at seemingly random times, like the middle of the night yesterday. |
We've landed our fixes for this and that'll go out in Nomad 1.4.2, which I'm in the process of prepping. Expect this shortly, and thanks for your patience all! |
Several keyring fixes have shipped in Nomad 1.4.2 which should put this issue to rest. I'm going to close this out but please feel free to let us know here if you're encountering the issue again! Thanks! |
What is the interim solution here? Downgrading to pre-1.4? Today this happened in our cluster, running 1.4.2, and jobs stopped being evaluated. I rotated the keys just as @HINT-SJ but it's obviously not ideal production setup. Can we turn this feature off somehow with settings? |
Here's the current status on this issue:
There are a number of interim solutions, in increasing order of complexity/annoyance:
If you've got a staging/test environment where you can do (3) while you're doing (1), that'd be helpful too!
The keyring is used for providing |
I did (2) killing the whole prod cluster :D 🤦♂️ Took me only about an hour to get it up so not a big deal. (1) unfortunately stopped working after about 4 or 5 rotations. I listed the keys and tried to remove the old ones but I wasn't able to, the We currently only have one nomad deployment, only occasionally I setup a simple dev cluster when playing with things, but in general no workloads run there so it wouldn't test much. For now I downgraded to the latest 1.3.x release, but man I'm missing 1.4.x new UI improvements. |
We've got Nomad 1.4.2 running in Dev and this just happened to us, effectively knocking the cluster out. No new or existing jobs can be scheduled and failed evaluations are piling up. This is Dev so I've not scrubbed anything:
|
@seanamos you didn't provide information on what led to the state (ex. was this new nodes added to the cluster, a leader election, or what?). But it might help me out if you could run |
@tgross Unfortunately the timing of this couldn't be worse as I was busy reworking the monitoring of the Dev cluster today. I will gather what I can. I have already done a It does look like the keystore is being updated after rotation.
|
While log collection wasn't working at the time, metrics gathering was: There was a leadership election a couple hours before the outage (or at least that we noticed). At the time the raft transaction time starts climbing is where the outage is noticed, it drops back down after I ran the keyring rotation. |
Ok, if you're willing to try out a build from |
With the help of our internal user, I was finally able to reproduce this and even write a test that can cause it to happen consistently. PR is up at #15227. We're going to test this out internally (but please feel free to try that PR in your own acceptance environments!) early next week and hopefully be able to finally put this one to bed. Thanks for your patience folks! |
We've got #15227 merged and will be doing some additional testing internally before 1.4.3 ships. We can't give an exact date on that other than to say we're moving as quickly as feasible on it. |
@tgross is there any sort of hacky fix before 1.4.3 is released? I'm seeing this in a staging environment that I can start over from scratch if I have to, but would rather not if there's a hammer I can bang it with. |
@chrismo there's two sets of workarounds, depending on where you're at, described in #14981 (comment) and #14981 (comment) |
Just a heads up that we're kicking off the process for releasing 1.4.3 shortly, so expect the patch soon. Thanks again for your patience with this, all! |
Hey @tgross, here's a quick report on how the update of the first cluster to v1.4.3 went... TL;DR: had to redo it! Status quo: Up until today, basically after server leader elections, I would check the clients tab in the GUI and if the #alloc counter on any of the the clients would read 0, I would routinely issue the At this point only one job has nomadVars going... So to recap:
At this point I'm pack(er)ing new AMIs with v1.4.3 and terraform the ASGs. 1 server goes -> okay, second server goes -> okay, third server goes -> WI job listed as pending. Server logs show something along the lines of Present hour: Now I'm a bit reluctant to move on to the second cluster. Couple of thoughts here:
Sooo yeah, not optimal... Are there any pointers as to how we should go about updating this thing securely? Cheers |
Hi @bfqrst! Thanks for the feedback on that. Something that's a little unclear for me here is what happened when you replaced the 3rd server. You said "WI job listed as pending". Was that job's deployment in-flight, or did it get rescheduled for some reason after the 3rd server was replaced, or...? My suspicion here is that the job had allocations with a WI signed by the missing key (or a variable encrypted by it), which wasn't around in the keystore. In 1.4.3 we track a |
As noted, 1.4.3 is out. We'll keep this issue pinned for a little while so that folks can see it if they're having upgrade issues. I'll keep an eye on this thread. |
No, it definitely wasn't in flight, it was already there and stopped the moment the last server wen't away and specifically complained about a key that wasn't there anymore.
So this would be my suspicion too, although I would have thought that an inactive key is truly obsolete and can be safely removed without comprimising already signed vars. But I'm just wild guessing and assuming here... I wasn't aware of the |
...on to the next cluster today then...
These are the keys:
Result:
EDIT: keep in mind this was a Sorry if I'm overly chatty on this matter, but I am trying to keep others from messing up their clusters... Let me know if I there's something I can assist with. Cheers |
I'm sorry @bfqrst I'm realizing I gave you bad advice to do a rekey. If you have missing keys, doing a rekey will never work because we need to decrypt each Variable and then reencrypt it. We obviously can't decrypt a Variable if we're missing the key we used to encrypt it!
If you've got key material that doesn't exist anywhere on the cluster but still have key metadata, there's no path to recovering that key or any variables/allocations that need it. So you should be able to do Your active key
Not at all, this is all helpful! We screwed up the initial implementation and this is a stateful system so simply upgrading won't fix everything, so I'm glad we're chatting through recovery here in a way that other folks can use. |
That is correct, yes, the log is from a new v1.4.3 server cluster member...
And this file exists, so thumbs up! BUT there is a second file present even after deleting the key that was stuck in
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.4.1 (2aa7e66)
Operating system and Environment details
Ubuntu 22.04, Nomad 1.4.1
Issue
After moving the Nomad server and clients to v1.4.1, I noticed that sometimes (unfortunately not always) after cycling Nomad server ASGs and Nomad client ASGs with new AMIs, jobs scheduled on the workers can't be allocated. So to be precise:
This literally never happened before 1.4.X
Client output looks like this:
nomad eval list
ID Priority Triggered By Job ID Namespace Node ID Status Placement Failures
427e9905 50 failed-follow-up plugin-aws-ebs-nodes default pending false
35f4fdfb 50 failed-follow-up plugin-aws-efs-nodes default pending false
46152dcd 50 failed-follow-up spot-drainer default pending false
71e3e58a 50 failed-follow-up plugin-aws-ebs-nodes default pending false
e86177a6 50 failed-follow-up plugin-aws-efs-nodes default pending false
2289ba5f 50 failed-follow-up spot-drainer default pending false
da3fdad6 50 failed-follow-up plugin-aws-ebs-nodes default pending false
b445b976 50 failed-follow-up plugin-aws-efs-nodes default pending false
48a6771e 50 failed-follow-up ingress default pending false
Reproduction steps
Unclear at this point. I seem to be able to somewhat force the issue, when I cycle the Nomad server ASG with updated AMIs.
Expected Result
Client work that was lost, should be rescheduled once the Nomad client comes up and reports readiness.
Actual Result
Lost jobs that can't be allocated on worker with an updated AMI.
nomad status
ID Type Priority Status Submit Date
auth-service service 50 pending 2022-10-09T11:32:57+02:00
ingress service 50 pending 2022-10-17T14:57:26+02:00
plugin-aws-ebs-controller service 50 running 2022-10-09T14:48:11+02:00
plugin-aws-ebs-nodes system 50 running 2022-10-09T14:48:11+02:00
plugin-aws-efs-nodes system 50 running 2022-10-09T11:37:04+02:00
prometheus service 50 pending 2022-10-18T21:19:24+02:00
spot-drainer system 50 running 2022-10-11T18:04:49+02:00
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: