-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation stopped immediately after start #14850
Comments
Hi @valodzka and thanks for raising this issue.
This certainly looks identical to #12797 as you pointed out, so cross linked from this issue is useful to provide additional context when that item of work gets roadmapped.
This message is used during the allocation reconciliation phase which is triggered when the scheduler receives an updated job specification. It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?
If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context. There may be some non-obvious messages which might provide an idea of what is happening also. |
We use simple automated system that call
Leader:
Client:
|
The issue appeared again. Unfortunately allocations were garbaraga collected before I can inspect them, but in the leader log I see the same message (and it first time it appeared in log since 2022-10-07T15:09:44.828Z), so it might be related:
|
Hi @jrasell, I tried to create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce. Steps:
|
Hi @valodzka and thanks for the additional details and context.
This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.
This is very useful and thanks for taking the time to create this. It will prove very useful further down the line to engineers looking into this issue. I am going to move this issue onto our backlog now from initial triage. |
@jrasell thank you for reply. I already read the documentation about this log message and if I understood correctly it is about a different bug. It states that "it is possible for these log lines to occur infrequently due to normal cluster conditions", and the problem is only if there are "repeated log lines". I only see this log message only once a few days. Is this considered repeated? Should workaround with a plan rejection tracker that remembers plans only for 5 min help in this case? |
Hi @valodzka, sorry about the delayed response on this. I've picked this problem back up and your reproduction is going to be very helpful. In particular this line may be a vital clue for us, as one of the larger Nomad users who's hit this issue also has
The behavior we've seen from some users is that a particular node gets "stuck" in a state where the planner consistently rejects plans for a given job, until the client is restarted. If the retried allocation gets placed successfully on the next plan, you may not be hitting the specific issue that the plan rejection tracker is intended to fix. |
This will be closed by #16401, which will ship in Nomad 1.5.1 (plus backports) |
Nomad version
1.3.6
Issue
Sometimes during deployment I see that freshly started allocation immediately stopped by nomad. During the same deployment multiple allocations with identical alloc index (name) created. Example, all allocs share same Eval ID and Job Version (full alloc status here: https://gist.github.com/valodzka/231f3942203a5f39528b52241905a06e):
Alloc 1:
Alloc 2:
Alloc 3:
What is strange here:
Dimension "network: port collision" exhausted on 1 nodes
. Not sure if it's relevant here but I guess it might be related.Issue #12797 might be relevant.
Expected Result
Nomad Server/Clients logs
I checked logs for both client and server and didn't found anything relevant.
The text was updated successfully, but these errors were encountered: