Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Batch Job Inaccurate Job Summary #13519

Open
espey opened this issue Jun 28, 2022 · 2 comments
Open

Nomad Batch Job Inaccurate Job Summary #13519

espey opened this issue Jun 28, 2022 · 2 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/batch Issues related to batch jobs and scheduling theme/job-summary type/bug

Comments

@espey
Copy link

espey commented Jun 28, 2022

Nomad version

Nomad Version 1.3.1

Issue

One of our Nomad Batch Jobs seems to have acquired a strange state in regards to the number of Dead and Running jobs within the Nomad Job summary. Below is a screenshot of this state from our cluster:

Screen Shot 2022-06-28 at 5 36 51 PM

When running nomad operator api /v1/job/<job_id>/summary, this is the output:
{"JobID":"<job_id>","Namespace":"default","Summary":{"<job>":{"Queued":0,"Complete":0,"Failed":0,"Running":0,"Starting":0,"Lost":0,"Unknown":0}},"Children":{"Pending":0,"Running":-182185,"Dead":891696},"CreateIndex":1514,"ModifyIndex":15192955}

After running nomad system reconcile summaries the state seems to be in a much healthier status. Running nomad operator api /v1/job/<job_id>/summary produces this new output:
{"JobID":"<job_id>","Namespace":"default","Summary":{},"Children":{"Pending":0,"Running":0,"Dead":132},"CreateIndex":1514,"ModifyIndex":15193265}

How does nomad get into this state and does this mean we need to run the reconcile summaries at some point in the future?

@espey espey added the type/bug label Jun 28, 2022
@jrasell
Copy link
Member

jrasell commented Jun 29, 2022

Hi @espey and thanks for raising this issue. This is certainly a bug but we will need to investigate what is causing this calculation problem before we will be able to say how it gets into this state.

Reconcile summaries resolves this issue because it performs a re-calculation of all job summaries which then mitigates the pervious incorrect calculations. I am not sure whether this will fix the underlying problem, if the summaries do become incorrect again, please let us know as it will help narrow down the investigation.

@tgross
Copy link
Member

tgross commented Jul 25, 2022

Some other issues that look related to this one: #10222 #4731 #10338 #13897.

@mikenomitch mikenomitch added the theme/batch Issues related to batch jobs and scheduling label Dec 6, 2022
@tgross tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/batch Issues related to batch jobs and scheduling theme/job-summary type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

4 participants