-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative numbers in "Children Job Summary" for periodic tasks #4731
Comments
Do you have reproduction steps? |
We just start stop virtual box instances in in various non-deterministic sequence, to monitor how evaluations will be catched by our cluster autoscale solution |
And have folow evaluation log
|
So after some time we just see that was launched 2 child jobs despite the fact that
|
Hope this helps. In log above |
And on leader at moment of launch buggy allocation was follow in logs
but at this time(15:17) we doens't start/stop instances and this is looks like lag in virtualbox network |
At the same time no any agent nodes have network problems, because thy logs are empty at that moment(15:17) First agent instance
Second agent instance
Third agent instance
And the last instance
|
@dadgar We reproduce this on test stand What we do
And we again see in logs
Our job difinition is:
|
@dadgar After we apply this patch we prevent overlapped periodic task from unnecessary launch when leader moves to another node, but doesn't fix statistic diff --git a/nomad/leader.go b/nomad/leader.go
index 119e72ebf..64f782394 100644
--- a/nomad/leader.go
+++ b/nomad/leader.go
@@ -388,6 +388,15 @@ func (s *Server) restorePeriodicDispatcher() error {
continue
}
+ if job.Periodic.ProhibitOverlap {
+ running, err := s.RunningChildren(job)
+ if err != nil {
+ s.logger.Printf("[WARN] Can't detect is periodic job have already running childs: %v", err)
+ } else if running {
+ continue
+ }
+ }
+
// If the periodic job has never been launched before, launch will hold
// the time the periodic job was added. Otherwise it has the last launch
// time of the periodic job. In our case periodic task may be launched more then they start intervals, so fo example task launched every 10 minutes, but run 20-30 minutes, in this cases when leader moves without patch nomad will launch additional job, but made this wrong, because already exist launched job |
To solve statistic aditional patch needed diff --git a/nomad/state/state_store.go b/nomad/state/state_store.go
index af0f0bfa4..3310aef70 100644
--- a/nomad/state/state_store.go
+++ b/nomad/state/state_store.go
@@ -2958,19 +2958,39 @@ func (s *StateStore) ReconcileJobSummaries(index uint64) error {
if err != nil {
return err
}
- for {
- rawJob := iter.Next()
- if rawJob == nil {
- break
- }
+
+ parentChildSummaryMap := make(map[structs.NamespacedID]*structs.JobChildrenSummary)
+
+ for rawJob := iter.Next(); rawJob != nil; rawJob = iter.Next(){
job := rawJob.(*structs.Job)
+ if job.ParentID != "" {
+ parentChildSummary, l_ok := parentChildSummaryMap[structs.NamespacedID{Namespace: job.Namespace, ID: job.ParentID}]
+ if !l_ok {
+ parentChildSummary = new(structs.JobChildrenSummary)
+ parentChildSummaryMap[structs.NamespacedID{Namespace: job.Namespace, ID: job.ParentID}] = parentChildSummary
+ }
+
+ // Increment new status
+ switch job.Status {
+ case structs.JobStatusPending:
+ parentChildSummary.Pending++
+ case structs.JobStatusRunning:
+ parentChildSummary.Running++
+ case structs.JobStatusDead:
+ parentChildSummary.Dead++
+ default:
+ return fmt.Errorf("unknown new job status %q", job.Status)
+ }
+ }
+
// Create a job summary for the job
summary := &structs.JobSummary{
JobID: job.ID,
Namespace: job.Namespace,
Summary: make(map[string]structs.TaskGroupSummary),
}
+
for _, tg := range job.TaskGroups {
summary.Summary[tg.Name] = structs.TaskGroupSummary{}
}
@@ -3029,6 +3049,27 @@ func (s *StateStore) ReconcileJobSummaries(index uint64) error {
}
}
+ for namespacedID, parentChildSummary := range parentChildSummaryMap {
+ summaryRaw, err := txn.First("job_summary", "id", namespacedID.Namespace, namespacedID.ID)
+ if err != nil {
+ return fmt.Errorf("unable to retrieve summary for parent job: %v", err)
+ }
+
+ if summaryRaw != nil {
+ existing := summaryRaw.(*structs.JobSummary)
+ pSummary := existing.Copy()
+ pSummary.Children = parentChildSummary
+
+ // Update the index
+ pSummary.ModifyIndex = index
+
+ // Insert the summary
+ if err := txn.Insert("job_summary", pSummary); err != nil {
+ return fmt.Errorf("job summary insert failed: %v", err)
+ }
+ }
+ }
+
// Update the indexes table for job summary
if err := txn.Insert("index", &IndexEntry{"job_summary", index}); err != nil {
return fmt.Errorf("index update failed: %v", err) |
I'm seeing this issue as well on 1.3.1:
The number of dead jobs is also a bit surprising, but it might be correct (this job has been deployed for a long time now). I'm sure there are currently 0 running jobs, though. |
Nomad version
Nomad v0.8.6 (fcc4149+CHANGES)
Issue
is it normal that
pending
andrunning
are negativeThe text was updated successfully, but these errors were encountered: