Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More then one matching SLURM task #163

Closed
albertz opened this issue Dec 18, 2023 · 6 comments · Fixed by #169
Closed

More then one matching SLURM task #163

albertz opened this issue Dec 18, 2023 · 6 comments · Fixed by #169

Comments

@albertz
Copy link
Member

albertz commented Dec 18, 2023

[2023-12-18 15:13:42,425] INFO: queue(5) runnable(1) running(10) waiting(2191)                                                                          
[2023-12-18 15:13:42,448] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ run [1]                                  
[2023-12-18 15:13:42,469] INFO: Submitted with job_id: ['3767058'] i6_core.returnn.training.ReturnnTrainingJob.Mxt0Fq5EAPMJ.run                         
[2023-12-18 15:13:43,234] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experimen
ts/exp2023_04_25_rf/i6.py                                                                                                                               
...
[2023-12-18 15:13:43,408] INFO: queue(6) running(10) waiting(2191)
[2023-12-18 16:07:58,088] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experimen
ts/exp2023_04_25_rf/i6.py
[2023-12-18 16:07:58,114] INFO: running: Job<alias/exp2023_04_25_rf/chunked_aed_import/chunk-C20-R15-H2-bs22k/train work/i6_core/returnn/training/Return
nTrainingJob.yr9RPZ4KpDXG> {ep 32/2000} 
[2023-12-18 16:07:58,129] INFO: running: Job<alias/exp2023_04_25_rf/chunked_ctc/chunk-C20-R15-H2-11gb-f32-bs8k-wrongLr-accgrad1-mgpu4-p100/train work/i6
_core/returnn/training/ReturnnTrainingJob.nGxCOVlNDroS> {ep 13/500} 
[2023-12-18 16:07:58,145] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-lrlin1e
_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.6KHl6rvV9hvx> {ep 24/500} 
[2023-12-18 16:07:58,164] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_4-
lrlin1e_5_295k-run2/train work/i6_core/returnn/training/ReturnnTrainingJob.6PFOVJ9fX451> {ep 24/500} 
[2023-12-18 16:07:58,186] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_4-
lrlin1e_5_295k-run3/train work/i6_core/returnn/training/ReturnnTrainingJob.gddacwOynWbN> {ep 24/500} 
[2023-12-18 16:07:58,206] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_5-
lrlin1e_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.IeBoLNDUwgWo> {ep 23/500} 
[2023-12-18 16:07:58,235] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg1000-wd1e_4
-lrlin1e_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.RrU9BMvLf8gm> {ep 196/500} 
[2023-12-18 16:07:58,292] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg500-wd1e_4-
lrlin1e_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.ltAtV1rNBhC6> {ep 202/500} 
[2023-12-18 16:07:58,383] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad1-mgpu4-wd1e_4-lrlin1e_
5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.PzLC0RKyvMLA> {ep 491/500} 
[2023-12-18 16:07:58,600] INFO: running: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/v6-11gb-f32-bs15k-accgrad4-mgpu2/train work/i6_c
ore/returnn/training/ReturnnTrainingJob.Iy77vquwmeqX> {ep 1162/2000} 
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/chunked_ctc/chunk-C20-R15-H2-bs22k/train work/i6_core/returnn/training/ReturnnTrain
ingJob.jMkDWpu5R0VV>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_100k/train work/i6_core/
returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_20k/train work/i6_core/r
eturnn/training/ReturnnTrainingJob.1Bar8z0wWtjq>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_50k/train work/i6_core/r
eturnn/training/ReturnnTrainingJob.por8G6TNlO9E>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_600k/train work/i6_core/
returnn/training/ReturnnTrainingJob.VAf79gjGIxyn>
[2023-12-18 16:07:58,600] INFO: runnable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-warmup100k/train work/i6_core/retu
rnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ>
[2023-12-18 16:07:58,600] INFO: runnable(6) running(10) waiting(2191)
[2023-12-18 16:07:58,603] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ run [1]                                  
[2023-12-18 16:07:58,617] INFO: Submitted with job_id: ['3767268'] i6_core.returnn.training.ReturnnTrainingJob.Mxt0Fq5EAPMJ.run                         
[2023-12-18 16:07:58,893] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.l2dwBB9n7TqS run [1]                                  
[2023-12-18 16:07:58,896] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.1Bar8z0wWtjq run [1]
[2023-12-18 16:07:58,896] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.VAf79gjGIxyn run [1]
[2023-12-18 16:07:58,904] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.por8G6TNlO9E run [1]
[2023-12-18 16:07:58,907] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.jMkDWpu5R0VV run [1]
[2023-12-18 16:07:58,911] INFO: Submitted with job_id: ['3767269'] i6_core.returnn.training.ReturnnTrainingJob.l2dwBB9n7TqS.run
[2023-12-18 16:07:58,911] INFO: Submitted with job_id: ['3767270'] i6_core.returnn.training.ReturnnTrainingJob.VAf79gjGIxyn.run
[2023-12-18 16:07:58,912] INFO: Submitted with job_id: ['3767271'] i6_core.returnn.training.ReturnnTrainingJob.1Bar8z0wWtjq.run
[2023-12-18 16:07:58,918] INFO: Submitted with job_id: ['3767272'] i6_core.returnn.training.ReturnnTrainingJob.por8G6TNlO9E.run
[2023-12-18 16:07:58,920] INFO: Submitted with job_id: ['3767273'] i6_core.returnn.training.ReturnnTrainingJob.jMkDWpu5R0VV.run
[2023-12-18 16:07:59,951] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.l2dwBB9n7TqS.run',
 1) > matches: [('3765376', 'PENDING'), ('3767269', 'PENDING')]
[2023-12-18 16:07:59,954] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.VAf79gjGIxyn.run',
 1) > matches: [('3765374', 'PENDING'), ('3767270', 'PENDING')]
[2023-12-18 16:07:59,956] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.jMkDWpu5R0VV.run',
 1) > matches: [('3764451', 'PENDING'), ('3767273', 'PENDING')]
[2023-12-18 16:07:59,976] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.1Bar8z0wWtjq.run',
 1) > matches: [('3765375', 'PENDING'), ('3767271', 'PENDING')]
[2023-12-18 16:07:59,977] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.Mxt0Fq5EAPMJ.run',
 1) > matches: [('3767058', 'PENDING'), ('3767268', 'PENDING')]
[2023-12-18 16:07:59,996] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.por8G6TNlO9E.run',
 1) > matches: [('3765373', 'PENDING'), ('3767272', 'PENDING')]
[2023-12-18 16:08:01,587] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experimen
ts/exp2023_04_25_rf/i6.py

This is the same message as in #156, but I think the bug (problem) here is different, because from the log, it seems there is no "Error to submit job", and the problem in #156 is also fixed already by #157, and I have seen multiple times that #157 is indeed working as intended.

Maybe it is relevant that when I interrupted the manager here, I got the crash from #164.

@albertz
Copy link
Member Author

albertz commented Dec 18, 2023

This seems really strange to me. You see in the log:

Submitted with job_id: ['3767058'] ...
Submitted with job_id: ['3767268'] ...
...
More then one matching SLURM task ... matches: [('3767058', 'PENDING'), ('3767268', 'PENDING')]

The "Submitted with job_id" message is correct, it is indeed submitted.

@albertz
Copy link
Member Author

albertz commented Dec 18, 2023

Maybe this is some multi-threading bug? I see that submit_call is called from multiple threads.

@critias
Copy link
Contributor

critias commented Jan 2, 2024

Very strange, especially since both submissions are nearly an hour apart make. This basically eliminates the possibility of caching effects. Looks more like sisyphus did not find the job when checking if the job was submitted, but for some reason it found when checking the current status. Which is strange...

@albertz
Copy link
Member Author

albertz commented Jan 9, 2024

I got this again:

...
[2024-01-09 16:52:19,402] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/espnet/v6-24gb-bs30k-wd1e_6-lrlin1e_5_587k-EBranchformer/train work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq>                                                                                                              
[2024-01-09 16:52:19,402] INFO: error(52) interrupted_resumable(1) queue(4) running(6) waiting(838)                                                             
[2024-01-09 16:52:22,662] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq run [1]                                          
[2024-01-09 16:52:22,678] INFO: Submitted with job_id: ['4057250'] i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run      
...
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_20k/train work/i6_core/returnn/training/ReturnnTrainingJob.1Bar8z0wWtjq>                                                                                                        
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_50k/train work/i6_core/returnn/training/ReturnnTrainingJob.por8G6TNlO9E>                                                                                                        
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-lrlin1e_5_600k/train work/i6_core/returnn/training/ReturnnTrainingJob.VAf79gjGIxyn>                                                                                                       
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/conformer_import_moh_att_2023_06_30/base-24gb-v6-warmup100k/train work/i6_core/returnn/training/ReturnnTrainingJob.Mxt0Fq5EAPMJ>                                                                                                           
[2024-01-09 19:58:31,189] WARNING: interrupted_resumable: Job<alias/exp2023_04_25_rf/espnet/v6-24gb-bs30k-wd1e_6-lrlin1e_5_587k-EBranchformer/train work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq>                                                                                                              
[2024-01-09 19:58:31,189] INFO: error(52) interrupted_resumable(5) running(6) waiting(838)                
[2024-01-09 19:59:03,386] INFO: Submit to queue: work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq run [1]
[2024-01-09 19:59:03,402] INFO: Submitted with job_id: ['4059231'] i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run
[2024-01-09 19:59:03,608] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]                                                                                                         
[2024-01-09 19:59:04,184] INFO: Experiment directory: /u/zeyer/setups/combined/2021-05-31      Call: ./sis m recipe/i6_experiments/users/zeyer/experiments/exp2023_04_25_rf/i6.py
...
[2024-01-09 19:59:04,186] INFO: queue: Job<alias/exp2023_04_25_rf/espnet/v6-24gb-bs30k-wd1e_6-lrlin1e_5_587k-EBranchformer/train work/i6_core/returnn/training/ReturnnTrainingJob.AituPEamOqTq>
[2024-01-09 19:59:04,211] INFO: running: Job<alias/exp2023_04_25_rf/aed/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_2-lrlin1e_5_295k/train work/i6_core/returnn/training/ReturnnTrainingJob.4Y1gqDiY2DHe> {ep 365/500} 
...
[2024-01-09 19:59:04,306] INFO: error(52) queue(5) running(6) waiting(838)
[2024-01-09 19:59:14,683] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
[2024-01-09 19:59:25,633] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
[2024-01-09 19:59:38,399] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
[2024-01-09 19:59:49,348] WARNING: More then one matching SLURM task, use first match < ('i6_core.returnn.training.ReturnnTrainingJob.AituPEamOqTq.run', 1) > matches: [('4057250', 'PENDING'), ('4059231', 'PENDING')]
...

So it looks a bit like it incorrectly identified it as WARNING: interrupted_resumable?

@albertz
Copy link
Member Author

albertz commented Jan 9, 2024

I wondered how it can be in this state interrupted_resumable. In Task._get_state_helper:

  • self.finished(task_id) return False
  • self.error(task_id) return False
  • And then I assume it follows this path:
                        engine_state = engine.task_state(self, task_id)
...
                if engine_state == gs.STATE_UNKNOWN:
                    if self.started(task_id):
                        # check again if it finished or crashed while retrieving the state
                        if self.finished(task_id):
                            return gs.STATE_FINISHED
                        elif self.error(task_id):
                            return gs.STATE_ERROR
                        # job logging file got updated recently, assume job is still running.
                        # used to avoid wrongly marking jobs as interrupted do to slow filesystem updates
                        elif self.running(task_id):
                            return gs.STATE_RUNNING
                        history = [] if engine is None else engine.get_submit_history(self)
                        if history and len(history[task_id]) > gs.MAX_SUBMIT_RETRIES:
                            # More then three tries to run this task, something is wrong
                            return gs.STATE_RETRY_ERROR
                        else:
                            # Task was started, but isn't running anymore => interrupted
                            if self._resume is None:
                                return gs.STATE_INTERRUPTED_NOT_RESUMABLE
                            else:
                                return gs.STATE_INTERRUPTED_RESUMABLE   # <-- going here

self.running(task_id) returns False if the job logging file has not been updated recently, which sometimes can happen at startup where it sometimes hangs a bit.

So, the main problem here is that engine.task_state(self, task_id) probably/assumably returned gs.STATE_UNKNOWN.

@albertz
Copy link
Member Author

albertz commented Jan 9, 2024

Ah, I think I found the problem:

In submit_helper:

                logging.info("Submitted with job_id: %s %s" % (job_id, name))
                for task_id in range(start_id, end_id, step_size):
                    self._task_info_cache[(name, task_id)].append((job_id, "PD"))

But then in task_state, when it uses the _task_info_cache:

        state = qs[0][1]
        if state in ["RUNNING", "COMPLETING"]:
            return STATE_RUNNING
        elif state in ["PENDING", "CONFIGURING"]:
            return STATE_QUEUE
        else:
            return STATE_UNKNOWN

I.e. "PD" is simply an unknown state. I guess it should be "PENDING".

albertz added a commit that referenced this issue Jan 9, 2024
submit_helper writes an invalid state to _task_info_cache,
and then task_state returns STATE_UNKNOWN,
which causes STATE_INTERRUPTED_RESUMABLE,
which causes a resubmit.

Fix #163
albertz added a commit that referenced this issue Jan 9, 2024
submit_helper writes an invalid state to _task_info_cache,
and then task_state returns STATE_UNKNOWN,
which causes STATE_INTERRUPTED_RESUMABLE,
which causes a resubmit.

Fix #163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants