-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
qstat returns no information about batch job #323
Comments
Bump this in priority |
Another user has encountered this on Polaris. The proposed solution that has been discussed was to parse the message that qstat returns for these jobs. It looks like this: |
However the solution I gave to the user was to hack site/service/scheduler.py. This could also be a solution. The proposed change would be to replace balsam/balsam/site/service/scheduler.py Line 157 in bfbf833
|
PR #345 fixes part of this issue. When Balsam queries PBS with qstat it will check if This will not handle the situation of a Balsam site has been inactive for a period of longer than 2 weeks and was not able to get information on the finished batch job before PBS purges the record. In this case further development is needed and this PR will change its state to submit_failed erroneously. However, a user can fix the state of the batch job by hand. It's unclear how common of an issue the this second case is, but should be addressed. |
This is an issue seen on Polaris with PBS Pro. Jobs that are submitted to the prod queue are routed to the small, medium, and large queues. If something about that routing fails the job disappears from PBS's history. However, the original qsub command succeeded. So to Balsam, it assumes the batch job is queued and tries to look for it with qstat, but qstat fails. This causes an uncaught exception that crashes the site. Sample error below.
2023-02-13 04:31:20.411 | 167662 | ERROR | balsam:120] Uncaught Exception <class 'balsam.platform.scheduler.scheduler.SchedulerNonZeroReturnCode'>: qstat: Unknown Job Id 412635.polaris- pbs-01.hsn.cm.polaris.alcf.anl.gov { "timestamp":1676262680, "pbs_version":"2022.1.1.20220926110806", "pbs_server":"polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov" } Traceback (most recent call last): File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/util/process.py", line 17, in run self._run() File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/site/service/service_base.py", line 23, in _run self.run_cycle() File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/site/service/scheduler.py", line 154, in run_cycle job_log = self.scheduler.parse_logs(job.scheduler_id, job.status_info.get("submit_script", None)) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/scheduler.py", line 163, in parse_logs log_data = cls._parse_logs(scheduler_id, job_script_path) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/pbs_sched.py", line 300, in _parse_logs stdout = scheduler_subproc(args) File "/home/jjlow/test/env/lib/python3.8/site-packages/balsam/platform/scheduler/scheduler.py", line 37, in scheduler_subproc raise SchedulerNonZeroReturnCode(p.stdout) balsam.platform.scheduler.scheduler.SchedulerNonZeroReturnCode: qstat: Unknown Job Id 412635.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov { "timestamp":1676262680, "pbs_version":"2022.1.1.20220926110806", "pbs_server":"polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov" }
The text was updated successfully, but these errors were encountered: