You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are cases of jobs configured with AutoRelease feature that are trying to copy back logs both times they run, but the second time the log copy fails because ifdh cannot override existing file.
gfal-copy error: 17 (File exists) - Destination https://[redacted]/fermigrid/jobsub/jobs/2024_03_12/6f20c05e-8023-4248-966f-0233d5a3c089/fife_wrap2024_03_12_1822276f20c05e-8023-4248-966f-0233d5a3c089cluster.67623825.0.err exists and overwrite is not set
The job in kibana is reported with Exit code 0, while checking the stdout log we have:
executable was killed: exiting 1
Wed Mar 13 07:20:24 UTC 2024 fife_wrap COMPLETED with exit status 1
which is confusing.
This is happening because the log is for the first time the job ran, while the job exit state kibana is possibly for te second time the job ran.
As discussed at the Jobsub weekly meeting, we could use the NumJobStarts classAd, or something similar, as suffix for the log filename to disentangle logs for each time the job is restarted and so be able to copy them back all, possibly making them available to users.
The text was updated successfully, but these errors were encountered:
There are cases of jobs configured with AutoRelease feature that are trying to copy back logs both times they run, but the second time the log copy fails because ifdh cannot override existing file.
An example job is
[email protected]
The job was part of POMS4_SUBMISSION_ID:1712364.
Fifebatch Events details show the job got held and released.
IFDH logs for the job show the log copy back failed the second time:
The job in kibana is reported with Exit code 0, while checking the stdout log we have:
which is confusing.
This is happening because the log is for the first time the job ran, while the job exit state kibana is possibly for te second time the job ran.
As discussed at the Jobsub weekly meeting, we could use the
NumJobStarts
classAd, or something similar, as suffix for the log filename to disentangle logs for each time the job is restarted and so be able to copy them back all, possibly making them available to users.The text was updated successfully, but these errors were encountered: