-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobdesc check in job module would be useful, possibly implement job creation API? #1443
Comments
Does that format for walltime work? For the timeout handler installed in wrexecd it expects walltime in seconds, not HH:MM:SS. At least I thought it did.... |
Now that might explain some things. The other jobs probably haven't needed to be killed, and I bet the walltime was set by the workflow manager, and it was never validated on the way through the system. Would it be reasonable to have a check for that kind of thing on submit do you think? |
Yeah, it might make sense in the near term for the job module to do at least some verification of the contents of the submit/create RPC payload. Right now it writes whatever you have in the payload directly the kvs, and doesn't even examine the contents. 😜 |
I will admit to shamelessly taking advantage of that for metadata…
That said, I’m a bit surprised we didn’t hit more issues with the
Jget_int calls on walltime for example. It’s a testament to well
checked code on that end that the jobs ran correctly for sure.
…On 10 Apr 2018, at 11:17, Mark Grondona wrote:
Yeah, it might make sense in the near term for the job module to do at
least some verification of the contents of the submit/create RPC
payload. Right now it writes whatever you have in the payload directly
the kvs, and doesn't even examine the contents.
😜
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#1443 (comment)
|
Well the purpose was for rapid prototyping and so maybe it ended up helping somewhat. Unfortunately lack of any kind of reasonable schema and validation has caused some pain when trying to use this for real work as well! Sorry about that. |
Switching the title over to something closer to the actual issue here. The question is probably whether this should be called a dupe of #268? |
I don't think this is a duplicate of #268. IMO, this issue should address minimum required fixes for the current job module, which is known deficient and didn't even try to address any part of #268. #268 will only be resolved when we've completely replaced the current job module with a true job ingest system. We should perhaps use what we learn here to feed in to #268. |
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
closed by #1988 |
Two out of 4000 jobs so far have managed to run well past their walltime without being killed. This instance is not, unfortunately, in verbose mode, so all I know to get is the kvs output for one of them:
The job is currently at about 11 hours. This particularly surprises me because running 25000 jobs through over the weekend I never saw this, and now it's twice. The only difference that immediately comes to mind between this and what I was running before is that this is using 8 hours where that was one.
The text was updated successfully, but these errors were encountered: