Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobdesc check in job module would be useful, possibly implement job creation API? #1443

Closed
trws opened this issue Apr 10, 2018 · 8 comments
Closed

Comments

@trws
Copy link
Member

trws commented Apr 10, 2018

Two out of 4000 jobs so far have managed to run well past their walltime without being killed. This instance is not, unfortunately, in verbose mode, so all I know to get is the kvs output for one of them:

lwj.0.6.880.0.procdesc = {"command": "/p/gscratch1/splash/splash_semiproduction/041018/wo...
lwj.0.6.880.0.stdin -> lwj.0.6.880.input.files.stdin
lwj.0.6.880.cmdline = ["/p/gscratch1/splash/splash_semiproduction/041018/workspace/run_cr...
lwj.0.6.880.create-time = 1523337807.685208
lwj.0.6.880.cwd = /p/gscratch1/splash/splash_semiproduction/041018/workspace
lwj.0.6.880.environ = {"BASH_ENV": "/usr/share/lmod/lmod/init/bash", "BASH_FUNC_ml()": "(...
lwj.0.6.880.ncores = 24
lwj.0.6.880.nnodes = 1
lwj.0.6.880.ntasks = 1
lwj.0.6.880.options = {"stdio-delay-commit": 1}
lwj.0.6.880.opts = {"cores-per-task": 24, "nnodes": 1, "ntasks": 1, "tasks-per-node": 1}
lwj.0.6.880.pmi.PMI_process_mapping = (vector,(0,1,1))
lwj.0.6.880.rank.130.cores = 24
lwj.0.6.880.rdl = { "type": "cluster", "path": "\/cluster", "basename": "cluster", "name"...
lwj.0.6.880.running-time = 1523337821.050703
lwj.0.6.880.starting-time = 1523337820.911810
lwj.0.6.880.state = running
lwj.0.6.880.walltime = 08:00:00

The job is currently at about 11 hours. This particularly surprises me because running 25000 jobs through over the weekend I never saw this, and now it's twice. The only difference that immediately comes to mind between this and what I was running before is that this is using 8 hours where that was one.

@grondo
Copy link
Contributor

grondo commented Apr 10, 2018

lwj.0.6.880.walltime = 08:00:00

Does that format for walltime work? For the timeout handler installed in wrexecd it expects walltime in seconds, not HH:MM:SS. At least I thought it did....

@trws
Copy link
Member Author

trws commented Apr 10, 2018

Now that might explain some things. The other jobs probably haven't needed to be killed, and I bet the walltime was set by the workflow manager, and it was never validated on the way through the system. Would it be reasonable to have a check for that kind of thing on submit do you think?

@grondo
Copy link
Contributor

grondo commented Apr 10, 2018

Yeah, it might make sense in the near term for the job module to do at least some verification of the contents of the submit/create RPC payload. Right now it writes whatever you have in the payload directly the kvs, and doesn't even examine the contents. 😜

@trws
Copy link
Member Author

trws commented Apr 10, 2018 via email

@grondo
Copy link
Contributor

grondo commented Apr 10, 2018

Well the purpose was for rapid prototyping and so maybe it ended up helping somewhat. Unfortunately lack of any kind of reasonable schema and validation has caused some pain when trying to use this for real work as well! Sorry about that.

@trws trws changed the title rare walltime killer failure jobdesc check in job module would be useful, possibly implement job creation API? Apr 10, 2018
@trws
Copy link
Member Author

trws commented Apr 10, 2018

Switching the title over to something closer to the actual issue here. The question is probably whether this should be called a dupe of #268?

@grondo
Copy link
Contributor

grondo commented Apr 10, 2018

I don't think this is a duplicate of #268.

IMO, this issue should address minimum required fixes for the current job module, which is known deficient and didn't even try to address any part of #268.

#268 will only be resolved when we've completely replaced the current job module with a true job ingest system. We should perhaps use what we learn here to feed in to #268.

grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 9, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
@grondo
Copy link
Contributor

grondo commented Feb 13, 2019

closed by #1988

@grondo grondo closed this as completed Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants