Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-wreckrun hangs with many tasks #772

Closed
grondo opened this issue Aug 16, 2016 · 12 comments
Closed

flux-wreckrun hangs with many tasks #772

grondo opened this issue Aug 16, 2016 · 12 comments

Comments

@grondo
Copy link
Contributor

grondo commented Aug 16, 2016

Running verbose, flux-wreckrun should print job state update for each lwj.state transition reserved->starting->running->complete -- however, some of these are missing in larger runs:

A working 512 task case:

$ time flux wreckrun -v -n512 /bin/true
wreckrun: 0.005s: Sending LWJ request for 512 tasks (cmdline "/bin/true")
wreckrun: 0.012s: Registered jobid 1
wreckrun: Allocating 512 tasks across 2304 available nodes..
wreckrun: tasks per node: node[0-511]: 1
wreckrun: 0.109s: Sending run event
wreckrun: 0.302s: State = reserved
wreckrun: 5.084s: State = starting
wreckrun: 5.637s: State = running
wreckrun: 6.300s: State = complete
wreckrun: tasks [0-511]: exited with exit code 0
wreckrun: All tasks completed successfully.

vs a suspicious run of the same size:

$ time flux wreckrun -v -n512 /bin/true
wreckrun: 0.004s: Sending LWJ request for 512 tasks (cmdline "/bin/true")
wreckrun: 0.009s: Registered jobid 2
wreckrun: Allocating 512 tasks across 2304 available nodes..
wreckrun: tasks per node: node[0-511]: 1
wreckrun: 0.109s: Sending run event
wreckrun: 2.697s: State = complete
wreckrun: tasks [0-511]: exited with exit code 0
wreckrun: All tasks completed successfully.
@grondo
Copy link
Contributor Author

grondo commented Aug 16, 2016

If the complete state is missed, then job will hang:

$ time flux wreckrun -v -n2304 /bin/true
wreckrun: 0.005s: Sending LWJ request for 2304 tasks (cmdline "/bin/true")
wreckrun: 0.011s: Registered jobid 4
wreckrun: Allocating 2304 tasks across 2304 available nodes..
wreckrun: tasks per node: node[0-2303]: 1
wreckrun: 0.613s: Sending run event
/* hang */

the job is complete however:

$ flux kvs get lwj.4.state
complete

@garlick
Copy link
Member

garlick commented Aug 16, 2016

To see if it's a general problem with KVS watch versus a problem downstream, you might try running something like this on the side

$ flux kvs watch lwj.6.state

and see if it transitions through the states that you expect.

@grondo
Copy link
Contributor Author

grondo commented Aug 16, 2016

Good idea! I'm still verifying that lwj kvs state is sane

@grondo
Copy link
Contributor Author

grondo commented Aug 16, 2016

The separate flux kvs watch worked fine: (this was with 36,834 tasks)

$ flux kvs watch lwj.8.state
NULL
"reserved"
"starting"
"running"
"complete"

Perhaps what is happening is that the single reactor loop in flux-wreckrun is congested with the kz callbacks.. not sure yet.

@garlick
Copy link
Member

garlick commented Aug 16, 2016

Well those messages ought to be queued until somebody reads them no matter how busy the reactor is in the mean time. I'm not sure what situation would lead to them being dropped...

@SteVwonder
Copy link
Member

To me, it doesn't seem like the events are being dropped. If events were being dropped, I would expect to see the first state, some missing states, and then the final state (e.g., State = reserved and then State = complete). What we see is just the last state.

What it seems like to me is that the kvs watch is not fully registering until the job has already progressed into the running or complete state. If that is the case, maybe going through the JSC would fix it?

@garlick
Copy link
Member

garlick commented Aug 16, 2016

@grondo should confirm but you may have nailed it! The kvswatcher on the state seems to be instantiated after the program is launched in flux-wreckrun. Maybe this can be trivially fixed by watching the state before sending the wrexec.run event.

@SteVwonder
Copy link
Member

One thing is still puzzling me. If it is the case that the KVS watch is only being created after the job has completed, shouldn't the kvs cb function be called immediately with "complete" as the current value? From the kvs.h comments:

/* kvs_watch* is like kvs_get* except the registered callback is called
 * to set the value.  It will be called immediately to set the initial
 * value and again each time the value changes.

In one of @grondo's examples above, no states are registered. So maybe in that specific case, wreckrun hangs before even setting the kvs watch? Ormaybe the kvs watch is only instantiated after the job archival has happened? If the kvs watch is set to the lwj-active directory, then this would surely cause problems. If it is set to the lwj directory, then I don't see why it would be a problem.

@grondo
Copy link
Contributor Author

grondo commented Aug 17, 2016

Sorry guys, I didn't mean to leave this vague issue for you to speculate on! I just wanted to capture what I was seeing before heading out for first week of school shenanigans.

@garlick, I didn't mean to imply that the reactor was dropping the state changes, but more generally flux-wreckrun or lua bindings is losing them, sorry for the confusion.

I actually think @garlick was onto something when he asked if lua bindings use kvs_watch_once vs kvs_watch, and I think they may. That is especially bad when the reactor is busy because the callback may be sitting for awhile without being handled and this is when state transitions could be missed.

Regardless, moving the kvs_watch call up in flux-wreckrun is definitely required, but probably also a band-aid to the flux-lua bindings

@grondo
Copy link
Contributor Author

grondo commented Aug 17, 2016

I actually think @garlick was onto something when he asked if lua bindings use kvs_watch_once vs kvs_watch

Eh, nevermind. I forgot about the kvswatcher abstraction in the bindings, which use kvs_watch() directly (I think, been awhile since I've poked around in here)

@grondo grondo changed the title flux-wreckrun loses lwj.state callbacks flux-wreckrun hangs with many tasks Aug 17, 2016
@grondo
Copy link
Contributor Author

grondo commented Aug 17, 2016

I can verify that moving the run event after all watches are sett up does solve the problem similar to the first case above (some events missed) as @SteVwonder guessed. Now I'm remembering this was done on purpose on the theory that the run event should be issued as soon as possible since the frontend tool (flux-wreckrun) is not required for a functional run of a job. Also as @SteVwonder suggested, even if the kvs_watch isn't called until after the job is complete, the tool should still get that final state and note that the job is done.

Even with the run event issued after reactor setup, I'm still seeing cases where wreckrun appears to hang (no states are returned) I've updated the issue description reflecting the new focus of this report

grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984
Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1468
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 9, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
@grondo
Copy link
Contributor Author

grondo commented Feb 13, 2019

closed by #1988

@grondo grondo closed this as completed Feb 13, 2019
chu11 pushed a commit to chu11/flux-core that referenced this issue Feb 13, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984
Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1468
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants