-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-wreckrun hangs with many tasks #772
Comments
If the
the job is complete however:
|
To see if it's a general problem with KVS watch versus a problem downstream, you might try running something like this on the side
and see if it transitions through the states that you expect. |
Good idea! I'm still verifying that lwj kvs state is sane |
The separate
Perhaps what is happening is that the single reactor loop in flux-wreckrun is congested with the kz callbacks.. not sure yet. |
Well those messages ought to be queued until somebody reads them no matter how busy the reactor is in the mean time. I'm not sure what situation would lead to them being dropped... |
To me, it doesn't seem like the events are being dropped. If events were being dropped, I would expect to see the first state, some missing states, and then the final state (e.g., What it seems like to me is that the kvs watch is not fully registering until the job has already progressed into the |
@grondo should confirm but you may have nailed it! The kvswatcher on the state seems to be instantiated after the program is launched in |
One thing is still puzzling me. If it is the case that the KVS watch is only being created after the job has completed, shouldn't the kvs cb function be called immediately with "complete" as the current value? From the kvs.h comments:
In one of @grondo's examples above, no states are registered. So maybe in that specific case, wreckrun hangs before even setting the kvs watch? Ormaybe the kvs watch is only instantiated after the job archival has happened? If the kvs watch is set to the lwj-active directory, then this would surely cause problems. If it is set to the lwj directory, then I don't see why it would be a problem. |
Sorry guys, I didn't mean to leave this vague issue for you to speculate on! I just wanted to capture what I was seeing before heading out for first week of school shenanigans. @garlick, I didn't mean to imply that the reactor was dropping the state changes, but more generally flux-wreckrun or lua bindings is losing them, sorry for the confusion. I actually think @garlick was onto something when he asked if lua bindings use Regardless, moving the |
Eh, nevermind. I forgot about the |
I can verify that moving the run event after all watches are sett up does solve the problem similar to the first case above (some events missed) as @SteVwonder guessed. Now I'm remembering this was done on purpose on the theory that the run event should be issued as soon as possible since the frontend tool ( Even with the run event issued after reactor setup, I'm still seeing cases where wreckrun appears to hang (no states are returned) I've updated the issue description reflecting the new focus of this report |
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1468 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
closed by #1988 |
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1468 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
Running verbose, flux-wreckrun should print job state update for each lwj.state transition
reserved->starting->running->complete
-- however, some of these are missing in larger runs:A working 512 task case:
vs a suspicious run of the same size:
The text was updated successfully, but these errors were encountered: