Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs unresponsive after series of wreck jobs #774

Closed
grondo opened this issue Aug 17, 2016 · 3 comments
Closed

kvs unresponsive after series of wreck jobs #774

grondo opened this issue Aug 17, 2016 · 3 comments

Comments

@grondo
Copy link
Contributor

grondo commented Aug 17, 2016

I ran the following in a flux session of size 512:

$ for i in 1 2 4 8 12 16 24 32; do flux wreckrun -n $((512*${i})) -d hostname; sleep 60; done

After completion, at least one thread of flux-broker was 100% busy, and the kvs appears unresponsive:

$ flux ping cmb
cmb.ping pad=0 seq=0 time=0.286 ms (B3BEB!3F477!0)
cmb.ping pad=0 seq=1 time=0.225 ms (B3BEB!3F477!0)
cmb.ping pad=0 seq=2 time=0.224 ms (B3BEB!3F477!0)
^C
$ flux ping kvs
[no output]

dmesg shows a stream like the following

2016-08-17T16:56:49.590631Z kvs.debug[0]: coalesced 4 commits
2016-08-17T16:56:50.171770Z broker.debug[0]: content purge: 122 entries
2016-08-17T16:56:52.171282Z broker.debug[0]: content purge: 81 entries
2016-08-17T16:56:54.175620Z broker.debug[0]: content purge: 76 entries
2016-08-17T16:56:56.173005Z broker.debug[0]: content purge: 72 entries
2016-08-17T16:56:58.172827Z broker.debug[0]: content purge: 195 entries
2016-08-17T16:56:59.324853Z kvs.debug[0]: coalesced 3 commits
2016-08-17T16:57:00.171512Z broker.debug[0]: content purge: 93 entries
2016-08-17T16:57:01.451946Z kvs.debug[0]: coalesced 3 commits
2016-08-17T16:57:02.171795Z broker.debug[0]: content purge: 118 entries
2016-08-17T16:57:02.561214Z kvs.debug[0]: coalesced 3 commits
2016-08-17T16:57:04.173284Z broker.debug[0]: content purge: 103 entries
2016-08-17T16:57:06.172983Z broker.debug[0]: content purge: 166 entries
2016-08-17T16:57:08.173314Z broker.debug[0]: content purge: 77 entries
2016-08-17T16:57:10.173335Z broker.debug[0]: content purge: 28 entries
2016-08-17T16:57:12.173768Z broker.debug[0]: content purge: 22 entries
2016-08-17T16:57:14.179167Z broker.debug[0]: content purge: 77 entries
2016-08-17T16:57:15.125174Z kvs.debug[0]: coalesced 5 commits
2016-08-17T16:57:16.173448Z broker.debug[0]: content purge: 33 entries
2016-08-17T16:57:18.173835Z broker.debug[0]: content purge: 27 entries
2016-08-17T16:57:18.556761Z kvs.debug[0]: coalesced 3 commits
2016-08-17T16:57:20.174665Z broker.debug[0]: content purge: 44 entries
2016-08-17T16:57:21.367483Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:22.174047Z broker.debug[0]: content purge: 32 entries
2016-08-17T16:57:22.610293Z kvs.debug[0]: coalesced 3 commits
2016-08-17T16:57:24.173526Z broker.debug[0]: content purge: 23 entries
2016-08-17T16:57:26.173902Z broker.debug[0]: content purge: 30 entries
2016-08-17T16:57:28.173772Z broker.debug[0]: content purge: 28 entries
2016-08-17T16:57:29.572551Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:30.174461Z broker.debug[0]: content purge: 39 entries
2016-08-17T16:57:32.175015Z broker.debug[0]: content purge: 33 entries
2016-08-17T16:57:32.858737Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:33.978762Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:34.173933Z broker.debug[0]: content purge: 35 entries
2016-08-17T16:57:35.465255Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:36.174394Z broker.debug[0]: content purge: 44 entries
2016-08-17T16:57:36.575777Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:38.173937Z broker.debug[0]: content purge: 35 entries
2016-08-17T16:57:38.254017Z kvs.debug[0]: coalesced 3 commits
2016-08-17T16:57:39.480687Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:40.176222Z broker.debug[0]: content purge: 41 entries
2016-08-17T16:57:40.535676Z kvs.debug[0]: coalesced 3 commits
2016-08-17T16:57:40.717308Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:41.014433Z kvs.debug[0]: coalesced 4 commits
2016-08-17T16:57:41.203749Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:41.408167Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:42.175996Z broker.debug[0]: content purge: 59 entries
2016-08-17T16:57:42.893490Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:43.630937Z kvs.debug[0]: coalesced 2 commits
2016-08-17T16:57:44.174295Z broker.debug[0]: content purge: 45 entries
2016-08-17T16:57:45.333923Z kvs.debug[0]: coalesced 2 commits

I spot checked some of the other brokers in the system and they are all busy doing something. Perhaps still processing data from the jobs? Unfortunately commands like flux comms-stats kvs also were hanging.

Not sure if this is reproducible. My job was killed due to timelimit while I was investigating.

@grondo
Copy link
Contributor Author

grondo commented Aug 18, 2016

This does seem to be reproducible. I was able to grab another 512 nodes on jade and successfully ran up to 16 tasks of hostname per node, but at either 24 or 32 the kvs got into the state above again. This is with default commit-per-line for job kzio -- are we just flooding the kvs with too many requests?

It appears to be making some sort of progress, so I'll wait and see how long til it finishes.

@grondo
Copy link
Contributor Author

grondo commented Aug 18, 2016

Maybe a problem is that the 24 task per node test was still impacting the kvs when the 32 task per node job was launched (flux-wreckrun --detach doesn't "wait" for previous jobs to be finished after all)

running with -o stdio-delay-commit seems to be no problem:

    ID       NTASKS     STARTING      RUNNING     COMPLETE        TOTAL
     9        16384       0.312s       1.455s       3.901s       5.356s

grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984
Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1468
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 9, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
@grondo
Copy link
Contributor Author

grondo commented Feb 13, 2019

closed by #1988

@grondo grondo closed this as completed Feb 13, 2019
chu11 pushed a commit to chu11/flux-core that referenced this issue Feb 13, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984
Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1468
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant