-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fluxion can't restart with queues enabled #1035
Comments
I'll see if I can make a standalone reproducer a bit later. |
@grondo, do you happen to have a testcase or reproducer that will produce this, or the original, issue? I'm looking over the code, and I think we must be having an issue with order of init or config or something, because the queue map is definitely populated in the normal path and load order, so something must change when we're in restart that sched doesn't expect. |
Yes, let me see if I can reproduce this in a test instance. |
Well, I couldn't reproduce the same issue, but this script causes fluxion to segfault in #!/bin/bash
flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux kvs get resource.R \
| flux R set-property batch:0-1 debug:2-3 \
| flux kvs put -r resource.R=-
flux kvs get resource.R | jq
flux config load <<EOF
[queues.debug]
requires = ["debug"]
[queues.batch]
requires = ["batch"]
[sched-fluxion-resource]
match-format = "rv1"
EOF
flux config get | jq
flux module load resource noverify
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue start --all
flux queue status
flux resource list
flux dmesg | grep version
flux submit --wait-event=start --queue=debug sleep inf
flux submit --wait-event=start --queue=debug sleep inf
flux module unload sched-fluxion-qmanager
flux module reload sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux jobs -a
#flux job status -vv $(flux job last 2) Run with $ flux start -s 4 ./issue#1035.sh
{
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0-3",
"children": {
"core": "0-47"
}
}
],
"starttime": 0.0,
"expiration": 0.0,
"nodelist": [
"corona[82,82,82,82]"
],
"properties": {
"batch": "0-1",
"debug": "2-3"
}
}
}
{
"queues": {
"debug": {
"requires": [
"debug"
]
},
"batch": {
"requires": [
"batch"
]
}
},
"sched-fluxion-resource": {
"match-format": "rv1"
}
}
batch: Scheduling is started
debug: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
debug: Job submission is enabled
debug: Scheduling is started
STATE QUEUE NNODES NCORES NODELIST
free batch 2 96 corona[82,82]
free debug 2 96 corona[82,82]
allocated 0 0
down 0 0
2023-06-13T21:30:21.314232Z sched-fluxion-resource.info[0]: version 0.27.0
2023-06-13T21:30:21.387592Z sched-fluxion-qmanager.info[0]: version 0.27.0
2023-06-13T21:30:24.388589Z sched-fluxion-resource.info[0]: version 0.27.0
2023-06-13T21:30:24.455246Z sched-fluxion-qmanager.info[0]: version 0.27.0
f3EczD3d
f3LrjBom
flux-start: 0 (pid 3726199) Segmentation fault
flux-jobs: ERROR: Unable to connect to Flux: Connection refused (gdb) bt
#0 planner_avail_resources_at (ctx=<optimized out>, at=<optimized out>)
at planner.cpp:527
#1 0x000015552d99c15f in Flux::resource_model::detail::dfu_impl_t::upd_by_outedges (this=0x155508005de0, subsystem="containment", jobmeta=...,
u=<optimized out>, e=...) at traversers/dfu_impl_update.cpp:123
#2 0x000015552d99eb61 in Flux::resource_model::detail::dfu_impl_t::upd_dfv (
this=<optimized out>, u=<optimized out>, writers=...,
needs=<optimized out>, excl=<optimized out>, jobmeta=...,
full=<optimized out>, to_parent=..., emit_shadow=<optimized out>)
at /usr/include/boost/graph/detail/edge.hpp:41
#3 0x000015552d99ef01 in Flux::resource_model::detail::dfu_impl_t::upd_dfv (
this=<optimized out>, u=<optimized out>, writers=...,
needs=<optimized out>, excl=<optimized out>, jobmeta=...,
full=<optimized out>, to_parent=..., emit_shadow=<optimized out>)
at traversers/dfu_impl_update.cpp:328
#4 0x000015552d9a033a in Flux::resource_model::detail::dfu_impl_t::update (
this=<optimized out>, root=0,
writers=std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, str=..., reader=..., jobmeta=...)
at traversers/dfu_impl_update.cpp:627
#5 0x000015552d98abb0 in Flux::resource_model::dfu_traverser_t::run (
this=0x155508005de0,
str="{\"graph\": {\"nodes\": [{\"id\": \"196\", \"metadata\": {\"type\": \"core\", \"basename\": \"core\", \"name\": \"core47\", \"id\": 47, \"uniq_id\": 196, \"rank\": 3, \"exclusive\": true, \"unit\": \"\", \"size\": 1, \"paths\": {\"containm"...,
writers=std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...},
reader=std::shared_ptr<Flux::resource_model::resource_reader_base_t> (use count 1, weak count 0) = {...}, id=85714796544, at=1686691577,
duration=3153600000) at traversers/dfu.cpp:357
#6 0x000015552d96c766 in run (duration=3153600000, at=1686691577,
jgf="{\"graph\": {\"nodes\": [{\"id\": \"196\", \"metadata\": {\"type\": \"core\", \"basename\": \"core\", \"name\": \"core47\", \"id\": 47, \"uniq_id\": 196, \"rank\": 3, \"exclusive\": true, \"unit\": \"\", \"size\": 1, \"paths\": {\"containm"..., jobid=85714796544,
ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...})
at resource_match.cpp:1679
#7 run_update (o=..., ov=<synthetic pointer>: <optimized out>,
at=<synthetic pointer>: <optimized out>, R=<optimized out>,
jobid=85714796544,
ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...})
at resource_match.cpp:1765
#8 update_request_cb (h=0x1555080019e0, w=<optimized out>,
msg=0x1555080df910, arg=<optimized out>) at resource_match.cpp:1835
--Type <RET> for more, q to quit, c to continue without paging--
#9 0x00001555550b2077 in call_handler (mh=0x15550800e3b0,
msg=msg@entry=0x1555080df910) at msg_handler.c:345
#10 0x00001555550b26ab in dispatch_message (type=1, msg=0x1555080df910,
d=0x155508010180) at msg_handler.c:381
#11 handle_cb (r=0x155508009f80, hw=<optimized out>, revents=<optimized out>,
arg=0x155508010180) at msg_handler.c:482
#12 0x00001555550e5a03 in ev_invoke_pending (loop=0x15550800f860) at ev.c:3770
#13 0x00001555550e9aa8 in ev_run (flags=0, loop=0x15550800f860) at ev.c:4190
#14 ev_run (loop=0x15550800f860, flags=0) at ev.c:4021
#15 0x00001555550b112f in flux_reactor_run (r=0x155508009f80,
flags=flags@entry=0) at reactor.c:128
#16 0x000015552d97034e in mod_main (h=0x1555080019e0, argc=<optimized out>,
argv=0x155508012d90) at resource_match.cpp:2670
#17 0x00000000004131d1 in module_thread (arg=0x700cf0) at module.c:183
#18 0x0000155554e791ca in start_thread () from /lib64/libpthread.so.0
#19 0x00001555537ece73 in clone () from /lib64/libc.so.6
|
Problem: There is no reproducer for issue flux-framework#1035: fluxion can't restart with queues enabled. Add a new test driver for issue reproducers: t5100-issues-test-driver.t Then add a reproducer script for flux-framework#1035 to the t/issues subdirectory.
Problem: There is no reproducer for issue flux-framework#1035: fluxion can't restart with queues enabled. Add a new test driver for issue reproducers: t5100-issues-test-driver.t Then add a reproducer script for flux-framework#1035 to the t/issues subdirectory.
Problem: There is no reproducer for issue flux-framework#1035: fluxion can't restart with queues enabled. Add a new test driver for issue reproducers: t5100-issues-test-driver.t Then add a reproducer script for flux-framework#1035 to the t/issues subdirectory.
Problem: If the Fluxion modules are reloaded, and queues are enabled, running jobs will be terminated with the following error (from #991):
This is using
match-format = "rv1"
The text was updated successfully, but these errors were encountered: