fluxion can't restart with queues enabled #1035

grondo · 2023-06-08T21:49:15Z

Problem: If the Fluxion modules are reloaded, and queues are enabled, running jobs will be terminated with the following error (from #991):

[  +9.794686] job-manager[0]: scheduler: hello
[  +9.797168] sched-fluxion-qmanager[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
[  +9.797231] sched-fluxion-qmanager[0]: raising fatal exception on running job id=63504667937603584

This is using match-format = "rv1"

The text was updated successfully, but these errors were encountered:

grondo · 2023-06-08T21:58:22Z

I'll see if I can make a standalone reproducer a bit later.

trws · 2023-06-13T20:29:50Z

@grondo, do you happen to have a testcase or reproducer that will produce this, or the original, issue? I'm looking over the code, and I think we must be having an issue with order of init or config or something, because the queue map is definitely populated in the normal path and load order, so something must change when we're in restart that sched doesn't expect.

grondo · 2023-06-13T20:36:34Z

Yes, let me see if I can reproduce this in a test instance.

grondo · 2023-06-13T21:31:35Z

Well, I couldn't reproduce the same issue, but this script causes fluxion to segfault in planner.cpp. It could be related?

#!/bin/bash
flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux kvs get resource.R \
	| flux R set-property batch:0-1 debug:2-3 \
	| flux kvs put -r resource.R=-
flux kvs get resource.R | jq
flux config load <<EOF
[queues.debug]
requires = ["debug"]

[queues.batch]
requires = ["batch"]

[sched-fluxion-resource]
match-format = "rv1"
EOF
flux config get | jq
flux module load resource noverify
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue start --all
flux queue status
flux resource list
flux dmesg | grep version

flux submit --wait-event=start --queue=debug sleep inf
flux submit --wait-event=start --queue=debug sleep inf

flux module unload sched-fluxion-qmanager
flux module reload sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux jobs -a
#flux job status -vv $(flux job last 2)

Run with flux start -s4 issue#1035.sh:

$ flux start  -s 4 ./issue#1035.sh 
{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0-3",
        "children": {
          "core": "0-47"
        }
      }
    ],
    "starttime": 0.0,
    "expiration": 0.0,
    "nodelist": [
      "corona[82,82,82,82]"
    ],
    "properties": {
      "batch": "0-1",
      "debug": "2-3"
    }
  }
}
{
  "queues": {
    "debug": {
      "requires": [
        "debug"
      ]
    },
    "batch": {
      "requires": [
        "batch"
      ]
    }
  },
  "sched-fluxion-resource": {
    "match-format": "rv1"
  }
}
batch: Scheduling is started
debug: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
debug: Job submission is enabled
debug: Scheduling is started
     STATE QUEUE      NNODES   NCORES NODELIST
      free batch           2       96 corona[82,82]
      free debug           2       96 corona[82,82]
 allocated                 0        0 
      down                 0        0 
2023-06-13T21:30:21.314232Z sched-fluxion-resource.info[0]: version 0.27.0
2023-06-13T21:30:21.387592Z sched-fluxion-qmanager.info[0]: version 0.27.0
2023-06-13T21:30:24.388589Z sched-fluxion-resource.info[0]: version 0.27.0
2023-06-13T21:30:24.455246Z sched-fluxion-qmanager.info[0]: version 0.27.0
f3EczD3d
f3LrjBom
flux-start: 0 (pid 3726199) Segmentation fault
flux-jobs: ERROR: Unable to connect to Flux: Connection refused

(gdb) bt
#0  planner_avail_resources_at (ctx=<optimized out>, at=<optimized out>)
    at planner.cpp:527
#1  0x000015552d99c15f in Flux::resource_model::detail::dfu_impl_t::upd_by_outedges (this=0x155508005de0, subsystem="containment", jobmeta=..., 
    u=<optimized out>, e=...) at traversers/dfu_impl_update.cpp:123
#2  0x000015552d99eb61 in Flux::resource_model::detail::dfu_impl_t::upd_dfv (
    this=<optimized out>, u=<optimized out>, writers=..., 
    needs=<optimized out>, excl=<optimized out>, jobmeta=..., 
    full=<optimized out>, to_parent=..., emit_shadow=<optimized out>)
    at /usr/include/boost/graph/detail/edge.hpp:41
#3  0x000015552d99ef01 in Flux::resource_model::detail::dfu_impl_t::upd_dfv (
    this=<optimized out>, u=<optimized out>, writers=..., 
    needs=<optimized out>, excl=<optimized out>, jobmeta=..., 
    full=<optimized out>, to_parent=..., emit_shadow=<optimized out>)
    at traversers/dfu_impl_update.cpp:328
#4  0x000015552d9a033a in Flux::resource_model::detail::dfu_impl_t::update (
    this=<optimized out>, root=0, 
    writers=std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, str=..., reader=..., jobmeta=...)
    at traversers/dfu_impl_update.cpp:627
#5  0x000015552d98abb0 in Flux::resource_model::dfu_traverser_t::run (
    this=0x155508005de0, 
    str="{\"graph\": {\"nodes\": [{\"id\": \"196\", \"metadata\": {\"type\": \"core\", \"basename\": \"core\", \"name\": \"core47\", \"id\": 47, \"uniq_id\": 196, \"rank\": 3, \"exclusive\": true, \"unit\": \"\", \"size\": 1, \"paths\": {\"containm"..., 
    writers=std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, 
    reader=std::shared_ptr<Flux::resource_model::resource_reader_base_t> (use count 1, weak count 0) = {...}, id=85714796544, at=1686691577, 
    duration=3153600000) at traversers/dfu.cpp:357
#6  0x000015552d96c766 in run (duration=3153600000, at=1686691577, 
    jgf="{\"graph\": {\"nodes\": [{\"id\": \"196\", \"metadata\": {\"type\": \"core\", \"basename\": \"core\", \"name\": \"core47\", \"id\": 47, \"uniq_id\": 196, \"rank\": 3, \"exclusive\": true, \"unit\": \"\", \"size\": 1, \"paths\": {\"containm"..., jobid=85714796544, 
    ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...})
    at resource_match.cpp:1679
#7  run_update (o=..., ov=<synthetic pointer>: <optimized out>, 
    at=<synthetic pointer>: <optimized out>, R=<optimized out>, 
    jobid=85714796544, 
    ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...})
    at resource_match.cpp:1765
#8  update_request_cb (h=0x1555080019e0, w=<optimized out>, 
    msg=0x1555080df910, arg=<optimized out>) at resource_match.cpp:1835
--Type <RET> for more, q to quit, c to continue without paging--
#9  0x00001555550b2077 in call_handler (mh=0x15550800e3b0, 
    msg=msg@entry=0x1555080df910) at msg_handler.c:345
#10 0x00001555550b26ab in dispatch_message (type=1, msg=0x1555080df910, 
    d=0x155508010180) at msg_handler.c:381
#11 handle_cb (r=0x155508009f80, hw=<optimized out>, revents=<optimized out>, 
    arg=0x155508010180) at msg_handler.c:482
#12 0x00001555550e5a03 in ev_invoke_pending (loop=0x15550800f860) at ev.c:3770
#13 0x00001555550e9aa8 in ev_run (flags=0, loop=0x15550800f860) at ev.c:4190
#14 ev_run (loop=0x15550800f860, flags=0) at ev.c:4021
#15 0x00001555550b112f in flux_reactor_run (r=0x155508009f80, 
    flags=flags@entry=0) at reactor.c:128
#16 0x000015552d97034e in mod_main (h=0x1555080019e0, argc=<optimized out>, 
    argv=0x155508012d90) at resource_match.cpp:2670
#17 0x00000000004131d1 in module_thread (arg=0x700cf0) at module.c:183
#18 0x0000155554e791ca in start_thread () from /lib64/libpthread.so.0
#19 0x00001555537ece73 in clone () from /lib64/libc.so.6

Problem: There is no reproducer for issue flux-framework#1035: fluxion can't restart with queues enabled. Add a new test driver for issue reproducers: t5100-issues-test-driver.t Then add a reproducer script for flux-framework#1035 to the t/issues subdirectory.

trws mentioned this issue Jun 20, 2023

planner: ensure result in planner_avail_resources_at #1038

Merged

mergify bot closed this as completed in #1038 Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluxion can't restart with queues enabled #1035

fluxion can't restart with queues enabled #1035

grondo commented Jun 8, 2023

grondo commented Jun 8, 2023

trws commented Jun 13, 2023

grondo commented Jun 13, 2023

grondo commented Jun 13, 2023

fluxion can't restart with queues enabled #1035

fluxion can't restart with queues enabled #1035

Comments

grondo commented Jun 8, 2023

grondo commented Jun 8, 2023

trws commented Jun 13, 2023

grondo commented Jun 13, 2023

grondo commented Jun 13, 2023