Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluxion can't restart with queues enabled #1035

Closed
grondo opened this issue Jun 8, 2023 · 4 comments · Fixed by #1038
Closed

fluxion can't restart with queues enabled #1035

grondo opened this issue Jun 8, 2023 · 4 comments · Fixed by #1038

Comments

@grondo
Copy link
Contributor

grondo commented Jun 8, 2023

Problem: If the Fluxion modules are reloaded, and queues are enabled, running jobs will be terminated with the following error (from #991):

[  +9.794686] job-manager[0]: scheduler: hello
[  +9.797168] sched-fluxion-qmanager[0]: jobmanager_hello_cb: ENOENT: map::at: No such file or directory
[  +9.797231] sched-fluxion-qmanager[0]: raising fatal exception on running job id=63504667937603584

This is using match-format = "rv1"

@grondo
Copy link
Contributor Author

grondo commented Jun 8, 2023

I'll see if I can make a standalone reproducer a bit later.

@trws
Copy link
Member

trws commented Jun 13, 2023

@grondo, do you happen to have a testcase or reproducer that will produce this, or the original, issue? I'm looking over the code, and I think we must be having an issue with order of init or config or something, because the queue map is definitely populated in the normal path and load order, so something must change when we're in restart that sched doesn't expect.

@grondo
Copy link
Contributor Author

grondo commented Jun 13, 2023

Yes, let me see if I can reproduce this in a test instance.

@grondo
Copy link
Contributor Author

grondo commented Jun 13, 2023

Well, I couldn't reproduce the same issue, but this script causes fluxion to segfault in planner.cpp. It could be related?

#!/bin/bash
flux module remove sched-fluxion-qmanager
flux module remove sched-fluxion-resource
flux module remove resource
flux kvs get resource.R \
	| flux R set-property batch:0-1 debug:2-3 \
	| flux kvs put -r resource.R=-
flux kvs get resource.R | jq
flux config load <<EOF
[queues.debug]
requires = ["debug"]

[queues.batch]
requires = ["batch"]

[sched-fluxion-resource]
match-format = "rv1"
EOF
flux config get | jq
flux module load resource noverify
flux module load sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux queue start --all
flux queue status
flux resource list
flux dmesg | grep version

flux submit --wait-event=start --queue=debug sleep inf
flux submit --wait-event=start --queue=debug sleep inf

flux module unload sched-fluxion-qmanager
flux module reload sched-fluxion-resource
flux module load sched-fluxion-qmanager
flux jobs -a
#flux job status -vv $(flux job last 2)

Run with flux start -s4 issue#1035.sh:

$ flux start  -s 4 ./issue#1035.sh 
{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0-3",
        "children": {
          "core": "0-47"
        }
      }
    ],
    "starttime": 0.0,
    "expiration": 0.0,
    "nodelist": [
      "corona[82,82,82,82]"
    ],
    "properties": {
      "batch": "0-1",
      "debug": "2-3"
    }
  }
}
{
  "queues": {
    "debug": {
      "requires": [
        "debug"
      ]
    },
    "batch": {
      "requires": [
        "batch"
      ]
    }
  },
  "sched-fluxion-resource": {
    "match-format": "rv1"
  }
}
batch: Scheduling is started
debug: Scheduling is started
batch: Job submission is enabled
batch: Scheduling is started
debug: Job submission is enabled
debug: Scheduling is started
     STATE QUEUE      NNODES   NCORES NODELIST
      free batch           2       96 corona[82,82]
      free debug           2       96 corona[82,82]
 allocated                 0        0 
      down                 0        0 
2023-06-13T21:30:21.314232Z sched-fluxion-resource.info[0]: version 0.27.0
2023-06-13T21:30:21.387592Z sched-fluxion-qmanager.info[0]: version 0.27.0
2023-06-13T21:30:24.388589Z sched-fluxion-resource.info[0]: version 0.27.0
2023-06-13T21:30:24.455246Z sched-fluxion-qmanager.info[0]: version 0.27.0
f3EczD3d
f3LrjBom
flux-start: 0 (pid 3726199) Segmentation fault
flux-jobs: ERROR: Unable to connect to Flux: Connection refused
(gdb) bt
#0  planner_avail_resources_at (ctx=<optimized out>, at=<optimized out>)
    at planner.cpp:527
#1  0x000015552d99c15f in Flux::resource_model::detail::dfu_impl_t::upd_by_outedges (this=0x155508005de0, subsystem="containment", jobmeta=..., 
    u=<optimized out>, e=...) at traversers/dfu_impl_update.cpp:123
#2  0x000015552d99eb61 in Flux::resource_model::detail::dfu_impl_t::upd_dfv (
    this=<optimized out>, u=<optimized out>, writers=..., 
    needs=<optimized out>, excl=<optimized out>, jobmeta=..., 
    full=<optimized out>, to_parent=..., emit_shadow=<optimized out>)
    at /usr/include/boost/graph/detail/edge.hpp:41
#3  0x000015552d99ef01 in Flux::resource_model::detail::dfu_impl_t::upd_dfv (
    this=<optimized out>, u=<optimized out>, writers=..., 
    needs=<optimized out>, excl=<optimized out>, jobmeta=..., 
    full=<optimized out>, to_parent=..., emit_shadow=<optimized out>)
    at traversers/dfu_impl_update.cpp:328
#4  0x000015552d9a033a in Flux::resource_model::detail::dfu_impl_t::update (
    this=<optimized out>, root=0, 
    writers=std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, str=..., reader=..., jobmeta=...)
    at traversers/dfu_impl_update.cpp:627
#5  0x000015552d98abb0 in Flux::resource_model::dfu_traverser_t::run (
    this=0x155508005de0, 
    str="{\"graph\": {\"nodes\": [{\"id\": \"196\", \"metadata\": {\"type\": \"core\", \"basename\": \"core\", \"name\": \"core47\", \"id\": 47, \"uniq_id\": 196, \"rank\": 3, \"exclusive\": true, \"unit\": \"\", \"size\": 1, \"paths\": {\"containm"..., 
    writers=std::shared_ptr<Flux::resource_model::match_writers_t> (use count 1, weak count 0) = {...}, 
    reader=std::shared_ptr<Flux::resource_model::resource_reader_base_t> (use count 1, weak count 0) = {...}, id=85714796544, at=1686691577, 
    duration=3153600000) at traversers/dfu.cpp:357
#6  0x000015552d96c766 in run (duration=3153600000, at=1686691577, 
    jgf="{\"graph\": {\"nodes\": [{\"id\": \"196\", \"metadata\": {\"type\": \"core\", \"basename\": \"core\", \"name\": \"core47\", \"id\": 47, \"uniq_id\": 196, \"rank\": 3, \"exclusive\": true, \"unit\": \"\", \"size\": 1, \"paths\": {\"containm"..., jobid=85714796544, 
    ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...})
    at resource_match.cpp:1679
#7  run_update (o=..., ov=<synthetic pointer>: <optimized out>, 
    at=<synthetic pointer>: <optimized out>, R=<optimized out>, 
    jobid=85714796544, 
    ctx=std::shared_ptr<resource_ctx_t> (use count 2, weak count 0) = {...})
    at resource_match.cpp:1765
#8  update_request_cb (h=0x1555080019e0, w=<optimized out>, 
    msg=0x1555080df910, arg=<optimized out>) at resource_match.cpp:1835
--Type <RET> for more, q to quit, c to continue without paging--
#9  0x00001555550b2077 in call_handler (mh=0x15550800e3b0, 
    msg=msg@entry=0x1555080df910) at msg_handler.c:345
#10 0x00001555550b26ab in dispatch_message (type=1, msg=0x1555080df910, 
    d=0x155508010180) at msg_handler.c:381
#11 handle_cb (r=0x155508009f80, hw=<optimized out>, revents=<optimized out>, 
    arg=0x155508010180) at msg_handler.c:482
#12 0x00001555550e5a03 in ev_invoke_pending (loop=0x15550800f860) at ev.c:3770
#13 0x00001555550e9aa8 in ev_run (flags=0, loop=0x15550800f860) at ev.c:4190
#14 ev_run (loop=0x15550800f860, flags=0) at ev.c:4021
#15 0x00001555550b112f in flux_reactor_run (r=0x155508009f80, 
    flags=flags@entry=0) at reactor.c:128
#16 0x000015552d97034e in mod_main (h=0x1555080019e0, argc=<optimized out>, 
    argv=0x155508012d90) at resource_match.cpp:2670
#17 0x00000000004131d1 in module_thread (arg=0x700cf0) at module.c:183
#18 0x0000155554e791ca in start_thread () from /lib64/libpthread.so.0
#19 0x00001555537ece73 in clone () from /lib64/libc.so.6

grondo added a commit to grondo/flux-sched that referenced this issue Jun 21, 2023
Problem: There is no reproducer for issue flux-framework#1035: fluxion can't restart
with queues enabled.

Add a new test driver for issue reproducers:

 t5100-issues-test-driver.t

Then add a reproducer script for flux-framework#1035 to the t/issues subdirectory.
grondo added a commit to grondo/flux-sched that referenced this issue Jun 21, 2023
Problem: There is no reproducer for issue flux-framework#1035: fluxion can't restart
with queues enabled.

Add a new test driver for issue reproducers:

 t5100-issues-test-driver.t

Then add a reproducer script for flux-framework#1035 to the t/issues subdirectory.
@mergify mergify bot closed this as completed in #1038 Jun 23, 2023
vsoch pushed a commit to researchapps/flux-sched that referenced this issue Jul 7, 2023
Problem: There is no reproducer for issue flux-framework#1035: fluxion can't restart
with queues enabled.

Add a new test driver for issue reproducers:

 t5100-issues-test-driver.t

Then add a reproducer script for flux-framework#1035 to the t/issues subdirectory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants