vbp: Rework probe state management #4115

nigoroll · 2024-06-03T18:21:12Z

I will be away until July, but I would appreciate reviews and collection of feedback in the meantime. Also, for the unlikely event that everyone is absolutely happy about it, I would also not oppose a merge.

Every time I looked at the probe code in the past, my mind ended up twisted and confused. A probe could change the "enabled" state (tracking the temperature) or be removed at any time (unless the mtx is held), yet the code did not seem to reflect this.

We un-twist my mind by implementing probe states:

cold: reflects cold backend temperature
scheduled: probe is on binheap, waiting for its time to come
running: a task has been started to run the probe
cooling: running probe disabled
deleted: running probe removed (while cooling only)

With this new scheme, we can now have (I think) a clean state diagram (see dot file):

a probe begins in the cold state
from cold, it can either get removed or scheduled via VBP_Control(..., 1)
from scheduled, it can go back to cold (via VBP_Control(..., 0)) or be picked up by vbp_thread() to change to running (aka task started)
once the task finishes, it normally goes back to scheduled, but in the meantime it could have changed to cooling or further on to deleted, so vbp_task_comple() handles these cases and either transitions to cold or deletes the probe
if the task can not be scheduled, the same handling happens

We now also remove running probes from the binheap to remove complexity. I am not entirely sure if there could have been a good reason for keeping running probes on the binheap, so if this is the case, we might want to reconsider this change. But it is not obvious to me how deleting and reinserting just to delete and reinsert later should be better than deleting and reinserting later.

Written to fix #4108 for good

dridi

After a first review this looks good to me.

What I really appreciate is the first commit acting as an assessment of the current informal state machine, with a diff showing the root cause of #4108 in the third commit.

I have one nitpick and a cosmetic suggestion.

bin/varnishd/cache/cache_backend_probe.c

doc/graphviz/cache_backend_probe.dot

nigoroll · 2024-07-08T10:46:06Z

@dridi I took your suggestions for the next force-push, thank you!

bin/varnishd/cache/cache_backend_probe.c

nigoroll · 2024-07-08T18:10:37Z

another dridification round done

nigoroll · 2024-07-09T14:07:06Z

During extended bugwash, @bsdphk suggested adding a cluster subgraph to the graphviz file to mark possible states while a probe task is running, but I could not make it become an improvement to clarity

diff --git a/doc/graphviz/cache_backend_probe.dot b/doc/graphviz/cache_backend_probe.dot
index 9f289b162..6b93e2b6b 100644
--- a/doc/graphviz/cache_backend_probe.dot
+++ b/doc/graphviz/cache_backend_probe.dot
@@ -3,9 +3,11 @@
 digraph cache_backend_probe {
        ALLOC
        scheduled
-       running
+       subgraph cluster_vbptask {
+               running
+               cooling # going cold while task runs
+       }
        cold
-       cooling # going cold while task runs
        deleted # from cooling, removed while task runs
        FREE

I did add a commit with a rename he suggested.

nigoroll · 2024-07-15T18:07:54Z

ftr, this was approved by today's bugwash

The state is not used yet other than for assertions.

nigoroll · 2024-07-15T18:23:49Z

FTR: The first dridification round had introduced a regression. The change was:

@@ -457,20 +457,16 @@ vbp_task_complete(struct vbp_target *vt)
 
        assert(vt->heap_idx == VBH_NOIDX);
 
-       if (vt->state == vbp_state_scheduled) {
-               WRONG("vbp_state_scheduled");
-       } else if (vt->state == vbp_state_running) {
+       if (vt->state == vbp_state_running) {
                vt->state = vbp_state_scheduled;
                vt->due = VTIM_real() + vt->interval;
                vbp_heap_insert(vt);
                vt = NULL;
-       } else if (vt->state == vbp_state_cold) {
-               WRONG("vbp_state_cold");
        } else if (vt->state == vbp_state_cooling) {
                vt->state = vbp_state_cold;
                vt = NULL;
        } else {
-               assert(vt->state == vbp_state_deleted);
+               WRONG(vt->state->name);
        }
        return (vt);
 }

here, I missed to accept vbp_state_deleted as valid.

Every time I looked at the probe code, my mind ended up twisted and confused. A probe could change the "enabled" state (tracking the temperature) and be removed at any time (unless the mtx is held), yet the code did not seem to reflect this. We un-twist my mind by completing the transition to probe states and adding a chain of two states for the case that a probe is controlled/deleted while its task is running: cooling: running probe disabled deleted: running probe removed (while cooling only) With this new scheme, we can now have (I think) a clean state diagram (see dot file): - a probe begins in the cold state - from cold, it can either get removed or scheduled via VBP_Control() - from scheduled, it can go back to cold (via VBP_Control()) or be picked up by vbp_thread() to change to running (aka task started) - once the task finishes, it normally goes back to scheduled, but in the meantime it could have changed to cooling or deleted, so vbp_task_comple() hadles these cases and either transitions to cold or deletes the probe - if the task can not be scheduled, the same handling happens We now also remove running probes from the binheap to remove complexity. Fixes varnishcache#4108 for good

This background thread does not run the actual probes, but schedules tasks which do (vbp_task). Rename suggested by phk.

delthas · 2024-09-25T09:18:24Z

FWIW, I'm (rarely) getting the following on reload (trying to create a MWE, but it's difficult):

Wrong turn at cache/cache_backend_probe.c:710:
running
Backtrace:
  0x560448ae1412: varnishd(VBT_format+0x72) [0x560448ae1412]
  0x560448a5f3e1: varnishd(pan_ic+545) [0x560448a5f3e1]
  0x560448ae0965: varnishd(VAS_Fail+0x45) [0x560448ae0965]
  0x560448a34b25: varnishd(VBP_Control+583) [0x560448a34b25]
  0x560448a6edab: varnishd(vcl_BackendEvent+107) [0x560448a6edab]
  0x560448a7034d: varnishd(vcl_set_state+589) [0x560448a7034d]
  0x560448a713b9: varnishd(vcl_cli_load+315) [0x560448a713b9]
  0x560448ae31b5: varnishd(cls_exec+629) [0x560448ae31b5]
  0x560448ae37a3: varnishd(VCLS_Poll+0x333) [0x560448ae37a3]
  0x560448a3aec4: varnishd(CLI_Run+0x64) [0x560448a3aec4]
  0x560448a5a3aa: varnishd(child_main+0x22a) [0x560448a5a3aa]
  0x560448aa4cfd: varnishd(mgt_launch_child+1229) [0x560448aa4cfd]
  0x560448aa5cab: varnishd(MCH_Start_Child+0xb) [0x560448aa5cab]
  0x560448a2ce26: varnishd(main+0x2716) [0x560448a2ce26]
  0x7f3d07682b45: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f3d07682b45]
  0x560448a2cf48: varnishd(_start+41) [0x560448a2cf48]

Also got this once (same backtrace, but the state this time is scheduled):

Wrong turn at cache/cache_backend_probe.c:710:
scheduled
Backtrace:
0x55f4660e1412: /usr/sbin/varnishd(VBT_format+0x72) [0x55f4660e1412]
0x55f46605f3e1: /usr/sbin/varnishd(pan_ic+545) [0x55f46605f3e1]
0x55f4660e0965: /usr/sbin/varnishd(VAS_Fail+0x45) [0x55f4660e0965]
0x55f466034b25: /usr/sbin/varnishd(VBP_Control+583) [0x55f466034b25]
0x55f46606edab: /usr/sbin/varnishd(vcl_BackendEvent+107) [0x55f46606edab]
0x55f46607034d: /usr/sbin/varnishd(vcl_set_state+589) [0x55f46607034d]
0x55f4660713b9: /usr/sbin/varnishd(vcl_cli_load+315) [0x55f4660713b9]
0x55f4660e31b5: /usr/sbin/varnishd(cls_exec+629) [0x55f4660e31b5]
0x55f4660e37a3: /usr/sbin/varnishd(VCLS_Poll+0x333) [0x55f4660e37a3]
0x55f46603aec4: /usr/sbin/varnishd(CLI_Run+0x64) [0x55f46603aec4]
0x55f46605a3aa: /usr/sbin/varnishd(child_main+0x22a) [0x55f46605a3aa]
0x55f4660a4cfd: /usr/sbin/varnishd(mgt_launch_child+1229) [0x55f4660a4cfd]
0x55f4660a5cab: /usr/sbin/varnishd(MCH_Start_Child+0xb) [0x55f4660a5cab]
0x55f46602ce26: /usr/sbin/varnishd(main+0x2716) [0x55f46602ce26]
0x7fc59f7f0b45: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc59f7f0b45]
0x55f46602cf48: /usr/sbin/varnishd(+0x2cf48) [0x55f46602cf48]

(WRONG(vt->state->name) running/scheduled on enable).

Edit: finally managed to track it down to a MWE. See #4199

nigoroll mentioned this pull request Jun 3, 2024

Assert error in VBP_Control(): Condition(vt->heap_idx == VBH_NOIDX) not true. #4108

Closed

nigoroll force-pushed the vbp_rework branch from 6e06afd to 66e90df Compare June 3, 2024 18:40

dridi approved these changes Jun 10, 2024

View reviewed changes

nigoroll force-pushed the vbp_rework branch from 66e90df to 13aaec2 Compare July 8, 2024 10:46

dridi approved these changes Jul 8, 2024

View reviewed changes

bin/varnishd/cache/cache_backend_probe.c Outdated Show resolved Hide resolved

bin/varnishd/cache/cache_backend_probe.c Outdated Show resolved Hide resolved

nigoroll force-pushed the vbp_rework branch from 13aaec2 to 7083222 Compare July 8, 2024 18:10

nigoroll added 2 commits July 15, 2024 20:11

vbp: Introduce a backend probe state & document fsm

895045a

The state is not used yet other than for assertions.

vbp: move VBH_NOIDX assertion to vbp_delete()

8495028

nigoroll force-pushed the vbp_rework branch from cf61df5 to 07a6f35 Compare July 15, 2024 18:11

nigoroll added 2 commits July 15, 2024 20:25

vbp: Rename vbp_thread to vbp_scheduler for clarity

fc2445b

This background thread does not run the actual probes, but schedules tasks which do (vbp_task). Rename suggested by phk.

nigoroll force-pushed the vbp_rework branch from 07a6f35 to fc2445b Compare July 15, 2024 18:26

nigoroll merged commit fc2445b into varnishcache:master Jul 15, 2024
1 of 11 checks passed

nigoroll deleted the vbp_rework branch July 15, 2024 18:28

delthas mentioned this pull request Sep 25, 2024

Panic on vcl.load: Wrong turn at cache/cache_backend_probe.c:710: running #4199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vbp: Rework probe state management #4115

vbp: Rework probe state management #4115

nigoroll commented Jun 3, 2024 •

edited

Loading

dridi left a comment

nigoroll commented Jul 8, 2024

nigoroll commented Jul 8, 2024

nigoroll commented Jul 9, 2024

nigoroll commented Jul 15, 2024

nigoroll commented Jul 15, 2024

delthas commented Sep 25, 2024 •

edited

Loading

vbp: Rework probe state management #4115

vbp: Rework probe state management #4115

Conversation

nigoroll commented Jun 3, 2024 • edited Loading

dridi left a comment

Choose a reason for hiding this comment

nigoroll commented Jul 8, 2024

nigoroll commented Jul 8, 2024

nigoroll commented Jul 9, 2024

nigoroll commented Jul 15, 2024

nigoroll commented Jul 15, 2024

delthas commented Sep 25, 2024 • edited Loading

nigoroll commented Jun 3, 2024 •

edited

Loading

delthas commented Sep 25, 2024 •

edited

Loading