Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods crashing and disconnections during resync #579

Open
Ulrar opened this issue Dec 13, 2023 · 8 comments
Open

Pods crashing and disconnections during resync #579

Ulrar opened this issue Dec 13, 2023 · 8 comments

Comments

@Ulrar
Copy link

Ulrar commented Dec 13, 2023

Hi,

Apologies if this doesn't belong here, I hesitated with creating this on the linstor side but I'm not sure where the problem lies. Since the pods keep crashing, I'm trying here first. I'm dumping as much details as I can, no idea what's relevant.

I have a 3 node (intel nuc 5, 7 and 8th gen) cluster running the latest stable Talos (1.5.5) / DRBD (9.2.4), and using this operator (https://github.com/piraeusdatastore/piraeus-operator//config/default?ref=v2 deployed using argocd). It works fine when all nodes are up and synced.

But as soon as a sync is needed for any reason (for example a node reboot), the connections start breaking and the pods start crashing over and over again. I've also seen weird behaviors like piraeusdatastore/piraeus#162 which I suspect are related.

In linstor resource l I see a lot of BrokenPipe, Unconnected, Connecting and eventually StandAlone when it happens. When looking at the pods, the controller and the gencert pods keep crashing and re-starting, with sometimes the controller-manager and the csi pods joining in although subjectively not as much. I've seen nodes completely crash and reboot a couple of times (not the same one every time).
I also get kubectl hanging for a few seconds then saying leader changed a lot, so getting information out during the issue is tough.

I've found that waiting until a good number of volumes go StandAlone, or if I manage to exec into the pods to run drbdadm disconnect .. I can then sync them one by one fairly reliably. Maybe that's just luck, though.
I have also played around with the placementCount, and 2 (+ a TieBreaker, when linstor feels like creating one which is not always) seems to be much easier to get to a synced state than a placementCount of 3. It's almost impossible to get all 3 nodes synced, but once they are it's as stable as when using 2.

I have kube-prometheus running so I tried going through the metrics but I don't see anything weird, that said the nodes quickly lose quorum and prevent scheduling so they may not be complete.

Looking at the dmesg (as reported by the talos api anyway) I don't see anything obvious, but I don't know what I'm looking for.
A lot of Rejecting concurrent remote state change 1129332041 because of state change 137936095 messages during the issue, but I presume that's to be expected given the behavior.

They're all connected to one switch (a UDM-Pro's front ports) with a gigabit link each.
I did run iperf3 between all three nodes to make sure and I do get a good stable ~900 Mb/s between them, and no DRBD disconnections when testing it.

In case that matters, all nodes have a physical interface plus a vlan interface over it. The nodeIP's valid subnet are set on the physical interface, the vlan is just there for one pod and metallb, I don't believe drbd should be using it but it should work regardless, there's no restrictions on either.

Any ideas, suggestions of what I could tweak or how to figure out what's wrong ?
Since syncing one at a time seems to work I wonder if limiting the sync rate might help ?

Thank you

@WanzenBug
Copy link
Member

First idea was just the resync saturating the network. I believe the default c-max-rate (i.e. max resync speed) is around 100MB/s, so if more than 9 resources need to sync a lot of data at the same time you might be running into the network limits. That could then also affect the kubeapi, which would then lead to the leader changed messages.

But then you talk about node crashes, which would indicate a deeper underlying issue. There have been a couple of fixes to DRBD since 9.2.4, so maybe the best thing to try would be to upgrade DRBD. It would also be interesting to see the kernel log from a crashed node. Usually you get some kind of message on the console about what caused the kernel to panic.

@Ulrar
Copy link
Author

Ulrar commented Dec 14, 2023

Hi,

Unfortunately that's the latest DRBD version available, they're tied to the Talos version and I believe they don't release minor updates unless there's a good reason. I can always ask, you never know I suppose.

Is there a way for me to change the c-max-rate, is that an option I could set on the storage class maybe ?
I have 5 volumes (times three nodes)

I'll see if I can extract some logs next time I see a node crash, I could have missed it since getting logs out of those isn't that straightforward.

Thanks

@WanzenBug
Copy link
Member

Two settings you can play with:

apiVersion: piraeus.io/v1
kind: LinstorCluster
metadata:
  name: linstorcluster
spec:
  properties:
  - name: DrbdOptions/PeerDevice/c-max-rate
    value: "102400" # I believe the unit is kilobits/s here
  - name: DrbdOptions/auto-resync-after-disable
    value: "false" # This will set "resync-after" options on all resource, so only one resource begins syncing at a time

@Ulrar
Copy link
Author

Ulrar commented Dec 16, 2023

Hi,

auto-resync-after-disable seems like exactly what I'd need, assuming the issue is indeed what we think here. I tried setting it on the LinstorCluster resource as above, but that didn't seem to do anything on the resources when looking at their definition.

Also looking at kubectl linstor controller drbd-options nothing comes back, and when looking in -h I don't see auto-resync-after-disable as an option. Is there anything I need to do after changing the LinstorCluster resource to actually apply it to existing resources ?

Thanks !

EDIT: I've tried causing a disconnect (moved the ethernet cable of a node from one port to another) and it's not solved. As usual I struggled to login to one of the pods and run a disconnect against all the resources to then sync them one by one, fighting with the operator which tries to re-connect all of them all the time.
Although this time, at one point, the connect command told me the sync-after was referring to an unknown resource. This went away fairly quickly and I still can't see sync-after configured anywhere so presumably I'm not looking in the correct place, but I guess that setting didn't help, if it's applied.

I'll try c-max-rate next but I doubt the actual sycing is the issue, it doesn't really spend much time doing that anyway, the issue seems to be more how many resources are trying to connect at the same time. Or maybe how many comparison it's running ? Hard to say

@WanzenBug
Copy link
Member

auto-resync-after-disable seems like exactly what I'd need, assuming the issue is indeed what we think here. I tried setting it on the LinstorCluster resource as above, but that didn't seem to do anything on the resources when looking at their definition.

Have you checked directly with drbdadm dump ... from the satellite containers?

I believe linstor controller drbd-options only allows setting options, but does not display them. You can check linstor controller list-properties directly to check the various DrbdOptions/.....

@Ulrar
Copy link
Author

Ulrar commented Dec 18, 2023

Nevermind, I was looking at the first device in the chain so of course it didn't have anything but the other volumes do have the resync-after, my bad.

So that's unfortunate, not quite sure what else to do at this point. I guess I'll eat the downtime and upgrade to the new Talos 1.6 with the latest drbd, who knows

EDIT :

pvc-56924ed3-7815-4655-9536-6b64792182ca role:Secondary
  disk:Inconsistent
  talos-00r-fu9 connection:Connecting
  talos-ozt-z3h role:Primary
    replication:SyncTarget peer-disk:UpToDate done:10.69

[...]

pvc-fbdf5c3c-2d49-49b8-ac10-f8e1212c7788 role:Secondary
  disk:Inconsistent
  talos-00r-fu9 connection:Connecting
  talos-ozt-z3h role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:7.79

It doesn't look like resync-after actually does anything, I have two volumes clearly going at the same time there. The last one is the first in the chain so fine, but that top one has a resync-after that was pointing at an inconsistent resource and went anyway. I guess this only works if the connection is stable enough to get to a sync state, unconnected / broken pipe and those states probably don't count ? Sadly as soon as anything is syncing all of my resources go into these states all the time

At least I did find a reliable-ish way of getting back online, I manually go and disconnect all the resources but one, then let that one go. Then I connect the next one, wait until it finishes and so on one by one. When I connect a new one often the previous ones get disconnected for a second or two, but since they were synced already they just come back right away on the next connection.
It's really something to do with the act of syncing, or maybe of calculating the bitmap or something that breaks the connections I think. I have one big volume that takes a while to sync, and during that time no problems at all so I think the issue isn't actually the link getting saturated, it's something breaking all the connections when a resource tries to come up or calculates a bitmap or something

@Ulrar
Copy link
Author

Ulrar commented Dec 18, 2023

Aha, I think I caught it !

talos-ozt-z3h: kern: warning: [2023-12-18T10:52:04.592382883Z]: ------------[ cut here ]------------
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:04.594771883Z]: drbd pvc-fbdf5c3c-2d49-49b8-ac10-f8e1212c7788 talos-00r-fu9: meta connection shut down by peer.
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:04.647636883Z]: NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:04.765105883Z]: drbd pvc-1a9e5a5e-fdba-4b8e-ae9f-1a7acd048184 talos-00r-fu9: meta connection shut down by peer.
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:04.845584883Z]: WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x205/0x210
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:04.963069883Z]: drbd pvc-86499a05-3ba9-4722-9bb1-69ae47406263 talos-00r-fu9: meta connection shut down by peer.
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.061011883Z]: Modules linked in: drbd_transport_tcp(O) drbd(O) e1000e
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.061019883Z]: CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O       6.1.67-talos #1
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.061056883Z]: drbd pvc-e57930e5-6772-41e4-8c98-99105b77970a talos-fdm-9ig: meta connection shut down by peer.
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.177711883Z]: Hardware name: Intel NUC5i5MYBE/NUC5i5MYBE, BIOS MYBDWi5v.86A.0059.2022.0706.2207 07/06/2022
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.181877883Z]: drbd pvc-c7bdfa9e-e3c2-4dd3-ac9c-b7b2e847d30b talos-fdm-9ig: meta connection shut down by peer.
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.252795883Z]: RIP: 0010:dev_watchdog+0x205/0x210
talos-ozt-z3h: kern:  notice: [2023-12-18T10:52:05.351639883Z]: drbd pvc-e57930e5-6772-41e4-8c98-99105b77970a tcp:talos-00r-fu9: Closing unexpected connection from 10.244.2.2
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.351699883Z]: drbd pvc-86499a05-3ba9-4722-9bb1-69ae47406263 talos-fdm-9ig: meta connection shut down by peer.
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464227883Z]: Code: 00 e9 42 ff ff ff 48 89 df c6 05 dd 7d fd 01 01 e8 10 46 fa ff 44 89 e9 48 89 de 48 c7 c7 b8 ed cc 8b 48 89 c2 e8 6b 19 de fe <0f> 0b e9 24 ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 53 48 89 fb 48
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464231883Z]: RSP: 0018:ffff9e6dc0100e70 EFLAGS: 00010282
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464234883Z]: RAX: 0000000000000000 RBX: ffff906413c00000 RCX: 0000000000000000
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464235883Z]: RDX: 0000000000000104 RSI: ffffffff8bb76d4f RDI: 00000000ffffffff
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464236883Z]: RBP: ffff906413c00488 R08: 3031652820353273 R09: 30706e65203a474f
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464237883Z]: R10: 3a474f4448435441 R11: 572056454454454e R12: ffff906413c003dc
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464238883Z]: R13: 0000000000000000 R14: ffffffff8abaa400 R15: ffff906413c00488
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464239883Z]: FS:  0000000000000000(0000) GS:ffff90654dc80000(0000) knlGS:0000000000000000
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464241883Z]: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464242883Z]: CR2: 000000c001fa9010 CR3: 00000001224a8002 CR4: 00000000003706e0
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464243883Z]: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464244883Z]: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464245883Z]: Call Trace:
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464247883Z]:  <IRQ>
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464248883Z]:  ? dev_watchdog+0x205/0x210
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.464252883Z]:  ? __warn+0x79/0xc0
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:05.868767883Z]: drbd pvc-1a9e5a5e-fdba-4b8e-ae9f-1a7acd048184 talos-fdm-9ig: Handshake to peer 3 successful: Agreed network protocol version 122
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.879700883Z]:  ? dev_watchdog+0x205/0x210
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.879708883Z]:  ? report_bug+0xe6/0x170
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.879712883Z]:  ? __irq_work_queue_local+0x39/0x80
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:05.879717883Z]:  ? handle_bug+0x3c/0x80
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:05.996311883Z]: drbd pvc-1a9e5a5e-fdba-4b8e-ae9f-1a7acd048184 talos-fdm-9ig: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:06.221030883Z]:  ? exc_invalid_op+0x13/0x60
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:06.221039883Z]:  ? asm_exc_invalid_op+0x16/0x20
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:06.221044883Z]:  ? pfifo_fast_reset+0x140/0x140
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:06.221050883Z]:  ? dev_watchdog+0x205/0x210
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:06.221122883Z]: drbd pvc-e57930e5-6772-41e4-8c98-99105b77970a talos-00r-fu9: Handshake to peer 0 successful: Agreed network protocol version 122
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:06.283642883Z]:  ? pfifo_fast_reset+0x140/0x140
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:06.324761883Z]: drbd pvc-c7bdfa9e-e3c2-4dd3-ac9c-b7b2e847d30b talos-00r-fu9: Handshake to peer 0 successful: Agreed network protocol version 122
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:06.324766883Z]: drbd pvc-c7bdfa9e-e3c2-4dd3-ac9c-b7b2e847d30b talos-00r-fu9: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:06.369023883Z]: drbd pvc-e57930e5-6772-41e4-8c98-99105b77970a talos-00r-fu9: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:06.388716883Z]: drbd pvc-1a9e5a5e-fdba-4b8e-ae9f-1a7acd048184: Preparing cluster-wide state change 254920882 (2->3 499/146)
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:06.454404883Z]:  call_timer_fn+0x27/0x130
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:06.818035883Z]: drbd pvc-c7bdfa9e-e3c2-4dd3-ac9c-b7b2e847d30b talos-00r-fu9: Preparing remote state change 43952240
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:06.876126883Z]:  __run_timers+0x221/0x2b0
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:06.876138883Z]:  run_timer_softirq+0x3b/0x80
talos-ozt-z3h: kern:    info: [2023-12-18T10:52:07.388671883Z]: drbd pvc-e57930e5-6772-41e4-8c98-99105b77970a talos-00r-fu9: Preparing remote state change 400215562
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:07.419838883Z]:  __do_softirq+0xf0/0x2fe
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:07.419848883Z]:  __irq_exit_rcu+0xab/0x110
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:09.297355883Z]:  sysvec_apic_timer_interrupt+0x9e/0xc0
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:09.354668883Z]:  </IRQ>
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:09.354670883Z]:  <TASK>
talos-ozt-z3h: kern: warning: [2023-12-18T10:52:09.354672883Z]:  asm_sysvec_apic_timer_interrupt+0x16/0x20

This happened during the resync after upgrading Talos to 1.6.0 and DRBD to the latest 9.4.6. It then of course went down and rebooted.

@WanzenBug
Copy link
Member

I suggest you open an issue over at https://github.com/LINBIT/drbd, ideally with the full kernel log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants