-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods crashing and disconnections during resync #579
Comments
First idea was just the resync saturating the network. I believe the default But then you talk about node crashes, which would indicate a deeper underlying issue. There have been a couple of fixes to DRBD since 9.2.4, so maybe the best thing to try would be to upgrade DRBD. It would also be interesting to see the kernel log from a crashed node. Usually you get some kind of message on the console about what caused the kernel to panic. |
Hi, Unfortunately that's the latest DRBD version available, they're tied to the Talos version and I believe they don't release minor updates unless there's a good reason. I can always ask, you never know I suppose. Is there a way for me to change the c-max-rate, is that an option I could set on the storage class maybe ? I'll see if I can extract some logs next time I see a node crash, I could have missed it since getting logs out of those isn't that straightforward. Thanks |
Two settings you can play with: apiVersion: piraeus.io/v1
kind: LinstorCluster
metadata:
name: linstorcluster
spec:
properties:
- name: DrbdOptions/PeerDevice/c-max-rate
value: "102400" # I believe the unit is kilobits/s here
- name: DrbdOptions/auto-resync-after-disable
value: "false" # This will set "resync-after" options on all resource, so only one resource begins syncing at a time |
Hi, auto-resync-after-disable seems like exactly what I'd need, assuming the issue is indeed what we think here. I tried setting it on the LinstorCluster resource as above, but that didn't seem to do anything on the resources when looking at their definition. Also looking at Thanks ! EDIT: I've tried causing a disconnect (moved the ethernet cable of a node from one port to another) and it's not solved. As usual I struggled to login to one of the pods and run a disconnect against all the resources to then sync them one by one, fighting with the operator which tries to re-connect all of them all the time. I'll try c-max-rate next but I doubt the actual sycing is the issue, it doesn't really spend much time doing that anyway, the issue seems to be more how many resources are trying to connect at the same time. Or maybe how many comparison it's running ? Hard to say |
Have you checked directly with I believe |
Nevermind, I was looking at the first device in the chain so of course it didn't have anything but the other volumes do have the resync-after, my bad. So that's unfortunate, not quite sure what else to do at this point. I guess I'll eat the downtime and upgrade to the new Talos 1.6 with the latest drbd, who knows EDIT :
It doesn't look like resync-after actually does anything, I have two volumes clearly going at the same time there. The last one is the first in the chain so fine, but that top one has a resync-after that was pointing at an inconsistent resource and went anyway. I guess this only works if the connection is stable enough to get to a sync state, unconnected / broken pipe and those states probably don't count ? Sadly as soon as anything is syncing all of my resources go into these states all the time At least I did find a reliable-ish way of getting back online, I manually go and disconnect all the resources but one, then let that one go. Then I connect the next one, wait until it finishes and so on one by one. When I connect a new one often the previous ones get disconnected for a second or two, but since they were synced already they just come back right away on the next connection. |
Aha, I think I caught it !
This happened during the resync after upgrading Talos to 1.6.0 and DRBD to the latest 9.4.6. It then of course went down and rebooted. |
I suggest you open an issue over at https://github.com/LINBIT/drbd, ideally with the full kernel log |
Hi,
Apologies if this doesn't belong here, I hesitated with creating this on the linstor side but I'm not sure where the problem lies. Since the pods keep crashing, I'm trying here first. I'm dumping as much details as I can, no idea what's relevant.
I have a 3 node (intel nuc 5, 7 and 8th gen) cluster running the latest stable Talos (1.5.5) / DRBD (9.2.4), and using this operator (https://github.com/piraeusdatastore/piraeus-operator//config/default?ref=v2 deployed using argocd). It works fine when all nodes are up and synced.
But as soon as a sync is needed for any reason (for example a node reboot), the connections start breaking and the pods start crashing over and over again. I've also seen weird behaviors like piraeusdatastore/piraeus#162 which I suspect are related.
In
linstor resource l
I see a lot of BrokenPipe, Unconnected, Connecting and eventually StandAlone when it happens. When looking at the pods, the controller and the gencert pods keep crashing and re-starting, with sometimes the controller-manager and the csi pods joining in although subjectively not as much. I've seen nodes completely crash and reboot a couple of times (not the same one every time).I also get
kubectl
hanging for a few seconds then sayingleader changed
a lot, so getting information out during the issue is tough.I've found that waiting until a good number of volumes go StandAlone, or if I manage to exec into the pods to run
drbdadm disconnect ..
I can then sync them one by one fairly reliably. Maybe that's just luck, though.I have also played around with the placementCount, and 2 (+ a TieBreaker, when linstor feels like creating one which is not always) seems to be much easier to get to a synced state than a placementCount of 3. It's almost impossible to get all 3 nodes synced, but once they are it's as stable as when using 2.
I have kube-prometheus running so I tried going through the metrics but I don't see anything weird, that said the nodes quickly lose quorum and prevent scheduling so they may not be complete.
Looking at the dmesg (as reported by the talos api anyway) I don't see anything obvious, but I don't know what I'm looking for.
A lot of
Rejecting concurrent remote state change 1129332041 because of state change 137936095
messages during the issue, but I presume that's to be expected given the behavior.They're all connected to one switch (a UDM-Pro's front ports) with a gigabit link each.
I did run iperf3 between all three nodes to make sure and I do get a good stable ~900 Mb/s between them, and no DRBD disconnections when testing it.
In case that matters, all nodes have a physical interface plus a vlan interface over it. The
nodeIP
's valid subnet are set on the physical interface, the vlan is just there for one pod and metallb, I don't believe drbd should be using it but it should work regardless, there's no restrictions on either.Any ideas, suggestions of what I could tweak or how to figure out what's wrong ?
Since syncing one at a time seems to work I wonder if limiting the sync rate might help ?
Thank you
The text was updated successfully, but these errors were encountered: