-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do endpoint tcp connections close? #501
Comments
I left "ip mptcp monitor" running on the client that was initiating the connection to the remote server and waited until I found an rsync that had dropped a subflow:
In this case it dropped the 10.11.21.251 -> 10.29.21.251:873 subflow sometime in the 6 hours since it started. The backup subflow is still alive but it's not actually transferring any data on it yet. Weirdly, the mptcp connection still report that there are 2 subflows? Now if we look at all the output from mptcp monitor this is all I captured:
So there is the initial connection but no further reporting on any subflow disconnects. Maybe I need to also capture data from mptcp monitor on the remote server too? If I look at the other end and the rsyncd process running on the server, it does at least show that there is only one active subflow:
But also the active "backup" subflow reported on the client is not present on the server.... I'll try running mptcp monitor on the server and catch one of these going bad. I know it was mentioned before that I shouldn't really be running "signal" subflows on both client and server, but we wanted these two host to be able to act as both server & client and independently initiate rsync transfers to each other. I'm not sure if this behaviour is a consequence of that? A quick refresher of the config on each server/client pair :
One thing is certain, all mptcp rsyncs are always initiated with fully working subflows, and then a subset of those will drop one or more subflows after many hours. Of the sublows dropped, sometimes it's the "idle" backup subflow and then after that it's almost always the "first" signal endpoint rather than the second. |
Okay, not the connection example above, but if I run mptcp monitor on the "server", I do see things like this from time to time:
Now this connection looks like this on the server:
And like this on the client that started the transfer:
Again, there was nothing useful for this subflow drop captured in the mptcp monitor output on the client. I see there were some patches recently to add "reset reason" for mptcp so maybe this kind of scenario is a common issue and is proving hard to debug? |
AFAICS there are at least 2 separate issues running here:
Finally the in-kernel path manager does not try to re-instate closed subflow. This last item is just the PM behavior, it can be changed. A few questions:
|
I have just rebooted the client & server (upgrading EL 8.7 -> 8.10) and so here is the nstat output for each after almost 2 hours of uptime (and subflows have already disconnected):
The client has many connections to other servers active too, hence why the numbers don't match up. Also worth noting that we are using BBR and large TCP buffers (128M+). So far the reset_reason is always 1, but I have also increased the logging coverage now so I'll know more soon. I also suspect that some of these subflows are ending without any output in mptcp monitor on either the client or server.... again I need a bit more time to verify this. It doesn't help that the subflow silently fails on the client, then I have to go to the server, figure out the same transfer and then search for it's token in the server's mptcp monitor output. I need to figure a better way to automate this. I also notice this error=111 quite a bit on the server too:
It always seem to be the first subflow but then resolves on a retry? Probably not related to the subflow disconnects we then see later but interesting that it's also always the first subflow. I am more than happy to try patches and help debug further! The fact that the first subflow is more likely to disconnect than the second might be due to the first subflow/ISP pair having a lower bandwidth and being more likely to frequently max out, but that is just conjecture at this point. |
Okay, here's an example where I definitely captured all logs from the moment of connection (when all subflows started out correctly), and the current running state where the second subflow endpoint has disconnected. First on the client:
And then at the other end on the server:
So definitely a tcp reset is definitely reported in this case. I even found an (rare) example where both signal subflows had gone and it was only running on the "backup" endpoint (from server):
I'm still waiting to find a case where the subflow was disconnected but there was no corresponding "reset_reason=1". Maybe I imagined it... |
Okay, here's an example where we don't see reset_reason but missing subflows... First on the client we see normal looking connection steps in the mptcp monitor output, but it looks like only the backup subflow still persists:
And then the other end on the server has some rather odd looking connections and ports in mptcp monitor output:
In this case, it seems like the 2 subflow endpoints never connected properly from the beginning and so it just ran on the backup subflow. So here we didn't have to wait for subflows to end after a period of optimal activity (e.g. reset_reason), they were gone from the beginning. This is potentially a different issue that just looks similar from afar? |
So here is and example of a client server connection that still has all it's subflows active and correct. Client:
And then on the server:
Even then I don't fully understand where the SF_CLOSED entries are coming from? Is this a consequence of both server & client having signal endpoints and the server or client announces subflows that are then not used and closed? |
I also tried removing an endpoint and re-adding it to see if that helped trigger mptcp connections with missing sublows to reconnect - all it did was stop all remaining active transfers using that endpoint. Only new connections started using both signal endpoints again. I think if we had a configurable way to retry closed subflows that would at least help work around whatever our underlying reason is for the high rate of subflow failures (over mutli-hour timescales anyway). I expect it's something you would want for bulk transfers that egress on normally stable connections but not for things like 4G or wifi connections that might go through prolonged periods of instability? If there is any other useful debugging that you can suggest, then just let me know. |
Hello, I'm not sure if I followed everything, but when you remove/add endpoints, can you try to use different IDs than the previous ones? |
So I tried with different ids but still couldn't get the existing mptcpize rsync transfers to pick up and use the new endpoints. Again a reminder of what a client/server host looks like:
And then just change IPs for the reverse (serverB) configuration at the other end.
So first I tried removing and re-adding the endpoint on the client (that initiated the rsync):
Any transfers that were still using the id 2 endpoint, promptly stopped. Any that had long before lost that endpoint subflow, remained the same and did not try to start a new one. Then I tried the same on the server, removing id 2 and re-adding as id 4. This didn't seem to change anything either. Then for good measure, I dropped id 4 on the client again and re-added as id 5 - still no change. New connections used the new endpoints just fine but existing transfers did not. Again, I don't know if this is because we are using "signal" on both server and client (because they can both initiate transfers between each other). |
@daire-byrne: likely there is no subflow [re-]creation attempt when you delete and re-add the endpoint because the mptcp connection already reached the 'signal_max' limit. More on this later[*]. AFAICS most (all?) subflow creation failures are for subflow created by the server towards the client after receiving a endpoint announce from the client. That is expected, as the client is not listening to accept incoming connection. I think you can address the latter issue explicitly adding a port number on the signal endpoint definition, something alike: ip mptcp endpoint add 10.17.21.251 id 2 signal dev ens224 12345 [*] The mptcp connection hits the 'signal_max' limit as:
I think [**] is a bug and should be fixable, if I could find some time... (the latter part could be difficult) |
Okay, well I was curious about setting the port on the signal endpoints but I get a slightly odd extra couple of endpoints:
I'm not really sure where the id 4 & 5 endpoints are coming from or if that is expected? The transfers themselves seem to be proceeding fine as expected. Anyway, it sounds like the subflow creation errors on first connect is a cosmetic issue anyway. So the two main issues I have are the dropped subflow connections after some period of time, and the fact I have no way to remediate them while they are active. It looks like you have a possible patch for the re-adding the endpoints to recreate subflows? I mean, if I could get that to work I might just end up cycling through endpoints every hour or so to keep both subflows transferring data. |
Such issue should be addressed by this series: https://lore.kernel.org/mptcp/[email protected]/T/#t With that deleting and re-adding a signal endpoint should lead to the related subflow re-creation |
rawflags 0x10 means that are 'implicit' endpoints: created by the PM to allow manipulation of subflows created in response to incoming ADD_ADDR, when the source IP does not map to any known endpoint. With the configuration specified above implicit endpoints are created because the PM takes in account even the endpoint port number to identify it. If you upgrade your iproute2 to version v6.5.0 or later you should get a more explicit output for the endpoint dump. |
Okay thanks for the explanation. I gave the patch a quick try and can confirm that removing an endpoint id on the "server" and re-adding again with a different id does now (re)create a new tcp subflow. Of course I hadn't completely thought everything through and now my problem is that because I have two servers that can also be clients of each other too, if I remove + add endpoints to fix clients connecting in, I also break the client connections running on that server which are connecting out (if that makes sense). I saw a recent patch from @matttbe around having both the "signal" and "subflow" defined on each endpoint but I'm not sure if that helps with my bi-directional setup (I will test it). But if I make the client actually explicitly use two "subflow" endpoints and del/add them with different ids. it does not re-add the subflow. So it seems like it's only possible to have the server side re-establish subflows but on the clients there is no way to do it? I guess mptcp is not really meant for the bi-directional case... I might have to give up on the idea and use a server + client host pair in each site instead. |
@daire-byrne: I think we should investigate why the subflows close. |
Yes, of course, always happy to test and debug. I have been trying to reproduce the issue of closing/reset subflows with synthetic tests like long running iperfs, but have not managed it yet. For some reason, our mix of production rsync transfers seems to reproduce it within a couple of hours. Out of 10 rsync processes, 2-3 will have lost at least one subflow. The one test I have been avoiding, but probably need to do, is to not use mptcp at all and see if any normal tcp rsync transfers fail. I have double checked the underlying networking between hosts and cannot find any other potential causes for the tcp subflow resets. |
Looking at the MIB stats you shared above, there are a bunch of MPTcpExtDSSNoMatchTCP events on the client side. That happens when a subflow observes an invalid MPTCP data sequence mapping and, for MPJ subflows, triggers a reset with reset reason == 1. MPTcpExtDSSNoMatchTCP should be caused only by buggy implementation, so apparently we still have some corner cases to fix in DSS handling. |
I've been checking some of the reset_reason=1 host pairs just now and I don't always the MPTcpExtDSSNoMatchTCP counts in nstat -a increase. But I will keep a close eye on this now and see if I can see a pattern. But I think I still see reset_reason=1 examples without a corresponding MPTcpExtDSSNoMatchTCP. |
So after looking at this for the last 24 hours, the MPTcpExtDSSNoMatchTCP increased by ~25 on the client and there were plenty of reset_reason=1 on the servers. I don't yet have a good script to definitively correlate the two things (need to correlate across 7 server/client pairs), but I'm working on it. It would be good to say definitively that every reset_reason=1 on the servers matches a MPTcpExtDSSNoMatchTCP on the client. Of course, even if I can say that all these disconnections are due to DSS, I'm not sure what the next steps would be. I'm pretty confident that in at least one of the failing paths, there are no middleboxes that could be interfering with the option. In terms of the hardware, the hosts are all VMWare VMs using vmxnet3. |
Well, I am now definitely seeing the DSS/reset_reason=1 instances lining up now. One example:
Slightly odd that both signal subflows died at the same time which left the transfer running on the backup subflow. Usually one or other of the signal subflows dies on it's own or at least separated in time. This is a more common example (that also incremented the
We just lost a single subflow in this case. So what is the best way to debug these DSS failures? |
Likely not the best method, but the attached patch |
Thanks @pabeni - patch applied and this is what I see on the client when the (id 2) subflow dies:
I do not see any new debug messages on the "server" end. I also never saw any suspicious dmesg entries on either server or client prior to this extra debug logging. Hope that helps... |
Weirdly, I also observed another DSS event (on the client) corresponding to a dropped subflow, but this time there was no debug message in dmesg on either the client or server. I will keep monitoring. |
I'm unsure about the timing of the relevant events, but the kernel message is rate-limited, if there are a bunch of almost concurrent similar events, the following one will not be logged. If you have more "Bad mapping" logs, could you please share them? In the above one the current mapping covers "very recent" past subflow data (e.g. the the mapping ends with the last copied bytes). |
A bunch of sanity checks in the mptcp code emits only a single splat (WARN_ON_ONCE()). If the host has a long runtime, the only relevant/emitted message could be quite back in time. I think we can hit this:
and in turn that could cause the issue you observe. @daire-byrne: would you mind include the additional attached |
Caught another on the client:
Where I show the output of the netstat -a I am looping every minute (with timestamp) and then the corresponding dmesg output. In this case subflow id 2 dropped:
I will apply the additional and leave it running overnight. |
A bunch of drops over an hour period this morning.
These all correspond to the eight DSS occurrences showing in "netstat -a". But it does not seem to be hitting the new debug lines... |
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Source: Kernel.org MR: 160524 Type: Integration Disposition: Backport from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable linux-5.10.y ChangeID: e93fa44f07149843b39f1ea63d72ec1253783df6 Description: commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Armin Kuster <[email protected]>
Source: Kernel.org MR: 160524 Type: Integration Disposition: Backport from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable linux-5.10.y ChangeID: e93fa44f07149843b39f1ea63d72ec1253783df6 Description: commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Armin Kuster <[email protected]>
Source: Kernel.org MR: 160524 Type: Integration Disposition: Backport from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable linux-5.10.y ChangeID: e93fa44f07149843b39f1ea63d72ec1253783df6 Description: commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Armin Kuster <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/2078428 commit 68cc924 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Koichiro Den <[email protected]> Signed-off-by: Stefan Bader <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 68cc924729ffcfe90d0383177192030a9aeb2ee4 upstream. When a subflow receives and discards duplicate data, the mptcp stack assumes that the consumed offset inside the current skb is zero. With multiple subflows receiving data simultaneously such assertion does not held true. As a result the subflow-level copied_seq will be incorrectly increased and later on the same subflow will observe a bad mapping, leading to subflow reset. Address the issue taking into account the skb consumed offset in mptcp_subflow_discard_data(). Fixes: 04e4cd4f7ca4 ("mptcp: cleanup mptcp_subflow_discard_data()") Cc: [email protected] Link: multipath-tcp/mptcp_net-next#501 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Hi,
You may remember me from such great issues as #430 and a few others. We are doing large rsync transfer using MPTCP over two ISP connections between our WAN offices.
Our setup is pretty much still as described in #430 (comment)
We have just started noticing that some long running transfers are dropping one of the TCP endpoints (as yet unknown) and I'm also wondering if there is a way to "repair" such a mptcp connection and get it to reconnect the lost subflow endpoints?
So here is an example of a transfer that starts out and is still fully connected (ss -neimMtp):
And then by comparison here's an rsync transfer that started out like above, but then something happened to drop some endpoints:
Namely the "backup" endpoint on ens192 and the second signal (bulk transfer) endpoint on ens256 have disconnected. The transfer continues on the remaining ens224 endpoint but at much reduced performance.
Now, I have also seen some cases where the two signal bulk transfers are still there, but the backup endpoint is gone. In the case of the disconnected backup connection, is it possible that it might drop out due to no data traversing it and the keepalive timeout killing it? Whatever the blip might have been, all routing and connectivity seems to be working again by the time I get to it.
One thing that does seem to happen is that all the running connections between two hosts seem to suffer this problem at the same time, so blips in the connectivity or maybe TCP memory exhaustion (we use large buffers) on one host could be the underlying root cause?
I'm not sure how best to debug the drops so any tips around that would be great. I'm also wondering if it's possible to "repair" such connections and force them to readd the endpoint connections? My quick searching did not throw up anything useful.
We are currently running v6.9.3.
Many thanks.
The text was updated successfully, but these errors were encountered: