Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benches: Benchmark with constrained connections #102

Merged
merged 5 commits into from
Feb 3, 2021

Conversation

mxinden
Copy link
Member

@mxinden mxinden commented Feb 2, 2021

Previously the benchmarks would use an unbounded channel to simulate a
network connection. While this is helpful to measure the CPU impact of
Yamux specific changes, it is difficult to estimate how changes might
effect network throughput in different environments.

The crate constrained-connection can help simulate different
environments (DSL, HSDPA, GBit LAN, ...) by providing a channel
enforcing a specified bandwidth and delay. That said, it does not
simulate the OS network stack, routers, buffers, packet loss, ... Thus
as one would expect this can be used to detect trends, but not to e.g.
determine bandwidth expectations.

This commit makes the benchmarks run on top of a subset of the
connection types provided by constrained-connection. In addition it
retains the simulation on top of an unconstrained connection.

When simulating with low bandwidth-delay-product connections, yamux can
deadlock in the roundtrip scenario. Imagine a client sending a large
payload exceeding the bandwidth-delay-product of the connection to a
server. The server will try to echo the payload back to the client. Once
both client and server exhausted the buffer provided through the
bandwidth-delay-product of the connection, neither will be able to make
progress as the other won't read from the connection. This commit
rewrites the benchmarks, only having the client send to the server, but
not the server echoing back to the client.


I hope for these benchmarks to be helpful when measuring the impact of libp2p/rust-libp2p#1849 and #100.

As an example below you can find the critcmp output of a run with WindowUpdateMode::OnReceive and WindowUpdateMode::OnRead. While the comparison of the two modes does not show a lot of difference, the throughput measurements on single / multi streams do.

You can find the bandwidth and the delay for each connection type (mobile, adsl2+, ...) here.

$ critcmp on-read on-receive

group                                                on-read                                on-receive
-----                                                -------                                ----------
concurrent/adsl2+/#streams1/#messages1               1.00     25.4±0.08ms   157.3 KB/sec    1.01     25.6±0.07ms   156.1 KB/sec
concurrent/adsl2+/#streams1/#messages10              1.00     25.5±0.07ms  1566.0 KB/sec    1.00     25.6±0.15ms  1561.5 KB/sec
concurrent/adsl2+/#streams1/#messages100             1.00    227.7±0.15ms  1756.6 KB/sec    1.01    228.9±0.86ms  1747.1 KB/sec
concurrent/adsl2+/#streams10/#messages1              1.00     25.6±0.09ms  1560.9 KB/sec    1.01     25.9±0.06ms  1544.2 KB/sec
concurrent/adsl2+/#streams10/#messages10             1.00    177.1±0.16ms     2.2 MB/sec    1.01    178.4±0.30ms     2.2 MB/sec
concurrent/adsl2+/#streams10/#messages100            1.00   1671.1±7.84ms     2.3 MB/sec    1.00   1676.6±4.08ms     2.3 MB/sec
concurrent/adsl2+/#streams100/#messages1             1.00    177.5±0.26ms     2.2 MB/sec    1.00    177.9±0.50ms     2.2 MB/sec
concurrent/adsl2+/#streams100/#messages10            1.00   1668.3±0.49ms     2.3 MB/sec    1.00   1669.5±0.89ms     2.3 MB/sec
concurrent/adsl2+/#streams100/#messages100           1.00      16.6±0.01s     2.3 MB/sec    1.00      16.7±0.04s     2.3 MB/sec
concurrent/gbit-lan/#streams1/#messages1             1.02    126.8±2.30µs    30.8 MB/sec    1.00    124.2±1.04µs    31.4 MB/sec
concurrent/gbit-lan/#streams1/#messages10            1.04   857.2±77.05µs    45.6 MB/sec    1.00   827.7±16.55µs    47.2 MB/sec
concurrent/gbit-lan/#streams1/#messages100           1.01      7.9±0.29ms    49.5 MB/sec    1.00      7.8±0.07ms    50.0 MB/sec
concurrent/gbit-lan/#streams10/#messages1            1.02    831.7±6.08µs    47.0 MB/sec    1.00    815.3±4.09µs    47.9 MB/sec
concurrent/gbit-lan/#streams10/#messages10           1.01      7.8±0.12ms    50.1 MB/sec    1.00      7.7±0.07ms    50.4 MB/sec
concurrent/gbit-lan/#streams10/#messages100          1.00     75.3±0.79ms    51.9 MB/sec    1.00     75.2±0.50ms    51.9 MB/sec
concurrent/gbit-lan/#streams100/#messages1           1.00      7.9±0.16ms    49.7 MB/sec    1.01      7.9±0.18ms    49.2 MB/sec
concurrent/gbit-lan/#streams100/#messages10          1.01     77.8±2.00ms    50.2 MB/sec    1.00     77.1±0.49ms    50.7 MB/sec
concurrent/gbit-lan/#streams100/#messages100         1.00   773.2±15.14ms    50.5 MB/sec    1.00    771.9±4.28ms    50.6 MB/sec
concurrent/mobile/#streams1/#messages1               1.00     50.5±0.30ms    79.3 KB/sec    1.00     50.3±0.07ms    79.5 KB/sec
concurrent/mobile/#streams1/#messages10              1.00    100.6±0.09ms   397.4 KB/sec    1.00    100.7±0.10ms   397.4 KB/sec
concurrent/mobile/#streams1/#messages100             1.00    653.4±0.32ms   612.1 KB/sec    1.00    653.0±0.46ms   612.6 KB/sec
concurrent/mobile/#streams10/#messages1              1.00    100.7±0.08ms   397.2 KB/sec    1.00    100.7±0.04ms   397.2 KB/sec
concurrent/mobile/#streams10/#messages10             1.00    553.4±0.32ms   722.9 KB/sec    1.00    554.3±1.31ms   721.6 KB/sec
concurrent/mobile/#streams10/#messages100            1.00       5.5±0.00s   723.1 KB/sec    1.00       5.5±0.00s   723.3 KB/sec
concurrent/mobile/#streams100/#messages1             1.00    554.0±0.34ms   722.1 KB/sec    1.00    553.7±0.35ms   722.5 KB/sec
concurrent/mobile/#streams100/#messages10            1.00       5.5±0.00s   723.1 KB/sec    1.00       5.5±0.01s   722.9 KB/sec
concurrent/mobile/#streams100/#messages100           1.00      55.1±0.02s   725.7 KB/sec    1.00      55.2±0.08s   725.2 KB/sec
concurrent/unconstrained/#streams1/#messages1        1.00     38.6±1.15µs   101.2 MB/sec    1.00     38.7±1.36µs   100.9 MB/sec
concurrent/unconstrained/#streams1/#messages10       1.00    131.2±2.80µs   297.7 MB/sec    1.03    134.8±8.17µs   289.7 MB/sec
concurrent/unconstrained/#streams1/#messages100      1.00  1067.1±67.74µs   366.1 MB/sec    1.06  1132.5±104.27µs   344.9 MB/sec
concurrent/unconstrained/#streams10/#messages1       1.01   254.2±18.23µs   153.7 MB/sec    1.00    251.6±5.78µs   155.3 MB/sec
concurrent/unconstrained/#streams10/#messages10      1.08  1491.8±76.80µs   261.9 MB/sec    1.00  1380.7±47.10µs   282.9 MB/sec
concurrent/unconstrained/#streams10/#messages100     1.00     13.0±0.58ms   300.8 MB/sec    1.03     13.4±0.95ms   292.5 MB/sec
concurrent/unconstrained/#streams100/#messages1      1.35      2.6±0.27ms   148.1 MB/sec    1.00  1947.3±51.61µs   200.6 MB/sec
concurrent/unconstrained/#streams100/#messages10     1.09     14.6±0.47ms   268.2 MB/sec    1.00     13.3±0.21ms   292.6 MB/sec
concurrent/unconstrained/#streams100/#messages100    1.10    141.9±7.09ms   275.2 MB/sec    1.00    128.4±3.47ms   304.1 MB/sec

Previously the benchmarks would use an unbounded channel to simulate a
network connection. While this is helpful to measure the CPU impact of
Yamux specific changes, it is difficult to estimate how changes might
effect network throughput in different environments.

The crate `constrained-connection` can help simulate different
environments (DSL, HSDPA, GBit LAN, ...) by providing a channel
enforcing a specified bandwidth and delay. That said, it does not
simulate the OS network stack, routers, buffers, packet loss, ... Thus
as one would expect this can be used to detect trends, but not to e.g.
determine bandwidth expectations.

This commit makes the benchmarks run on top of a subset of the
connection types provided by `constrained-connection`. In addition it
retains the simulation on top of an unconstrained connection.

When simulating with low bandwidth-delay-product connections, yamux can
deadlock in the roundtrip scenario. Imagine a client sending a large
payload exceeding the bandwidth-delay-product of the connection to a
server. The server will try to echo the payload back to the client. Once
both client and server exhausted the buffer provided through the
bandwidth-delay-product of the connection, neither will be able to make
progress as the other won't read from the connection. This commit
rewrites the benchmarks, only having the client send to the server, but
not the server echoing back to the client.
@romanb
Copy link
Contributor

romanb commented Feb 2, 2021

When simulating with low bandwidth-delay-product connections, yamux can
deadlock in the roundtrip scenario. Imagine a client sending a large
payload exceeding the bandwidth-delay-product of the connection to a
server. The server will try to echo the payload back to the client. Once
both client and server exhausted the buffer provided through the
bandwidth-delay-product of the connection, neither will be able to make
progress as the other won't read from the connection.

By "payload" do you mean a single message? This only happens with WindowUpdateMode::OnRead? With OnReceive I would expect a stream reset, at least if a single message is larger than the max_buffer_size. If there is a deadlock when the capacity of the "virtual link" is exhausted, then that should only be because of a fault in the client-server "echo" protocol used in the benchmarks, as a peer in a protocol should not stop reading because it cannot finish writing (a particular message), or if so, considering it a protocol error and dropping the stream. On first impulse I would prefer to have an echo protocol in the benchmarks that cannot deadlock, be that due to conscious sizing of individual messages according to the bandwidth or by increasing receive buffers as necessary. But if it makes the benchmarks simpler and a benchmark with roundtrips does not seem to add much value over a unidirectional data transfer, I won't object. It would just be good to be very clear about why roundtrip benchmarks with varying virtual link capacities deadlock and why we don't want or can write them such that they don't. If there is an as-of-yet undiscovered opportunity for deadlock other than what is already known and also documented for WindowUpdateMode::OnRead, it would be good to look into and / or document.

@mxinden
Copy link
Member Author

mxinden commented Feb 2, 2021

By "payload" do you mean a single message?

Yes.

If there is a deadlock when the capacity of the "virtual link" is exhausted, then that should only be because of a fault in the client-server "echo" protocol used in the benchmarks, as a peer in a protocol should not stop reading because it cannot finish writing (a particular message), or if so, considering it a protocol error and dropping the stream.

As far as I can tell that is not the case. I have attached a unit test below which reproduces a variant of the deadlock. It uses the default WindowUpdateMode OnReceive. The server reads a single message on each incoming stream and echoes it back to the client. The client opens num_streams streams, writes and reads the message concurrently and then closes each stream. With a message size of 1024 bytes and a capacity link of 10 * 1024 bytes this unit test deadlocks with > 12 concurrent streams on my machine.

If there is an as-of-yet undiscovered opportunity for deadlock other than what is already known and also documented for WindowUpdateMode::OnRead, it would be good to look into and / or document.

Debugging the issue, I am afraid this is inherent to the design of connection.rs. When asked to send a new frame on a stream on_stream_command seems to block the next_stream loop until the message is fully sent. It will not allow reading when blocked on writing. Thus, when the capacity of the link is saturated, the client will block writing its next message, the server will be blocked on echoing its next message back to the client, thus both are deadlock.

Off the top of my head I would deem the probability for this edge case (deadlock due to a low capacity link) to be low, especially since it would require the mutual dependency between server and client, one blocking the other. I would expect the OS networking stack to add many more bytes to the capacity of the link due to its various buffers. That said, I am happy to look into ways how to continue reading when blocked on writing a single frame, given that the issue might cause deadlocks in other scenarios as well.

Please take the above with a grain of salt. After all I am relatively new to this Yamux implementation. I might as well me missing something.

@romanb First off, thank you for the elaborate comment. Does the above make sense to you? I am happy to extend my (already lengthy) reply.

Unit test to reproduce deadlock.
diff --git a/src/tests.rs b/src/tests.rs
index fc661a6..cbccbb1 100644
--- a/src/tests.rs
+++ b/src/tests.rs
@@ -16,6 +16,14 @@ use rand::Rng;
 use std::{fmt::Debug, io, net::{Ipv4Addr, SocketAddr, SocketAddrV4}};
 use tokio::{net::{TcpStream, TcpListener}, runtime::Runtime, task};
 use tokio_util::compat::{Compat, Tokio02AsyncReadCompatExt};
+use futures::channel::mpsc::{unbounded, UnboundedSender, UnboundedReceiver};
+use std::sync::{Arc, Mutex};
+use std::task::{Context, Poll, Waker};
+use std::pin::Pin;
+use futures::io::AsyncReadExt;
+use futures::future::join;
+use futures::task::Spawn;
+use futures::executor::block_on;
 
 #[test]
 fn prop_config_send_recv_single() {
@@ -298,3 +306,190 @@ where
     Ok(result)
 }
 
+#[test]
+fn deadlock_low_capacity_link() {
+    let mut pool = futures::executor::LocalPool::new();
+    // Message is 10x smaller than link capacity.
+    let msg = vec![1u8; 1024];
+    // On my machine stalls with > 12.
+    let num_streams = 13;
+    let (tx, rx) = unbounded();
+
+    let (server_endpoint, client_endpoint) = bounded_capacity::Endpoint::new(10 * 1024);
+
+    // Create and spawn a server that echoes every message back to the client.
+    let server = Connection::new(server_endpoint, Config::default(), Mode::Server);
+    pool.spawner().spawn_obj( async move {
+        crate::into_stream(server)
+            .try_for_each_concurrent(None, |mut stream| async move {
+                {
+                    let (mut r, mut w) = futures::io::AsyncReadExt::split(&mut stream);
+                    futures::io::copy(&mut r, &mut w).await?;
+                }
+                stream.close().await?;
+                Ok(())
+            })
+            .await
+            .expect("server works")
+    }.boxed().into()).unwrap();
+
+    // Create and spawn a client.
+    let client = Connection::new(client_endpoint, Config::default(), Mode::Client);
+    let ctrl = client.control();
+    pool.spawner()
+        .spawn_obj(crate::into_stream(client).for_each(|r| async { r.unwrap(); }).boxed().into())
+        .unwrap();
+
+    // Create `num_streams` streams, sending and receiving `msg` on each.
+    for _ in 0..num_streams {
+        let msg = msg.clone();
+        let mut ctrl = ctrl.clone();
+        let tx = tx.clone();
+        pool.spawner().spawn_obj(async move {
+            let stream = ctrl.open_stream().await.unwrap();
+            let (mut reader, mut writer) = AsyncReadExt::split(stream);
+            let mut b = vec![0; msg.len()];
+            join(
+                writer.write_all(msg.as_ref()).map_err(|e| panic!(e)),
+                reader.read_exact(&mut b[..]).map_err(|e| panic!(e)),
+            ).await;
+            let mut stream = reader.reunite(writer).unwrap();
+            stream.close().await.unwrap();
+            tx.unbounded_send(b.len()).unwrap();
+        }.boxed().into()).unwrap();
+    };
+
+    // Run all tasks until none make any more progress.
+    pool.run_until_stalled();
+    drop(pool);
+    drop(tx);
+
+    // Expect each of the `num_streams` tasks to finish, reporting the amount of bytes they send and
+    // received.
+    let n = block_on(rx.fold(0, |acc, n| future::ready(acc + n)));
+    assert_eq!(n, num_streams * msg.len());
+}
+
+mod bounded_capacity {
+    use super::*;
+    use futures::ready;
+    use std::io::{Error, ErrorKind, Result};
+
+    pub struct Endpoint {
+        sender: UnboundedSender<Vec<u8>>,
+        receiver: UnboundedReceiver<Vec<u8>>,
+        next_item: Option<Vec<u8>>,
+
+        shared_send: Arc<Mutex<Shared>>,
+        shared_receive: Arc<Mutex<Shared>>,
+
+        capacity: usize,
+    }
+
+    #[derive(Default)]
+    struct Shared {
+        size: usize,
+        waker_write: Option<Waker>,
+    }
+
+    impl Endpoint {
+        pub fn new(capacity: usize) -> (Endpoint, Endpoint) {
+            let (a_to_b_sender, a_to_b_receiver) = unbounded();
+            let (b_to_a_sender, b_to_a_receiver) = unbounded();
+
+            let a_to_b_shared = Arc::new(Mutex::new(Default::default()));
+            let b_to_a_shared = Arc::new(Mutex::new(Default::default()));
+
+            let a = Endpoint {
+                sender: a_to_b_sender,
+                receiver: b_to_a_receiver,
+                next_item: None,
+                shared_send: a_to_b_shared.clone(),
+                shared_receive: b_to_a_shared.clone(),
+                capacity,
+            };
+
+            let b = Endpoint {
+                sender: b_to_a_sender,
+                receiver: a_to_b_receiver,
+                next_item: None,
+                shared_send: b_to_a_shared,
+                shared_receive: a_to_b_shared,
+                capacity,
+            };
+
+            (a, b)
+
+        }
+    }
+
+    impl AsyncRead for Endpoint {
+        fn poll_read(
+            mut self: Pin<&mut Self>,
+            cx: &mut Context<'_>,
+            buf: &mut [u8],
+        ) -> Poll<Result<usize>> {
+            let item = match self.next_item.as_mut() {
+                Some(item) => item,
+                None => match ready!(self.receiver.poll_next_unpin(cx)) {
+                    Some(item) => {
+                        self.next_item = Some(item);
+                        self.next_item.as_mut().unwrap()
+                    }
+                    None => {
+                        return Poll::Ready(Ok(0));
+                    }
+                },
+            };
+
+            let n = std::cmp::min(buf.len(), item.len());
+
+            buf[0..n].copy_from_slice(&item[0..n]);
+
+            if n < item.len() {
+                *item = item.split_off(n);
+            } else {
+                self.next_item.take().unwrap();
+            }
+
+            let mut shared = self.shared_receive.lock().unwrap();
+            if let Some(waker) = shared.waker_write.take() {
+                waker.wake();
+            }
+
+            debug_assert!(shared.size >= n);
+            shared.size -= n;
+
+            Poll::Ready(Ok(n))
+        }
+    }
+
+    impl AsyncWrite for Endpoint {
+        fn poll_write(self: Pin<&mut Self>, cx: &mut Context<'_>, buf: &[u8]) -> Poll<Result<usize>> {
+            let mut shared = self.shared_send.lock().unwrap();
+            let n = std::cmp::min(self.capacity - shared.size, buf.len());
+            if n == 0 {
+                shared.waker_write = Some(cx.waker().clone());
+                return Poll::Pending;
+            }
+
+            self.sender
+                .unbounded_send(buf[0..n].to_vec())
+                .map_err(|e| Error::new(ErrorKind::ConnectionAborted, e))?;
+
+            shared.size += n;
+
+            Poll::Ready(Ok(n))
+        }
+
+        fn poll_flush(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Result<()>> {
+            ready!(self.sender.poll_flush_unpin(cx)).unwrap();
+            Poll::Ready(Ok(()))
+        }
+
+        fn poll_close(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Result<()>> {
+            ready!(self.sender.poll_close_unpin(cx)).unwrap();
+            Poll::Ready(Ok(()))
+        }
+    }
+}

@romanb
Copy link
Contributor

romanb commented Feb 3, 2021

Debugging the issue, I am afraid this is inherent to the design of connection.rs. When asked to send a new frame on a stream on_stream_command seems to block the next_stream loop until the message is fully sent. It will not allow reading when blocked on writing.

I think you're right and I think it can be problematic. I will take a closer look, thanks for the test case. With that being clarified, I don't think it needs to block your changes to the benchmarks here, which generally look good to me.

benches/concurrent.rs Outdated Show resolved Hide resolved
@mxinden
Copy link
Member Author

mxinden commented Feb 3, 2021

With that being clarified, I don't think it needs to block your changes to the benchmarks here, which generally look good to me.

Sounds good to me.

In case we do end up resolving the deadlock, I will partially revert these changes, benchmarking a round-trip instead of a one-way data transfer.

@mxinden mxinden merged commit dada223 into libp2p:develop Feb 3, 2021
@romanb
Copy link
Contributor

romanb commented Feb 3, 2021

Unit test to reproduce deadlock.

A small follow-up to that test: I could not see a deadlock as the reason for it failing with larger numbers of substreams, but rather simply because it was using LocalPool::run_until_stalled() which I think completes as soon as all futures in the pool are Pending, but that does not mean that they cannot make progress by being woken (and that they indeed may still be woken). A slightly modified version of your test is here with the only important difference being the use of run_until() for the tasks in question (there is a bit of noise due to upgrading tokio for the tests).

Nevertheless, regarding deadlocks due to reads waiting for writes to finish on both ends, I think I did manage to trigger a stall with tests/concurrent.rs, which was also triggering it in an earlier version of yamux that was then explicitly patched. I don't remember if that consideration was dropped on purpose in a later rewrite, whether it was an oversight, or whether the susceptibility to this problem with the current code is simply less than with the old, since tests/concurrent.rs always seems to pass at least with the current settings of 1000 streams and 100KiB messages - the same configuration for which it stalled occasionally until #51.

So long story short, this still seems to need further investigation and better reproduction.

@romanb
Copy link
Contributor

romanb commented Feb 4, 2021

See #104 for a follow-up.

@mxinden
Copy link
Member Author

mxinden commented Feb 4, 2021

A small follow-up to that test: I could not see a deadlock as the reason for it failing with larger numbers of substreams, but rather simply because it was using LocalPool::run_until_stalled() which I think completes as soon as all futures in the pool are Pending, but that does not mean that they cannot make progress by being woken (and that they indeed may still be woken).

I don't follow. Isn't the fact that all futures in the pool are pending a proof of a deadlock? I am not aware of any external entities (e.g. timers) being used which could wake either of them up again.

@romanb
Copy link
Contributor

romanb commented Feb 4, 2021

I don't follow. Isn't the fact that all futures in the pool are pending a proof of a deadlock?

As far as I can tell, no. You can observe the behaviour by running the modified test I linked to earlier, replacing run_until with run_until_stalled. In particular, what I have seen is that the reason it starts to trigger with 13 streams is that this is about the number that results in the first wake() of the waker_write. So there is a task scheduled to be polled again, after it already returned Poll::Pending. So indeed it seems that run_until_stalled() can return if all futures are Pending even if some of them may have already been woken for being polled again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants