Skip to content
This repository has been archived by the owner on Apr 3, 2021. It is now read-only.

Wasted layer of buffering and inneficciency #149

Closed
fortuna opened this issue Nov 25, 2020 · 7 comments
Closed

Wasted layer of buffering and inneficciency #149

fortuna opened this issue Nov 25, 2020 · 7 comments
Labels

Comments

@fortuna
Copy link

fortuna commented Nov 25, 2020

tcpRecvFn only reads one packet at a time from the pbuf (p.tot_len), passing it to Receive, which blocks on the application call to conn.Read. It gets a buffer from the pool if the packet spans multiple pbuf entries, allocating a new buffer if the packet is larger than 2KiB.

Problem 1) The buffer is held for the duration of the Read, which in practice means all the time. That buffering layer is actually unnecessary.

Problem 2) Only one packet is read, even though there may be a lot more data available. That's a missed opportunity for higher efficiency

You can fix the problem by switching the data flow from push to pull and writing directly to the application-provided buffer:

  • Replace the tcpConn.sndPipeWriter and tcpConn.sndPipeReader with a readCh chan []byte
  • In tcpRecvFn wait for a read buffer readBuf <- conn.readCh
  • On conn.Read(buf), pass the buffer to the channel: conn.readCh <- buf
  • Copy as much data as possible from the pbuf into the readBuf
  • Notify the reader (done channel?) and call tcp_recved.

There are possibly some concurrency nuances to handle closing and errors. The implementation of io.Pipe has some insights.

/cc: @bemasc @alalamav

@fortuna
Copy link
Author

fortuna commented Nov 25, 2020

If you don't want tcpRecvFn to access private members of TCPConn, then you could implement a local Reader with the logic above, and pass that Reader to the newTCPConn constructor, to be used in place of tcpConn.sndPipeReader. You could then get rid of tcpConn.Receive.

@eycorsican
Copy link
Owner

I think it's possible to avoid this buffer allocation and this copy by Receive pbufs in a loop one by one in this manner, dechaining a pbuf from a chain seems quite easy, there is pbuf_dechain function provided by lwip.

@fortuna
Copy link
Author

fortuna commented Nov 26, 2020

That would help remove the extra 2KiB per connection. However, each Receive call still blocks on a Read call, so we would be limiting the Reads to one pbuf, when we could have the application read as much as it can in one go if we switch from push to pull.

One thing I don't understand is how the stream or packet is split into pbufs. Is there a max pbuf size? Is it a configuration option?

@fortuna
Copy link
Author

fortuna commented Nov 26, 2020

To give you some more context about the problem we are having: the Outline client uses go-tun2socks to send traffic to a Shadowsocks server. Shadowsocks puts data in encrypted frames. We can accumulated as much as 16KiB in a frame, but because we don't know whether the next Read will block, we are forced to emit a frame for every Read.

Ideally, I'd give tun2socks my 16KiB buffer and it would fill it with as much available data as possible, possibly from multiple packets.

The Shadowsocks overhead is not a lot: 34 bytes per frame. But assuming a 1500 MTU - TCP/IP overhead = 1420 we are talking about sending 2.5% more bytes for every byte sent. Not bad, but that's the minimum overhead and it can be an issue if we have small pbufs. For example, that's 25% if we are reading 1.4KiB pbufs.

@eycorsican
Copy link
Owner

This PR would solve the blocking issue, and if tcpRecvFn is called fast enough (cgo is slow), the application would get as much data as possible. But a buffer is introduced. It's not clear to me how we could solve both problems at the same time.

One thing I don't understand is how the stream or packet is split into pbufs. Is there a max pbuf size? Is it a configuration option?

FWIW, it depends on how the pbuf is allocated and the input IP packet size, the tcpRecvFn callback function is mainly driven by packet input.

@eycorsican
Copy link
Owner

I see three issues here:

  1. Extra intermediate buffering
  2. Application Read blocks lwIP
  3. Application Reads almost always return small data chunks

I implemented your idea above (If I'm understanding it right) 425fb2e and it should solve 1 and 2, without intermediate buffering, I don't see any chance to solve 3.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants