Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TCP connection erroneously duplicates message on Windows #58764

Open
jackc opened this issue Feb 27, 2023 · 4 comments
Open

net: TCP connection erroneously duplicates message on Windows #58764

jackc opened this issue Feb 27, 2023 · 4 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Milestone

Comments

@jackc
Copy link

jackc commented Feb 27, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.1 windows/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
PS C:\Users\jack\dev\golang-windows-dupreads> go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\jack\AppData\Local\go-build
set GOENV=C:\Users\jack\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\jack\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\jack\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.20.1
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=0
set GOMOD=C:\Users\jack\dev\golang-windows-dupreads\go.mod
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=C:\Users\jack\AppData\Local\Temp\go-build129281977=/tmp/go-build -gno-record-gcc-switches
PS C:\Users\jack\dev\golang-windows-dupreads>

What did you do?

  1. Create a TCP server on localhost.
  2. Connect to that server.
  3. Wrap the client conn in a wrapper that records all Write data and the server conn in a wrapper that records all Read data.
  4. Read and write a bunch of data in a very particular way (details below).
  5. Compare what was written to what was read.

What did you expect to see?

There should be no different in the bytes that the client sent and the bytes the server received.

What did you see instead?

At some point after ~2.7 MB a chunk of data is received twice.

Details

This problem was originally encountered in the tests for https://github.com/jackc/pgx. pgx has need of non-blocking IO (see #36973 and #15735). It also needs this supported when using TLS. Since the desired functionality doesn't presently exist in Go it has an internal package nbconn that wraps up a lot of complicated code in a way that presents a net.Conn that supports non-blocking IO even over TLS.

One of the tests consistently encounters a tls: bad record MAC on Windows but not on Linux or Mac. Given the complicated code it would be obvious to assume that the problem is in that nbconn package. However, after many hours of debugging I believe I have ruled that out. I have a reproducible test case that creates a net.Conn for a client and server and records everything written and read for the client to server stream. The TLS connection and the nbconn connection are on top of / use those recording connections. The recorded connections observe the repeated data (which obviously will corrupt the TLS connection).

Given that the repeated data is observed below the TLS connection I don't believe it is plausible that the bug exists in either pgx or the tls package. It seems likelier that it is a problem in the net package, the Go runtime, or Windows itself.

I have a test case at https://github.com/jackc/golang-windows-dupreads. It always fails on Windows but succeeds on other platforms. I've reduced the test as much as possible, but unfortunately I am unable to reproduce it without the particular pattern of reads and writes produced by nbconn and tls.

Here is the core of the test. It creates a server on localhost and establishes a connection to it. It wraps both connections in recorders. Then it performs some activity. Lastly, it checks to see if the bytes written and the bytes read match:

func TestRepeatedMessage(t *testing.T) {
	// Make a connected pair of TCP connections.
	tcpClientConn, tcpServerConn, err := makeTCPConns()
	if err != nil {
		t.Errorf("makeTCPConns failed: %v", err)
	}

	// Wrap the connections in recording connections.
	recordedClientConn, recordedServerConn := makeRecordedConns(tcpClientConn, tcpServerConn)

	// Establish a TLS connection from the client to the server.
	clientConn, serverConn, err := makeTSLConnection(recordedClientConn, recordedServerConn)
	if err != nil {
		t.Errorf("makeTLSConnection failed: %v", err)
	}
	defer clientConn.Close()
	defer serverConn.Close()

	// This message size should be large enough to block a Write until the other side Reads some of it. Both the server
	// and the client try to Write a large message without reading.
	const messageSize = 4 * 1024 * 1024

	// The server writes a large message before reading.
	errChan := make(chan error, 1)
	go func() {
		remoteWriteBuf := make([]byte, messageSize)
		_, err := serverConn.Write(remoteWriteBuf)
		if err != nil {
			errChan <- err
			return
		}

		readBuf := make([]byte, messageSize)
		_, err = io.ReadFull(serverConn, readBuf)
		errChan <- err
	}()

	// This Write will always succeed due to the nbconn wrapper buffering all writes. This is necessary to prevent the
	// TLS connection from being broken by a temporarily failing write.
	writeBuf := make([]byte, messageSize)
	n, err := clientConn.Write(writeBuf)
	if err != nil {
		t.Errorf("clientConn.Write failed: %v", err)
	}
	if messageSize != n {
		t.Errorf("clientConn.Write wrote wrong number of bytes: %d instead of %d", n, messageSize)
	}

	readBuf := make([]byte, messageSize)
	// This Read is what will actually flush the previous Write. Since the server goroutine above also writes a large
	// message before reading this means that both sides are writing enough bytes to block until the other side reads.
	// This will deadlock unless the internal non-blocking and buffering logic is working.
	_, err = io.ReadFull(clientConn, readBuf)
	if err != nil {
		t.Errorf("io.ReadFull(clientConn, readBuf) failed: %v", err)
	}

	err = <-errChan
	if err != nil {
		t.Errorf("serverConn goroutine encountered an error: %v", err)
	}

	// Now compare what was written by the client connection and what was read by the server connection.
	recordedServerConn.readLock.Lock()
	defer recordedServerConn.readLock.Unlock()
	readLog := recordedServerConn.readLog

	recordedClientConn.writeLock.Lock()
	defer recordedClientConn.writeLock.Unlock()
	writeLog := recordedClientConn.writeLog

	if readLog.Len() != writeLog.Len() {
		t.Errorf("readLog.Len(): %d, writeLog.Len(): %d\n", readLog.Len(), writeLog.Len())
	}

	for i := 0; i < readLog.Len(); i++ {
		if readLog.Bytes()[i] != writeLog.Bytes()[i] {
			t.Errorf("mismatch at: %d", i)
			t.Logf("readLog.Bytes()[i-20 : i+20]  %v", readLog.Bytes()[i-20:i+20])
			t.Logf("writeLog.Bytes()[i-20 : i+20] %v", writeLog.Bytes()[i-20:i+20])
			firstOccuranceOfMismatchDataIndex := bytes.Index(writeLog.Bytes(), readLog.Bytes()[i:i+100])
			if firstOccuranceOfMismatchDataIndex < i {
				t.Logf("mismatch data appears earlier in stream at %d which is %d bytes earlier", firstOccuranceOfMismatchDataIndex, i-firstOccuranceOfMismatchDataIndex)
			}
			break
		}
	}
}

Run with go test

PS C:\Users\jack\dev\golang-windows-dupreads> go test
--- FAIL: TestRepeatedMessage (0.14s)
    nbconn_test.go:229: serverConn goroutine encountered an error: local error: tls: bad record MAC
    nbconn_test.go:242: readLog.Len(): 2799033, writeLog.Len(): 4200393
    nbconn_test.go:247: mismatch at: 2766292
    nbconn_test.go:248: readLog.Bytes()[i-20 : i+20]  [33 53 47 198 118 85 105 211 192 151 29 220 243 125 77 23 3 3 64 17 75 110 182 199 210 112 201 152 118 93 91 4 251 122 27 61 92 41 94 214]
    nbconn_test.go:249: writeLog.Bytes()[i-20 : i+20] [33 53 47 198 118 85 105 211 192 151 29 220 243 125 77 23 3 3 64 17 94 72 252 16 226 122 111 94 77 211 208 12 105 22 3 27 13 109 105 190]
    nbconn_test.go:252: mismatch data appears earlier in stream at 2749886 which is 16406 bytes earlier
FAIL
exit status 1
FAIL    dupreads        0.350s

Observations:

  1. The mismatch / repeated message is always 16406 bytes earlier. This appears to be the size of the Writes emitted by the tls package.
  2. The error never occurs before 2.7MB.
  3. While going back and forth between Windows and Mac while testing this issue I observed that Windows seems to have a larger TCP buffer or window. i.e. You can write more bytes to a connection without the other side reading before the write blocks. This leads this test to generate a different pattern of reads and writes on the different platforms.
@ianlancetaylor
Copy link
Member

Can you show us a complete stand-alone program or test that demonstrates the problem?

While anything is possible it's quite unlikely that there is a problem of this sort in the net package.

@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 27, 2023
@jackc
Copy link
Author

jackc commented Feb 27, 2023

@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Feb 27, 2023
@ianlancetaylor ianlancetaylor added this to the Go1.21 milestone Feb 27, 2023
@ianlancetaylor
Copy link
Member

CC @golang/windows

@ydnar
Copy link

ydnar commented Mar 1, 2023

We started seeing sporadic tls: bad record MAC in production on Linux since the release of Go 1.19.6, related to opening long-lived TLS connections to EPP servers. It seems like it’s occurring in the TLS handshake.

The issue seems to be confined to a single remote operator (GoDaddy), and there was no (public) changes on their end between our deployment on Go 1.19.5 and Go 1.19.6.

This was the only crypto/tls CL in Go 1.19.6 (a 1.20 backport), but there’s no immediately obvious connection to the increased alertBadRecordMAC errors: 00b256e

FWIW the tls.Conns are dialed via an SSH proxy (needed for IP allow-listing).

@gopherbot gopherbot modified the milestones: Go1.21, Go1.22 Aug 8, 2023
@gopherbot gopherbot modified the milestones: Go1.22, Go1.23 Feb 6, 2024
@gopherbot gopherbot modified the milestones: Go1.23, Go1.24 Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Windows
Projects
None yet
Development

No branches or pull requests

4 participants