Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http/httputil: concurrent map writes in ReverseProxy on subsequent RoundTripper usages #65123

Closed
mariash opened this issue Jan 16, 2024 · 18 comments
Assignees
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mariash
Copy link

mariash commented Jan 16, 2024

Go version

go version go1.21.6 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/pivotal/.cache/go-build'
GOENV='/home/pivotal/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/pivotal/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/pivotal/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.6'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build743232752=/tmp/go-build -gno-record-gcc-switches'

What did you do?

https://go.dev/play/p/kxnqMuLCo-2

What did you see happen?

Sometimes getting an error:

fatal error: concurrent map writes

goroutine 53 [running]:
net/http/httputil.(*ReverseProxy).ServeHTTP.func2(0xc000222420?, 0xc000214000?)
	/usr/local/go-faketime/src/net/http/httputil/reverseproxy.go:464 +0x6b
net/http.(*persistConn).readResponse(0xc000230000, {{}, 0xc000214000, {0xc000142300}, 0xc00021c420, 0x0, 0xc00021c360, 0xc00021c3c0}, 0xc000110180)
	/usr/local/go-faketime/src/net/http/transport.go:2338 +0x1ff
net/http.(*persistConn).readLoop(0xc000230000)
	/usr/local/go-faketime/src/net/http/transport.go:2133 +0x3b0
created by net/http.(*Transport).dialConn in goroutine 49
	/usr/local/go-faketime/src/net/http/transport.go:1776 +0x169f

goroutine 1 [select]:
net/http.(*persistConn).roundTrip(0xc00019a120, 0xc0001a2000)
	/usr/local/go-faketime/src/net/http/transport.go:2652 +0x979
net/http.(*Transport).roundTrip(0x8e13c0, 0xc00018a000)
	/usr/local/go-faketime/src/net/http/transport.go:604 +0x79a
net/http.(*Transport).RoundTrip(0xc00007ca08?, 0x740520?)
	/usr/local/go-faketime/src/net/http/roundtrip.go:17 +0x13
net/http.send(0xc00018a000, {0x740520, 0x8e13c0}, {0xc00007cc01?, 0x414da8?, 0x0?})
	/usr/local/go-faketime/src/net/http/client.go:260 +0x606
net/http.(*Client).send(0x8e6aa0, 0xc00018a000, {0x416328?, 0x42151c?, 0x0?})
	/usr/local/go-faketime/src/net/http/client.go:181 +0x98
net/http.(*Client).do(0x8e6aa0, 0xc00018a000)
	/usr/local/go-faketime/src/net/http/client.go:724 +0x912
net/http.(*Client).Do(...)
	/usr/local/go-faketime/src/net/http/client.go:590
main.main()
	/tmp/sandbox3108800449/prog.go:65 +0x45f

goroutine 6 [IO wait]:
internal/poll.runtime_pollWait(0x7ea93a63ae38, 0x72)
	/usr/local/go-faketime/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc000110000?, 0x4?, 0x0)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000110000)
	/usr/local/go-faketime/src/internal/poll/fd_unix.go:611 +0x2ac
net.(*netFD).accept(0xc000110000)
	/usr/local/go-faketime/src/net/fd_unix.go:172 +0x29
net.(*TCPListener).accept(0xc000128000)
	/usr/local/go-faketime/src/net/tcpsock_posix.go:152 +0x1e
net.(*TCPListener).Accept(0xc000128000)
	/usr/local/go-faketime/src/net/tcpsock.go:315 +0x30
net/http.(*Server).Serve(0xc000100000, {0x742080, 0xc000128000})
	/usr/local/go-faketime/src/net/http/server.go:3056 +0x364
net/http.(*Server).ListenAndServe(0xc000100000)
	/usr/local/go-faketime/src/net/http/server.go:2985 +0x71
net/http.ListenAndServe({0x6c51dd, 0x5}, {0x73fcc0?, 0xc00001c240})
	/usr/local/go-faketime/src/net/http/server.go:3239 +0x8f
created by main.main in goroutine 1
	/tmp/sandbox3108800449/prog.go:22 +0xb6

goroutine 38 [IO wait]:
internal/poll.runtime_pollWait(0x7ea93a63ab50, 0x72)
	/usr/local/go-faketime/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc0000ba000?, 0xc000102491?, 0x0)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0000ba000, {0xc000102491, 0x1, 0x1})
	/usr/local/go-faketime/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc0000ba000, {0xc000102491?, 0xc000040780?, 0x40595b?})
	/usr/local/go-faketime/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc000042040, {0xc000102491?, 0xc000180000?, 0xc000076118?})
	/usr/local/go-faketime/src/net/net.go:179 +0x45
net/http.(*connReader).backgroundRead(0xc000102480)
	/usr/local/go-faketime/src/net/http/server.go:683 +0x37
created by net/http.(*connReader).startBackgroundRead in goroutine 9
	/usr/local/go-faketime/src/net/http/server.go:679 +0xba

goroutine 18 [IO wait]:
internal/poll.runtime_pollWait(0x7ea93a63ad40, 0x72)
	/usr/local/go-faketime/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc000110080?, 0x16?, 0x0)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000110080)
	/usr/local/go-faketime/src/internal/poll/fd_unix.go:611 +0x2ac
net.(*netFD).accept(0xc000110080)
	/usr/local/go-faketime/src/net/fd_unix.go:172 +0x29
net.(*TCPListener).accept(0xc000128060)
	/usr/local/go-faketime/src/net/tcpsock_posix.go:152 +0x1e
net.(*TCPListener).Accept(0xc000128060)
	/usr/local/go-faketime/src/net/tcpsock.go:315 +0x30
net/http.(*Server).Serve(0xc0001000f0, {0x742080, 0xc000128060})
	/usr/local/go-faketime/src/net/http/server.go:3056 +0x364
net/http.(*Server).ListenAndServe(0xc0001000f0)
	/usr/local/go-faketime/src/net/http/server.go:2985 +0x71
net/http.ListenAndServe({0x6c51e2, 0x5}, {0x0?, 0x0})
	/usr/local/go-faketime/src/net/http/server.go:3239 +0x8f
created by main.main in goroutine 1
	/tmp/sandbox3108800449/prog.go:56 +0x2de

goroutine 19 [IO wait]:
internal/poll.runtime_pollWait(0x7ea93a63ac48, 0x72)
	/usr/local/go-faketime/src/runtime/netpoll.go:343 +0x85
internal/poll.(*pollDesc).wait(0xc0001a4200?, 0xc00012e000?, 0x0)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:84 +0x27
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go-faketime/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001a4200, {0xc00012e000, 0x1000, 0x1000})
	/usr/local/go-faketime/src/internal/poll/fd_unix.go:164 +0x27a
net.(*netFD).Read(0xc0001a4200, {0xc00012e000?, 0x40cd65?, 0xc0001023c0?})
	/usr/local/go-faketime/src/net/fd_posix.go:55 +0x25
net.(*conn).Read(0xc000108030, {0xc00012e000?, 0xc000044800?, 0x10?})
	/usr/local/go-faketime/src/net/net.go:179 +0x45
net/http.(*persistConn).Read(0xc00019a120, {0xc00012e000?, 0xc00007cc90?, 0x7ea8f3cc02a8?})
	/usr/local/go-faketime/src/net/http/transport.go:1954 +0x4a
bufio.(*Reader).fill(0xc0001140c0)
	/usr/local/go-faketime/src/bufio/bufio.go:113 +0x103
bufio.(*Reader).ReadSlice(0xc0001140c0, 0xd5?)
	/usr/local/go-faketime/src/bufio/bufio.go:379 +0x29
bufio.(*Reader).ReadLine(0xc0001140c0)
	/usr/local/go-faketime/src/bufio/bufio.go:408 +0x25
net/textproto.(*Reader).readLineSlice(0xc00007cc90)
	/usr/local/go-faketime/src/net/textproto/reader.go:56 +0x99
net/textproto.(*Reader).ReadLine(...)
	/usr/local/go-faketime/src/net/textproto/reader.go:39
net/http.ReadResponse(0xc0001140c0, 0xc00018a100)
	/usr/local/go-faketime/src/net/http/response.go:161 +0x97
net/http.(*persistConn).readResponse(0xc00019a120, {{}, 0xc00018a100, {0xc00018a000}, 0xc0001340c0, 0x1, 0xc000134000, 0xc000134060}, 0x0)
	/usr/local/go-faketime/src/net/http/transport.go:2311 +0xce
net/http.(*persistConn).readLoop(0xc00019a120)
	/usr/local/go-faketime/src/net/http/transport.go:2133 +0x3b0
created by net/http.(*Transport).dialConn in goroutine 33
	/usr/local/go-faketime/src/net/http/transport.go:1776 +0x169f

goroutine 20 [select]:
net/http.(*persistConn).writeLoop(0xc00019a120)
	/usr/local/go-faketime/src/net/http/transport.go:2421 +0xe5
created by net/http.(*Transport).dialConn in goroutine 33
	/usr/local/go-faketime/src/net/http/transport.go:1777 +0x16f1

goroutine 7 [runnable]:
net/textproto.MIMEHeader.Set(...)
	/usr/local/go-faketime/src/net/textproto/header.go:22
net/http.Header.Set(...)
	/usr/local/go-faketime/src/net/http/header.go:40
main.main.func4({0x741f00, 0xc000150000}, 0xc000110180?, {0xa?, 0xc00019963a?})
	/tmp/sandbox3108800449/prog.go:50 +0xc5
net/http/httputil.(*ReverseProxy).ServeHTTP(0xc00010c140, {0x741f00, 0xc000150000}, 0xc000142100)
	/usr/local/go-faketime/src/net/http/httputil/reverseproxy.go:475 +0x10c3
main.main.main.func2.func6({0x741f00, 0xc000150000}, 0x6252d9?)
	/tmp/sandbox3108800449/prog.go:33 +0xe5
net/http.HandlerFunc.ServeHTTP(0x10?, {0x741f00?, 0xc000150000?}, 0x7ea93a637b98?)
	/usr/local/go-faketime/src/net/http/server.go:2136 +0x29
net/http.(*ServeMux).ServeHTTP(0x40d0e5?, {0x741f00, 0xc000150000}, 0xc000142000)
	/usr/local/go-faketime/src/net/http/server.go:2514 +0x142
net/http.serverHandler.ServeHTTP({0x741038?}, {0x741f00?, 0xc000150000?}, 0x1?)
	/usr/local/go-faketime/src/net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc0000bc000, {0x742490, 0xc000102360})
	/usr/local/go-faketime/src/net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 18
	/usr/local/go-faketime/src/net/http/server.go:3086 +0x5cb

goroutine 27 [chan receive]:
main.main.main.func2.func6.1()
	/tmp/sandbox3108800449/prog.go:30 +0x19
net/http.(*persistConn).readResponse(0xc000330000, {{}, 0xc00032c000, {0xc000142300}, 0xc000134720, 0x0, 0xc000134660, 0xc0001346c0}, 0xc000110180)
	/usr/local/go-faketime/src/net/http/transport.go:2319 +0x144
net/http.(*persistConn).readLoop(0xc000330000)
	/usr/local/go-faketime/src/net/http/transport.go:2133 +0x3b0
created by net/http.(*Transport).dialConn in goroutine 65
	/usr/local/go-faketime/src/net/http/transport.go:1776 +0x169f

What did you expect to see?

Expected no errors.

This example has custom implementation of ErrorHandler that writes to response headers a bunch of times to better expose the data race. ReverseProxy defaultErrorHandler writes to response headers once. And there is also some header updates happening after.

This seems to be an issue when RoundTrip retries requests. The first RoundTrip call does block until trace contexts are run, but not the subsequent RoundTrip calls.

@mauri870 mauri870 added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 16, 2024
@mauri870
Copy link
Member

cc @neild

@mauri870
Copy link
Member

@mariash Can you try running it with -race to see if there is any race conditions with the code? Thanks

@mariash
Copy link
Author

mariash commented Jan 16, 2024

@mauri870 the error I am getting with -race is:

WARNING: DATA RACE
Write at 0x00c0001d2060 by goroutine 27:
  net/http.(*response).Header()
      /usr/local/go/src/net/http/server.go:1094 +0x11d
  net/http/httputil.(*ReverseProxy).ServeHTTP.func2()
      /usr/local/go/src/net/http/httputil/reverseproxy.go:459 +0x54
  net/http.(*persistConn).readResponse()
      /usr/local/go/src/net/http/transport.go:2338 +0x38f
  net/http.(*persistConn).readLoop()
      /usr/local/go/src/net/http/transport.go:2133 +0x59c
  net/http.(*Transport).dialConn.func5()
      /usr/local/go/src/net/http/transport.go:1776 +0x33

Previous write at 0x00c0001d2060 by goroutine 16:
  net/http.(*response).Header()
      /usr/local/go/src/net/http/server.go:1094 +0x11d
  main.main.func4()
      /home/pivotal/workspace/reverseproxy-example-or/server.go:50 +0x6d
  net/http/httputil.(*ReverseProxy).ServeHTTP()
      /usr/local/go/src/net/http/httputil/reverseproxy.go:475 +0x185d
  main.main.main.func2.func6()
      /home/pivotal/workspace/reverseproxy-example-or/server.go:33 +0x17b
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2136 +0x47
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/net/http/server.go:2514 +0xbc
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2938 +0x2a1
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:2009 +0xc24
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3086 +0x4f

Goroutine 27 (running) created at:
  net/http.(*Transport).dialConn()
      /usr/local/go/src/net/http/transport.go:1776 +0x25d9
  net/http.(*Transport).dialConnFor()
      /usr/local/go/src/net/http/transport.go:1467 +0x129
  net/http.(*Transport).queueForDial.func1()
      /usr/local/go/src/net/http/transport.go:1436 +0x44

Goroutine 16 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:3086 +0x86c
  net/http.(*Server).ListenAndServe()
      /usr/local/go/src/net/http/server.go:2985 +0xbc
  net/http.ListenAndServe()
      /usr/local/go/src/net/http/server.go:3239 +0x124
  main.main.func8()
      /home/pivotal/workspace/reverseproxy-example-or/server.go:56 +0x39

@mariash
Copy link
Author

mariash commented Jan 17, 2024

My understanding the first ReadResponse happens in the same goroutine. Subsequent ReadResponse calls are happening in the separate goroutine. Which eventually calls Got1xxResponse. ReverseProxy configures Got1xxResponse to write to Headers map. And then original goroutine modifies headers as well.

@panjf2000
Copy link
Member

panjf2000 commented Jan 17, 2024

My understanding the first ReadResponse happens in the same goroutine. Subsequent ReadResponse calls are happening in the separate goroutine. Which eventually calls Got1xxResponse. ReverseProxy configures Got1xxResponse to write to Headers map. And then original goroutine modifies headers as well.

That's pretty much it. Nice catch! Would you like to send a CL to fix it? Contribution is always welcome! If it's not convenient for you, I can take on it.
@mariash

@panjf2000 panjf2000 added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 17, 2024
@panjf2000 panjf2000 added this to the Backlog milestone Jan 17, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/556037 mentions this issue: net/http/httputil: fix the contention of response header for ReverseProxy

@panjf2000 panjf2000 self-assigned this Jan 17, 2024
@panjf2000
Copy link
Member

Now I've taken a deeper look at the code, CL 556037 needs more work, still working on it.

@mariash
Copy link
Author

mariash commented Jan 17, 2024

Hi @panjf2000 thank you for looking into this.

Looking at CL 556037:
1 . Seems like access to got1xxResponses variable might not be thread-safe.
2. It can be read before Got1xxResponse hook is called which will result in inconsistent behavior.

I feel like there could be couple ways to fix the issue. Ideally, ResponseWriter would be thread-safe, but it returns a Headers map directly and changing this seems to be a complex change.

Second option would be for Transport to wait for read response before returning from RoundTrip, e.g. making sure readLoop is done which seems like what was fixed in writeLoop. This would be great too since it will allow anyone to safely use Got100Continue and Got1xxResponse hooks.

And the last option is to make ReverseProxy itself thread-safe, probably add lock when accessing Header map. But this would mean that custom hooks or ErrorHandler can't use the same ResponseWriter safely.

@panjf2000
Copy link
Member

panjf2000 commented Jan 18, 2024

@mariash Thank you for your analysis and suggestion.

Option 2 does sound like a better idea, I'll try that in CL 556037 and add some tests for it.

Looking at CL 556037:
1 . Seems like access to got1xxResponses variable might not be thread-safe.
2. It can be read before Got1xxResponse hook is called which will result in inconsistent behavior.

One more thing though, I don't quite get your point about this, given that got1xxResponses would only be accessed by Got1xxResponse in readLoop goroutine during RoundTrip, how can it not be thread-safe or be read before Got1xxResponse?

@panjf2000 panjf2000 modified the milestones: Backlog, Go1.23 Jan 18, 2024
@panjf2000
Copy link
Member

panjf2000 commented Jan 18, 2024

After a deeper look at your code, it seems that the root cause of the data race is the RetryTransport you implemented, it would be no contention if it is removed and let the ReverseProxy use the default transport. Besides, according to the doc of RoundTripper, you shouldn't reuse the request body cuz it will be closed after RoundTrip(). Why did you need the RetryTransport anyway?

This issue now seems a little like a false alert to me.

@mariash

@mariash
Copy link
Author

mariash commented Jan 18, 2024

@panjf2000 thank you again for you response! To answer your questions:

One more thing though, I don't quite get your point about this, given that got1xxResponses would only be accessed by Got1xxResponse in readLoop goroutine during RoundTrip, how can it not be thread-safe or be read before Got1xxResponse?

got1xxResponses is written in the readLoop goroutine which started when RoundTrip is called. Since RoundTrip method does not wait for readLoop to finish before returning got1xxResponses is getting read on line 481 of your change which causes 2 goroutines to write/read at the same time.

After a deeper look at your code, it seems that the root cause of the data race is the RetryTransport you implemented, it would be no contention if it is removed and let the ReverseProxy use the default transport. Besides, according to the doc of RoundTripper, you shouldn't reuse the request body cuz it will be closed after RoundTrip(). Why did you need the RetryTransport anyway?

Here the doc states:

Transports should be reused instead of created as needed. Transports are safe for concurrent use by multiple goroutines.

My understanding you can call RoundTrip on the same transport multiple times. This is why I titled this issue: "on subsequent RoundTripper usages".

Anyway, since the issue is in readLoop still running you don't need to reuse RoundTripper to get the error. Here I have an example that uses default Transport and will cause RoundTrip to return earlier than the readLoop by setting ResponseHeaderTimeout. It is harder to get the data race in this case. You need to run this code with -race flag which you can't do in the Golang playground. After running this couple times I am getting the same data race:

WARNING: DATA RACE
Write at 0x00c000290120 by goroutine 14:
  runtime.mapassign_faststr()
      /usr/local/go/src/runtime/map_faststr.go:203 +0x0
  net/textproto.MIMEHeader.Set()
      /usr/local/go/src/net/textproto/header.go:22 +0xde
  net/http.Header.Set()
      /usr/local/go/src/net/http/header.go:40 +0x67
  main.main.func3()
      /home/pivotal/workspace/reverseproxy-example/server.go:44 +0x51
  net/http/httputil.(*ReverseProxy).ServeHTTP()
      /usr/local/go/src/net/http/httputil/reverseproxy.go:477 +0x185d
  main.main.main.func2.func5()
      /home/pivotal/workspace/reverseproxy-example/server.go:29 +0x10b
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2136 +0x47
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/net/http/server.go:2514 +0xbc
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2938 +0x2a1
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:2009 +0xc24
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3086 +0x4f

Previous read at 0x00c000290120 by goroutine 21:
  runtime.mapiternext()
      /usr/local/go/src/runtime/map.go:867 +0x0
  net/http/httputil.(*ReverseProxy).ServeHTTP.func2()
      /usr/local/go/src/net/http/httputil/reverseproxy.go:465 +0x104
  net/http.(*persistConn).readResponse()
      /usr/local/go/src/net/http/transport.go:2387 +0x4b0
  net/http.(*persistConn).readLoop()
      /usr/local/go/src/net/http/transport.go:2165 +0x6a5
  net/http.(*Transport).dialConn.func5()
      /usr/local/go/src/net/http/transport.go:1793 +0x33

So the problem in that transport.go leaves readLoop goroutine running after RoundTrip is returned in some error cases:

  1. When calling RoundTrip readLoop goroutine is started, which calls hooks as part of readResponse like Got100Continue and Got1xxResponse.
  2. RoundTrip does not wait for readLoop to finish. There was a fix for it to wait for writeLoop but not the readLoop. readLoop waits for RoundTrip to finish (callerGone channel) but not the other way.
  3. That results in the readLoop still running after RoundTrip is returned.
  4. In case of ReverseProxy that readLoop modifies headers in the Got1xxResponse and after RoundTrip is finished.

In majority of cases readLoop is executed fast enough. But we are seeing this concurrent map writes sometimes and that results in fatal concurrent map write errors that cannot be recovered and it causes the whole server to crash instead of just one connection failure.

I experimented with transport.go and had a change similar to this one - waiting for readResponse to finish if it is running. It seems to fix the race condition for ReverseProxy.

@panjf2000
Copy link
Member

panjf2000 commented Jan 19, 2024

got1xxResponses is written in the readLoop goroutine which started when RoundTrip is called. Since RoundTrip method does not wait for readLoop to finish before returning got1xxResponses is getting read on line 481 of your change which causes 2 goroutines to write/read at the same time.

If I understand the code correctly, even though the default RoundTrip returns without waiting for readLoop to exit, it does wait for persistConn.readResponse in which Got1xxResponse is called to modify the headers. After RoundTrip returns, it's safe to update headers in the current goroutine while readLoop goroutine is running and it will not go into persistConn.readResponse cuz readLoop will be blocked at:

rc := <-pc.reqch

unless RoundTrip is called again, which is what you did in your previous reproducer with RetryTransport. In that case, we don't even need CL 556037 to introduce got1xxResponses, the current implementation is already goroutine-safe.

My understanding you can call RoundTrip on the same transport multiple times. This is why I titled this issue: "on subsequent RoundTripper usages".

You certainly can reuse RoundTrip, it's designed that way. What I was saying is that you shouldn't reuse it with the same request cuz the request body would be closed after RoundTrip returns. As the doc states:

// RoundTrip should not modify the request, except for
// consuming and closing the Request's Body. RoundTrip may
// read fields of the request in a separate goroutine. Callers
// should not mutate or reuse the request until the Response's
// Body has been closed.

When you did so, it resumed the readLoop and it went into persistConn.readResponse again:

go/src/net/http/transport.go

Lines 2658 to 2665 in 66d34c7

pc.reqch <- requestAndChan{
req: req.Request,
cancelKey: req.cancelKey,
ch: resc,
addedGzip: requestedGzip,
continueCh: continueCh,
callerGone: gone,
}

As for your new reproducer, it does cause data race, but in a completely different way: you manage to prevent RoundTrip from waiting persistConn.readResponse by setting http.Transport.ResponseHeaderTimeout to a non-zero value and stalling the backend server for the exact same amount of time to respond to a request, which causes RoundTrip to fall into the branch case <-respHeaderTimer instead of case re := <-resc and lead to a early return before persistConn.readResponse is done:

go/src/net/http/transport.go

Lines 2699 to 2715 in 66d34c7

case <-respHeaderTimer:
if debugRoundTrip {
req.logf("timeout waiting for response headers.")
}
pc.close(errTimeout)
return nil, errTimeout
case re := <-resc:
if (re.res == nil) == (re.err == nil) {
panic(fmt.Sprintf("internal error: exactly one of res or err should be set; nil=%v", re.res == nil))
}
if debugRoundTrip {
req.logf("resc recv: %p, %T/%#v", re.res, re.err, re.err)
}
if re.err != nil {
return nil, pc.mapRoundTripError(req, startBytesWritten, re.err)
}
return re.res, nil

I have got to say that is one tricky case, and I'm not sure whether it should be counted as an issue and be fixed cuz http.Transport.ResponseHeaderTimeout seems that it's created for this kind of scenario (returning without waiting for persistConn.readResponse) originally (or intentionally) and the data race here is expected. I'm going to need more opinions from @neild and @bcmills.

I experimented with transport.go and had a change similar to this one - waiting for readResponse to finish if it is running. It seems to fix the race condition for ReverseProxy.

As I said above, we need to determine whether or not this "issue" should be fixed, if we reckon that this needs a fix eventually, and you're interested in making a contribution for it, please let me know and I'd be happy to close my CL and review yours, new contributors are always welcome.

Thanks again for all the effort on this! @mariash

@gdamiaN538

This comment has been minimized.

mariash added a commit to cloudfoundry/gorouter that referenced this issue Jan 22, 2024
See issue: golang/go#65123

ReverseProxy configures Got1xxResponse trace hook. We configure
ReverseProxy with our ProxyRoundTripper. ProxyRoundTripper eventually
calls http.Transport.

http.Transport runs readLoop for each connection in a separate
goroutine. When RoundTrip is called readLoop will run Got1xxResponse
hook.

If there are no errors during request handling, RoundTrip waits for
readLoop to finish. If there is an error though RoundTrip exits early
and does not wait for readLoop to finish. This results in readLoop
goroutine running in parallel and we get a data race in our ErrorHandler
which modifies response headers at the same time as Got1xxResponse hook.

This error results in concurrent map writes and not panic, which is not
caught by panic handler making Gorouter fail and drop all connections.

This code can be removed once the ReverseProxy issue is fixed.
dsabeti pushed a commit to cloudfoundry/gorouter that referenced this issue Jan 23, 2024
See issue: golang/go#65123

ReverseProxy configures Got1xxResponse trace hook. We configure
ReverseProxy with our ProxyRoundTripper. ProxyRoundTripper eventually
calls http.Transport.

http.Transport runs readLoop for each connection in a separate
goroutine. When RoundTrip is called readLoop will run Got1xxResponse
hook.

If there are no errors during request handling, RoundTrip waits for
readLoop to finish. If there is an error though RoundTrip exits early
and does not wait for readLoop to finish. This results in readLoop
goroutine running in parallel and we get a data race in our ErrorHandler
which modifies response headers at the same time as Got1xxResponse hook.

This error results in concurrent map writes and not panic, which is not
caught by panic handler making Gorouter fail and drop all connections.

This code can be removed once the ReverseProxy issue is fixed.
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/567216 mentions this issue: net/http/httputil: avoid ReverseProxy data race on 1xx response and error

@neild
Copy link
Contributor

neild commented Feb 26, 2024

Thanks for all the nice analysis!

I think this is ultimately a fairly straightforward ReverseProxy bug. (Although replicating it is certainly not straightforward!) ReverseProxy uses a httptrace.ClientTrace.Got1xxResponse hook to process 1xx responses. ClientTrace hooks are documented as not being synchronized with RoundTrip; they can execute in a separate goroutine and may be called after RoundTrip has returned. ReverseProxy doesn't account for that possibility.

https://go.dev/cl/567216 adds some synchronization to ReverseProxy to handle hooks executing after the request has completed. (Which should only be possible in the case of a request failing, for one reason or another.)

@AlexanderYastrebov
Copy link
Contributor

https://go.dev/cl/567216 adds some synchronization to ReverseProxy to handle hooks executing after the request has completed. (Which should only be possible in the case of a request failing, for one reason or another.)

I am wondering whether lock should also cover p.getErrorHandler() below:

res, err := transport.RoundTrip(outreq)
roundTripMutex.Lock()
roundTripDone = true
roundTripMutex.Unlock()
if err != nil {
p.getErrorHandler()(rw, outreq, err)
return
}

defaultErrorHandler writes 502 status
func (p *ReverseProxy) defaultErrorHandler(rw http.ResponseWriter, req *http.Request, err error) {
p.logf("http: proxy error: %v", err)
rw.WriteHeader(http.StatusBadGateway)
}
func (p *ReverseProxy) getErrorHandler() func(http.ResponseWriter, *http.Request, error) {
if p.ErrorHandler != nil {
return p.ErrorHandler
}
return p.defaultErrorHandler
}

which should fail if Got1xxResponse hook managed to run rw.WriteHeader(code):
roundTripMutex.Lock()
defer roundTripMutex.Unlock()
if roundTripDone {
// If RoundTrip has returned, don't try to further modify
// the ResponseWriter's header map.
return nil
}
h := rw.Header()
copyHeader(h, http.Header(header))
rw.WriteHeader(code)

@mariash
Copy link
Author

mariash commented Mar 15, 2024

@neild thank you for your fix. We added similar workaround in our code so now we can remove it.

@neild
Copy link
Contributor

neild commented Mar 18, 2024

@AlexanderYastrebov

I am wondering whether lock should also cover p.getErrorHandler() below:

I don't think so, but perhaps I'm missing something?

ResponseWriter allows writing any number of 1xx statuses followed by at most one 2xx-5xx status, so it's fine if the Got1xxResponse hook has called WriteHeader prior to the error handler executing. The mutex and roundTripDone guard prevent a 1xx header from being written concurrently with an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants