-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http/httputil: concurrent map writes in ReverseProxy on subsequent RoundTripper usages #65123
Comments
cc @neild |
@mariash Can you try running it with -race to see if there is any race conditions with the code? Thanks |
@mauri870 the error I am getting with
|
My understanding the first ReadResponse happens in the same goroutine. Subsequent ReadResponse calls are happening in the separate goroutine. Which eventually calls Got1xxResponse. ReverseProxy configures Got1xxResponse to write to Headers map. And then original goroutine modifies headers as well. |
That's pretty much it. Nice catch! Would you like to send a CL to fix it? Contribution is always welcome! If it's not convenient for you, I can take on it. |
Change https://go.dev/cl/556037 mentions this issue: |
Now I've taken a deeper look at the code, CL 556037 needs more work, still working on it. |
Hi @panjf2000 thank you for looking into this. Looking at CL 556037: I feel like there could be couple ways to fix the issue. Ideally, ResponseWriter would be thread-safe, but it returns a Headers map directly and changing this seems to be a complex change. Second option would be for Transport to wait for read response before returning from RoundTrip, e.g. making sure readLoop is done which seems like what was fixed in writeLoop. This would be great too since it will allow anyone to safely use Got100Continue and Got1xxResponse hooks. And the last option is to make ReverseProxy itself thread-safe, probably add lock when accessing Header map. But this would mean that custom hooks or ErrorHandler can't use the same ResponseWriter safely. |
@mariash Thank you for your analysis and suggestion. Option 2 does sound like a better idea, I'll try that in CL 556037 and add some tests for it.
One more thing though, I don't quite get your point about this, given that |
After a deeper look at your code, it seems that the root cause of the data race is the This issue now seems a little like a false alert to me. |
@panjf2000 thank you again for you response! To answer your questions:
got1xxResponses is written in the readLoop goroutine which started when RoundTrip is called. Since RoundTrip method does not wait for readLoop to finish before returning got1xxResponses is getting read on line 481 of your change which causes 2 goroutines to write/read at the same time.
Here the doc states:
My understanding you can call RoundTrip on the same transport multiple times. This is why I titled this issue: "on subsequent RoundTripper usages". Anyway, since the issue is in readLoop still running you don't need to reuse RoundTripper to get the error. Here I have an example that uses default Transport and will cause RoundTrip to return earlier than the readLoop by setting ResponseHeaderTimeout. It is harder to get the data race in this case. You need to run this code with
So the problem in that transport.go leaves readLoop goroutine running after RoundTrip is returned in some error cases:
In majority of cases readLoop is executed fast enough. But we are seeing this concurrent map writes sometimes and that results in fatal concurrent map write errors that cannot be recovered and it causes the whole server to crash instead of just one connection failure. I experimented with transport.go and had a change similar to this one - waiting for readResponse to finish if it is running. It seems to fix the race condition for ReverseProxy. |
If I understand the code correctly, even though the default Line 2150 in 66d34c7
unless RoundTrip is called again, which is what you did in your previous reproducer with RetryTransport . In that case, we don't even need CL 556037 to introduce got1xxResponses , the current implementation is already goroutine-safe.
You certainly can reuse
When you did so, it resumed the readLoop and it went into Lines 2658 to 2665 in 66d34c7
As for your new reproducer, it does cause data race, but in a completely different way: you manage to prevent Lines 2699 to 2715 in 66d34c7
I have got to say that is one tricky case, and I'm not sure whether it should be counted as an issue and be fixed cuz http.Transport.ResponseHeaderTimeout seems that it's created for this kind of scenario (returning without waiting for persistConn.readResponse ) originally (or intentionally) and the data race here is expected. I'm going to need more opinions from @neild and @bcmills.
As I said above, we need to determine whether or not this "issue" should be fixed, if we reckon that this needs a fix eventually, and you're interested in making a contribution for it, please let me know and I'd be happy to close my CL and review yours, new contributors are always welcome. Thanks again for all the effort on this! @mariash |
This comment has been minimized.
This comment has been minimized.
See issue: golang/go#65123 ReverseProxy configures Got1xxResponse trace hook. We configure ReverseProxy with our ProxyRoundTripper. ProxyRoundTripper eventually calls http.Transport. http.Transport runs readLoop for each connection in a separate goroutine. When RoundTrip is called readLoop will run Got1xxResponse hook. If there are no errors during request handling, RoundTrip waits for readLoop to finish. If there is an error though RoundTrip exits early and does not wait for readLoop to finish. This results in readLoop goroutine running in parallel and we get a data race in our ErrorHandler which modifies response headers at the same time as Got1xxResponse hook. This error results in concurrent map writes and not panic, which is not caught by panic handler making Gorouter fail and drop all connections. This code can be removed once the ReverseProxy issue is fixed.
See issue: golang/go#65123 ReverseProxy configures Got1xxResponse trace hook. We configure ReverseProxy with our ProxyRoundTripper. ProxyRoundTripper eventually calls http.Transport. http.Transport runs readLoop for each connection in a separate goroutine. When RoundTrip is called readLoop will run Got1xxResponse hook. If there are no errors during request handling, RoundTrip waits for readLoop to finish. If there is an error though RoundTrip exits early and does not wait for readLoop to finish. This results in readLoop goroutine running in parallel and we get a data race in our ErrorHandler which modifies response headers at the same time as Got1xxResponse hook. This error results in concurrent map writes and not panic, which is not caught by panic handler making Gorouter fail and drop all connections. This code can be removed once the ReverseProxy issue is fixed.
Change https://go.dev/cl/567216 mentions this issue: |
Thanks for all the nice analysis! I think this is ultimately a fairly straightforward https://go.dev/cl/567216 adds some synchronization to |
I am wondering whether lock should also cover go/src/net/http/httputil/reverseproxy.go Lines 481 to 488 in a46ecdc
defaultErrorHandler writes 502 statusgo/src/net/http/httputil/reverseproxy.go Lines 306 to 316 in a46ecdc
which should fail if Got1xxResponse hook managed to run rw.WriteHeader(code) :go/src/net/http/httputil/reverseproxy.go Lines 463 to 472 in a46ecdc
|
@neild thank you for your fix. We added similar workaround in our code so now we can remove it. |
I don't think so, but perhaps I'm missing something?
|
Go version
go version go1.21.6 linux/amd64
Output of
go env
in your module/workspace:What did you do?
https://go.dev/play/p/kxnqMuLCo-2
What did you see happen?
Sometimes getting an error:
What did you expect to see?
Expected no errors.
This example has custom implementation of ErrorHandler that writes to response headers a bunch of times to better expose the data race. ReverseProxy defaultErrorHandler writes to response headers once. And there is also some header updates happening after.
This seems to be an issue when RoundTrip retries requests. The first RoundTrip call does block until trace contexts are run, but not the subsequent RoundTrip calls.
The text was updated successfully, but these errors were encountered: