Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create memhttp package to debug flaky testcases #594

Merged
merged 23 commits into from
Nov 1, 2023
Merged

Conversation

emcfarlane
Copy link
Contributor

@emcfarlane emcfarlane commented Sep 18, 2023

Creates two internal packages: memhttp and memhttptest based on https://github.com/akshayjshah/memhttp . We use this for testing connect's services.

Changeshttptest.Server to use an in memory network. This makes testing more robust to ephemeral port issues under load. Two flaky test cases were now easily reproducible and fixed. Both occurred from races between starting the network request in a go routine and checking for a http2 stream and erroring if not. This can error on Send or Receive depending on how fast the request can complete.

On duplexHTTPCall I removed SetError to avoid overwriting the original response error. Now BlockUntilResponseReady returns an error from the response initialization.

See: golang/go#14200

Fixes:

  • TestServer/http1/*/*/cumsum_cancel_before_send: send can error on write message.
  • TestBidiRequiresHTTP2: send can io.EOF error on write, or succeed.
  • TestBidiOverHTTP1: same as above.
  • TestClientPeer/grpcweb: stream request sends an invalidate payload invoking server error, races with interceptor.

Two places where errors raced:

Changes httptest.Server to use an in memory network. This makes testing
more robust to ephemeral port issues under load. Two flaky test cases
were now easily reproducible and fixed. Both occured from races between
starting the network request in a go routine and checking for a http2
stream and erroring if not. This can error on Send or Receive depending
on how fast the request can complete.

See: golang/go#14200
@emcfarlane
Copy link
Contributor Author

Running many iterations locally to find more, need to fix another one:

--- FAIL: TestClientPeer (0.00s)
    --- FAIL: TestClientPeer/grpcweb (0.01s)
        client_ext_test.go:111:
            assertion:  assert.Nil
            got:        unknown: io: read/write on closed pipe

Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far to me. Are you planning to fix that other flake you found in this PR? Or do you want this PR approved now so you can merge it and fix the other issue in a subsequent PR?

connect_ext_test.go Outdated Show resolved Hide resolved
@emcfarlane
Copy link
Contributor Author

@jhump had to step out but hopefully can debug it now. Not sure what is calling close twice.

@emcfarlane
Copy link
Contributor Author

I left it as a TODO. I think it's related to SetError but not sure how it rises. I have other branches to rejig the duplex_http_call which should make this easier to debug. Will fix later.

@emcfarlane
Copy link
Contributor Author

Benchmarking increase in allocs, making the net.Pipe is ~8 and theres some more locking. Maybe better to use local network for bench.

- BenchmarkConnect/unary-8                    5433           1981280 ns/op         5651212 B/op        229 allocs/op
+ BenchmarkConnect/unary-8                    5440           1980366 ns/op         5729820 B/op        237 allocs/op

@emcfarlane emcfarlane requested a review from jhump September 18, 2023 16:32
@emcfarlane emcfarlane self-assigned this Sep 18, 2023
jhump
jhump previously approved these changes Sep 18, 2023
Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to use local network for bench.

Yeah, I think so, too. The benchmark seems less useful if it's intentionally different from real world conditions 🤷

@emcfarlane emcfarlane changed the title Debug flaky testcases with in memory network Create memhttp package to debug flaky testcases Oct 10, 2023
internal/memhttp/memhttptest/http.go Outdated Show resolved Hide resolved
.golangci.yml Outdated Show resolved Hide resolved
internal/memhttp/listener.go Outdated Show resolved Hide resolved
internal/memhttp/listener.go Outdated Show resolved Hide resolved
internal/memhttp/listener.go Outdated Show resolved Hide resolved
internal/memhttp/memhttp.go Outdated Show resolved Hide resolved
internal/memhttp/memhttp.go Outdated Show resolved Hide resolved
internal/memhttp/memhttp.go Outdated Show resolved Hide resolved
compression_test.go Outdated Show resolved Hide resolved
.golangci.yml Outdated Show resolved Hide resolved
internal/memhttp/option.go Outdated Show resolved Hide resolved
internal/memhttp/option.go Outdated Show resolved Hide resolved
@jhump jhump dismissed their stale review October 10, 2023 22:54

Direction has changed a little since original review

@emcfarlane emcfarlane requested a review from jhump October 11, 2023 12:15
@emcfarlane
Copy link
Contributor Author

emcfarlane commented Oct 13, 2023

=== CONT  TestBidiRequiresHTTP2
    connect_ext_test.go:786:
        assertion:      assert.Equal
        got:    unknown
        want:   unimplemented

Test is still flaky 😞, need to relook into this.

Two errors:

Removes SetError in favour of reporting errors on BlockUntilResponseReady.
ensureRequestMade removes sync.Once as the contract of ClientStream state
calling Send with CloseRequest is not safe to call concurrently.
@emcfarlane emcfarlane requested a review from jhump October 22, 2023 19:58
@emcfarlane
Copy link
Contributor Author

emcfarlane commented Oct 22, 2023

@jhump behaviour in duplexHTTPCall has changed in the last 2 commits to address the flaky tests. I've removed SetError to avoid subsequent errors overwriting the original cause. Now BlockUntilResponseReady reports the error and Send is closed with io.EOF. ensureMakeRequest is no longer concurrent safe which is consistent with the interface contract for ClientStream that states Send or CloseRequest will race. Please re-review!

duplex_http_call.go Outdated Show resolved Hide resolved
duplex_http_call.go Outdated Show resolved Hide resolved
connect_ext_test.go Show resolved Hide resolved
connect_ext_test.go Outdated Show resolved Hide resolved
duplex_http_call.go Outdated Show resolved Hide resolved
duplex_http_call.go Outdated Show resolved Hide resolved
protocol_connect.go Show resolved Hide resolved
@@ -106,10 +106,8 @@ func TestClientPeer(t *testing.T) {
err = clientStream.Send(&pingv1.SumRequest{})
assert.Nil(t, err)
// server streaming
serverStream, err := client.CountUp(ctx, connect.NewRequest(&pingv1.CountUpRequest{}))
serverStream, err := client.CountUp(ctx, connect.NewRequest(&pingv1.CountUpRequest{Number: 1}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change relevant? Why would the request data make this flaky or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It causes the server to error with invalid_argument: number must be positive: got 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. But was that not true before? Why wasn't this a problem before this PR? (I'm trying to understand why change it now.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is checking in the interceptor that the peer address and peer protocol is set. It's testing each request style by sending an empty but valid request, however server stream is the only one that will cause the server handler to error. This can now report the server error instead of success. I don't think this is intended behaviour of the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not setting Number will cause the following only for grpcweb:

--- FAIL: TestClientPeer (0.00s)
    --- FAIL: TestClientPeer/grpcweb (0.07s)
        client_ext_test.go:112:
            assertion:  assert.Nil
            got:        unknown: http2: response body closed

The grpcweb error response is a header only response, need to investigate why this error is returned on a no body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now flaky with:

            client_ext_test.go:113:
                assertion:      assert.Nil
                got:    unknown: io: read/write on closed pipe

Discard is reading from a closed response which is causing an io.ErrClosedPipe. This is probably due to the new memhttp. Need to investigate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now flaky with:

By "now", do you mean after the last commit you pushed, which removed calls to response.Body.Close()?

Also, in that latest commit, who is the caller that is now responsible for calling Close? Is this done when StreamClientConn.CloseResponse() is called? Is there a chance some code path is failing to call that? (In particular, I'm suspicious why this would only happen for gRPC-Web and not others.)

Discard is reading from a closed response which is causing an io.ErrClosedPipe.

It seems like discard should be resilient against io.ErrClosedPipe and replace it with io.EOF when observed, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's due to the change in closing the pipe writer on Reader instead of Writer. net.Pipe will close both halves but the error is given on the opposite end. So closing the Reader made the Writer return nice io.EOF errors but the Reader to receive io.ErrClosedPipe. HTTP2 libraries set the request error to the response here

I'll need to write more testing around closing, and create an issue for testing with different latencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP2 sets the error on the response then here

Copy link
Contributor Author

@emcfarlane emcfarlane Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discard should never see io.ErrClosedPipe, however I've added context error checks to avoid issues with cancelation errors bubbling up to CloseResponse() errors. These are tested in TestServer tests.

Working and passing with -count=100 locally, can't see any more flaky test issues. Could you please re-review.

duplex_http_call.go Show resolved Hide resolved
duplex_http_call.go Outdated Show resolved Hide resolved
@emcfarlane emcfarlane requested a review from jhump October 23, 2023 16:50
@emcfarlane emcfarlane mentioned this pull request Oct 23, 2023
6 tasks
if _, err := discard(d.response.Body); err != nil {
if _, err := discard(d.response.Body); err != nil &&
!errors.Is(err, context.Canceled) &&
!errors.Is(err, context.DeadlineExceeded) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these error checks to make CloseResponse behaviour more consistent. Otherwise Close does race with context cancel on errors causing either nil or a context error. Maybe we can revisit the discard behaviour in another PR.

connect_ext_test.go Show resolved Hide resolved
@@ -257,15 +263,15 @@ func (d *duplexHTTPCall) makeRequest() {
err = NewError(CodeUnavailable, err)
}
d.responseErr = err
d.requestBodyReader.CloseWithError(io.EOF)
d.requestBodyWriter.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed? Since it was using CloseWithError and not simply Close, this would not be the source of io.ErrPipeClosed issues. In fact, I believe this will cause attempts to send on the broken stream to now return io.ErrPipeClosed instead of io.EOF.

In order to control the error received by the write half (senders), you must close the read half (which is what it was doing before), and vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close is just calling CloseWithError(nil) which then gets set to io.EOF: https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/io/pipe.go;l=100
It's always the opposite side of the pipe that gets the error with the current side getting: io.ErrClosedPipe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetError called close on read:

_ = d.requestBodyReader.Close()

Was my mistake switching to Writer, it's now fixed in the latest PR.

Copy link
Member

@jhump jhump Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current side getting: io.ErrClosedPipe
SetError called close on read:

Exactly. By closing the writer here, we will induce io.ErrClosedPipe in calls to send (which is what used the writer). The old behavior, closing the reader, is better since it results in EOF errors instead.

Was my mistake switching to Writer, it's now fixed in the latest PR.

I don't follow this. I'm looking at refreshed diff and it's still closing the writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I confused myself and switched the ordering. My previous comments don't align.

Closing the write side makes Write on the pipe error with io.ErrClosedPipe. Connect checks for this when sending:

if err != nil && errors.Is(err, io.ErrClosedPipe) {

However, on the read side we need to return io.EOF to avoid HTTP2 setting the io.ErrClosedPipe as the error for the response body: #594 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloseWrite behaviour was different to SetError:

return d.requestBodyWriter.Close()

_ = d.requestBodyReader.Close()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you for explaining!

@akshayjshah akshayjshah merged commit 41e8e3b into main Nov 1, 2023
7 checks passed
@akshayjshah akshayjshah deleted the ed/memtest branch November 1, 2023 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants