Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do function panics when HTTPClient returns a nil response #714

Conversation

agatamigalska-newstore
Copy link
Contributor

@agatamigalska-newstore agatamigalska-newstore commented Nov 2, 2018

This PR covers the scenario when res, err = s.HTTPClient.Do(req) results in res that is nil (https://github.com/stripe/stripe-go/blob/master/stripe.go#L341 ). This can happen for example in case of timeout.

  • a test was added to verify no panic in case of timeout
  • in the Do function an additional check was added to prevent panic

Closes #713


[EDIT] Initial motivation:

Integration tests in the codebase I'm working on are randomly failing with the following stack trace:

error_cause=runtime error: invalid memory address or nil pointer dereference 
stack_trace�=[
  {
  "function": "runtime.call32",
  "location": "/usr/local/go/src/runtime/asm_amd64.s:573"
  },
  {
  "function": "runtime.gopanic",
  "location": "/usr/local/go/src/runtime/panic.go:503"
  },
  {
  "function": "runtime.panicmem",
  "location": "/usr/local/go/src/runtime/panic.go:63"
  },
  {
  "function": "runtime.sigpanic",
  "location": "/usr/local/go/src/runtime/signal_unix.go:388"
  },
  {
  "function": "github.com/newstore/newstore/vendor/github.com/stripe/stripe-go.(*BackendImplementation).Do",
  "location": "/go/src/github.com/newstore/newstore/vendor/github.com/stripe/stripe-go/stripe.go:342"
  },
  {
  "function": "github.com/newstore/newstore/vendor/github.com/stripe/stripe-go.(*BackendImplementation).CallRaw",
  "location": "/go/src/github.com/newstore/newstore/vendor/github.com/stripe/stripe-go/stripe.go:224"
  },
  {
  "function": "github.com/newstore/newstore/vendor/github.com/stripe/stripe-go.(*BackendImplementation).Call",
  "location": "/go/src/github.com/newstore/newstore/vendor/github.com/stripe/stripe-go/stripe.go:186"
  },
  {
  "function": "github.com/newstore/newstore/vendor/github.com/stripe/stripe-go/customer.Client.Get",
  "location": "/go/src/github.com/newstore/newstore/vendor/github.com/stripe/stripe-go/customer/client.go:38"
  },
  ...

In this PR I'm trying to capture the situation in which this panic can happen in client.go:38. The best hypothesis was that res is nil, and so the attempt to access res.Body causes a panic.

The goal in the test was to simulate the scenario where HttpClient.Do() function returns nil response.

The timeout scenario for the test was chosen simply because, given the infrastructure used for testing stripe-go, it was the easiest to simulate.

@remi-stripe
Copy link
Contributor

@agatamigalska-newstore Thanks a lot for opening a PR so quickly and adding a test. This one is quite specific to some Go internals so I'll defer to @brandur-stripe who can look next week and he'll follow up with feedback!

Copy link
Contributor

@brandur-stripe brandur-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a crack at the fix! I left a few comments below, but a couple general ones as well:

  1. Do we need to handle this condition outside of the retry loop as well? There's very similar code to what you fixed on https://github.com/stripe/stripe-go/blob/master/stripe.go#L354 which might also be prone to the same problem (say # of retries is set to 0).
  2. In case this condition does occur, it'd be nice to produce some information logging about it. On the lines following the ones you changed you'll notice we log this:
		if s.LogLevel > 0 {
			s.Logger.Printf("Request failed with: %s (error: %v)\n", string(resBody), err)
		}

I believe that under your implementation, this will just produce:

Request failed with: (error: )

Which isn't very good. We should set err with errors.New to a message like "Received nil response which may indicate a timeout).

Lastly, do you understand why exactly this problem is happening? Go's HTTP client is supposed to return an error when a request times out. I wrote a simple test program to demonstrate this:

package main

import (
	"fmt"
	"net/http"
	"time"
)

func main() {
	client := &http.Client{
		Timeout: 1 * time.Millisecond,
	}

	resp, err := client.Get("http://example.com")
	fmt.Printf("resp = %+v\n", resp)
	fmt.Printf("err = %+v\n", err)
}

Result:

$ go run main.go
resp = <nil>
err = Get http://example.com: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

(And relatedly, I sort of suspect that the test you added may not be testing for the case you're hoping for given that it's probably just handling the normal error that gets returned in the normal case.)

stripe.go Outdated
err error
)
// Checking res instead of err because s.HTTPClient.Do(req) has one special case for Go 1 compatibility
// when both the response and an error are returned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify this comment a bit? It should read something like "when both the response and error are returned as nil" I think?

Also, what exactly do you mean by "Go 1 compatibility"? All Go versions right now are Go 1 right? Do you mean Go 1.10 or something like that?

Copy link
Contributor Author

@agatamigalska-newstore agatamigalska-newstore Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was based on the following statement made in spripe-go's dependency go-1.10/src/net/http/client.go (lines 602 - 605) :

// Special case for Go 1 compatibility: return both the response
// and an error if the CheckRedirect function failed.
// See https://golang.org/issue/3795
// The resp.Body has already been closed.

Current github repo don't have this comment anymore. Anyway, since I understood this library belongs to Go's core, I assumed that must make sense and didn't give it more thought. I still don't know what exactly they meant, so I updated my comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I just took a look at that issue and it looks like they were talking about Go "1.0" (which was somewhat ambiguously just called "Go 1").

stripe_test.go Outdated
@@ -118,6 +119,71 @@ func TestDo_Retry(t *testing.T) {
assert.Equal(t, 2, requestNum)
}

// Tests client retries when HTTP returns timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we either elaborate on this comment or remove it? e.g. You could add something like "Timeouts can result in an edge whereby both a request's result and error come back as nil, which can fail the retry infrastructure if it's not handling it."

stripe_test.go Outdated
// To run only this test:
//
// go test . -run TestDo_RetryOnTimeout -test.v
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think this "run only this test" comment was copied from TestDo_Retry above. Let's just remove this part. I included it in TestDo_Retry because it gives you a hint at how to better understand what the test is doing, but in general there's sufficient instruction for this type of thing in the README.

stripe_test.go Outdated
// previously had regressions in this behavior as we switched to HTTP/2
// and `Request` became non-reusable, so we want to check it with every
// request.
assert.Equal(t, "bar", r.Form.Get("foo"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's strip out this assert and just rely on the one in TestDo_Retry above (it and the comment above it were copied from there).

stripe_test.go Outdated
defer func() {
var recovery = recover()
assert.Nil(t, recovery)
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's sufficient to just let the test framework catch the panic for us and fail the run if one occurs. Let's strip out this defer.

stripe_test.go Outdated
// request.
assert.Equal(t, "bar", r.Form.Get("foo"))
time.Sleep(timeout)
requestNum++
Copy link
Contributor

@brandur-stripe brandur-stripe Nov 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is a little subtle, but there's a race here which is causing the Travis build to fail on this test. Check there for details, but you should also be able to reproduce it locally by using go test with the -race flag (which is what we do in Travis):

go test . -run TestDo_RetryOnTimeout -test.v -race

I believe what's happening is that you have multiple server Goroutines that are active in this handler concurrently. Because the client is timing out early, the first request is still active on the server when the second starts processing. The fix is pretty easy though: make sure to use a mutex around any variable accesses like this one that could be touched from multiple Goroutines.

@agatamigalska-newstore
Copy link
Contributor Author

agatamigalska-newstore commented Nov 5, 2018

@brandur-stripe Please see my update to the PR description, I added the initial motivation for this fix.

For your general remarks:

  1. I don't think it is possible. The cases we have are:
|-----|-----|----------------------------|
|  res|  err|               what happens |
|-----|-----|----------------------------|
|  nil| !nil| returns before reading res |
| !nil|  nil|                   all good |
| !nil| !nil| returns before reading res |
|-----|-----|----------------------------|
  1. Actually it prints out
Request failed with:  (error: Post http://127.0.0.1:38643/v1/hello: net/http: request canceled (Client.Timeout exceeded while awaiting headers))

I think it looks quite fine, doesn't it?

  1. I think there is some misunderstanding about err being nil or not nil. In the situation I'm facing here err is not nil.
    Though, I admit that this comment:
// Checking res instead of err because s.HTTPClient.Do(req) has one special case when both the 
// response and an error are returned
		if res != nil {

is probably confusing due to my misconception about Go principles. I thought it is a practice to return nil response if err is not nil, and not nil response if err is nil (like an XOR approach).
So I was expecting HttpClient.Do() to always return either not nil, nil or nil, not nil as response, error. After checking the source code I learned there is one case when it returns not nil, not nil. So I thought it was unusual and so I made this comment.

@agatamigalska-newstore
Copy link
Contributor Author

The build for Go 1.8 has failed due to

The command "go get -u golang.org/x/net/http2" failed 3 times.

Any ideas?

@agatamigalska-newstore
Copy link
Contributor Author

@brandur-stripe Please review, thanks.

@brandur-stripe
Copy link
Contributor

@agatamigalska-newstore Ah! Okay, I misunderstood the premise of the PR a bit — re-reading the code now, I can see that we are not correctly handling the err case during a retry, which is obviously incorrect.

Which you mind changing your implementation a bit so that we very explicitly handle the case on err and print accordingly? Here's the diff I came up with if you just want to apply it directly:

diff --git a/stripe.go b/stripe.go
index e356177..dc6fcf1 100644
--- a/stripe.go
+++ b/stripe.go
@@ -351,17 +351,23 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte
 			break
 		}
 
-		resBody, err := ioutil.ReadAll(res.Body)
-		res.Body.Close()
 		if err != nil {
 			if s.LogLevel > 0 {
-				s.Logger.Printf("Cannot read response: %v\n", err)
+				s.Logger.Printf("Request failed with error: %v\n", err)
+			}
+		} else {
+			resBody, err := ioutil.ReadAll(res.Body)
+			res.Body.Close()
+			if err != nil {
+				if s.LogLevel > 0 {
+					s.Logger.Printf("Cannot read response: %v\n", err)
+				}
+				return err
 			}
-			return err
-		}
 
-		if s.LogLevel > 0 {
-			s.Logger.Printf("Request failed with: %s (error: %v)\n", string(resBody), err)
+			if s.LogLevel > 0 {
+				s.Logger.Printf("Request failed with: %s (error: %v)\n", string(resBody), err)
+			}
 		}
 
 		sleepDuration := s.sleepTime(retry)

Regarding your test, I gave you a little bit of bad advice. Go actually has an atomic counter built in, so would you mind if we switched to just using that. Here's an example of using it, but your implementation should look something like this:

var counter uint64

...

atomic.AddUint64(&counter, 1)

...

atomic.LoadUint64(&counter)

Thanks!

stripe_test.go Outdated
var body = bytes.NewBufferString("foo=bar")
var response testServerResponse

// make sure that there was no panic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we strip this comment? I think it's somewhat self-explanatory that we don't want a panic.

stripe_test.go Outdated
// timeout should not prevent retry
assert.Equal(t, 2, counter.Get())
var recovery = recover()
assert.Nil(t, recovery)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can we take out the recover() and subsequent assert? If there's a panic, the test run will fail automatically.

@brandur-stripe
Copy link
Contributor

The build for Go 1.8 has failed due to

The command "go get -u golang.org/x/net/http2" failed 3 times.

Any ideas?

@agatamigalska-newstore Unfortunately an upstream change in Go's HTTP/2 package ended up breaking it for 1.8, and that took our build down with it. I dropped 1.8 from the build matrix in #716 and that's now merged, so you should be able to rebase on that to get things passing again.

@agatamigalska-newstore agatamigalska-newstore force-pushed the prevent_panic_on_http_timeout branch from 2515dfe to 062c822 Compare November 7, 2018 11:13
@agatamigalska-newstore
Copy link
Contributor Author

Thanks for your comments @brandur-stripe. The code is ready for the next round :)

stripe.go Outdated
return err
}
if s.LogLevel > 0 {
s.Logger.Printf("Request failed with: %s (error: %v)\n", string(resBody), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agatamigalska-newstore Sorry, I know this is code I gave you, but reading it again, I now realize that on this last log line err will never be non-nil because we're now handling that case above. It would also be kind of nice if we logged the response status code because that's one of the factors that the library will use to determine the retry.

Would you mind changing it to:

s.Logger.Printf("Request failed with body: %s (status: %v)\n", string(resBody), res.StatusCode)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thanks for making all those other changes!

After this last one, this looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as requested, this time I also read it instead of just copy-pasting ;)

stripe.go Outdated
if s.LogLevel > 0 {
s.Logger.Printf("Cannot read response: %v\n", err)
}
return err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandur-stripe This return statement was here before I started messing with this code, and I'm wondering - why not let it retry? It seems like the only case when (apart from shouldRetry() returning false) the retrial is not permitted, and I wonder why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that it wasn't originally retried because it's kind of an "error inside of an error" condition, and also probably somewhat unlikely to occur (usually once you have the response you'll get back a response body).

If you want, you can start retrying it as well since you're right in that it'd be strictly more correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the return statement and amended logged messages. Please have a look @brandur-stripe

@brandur-stripe
Copy link
Contributor

Excellent! Thanks for all the work here.

@brandur-stripe brandur-stripe merged commit 0e4f60c into stripe:master Nov 12, 2018
@brandur-stripe
Copy link
Contributor

Released as 53.0.1.

nadaismail-stripe pushed a commit that referenced this pull request Oct 18, 2024
Bumps [sorbet](https://github.com/sorbet/sorbet) from 0.5.10346 to 0.5.10382.
- [Release notes](https://github.com/sorbet/sorbet/releases)
- [Commits](https://github.com/sorbet/sorbet/commits)

---
updated-dependencies:
- dependency-name: sorbet
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants