-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Stripe client telemetry to request headers #766
Add Stripe client telemetry to request headers #766
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jameshageman-stripe, mostly looks great. Thanks!
We try to have this sort of option configurable at the backend-level though so that you could turn it on without changing global state. Could you try tweaking this to copy the model established by say Logger
in that there's a global option, but it's also in BackendConfig
so that a user could request it with GetBackendWithConfig
without touching global state?
ptal @jameshageman-stripe
stripe.go
Outdated
@@ -158,6 +169,7 @@ type BackendImplementation struct { | |||
// | |||
// See also SetNetworkRetriesSleep. | |||
networkRetriesSleep bool | |||
lastRequestMetrics *RequestMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because networkRetriesSleep
above has its own comment, could you just add a newline above this one just to show that it's not part of the same group?
stripe.go
Outdated
@@ -140,6 +144,13 @@ type BackendConfig struct { | |||
URL string | |||
} | |||
|
|||
// RequestMetrics contains the payload sent in the `X-Stripe-Client-Telemetry` | |||
// header when stripe.EnableTelemetry = true. | |||
type RequestMetrics struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this requestMetrics
instead so that it's not exported outside the package (whether the first letter is lower case or upper case determines this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially did make RequestMetrics
private, however I wanted the tests to ensure that I could unmarshal the sent metrics back into this struct. Do you think it's ok to omit that check? Otherwise, I could duplicate the requestMetrics
definition in stripe_test.go
for the sake of unmarshaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry I missed this.
I actually forgot that stripe_test.go
is in its own stripe_test
package. IMO, we should probably just put it in stripe
. All the tests in subpackages are just in the same namespace as their accompanying file (customer/
, charge/
, form/
, etc.) and even most of the *_test.go
in the top-level package are as well.
stripe.go
Outdated
@@ -292,8 +304,15 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte | |||
s.Logger.Printf("Requesting %v %v%v\n", req.Method, req.URL.Host, req.URL.Path) | |||
} | |||
|
|||
if EnableTelemetry && s.lastRequestMetrics != nil { | |||
metricsJSON, _ := json.Marshal(s.lastRequestMetrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is unlikely to happen, but let's not swallow errors just in case we introduce something in the future that becomes difficult to debug because we're throwing some message away. Observe err
and handle if it's non-nil
like elsewhere.
stripe.go
Outdated
@@ -292,8 +304,15 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte | |||
s.Logger.Printf("Requesting %v %v%v\n", req.Method, req.URL.Host, req.URL.Path) | |||
} | |||
|
|||
if EnableTelemetry && s.lastRequestMetrics != nil { | |||
metricsJSON, _ := json.Marshal(s.lastRequestMetrics) | |||
payload := fmt.Sprintf(`{"last_request_metrics":%s}`, metricsJSON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for cleanliness, could you create a wrapper struct for this one instead of manually assembling the JSON string (maybe requestTelemetry
)?
@brandur-stripe updated! Left one unresolved comment re: |
Thanks for the updates! And oops — unfortunately I overlooked something major on the last review, which as that as implemented, this is extremely unsafe from a concurrency perspective (which is fairly important from Go). As multiple Goroutines make requests, they'll all be vying to write It's not super obvious to me how to solve this one well (maybe a bounded queue?), but we'll need to rethink the approach. |
That's a good point! Just using a mutex around the value would cause metrics to be discarded if multiple calls finished at the same time. It doesn't seem like a huge deal if metrics are missed, but it would be nice to handle concurrency in some capacity. Perhaps we could use a buffered channel to keep a finite amount of request metrics around (say, 16?), but only perform non-blocking writes to the channel and discard the metrics when the channel is full. Likewise, new requests can pull a While we're at it, I'd like to try to add a test that detects the existing data race. |
Yep, that sounds good to me. One minor tweak that might be worth considering: a circular buffer would probably be slightly better here than a simple buffered channel because in the case where it's full, we'd favor more recent data points instead of older data points (which seems appropriate). That said, probably not worth adding a dependency for. |
@brandur I think we could accomplish something similar in pure go without a dependency. On write, if the buffer is full, we could spin up a goroutine to push an element into the buffer and simulaneously pop the oldest element off. It introduces an ephemeral goroutine, but it would allow more recent metrics to have priority. Something like this: // after a request is made in BackendImplementation#Do()
metrics := requestMetrics{
RequestID: reqID,
RequestDurationMS: requestDurationMS,
}
select {
case s.prevRequestMetrics <- metrics:
// Non-blocking insert succeeded, buffer was not full.
default:
// Buffer was full, so pop off the oldest value and insert a newer one.
// This is done in a goroutine to prevent blocking the client.
go func() {
<-s.prevRequestMetrics
s.prevRequestMetrics <- metrics
}()
} I'm not sure how I feel about spawning a whole new goroutine for this, so I'd like to know your thoughts. |
Nice idea! I think though that you'd still have a potential race that might cause the new Goroutines to live longer than you'd want them to. These two lines are not guaranteed to run atomically: <-s.prevRequestMetrics
s.prevRequestMetrics <- metrics So after the ephemeral Goroutine pulls a value out of the channel, another Goroutine could sneak in and push its own value in, which would leave the ephemeral Goroutine waiting on My general feeling is that it's a little heavy-handed spinning up ephemeral Goroutines for this anyway. I was just Googling and found this article from Pivotal on implementing a channel-based ring buffer. It's actually very similar to what you suggested, except there's just a single Goroutine in charge of managing the ring, which avoids potential races. Honestly not sure if this is a good idea or not, but it seems pretty workable. |
Cool! Between the Pivotal channel-based ring buffer and using container/ring with mutexes, I'd opt for the channel solution. I'll try it out in this PR and we can decide if we like it better than non-circular buffers. |
81b1481
to
0752b9d
Compare
0752b9d
to
7f09c4a
Compare
The circular buffer solution (added in 7f09c4a) was passing the normal tests, but it was failing unexpectedly on the coverage tests. It would fail on this assertion: Line 252 in 7f09c4a
I think this indicates that the telemetry pushed into the However, this seems like it would be an unpredictable race condition based on the specific goroutine scheduling. Therefore I'm confused why it happened consistently on coverage tests but never on normal tests (in CI or on my laptop). I'll try adding some extra logging to investigate. |
aea3fbe
to
7f09c4a
Compare
My guess was correct: there was a synchronization issue. My initial implementation of I've added synchronous writes by having the client wait on a <-s.prevRequestMetrics
s.prevRequestMetrics <- metrics I think I'd prefer to go back to one a single channel solution and add a mutex. |
@brandur Alright, I think this is ready for re-review. Since you last commented, the following has changed:
|
07c4467
to
793abff
Compare
stripe.go
Outdated
select { | ||
case s.requestMetricsBuffer <- r: | ||
default: | ||
<-s.requestMetricsBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameshageman-stripe I like how clever this is to avoid taking a lock when reading the buffer, but I think as-is there is the (remote but possible) possibility of a deadlock:
goroutine 1:
locks requestMetricsMutex
attempt to write to requestMetricsBuffer
, but it is full, fall through to the default
case on line 982
goroutine 1 descheduled
goroutine 2..n:
read from requestMetricsBuffer until it is empty (because the user's application has issued a burst of API requests, or something)
goroutine 1:
rescheduled
attempt to read from <-s.requestMetricsBuffer
, which will block
goroutine 1 is now blocked waiting for a metric to appear on the channel, but nobody else will be able to write to the channel as requestMetricsMutex
is held (all incoming requests will now also deadlock, waiting on that mutex).
I think this can be addressed by changing:
<-s.requestMetricsBuffer
to
select {
case _ = <-s.requestMetricsBuffer:
default:
}
So that we don't do a blocking channel read with the mutex held (along with a big fat comment), but maybe its worth considering if this is too clever and we should unconditionally grab the mutex for all reads + writes of s.requestMetricsBuffer
Unless I'm reading this wrong and there is nothing to worry about, which is possible!
@bobby-stripe @dcarney-stripe and I had some further discussion in person that I thought I'd share. The circular buffer implementation has grown in complexity enough to warrant it's extraction from the
With these points in mind, I think it's worth reconsidering using a standard FIFO channel instead of a circular buffer, as the benefit (slightly younger request metrics on average) may not outweigh the added complexity. |
Yeah, fair enough. I think the circular buffer would be slightly better implementation, but things change given that it's not that easy. I totally buy the complexity argument, and even with thorough review, there's still a much higher chance of introducing a bug with our own buffer implementation than just using a standard channel. I'm +1 going back to FIFO channel if you guys are. Thanks for looking into it at least! |
Reverted back to a simple FIFO channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Left a couple minor comments, but looking great. Thanks for the updates!
@@ -779,6 +835,9 @@ const minNetworkRetriesDelay = 500 * time.Millisecond | |||
|
|||
const uploadsURL = "https://uploads.stripe.com" | |||
|
|||
// The number of requestMetric objects to buffer for client telemetry. | |||
const telemetryBufferSize = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this up a couple lines? (The constant names are ordered alphabetically so let's try to keep them that way.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, would you mind amending the comment to mention that additional objects will be dropped if the buffer is full? It'll make it extra clear what the lib does when the buffer is full (as opposed to other behaviors like growing the buffer, or sending metrics synchronously, etc.)
(I know you mention that at the site where the channel is actually written to, but I think it'd be helpful to have here as well).
} | ||
|
||
// If the metrics buffer is full, discard the new metrics. Otherwise, add | ||
// them to the buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment!
stripe.go
Outdated
// to Stripe in subsequent requests via the `X-Stripe-Client-Telemetry` header. | ||
// | ||
// Defaults to false. | ||
EnableTelemetry bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, can you put EnableTelemetry
into the right place alphabetically?
stripe.go
Outdated
@@ -338,8 +371,14 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte | |||
} | |||
} | |||
|
|||
// `requestStart` is used solely for client telemetry and, unlike `start`, | |||
// does not account for the time spent building the request body. | |||
requestStart := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use start
? All that's happening between the two is some structs being initialized locally. In practice it's likely to have a negligible effect on timing, and I don't think it's worth complicating the code over.
stripe.go
Outdated
s.Logger.Printf("Unable to encode client telemetry: %s", err) | ||
} | ||
default: | ||
// no metrics available, ignore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be worth amending this comment to point out that this default
case needs to be here to enable a non-blocking receive on the channel. Wouldn't want some over-zealous refactorer to remove it, thinking that it was an innocuous block.
}, | ||
).(*stripe.BackendImplementation) | ||
|
||
for i := 0; i < 2; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's initially unclear why this loop executes twice. Can you add a quick comment that explains that it's because the telemetry comes from the previous completed request/response?
message := "Hello, client." | ||
requestNum := 0 | ||
|
||
testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great test. 👍 I love Go's httptest
package!
requestNum++ | ||
|
||
telemetryStr := r.Header.Get("X-Stripe-Client-Telemetry") | ||
switch requestNum { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
6eef65d
to
02de4bb
Compare
02de4bb
to
60be451
Compare
@jameshageman-stripe This is looking great! Most of my comments were small little nitpicks about comments. I think the general concept of using the simple buffered channel is the way to go, and led to some very clean code. We can always revisit this if we decide to do something more sophisticated. |
@dcarney @brandur-stripe Thanks for your feedback! I think I've addressed all of the comments, and I just added another assertion that checks that we actually measure request duration. Let me know if you see anything else you'd like to discuss :) |
Awesome — LGTM. Thanks for all the hard work here James! |
Released as 55.12.0. |
* Functional-style idempotency key generation * missed one * fix bad user reference
Follows stripe/stripe-ruby#696, stripe/stripe-php#549, and stripe/stripe-python#518 in adding telemetry metadata to request headers.
The telemetry is disabled by default, and can be enabled per-client like so:
or globally by setting
cc @dcarney-stripe @bobby-stripe