Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stripe client telemetry to request headers #766

Merged
merged 18 commits into from
Jan 17, 2019

Conversation

jameshageman-stripe
Copy link
Contributor

@jameshageman-stripe jameshageman-stripe commented Jan 11, 2019

Follows stripe/stripe-ruby#696, stripe/stripe-php#549, and stripe/stripe-python#518 in adding telemetry metadata to request headers.

The telemetry is disabled by default, and can be enabled per-client like so:

backend := stripe.GetBackendWithConfig(
	stripe.APIBackend,
	&stripe.BackendConfig{
		URL: "https://api.stripe.com/v1",
		EnableTelemetry: true,
	},
)

or globally by setting

stripe.EnableTelemetry = true

cc @dcarney-stripe @bobby-stripe

Copy link
Contributor

@brandur-stripe brandur-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jameshageman-stripe, mostly looks great. Thanks!

We try to have this sort of option configurable at the backend-level though so that you could turn it on without changing global state. Could you try tweaking this to copy the model established by say Logger in that there's a global option, but it's also in BackendConfig so that a user could request it with GetBackendWithConfig without touching global state?

ptal @jameshageman-stripe

stripe.go Outdated
@@ -158,6 +169,7 @@ type BackendImplementation struct {
//
// See also SetNetworkRetriesSleep.
networkRetriesSleep bool
lastRequestMetrics *RequestMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because networkRetriesSleep above has its own comment, could you just add a newline above this one just to show that it's not part of the same group?

stripe.go Outdated
@@ -140,6 +144,13 @@ type BackendConfig struct {
URL string
}

// RequestMetrics contains the payload sent in the `X-Stripe-Client-Telemetry`
// header when stripe.EnableTelemetry = true.
type RequestMetrics struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this requestMetrics instead so that it's not exported outside the package (whether the first letter is lower case or upper case determines this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially did make RequestMetrics private, however I wanted the tests to ensure that I could unmarshal the sent metrics back into this struct. Do you think it's ok to omit that check? Otherwise, I could duplicate the requestMetrics definition in stripe_test.go for the sake of unmarshaling.

Copy link
Contributor

@brandur-stripe brandur-stripe Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry I missed this.

I actually forgot that stripe_test.go is in its own stripe_test package. IMO, we should probably just put it in stripe. All the tests in subpackages are just in the same namespace as their accompanying file (customer/, charge/, form/, etc.) and even most of the *_test.go in the top-level package are as well.

stripe.go Outdated
@@ -292,8 +304,15 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte
s.Logger.Printf("Requesting %v %v%v\n", req.Method, req.URL.Host, req.URL.Path)
}

if EnableTelemetry && s.lastRequestMetrics != nil {
metricsJSON, _ := json.Marshal(s.lastRequestMetrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is unlikely to happen, but let's not swallow errors just in case we introduce something in the future that becomes difficult to debug because we're throwing some message away. Observe err and handle if it's non-nil like elsewhere.

stripe.go Outdated
@@ -292,8 +304,15 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte
s.Logger.Printf("Requesting %v %v%v\n", req.Method, req.URL.Host, req.URL.Path)
}

if EnableTelemetry && s.lastRequestMetrics != nil {
metricsJSON, _ := json.Marshal(s.lastRequestMetrics)
payload := fmt.Sprintf(`{"last_request_metrics":%s}`, metricsJSON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for cleanliness, could you create a wrapper struct for this one instead of manually assembling the JSON string (maybe requestTelemetry)?

@jameshageman-stripe
Copy link
Contributor Author

@brandur-stripe updated! Left one unresolved comment re: RequestMetrics visibility.

@jameshageman-stripe
Copy link
Contributor Author

cc @akropp-stripe

@brandur-stripe
Copy link
Contributor

Thanks for the updates!

And oops — unfortunately I overlooked something major on the last review, which as that as implemented, this is extremely unsafe from a concurrency perspective (which is fairly important from Go). As multiple Goroutines make requests, they'll all be vying to write lastRequestMetrics. Similarly as new requests are being issued, what's in lastRequestMetrics at any given moment is anybody's guess. Wrapping accesses in a mutex isn't enough because even with that, the values being read and written will be so volatile.

It's not super obvious to me how to solve this one well (maybe a bounded queue?), but we'll need to rethink the approach.

@jameshageman-stripe
Copy link
Contributor Author

That's a good point! Just using a mutex around the value would cause metrics to be discarded if multiple calls finished at the same time. It doesn't seem like a huge deal if metrics are missed, but it would be nice to handle concurrency in some capacity.

Perhaps we could use a buffered channel to keep a finite amount of request metrics around (say, 16?), but only perform non-blocking writes to the channel and discard the metrics when the channel is full. Likewise, new requests can pull a requestMetrics struct off the channel if it is not empty. This would allow concurrently produced metrics to be kept around, without introducing any blocking calls and unbounded memory allocations. Let me know how that sounds to you.

While we're at it, I'd like to try to add a test that detects the existing data race.

@brandur-stripe
Copy link
Contributor

brandur-stripe commented Jan 14, 2019

Perhaps we could use a buffered channel to keep a finite amount of request metrics around (say, 16?), but only perform non-blocking writes to the channel and discard the metrics when the channel is full. Likewise, new requests can pull a requestMetrics struct off the channel if it is not empty. This would allow concurrently produced metrics to be kept around, without introducing any blocking calls and unbounded memory allocations. Let me know how that sounds to you.

Yep, that sounds good to me.

One minor tweak that might be worth considering: a circular buffer would probably be slightly better here than a simple buffered channel because in the case where it's full, we'd favor more recent data points instead of older data points (which seems appropriate). That said, probably not worth adding a dependency for.

@jameshageman-stripe
Copy link
Contributor Author

@brandur I think we could accomplish something similar in pure go without a dependency. On write, if the buffer is full, we could spin up a goroutine to push an element into the buffer and simulaneously pop the oldest element off. It introduces an ephemeral goroutine, but it would allow more recent metrics to have priority.

Something like this:

// after a request is made in BackendImplementation#Do()

metrics := requestMetrics{
	RequestID:         reqID,
	RequestDurationMS: requestDurationMS,
}

select {
case s.prevRequestMetrics <- metrics:
	// Non-blocking insert succeeded, buffer was not full.
default:
	// Buffer was full, so pop off the oldest value and insert a newer one.
	// This is done in a goroutine to prevent blocking the client.
	go func() {
		<-s.prevRequestMetrics
		s.prevRequestMetrics <- metrics
	}()
}

I'm not sure how I feel about spawning a whole new goroutine for this, so I'd like to know your thoughts.

@brandur-stripe
Copy link
Contributor

brandur-stripe commented Jan 14, 2019

I'm not sure how I feel about spawning a whole new goroutine for this, so I'd like to know your thoughts.

Nice idea!

I think though that you'd still have a potential race that might cause the new Goroutines to live longer than you'd want them to. These two lines are not guaranteed to run atomically:

<-s.prevRequestMetrics
s.prevRequestMetrics <- metrics

So after the ephemeral Goroutine pulls a value out of the channel, another Goroutine could sneak in and push its own value in, which would leave the ephemeral Goroutine waiting on s.prevRequestMetrics <- metrics.

My general feeling is that it's a little heavy-handed spinning up ephemeral Goroutines for this anyway.

I was just Googling and found this article from Pivotal on implementing a channel-based ring buffer. It's actually very similar to what you suggested, except there's just a single Goroutine in charge of managing the ring, which avoids potential races. Honestly not sure if this is a good idea or not, but it seems pretty workable.

@jameshageman-stripe
Copy link
Contributor Author

Cool! Between the Pivotal channel-based ring buffer and using container/ring with mutexes, I'd opt for the channel solution.

I'll try it out in this PR and we can decide if we like it better than non-circular buffers.

@jameshageman-stripe
Copy link
Contributor Author

The circular buffer solution (added in 7f09c4a) was passing the normal tests, but it was failing unexpectedly on the coverage tests. It would fail on this assertion:

assert.True(t, len(telemetryStr) > 0, "telemetryStr should not be empty")

I think this indicates that the telemetry pushed into the inputChannel by the first request had not yet been passed to the outputChannel by the ringBuffer's run() goroutine.

However, this seems like it would be an unpredictable race condition based on the specific goroutine scheduling. Therefore I'm confused why it happened consistently on coverage tests but never on normal tests (in CI or on my laptop).

I'll try adding some extra logging to investigate.

@jameshageman-stripe
Copy link
Contributor Author

My guess was correct: there was a synchronization issue. My initial implementation of ringBuffer did asynchronous writes, so the first request's metrics hadn't necessarily been placed on the outputChannel by the time the second request was kicked off (it all depended on go scheduling/time slicing).

I've added synchronous writes by having the client wait on a done channel, but I feel like this solution has become more complicated than simply wrapping the following in a mutex guard:

<-s.prevRequestMetrics
s.prevRequestMetrics <- metrics

I think I'd prefer to go back to one a single channel solution and add a mutex.

@jameshageman-stripe
Copy link
Contributor Author

@brandur Alright, I think this is ready for re-review. Since you last commented, the following has changed:

  • I implemented a circular buffer of requestMetrics using a single channel and a mutex.
  • EnableTelemetry can be passed into GetBackendWithConfig via the BackendConfig struct, which allows individual clients to have telemetry enabled.
  • I l(re-)added a global stripe.EnableTelemetry flag that will enable telemetry headers to all clients instead of a single Backend.
  • I did not change the package in stripe_test.go from stripe_test to stripe, because that change would end up adding a large diff to an already large PR. Instead, I duplicated the requestMetrics and requestTelemetry headers into the one test that used them.

stripe.go Outdated
select {
case s.requestMetricsBuffer <- r:
default:
<-s.requestMetricsBuffer
Copy link

@bobby-stripe bobby-stripe Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameshageman-stripe I like how clever this is to avoid taking a lock when reading the buffer, but I think as-is there is the (remote but possible) possibility of a deadlock:

goroutine 1:
locks requestMetricsMutex
attempt to write to requestMetricsBuffer, but it is full, fall through to the default case on line 982
goroutine 1 descheduled

goroutine 2..n:
read from requestMetricsBuffer until it is empty (because the user's application has issued a burst of API requests, or something)

goroutine 1:
rescheduled
attempt to read from <-s.requestMetricsBuffer, which will block

goroutine 1 is now blocked waiting for a metric to appear on the channel, but nobody else will be able to write to the channel as requestMetricsMutex is held (all incoming requests will now also deadlock, waiting on that mutex).

I think this can be addressed by changing:

<-s.requestMetricsBuffer

to

select {
case _ = <-s.requestMetricsBuffer:
default:
}

So that we don't do a blocking channel read with the mutex held (along with a big fat comment), but maybe its worth considering if this is too clever and we should unconditionally grab the mutex for all reads + writes of s.requestMetricsBuffer

Unless I'm reading this wrong and there is nothing to worry about, which is possible!

@jameshageman-stripe
Copy link
Contributor Author

@bobby-stripe @dcarney-stripe and I had some further discussion in person that I thought I'd share.

The circular buffer implementation has grown in complexity enough to warrant it's extraction from the stripe.go package, due to the required coordination between multiple channels, goroutines, and/or mutexes depending on the implementation. The complexity of implementing a safe, concurrent circular buffer led us to discuss the reasoning for using a circular buffer over a regular FIFO channel. Some points:

  • With a circular buffer, the oldest request metrics get dropped when the buffer is full
  • With a regular channel, the newest request metrics get dropped when the buffer is full
  • The buffer only fills up when there are concurrently executing requests
  • When we send a requestMetrics struct, it is already for a past request. The gap between recording a request metric and sending it could be on the order of minutes if the client is not sending many requests.
  • We do not expect or require delivery of metrics for every request
  • Using a circular buffer would reduce the average time between a request being recorded, but it does not stop clients from potentially sending requestMetrics that are many minutes old.

With these points in mind, I think it's worth reconsidering using a standard FIFO channel instead of a circular buffer, as the benefit (slightly younger request metrics on average) may not outweigh the added complexity.

@brandur-stripe
Copy link
Contributor

With these points in mind, I think it's worth reconsidering using a standard FIFO channel instead of a circular buffer, as the benefit (slightly younger request metrics on average) may not outweigh the added complexity.

Yeah, fair enough.

I think the circular buffer would be slightly better implementation, but things change given that it's not that easy. I totally buy the complexity argument, and even with thorough review, there's still a much higher chance of introducing a bug with our own buffer implementation than just using a standard channel.

I'm +1 going back to FIFO channel if you guys are. Thanks for looking into it at least!

@jameshageman-stripe
Copy link
Contributor Author

Reverted back to a simple FIFO channel.

r? @brandur-stripe

Copy link
Contributor

@brandur-stripe brandur-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Left a couple minor comments, but looking great. Thanks for the updates!

@@ -779,6 +835,9 @@ const minNetworkRetriesDelay = 500 * time.Millisecond

const uploadsURL = "https://uploads.stripe.com"

// The number of requestMetric objects to buffer for client telemetry.
const telemetryBufferSize = 16
Copy link
Contributor

@brandur-stripe brandur-stripe Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this up a couple lines? (The constant names are ordered alphabetically so let's try to keep them that way.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would you mind amending the comment to mention that additional objects will be dropped if the buffer is full? It'll make it extra clear what the lib does when the buffer is full (as opposed to other behaviors like growing the buffer, or sending metrics synchronously, etc.)

(I know you mention that at the site where the channel is actually written to, but I think it'd be helpful to have here as well).

}

// If the metrics buffer is full, discard the new metrics. Otherwise, add
// them to the buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment!

stripe.go Outdated
// to Stripe in subsequent requests via the `X-Stripe-Client-Telemetry` header.
//
// Defaults to false.
EnableTelemetry bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, can you put EnableTelemetry into the right place alphabetically?

stripe.go Outdated
@@ -338,8 +371,14 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte
}
}

// `requestStart` is used solely for client telemetry and, unlike `start`,
// does not account for the time spent building the request body.
requestStart := time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use start? All that's happening between the two is some structs being initialized locally. In practice it's likely to have a negligible effect on timing, and I don't think it's worth complicating the code over.

stripe.go Outdated
s.Logger.Printf("Unable to encode client telemetry: %s", err)
}
default:
// no metrics available, ignore.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be worth amending this comment to point out that this default case needs to be here to enable a non-blocking receive on the channel. Wouldn't want some over-zealous refactorer to remove it, thinking that it was an innocuous block.

},
).(*stripe.BackendImplementation)

for i := 0; i < 2; i++ {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's initially unclear why this loop executes twice. Can you add a quick comment that explains that it's because the telemetry comes from the previous completed request/response?

message := "Hello, client."
requestNum := 0

testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test. 👍 I love Go's httptest package!

requestNum++

telemetryStr := r.Header.Get("X-Stripe-Client-Telemetry")
switch requestNum {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dcarney
Copy link

dcarney commented Jan 17, 2019

@jameshageman-stripe This is looking great! Most of my comments were small little nitpicks about comments.

I think the general concept of using the simple buffered channel is the way to go, and led to some very clean code. We can always revisit this if we decide to do something more sophisticated.

@jameshageman-stripe
Copy link
Contributor Author

@dcarney @brandur-stripe Thanks for your feedback! I think I've addressed all of the comments, and I just added another assertion that checks that we actually measure request duration.

Let me know if you see anything else you'd like to discuss :)

@brandur-stripe
Copy link
Contributor

Awesome — LGTM. Thanks for all the hard work here James!

@brandur-stripe brandur-stripe merged commit d543f9d into stripe:master Jan 17, 2019
@brandur-stripe
Copy link
Contributor

Released as 55.12.0.

nadaismail-stripe pushed a commit that referenced this pull request Oct 18, 2024
* Functional-style idempotency key generation

* missed one

* fix bad user reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants