Add Stripe client telemetry to request headers #766

jameshageman-stripe · 2019-01-11T23:48:54Z

Follows stripe/stripe-ruby#696, stripe/stripe-php#549, and stripe/stripe-python#518 in adding telemetry metadata to request headers.

The telemetry is disabled by default, and can be enabled per-client like so:

backend := stripe.GetBackendWithConfig(
	stripe.APIBackend,
	&stripe.BackendConfig{
		URL: "https://api.stripe.com/v1",
		EnableTelemetry: true,
	},
)

or globally by setting

stripe.EnableTelemetry = true

cc @dcarney-stripe @bobby-stripe

brandur-stripe

Hey @jameshageman-stripe, mostly looks great. Thanks!

We try to have this sort of option configurable at the backend-level though so that you could turn it on without changing global state. Could you try tweaking this to copy the model established by say Logger in that there's a global option, but it's also in BackendConfig so that a user could request it with GetBackendWithConfig without touching global state?

ptal @jameshageman-stripe

brandur-stripe · 2019-01-12T00:11:03Z

stripe.go

@@ -158,6 +169,7 @@ type BackendImplementation struct {
 	//
 	// See also SetNetworkRetriesSleep.
 	networkRetriesSleep bool
+	lastRequestMetrics  *RequestMetrics


Because networkRetriesSleep above has its own comment, could you just add a newline above this one just to show that it's not part of the same group?

brandur-stripe · 2019-01-12T00:11:44Z

stripe.go

@@ -140,6 +144,13 @@ type BackendConfig struct {
 	URL string
 }

+// RequestMetrics contains the payload sent in the `X-Stripe-Client-Telemetry`
+// header when stripe.EnableTelemetry = true.
+type RequestMetrics struct {


Let's call this requestMetrics instead so that it's not exported outside the package (whether the first letter is lower case or upper case determines this).

I initially did make RequestMetrics private, however I wanted the tests to ensure that I could unmarshal the sent metrics back into this struct. Do you think it's ok to omit that check? Otherwise, I could duplicate the requestMetrics definition in stripe_test.go for the sake of unmarshaling.

Oops, sorry I missed this.

I actually forgot that stripe_test.go is in its own stripe_test package. IMO, we should probably just put it in stripe. All the tests in subpackages are just in the same namespace as their accompanying file (customer/, charge/, form/, etc.) and even most of the *_test.go in the top-level package are as well.

brandur-stripe · 2019-01-12T00:13:13Z

stripe.go

@@ -292,8 +304,15 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte
 		s.Logger.Printf("Requesting %v %v%v\n", req.Method, req.URL.Host, req.URL.Path)
 	}

+	if EnableTelemetry && s.lastRequestMetrics != nil {
+		metricsJSON, _ := json.Marshal(s.lastRequestMetrics)


This one is unlikely to happen, but let's not swallow errors just in case we introduce something in the future that becomes difficult to debug because we're throwing some message away. Observe err and handle if it's non-nil like elsewhere.

brandur-stripe · 2019-01-12T00:13:59Z

stripe.go

@@ -292,8 +304,15 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte
 		s.Logger.Printf("Requesting %v %v%v\n", req.Method, req.URL.Host, req.URL.Path)
 	}

+	if EnableTelemetry && s.lastRequestMetrics != nil {
+		metricsJSON, _ := json.Marshal(s.lastRequestMetrics)
+		payload := fmt.Sprintf(`{"last_request_metrics":%s}`, metricsJSON)


Just for cleanliness, could you create a wrapper struct for this one instead of manually assembling the JSON string (maybe requestTelemetry)?

jameshageman-stripe · 2019-01-12T00:38:50Z

@brandur-stripe updated! Left one unresolved comment re: RequestMetrics visibility.

jameshageman-stripe · 2019-01-12T00:51:45Z

cc @akropp-stripe

brandur-stripe · 2019-01-14T18:37:27Z

Thanks for the updates!

And oops — unfortunately I overlooked something major on the last review, which as that as implemented, this is extremely unsafe from a concurrency perspective (which is fairly important from Go). As multiple Goroutines make requests, they'll all be vying to write lastRequestMetrics. Similarly as new requests are being issued, what's in lastRequestMetrics at any given moment is anybody's guess. Wrapping accesses in a mutex isn't enough because even with that, the values being read and written will be so volatile.

It's not super obvious to me how to solve this one well (maybe a bounded queue?), but we'll need to rethink the approach.

jameshageman-stripe · 2019-01-14T19:08:39Z

That's a good point! Just using a mutex around the value would cause metrics to be discarded if multiple calls finished at the same time. It doesn't seem like a huge deal if metrics are missed, but it would be nice to handle concurrency in some capacity.

Perhaps we could use a buffered channel to keep a finite amount of request metrics around (say, 16?), but only perform non-blocking writes to the channel and discard the metrics when the channel is full. Likewise, new requests can pull a requestMetrics struct off the channel if it is not empty. This would allow concurrently produced metrics to be kept around, without introducing any blocking calls and unbounded memory allocations. Let me know how that sounds to you.

While we're at it, I'd like to try to add a test that detects the existing data race.

brandur-stripe · 2019-01-14T20:55:33Z

Perhaps we could use a buffered channel to keep a finite amount of request metrics around (say, 16?), but only perform non-blocking writes to the channel and discard the metrics when the channel is full. Likewise, new requests can pull a requestMetrics struct off the channel if it is not empty. This would allow concurrently produced metrics to be kept around, without introducing any blocking calls and unbounded memory allocations. Let me know how that sounds to you.

Yep, that sounds good to me.

One minor tweak that might be worth considering: a circular buffer would probably be slightly better here than a simple buffered channel because in the case where it's full, we'd favor more recent data points instead of older data points (which seems appropriate). That said, probably not worth adding a dependency for.

jameshageman-stripe · 2019-01-14T22:21:56Z

@brandur I think we could accomplish something similar in pure go without a dependency. On write, if the buffer is full, we could spin up a goroutine to push an element into the buffer and simulaneously pop the oldest element off. It introduces an ephemeral goroutine, but it would allow more recent metrics to have priority.

Something like this:

// after a request is made in BackendImplementation#Do()

metrics := requestMetrics{
	RequestID:         reqID,
	RequestDurationMS: requestDurationMS,
}

select {
case s.prevRequestMetrics <- metrics:
	// Non-blocking insert succeeded, buffer was not full.
default:
	// Buffer was full, so pop off the oldest value and insert a newer one.
	// This is done in a goroutine to prevent blocking the client.
	go func() {
		<-s.prevRequestMetrics
		s.prevRequestMetrics <- metrics
	}()
}

I'm not sure how I feel about spawning a whole new goroutine for this, so I'd like to know your thoughts.

brandur-stripe · 2019-01-14T22:43:12Z

I'm not sure how I feel about spawning a whole new goroutine for this, so I'd like to know your thoughts.

Nice idea!

I think though that you'd still have a potential race that might cause the new Goroutines to live longer than you'd want them to. These two lines are not guaranteed to run atomically:

<-s.prevRequestMetrics
s.prevRequestMetrics <- metrics

So after the ephemeral Goroutine pulls a value out of the channel, another Goroutine could sneak in and push its own value in, which would leave the ephemeral Goroutine waiting on s.prevRequestMetrics <- metrics.

My general feeling is that it's a little heavy-handed spinning up ephemeral Goroutines for this anyway.

I was just Googling and found this article from Pivotal on implementing a channel-based ring buffer. It's actually very similar to what you suggested, except there's just a single Goroutine in charge of managing the ring, which avoids potential races. Honestly not sure if this is a good idea or not, but it seems pretty workable.

jameshageman-stripe · 2019-01-15T00:53:46Z

Cool! Between the Pivotal channel-based ring buffer and using container/ring with mutexes, I'd opt for the channel solution.

I'll try it out in this PR and we can decide if we like it better than non-circular buffers.

jameshageman-stripe · 2019-01-15T22:41:15Z

The circular buffer solution (added in 7f09c4a) was passing the normal tests, but it was failing unexpectedly on the coverage tests. It would fail on this assertion:

stripe-go/stripe_test.go

Line 252 in 7f09c4a

assert.True(t, len(telemetryStr) > 0, "telemetryStr should not be empty")

I think this indicates that the telemetry pushed into the inputChannel by the first request had not yet been passed to the outputChannel by the ringBuffer's run() goroutine.

However, this seems like it would be an unpredictable race condition based on the specific goroutine scheduling. Therefore I'm confused why it happened consistently on coverage tests but never on normal tests (in CI or on my laptop).

I'll try adding some extra logging to investigate.

jameshageman-stripe · 2019-01-15T23:45:50Z

My guess was correct: there was a synchronization issue. My initial implementation of ringBuffer did asynchronous writes, so the first request's metrics hadn't necessarily been placed on the outputChannel by the time the second request was kicked off (it all depended on go scheduling/time slicing).

I've added synchronous writes by having the client wait on a done channel, but I feel like this solution has become more complicated than simply wrapping the following in a mutex guard:

<-s.prevRequestMetrics
s.prevRequestMetrics <- metrics

I think I'd prefer to go back to one a single channel solution and add a mutex.

jameshageman-stripe · 2019-01-16T17:06:47Z

@brandur Alright, I think this is ready for re-review. Since you last commented, the following has changed:

I implemented a circular buffer of requestMetrics using a single channel and a mutex.
EnableTelemetry can be passed into GetBackendWithConfig via the BackendConfig struct, which allows individual clients to have telemetry enabled.
I l(re-)added a global stripe.EnableTelemetry flag that will enable telemetry headers to all clients instead of a single Backend.
I did not change the package in stripe_test.go from stripe_test to stripe, because that change would end up adding a large diff to an already large PR. Instead, I duplicated the requestMetrics and requestTelemetry headers into the one test that used them.

bobby-stripe · 2019-01-16T18:23:38Z

stripe.go

+	select {
+	case s.requestMetricsBuffer <- r:
+	default:
+		<-s.requestMetricsBuffer


@jameshageman-stripe I like how clever this is to avoid taking a lock when reading the buffer, but I think as-is there is the (remote but possible) possibility of a deadlock:

goroutine 1:
locks requestMetricsMutex
attempt to write to requestMetricsBuffer, but it is full, fall through to the default case on line 982
goroutine 1 descheduled

goroutine 2..n:
read from requestMetricsBuffer until it is empty (because the user's application has issued a burst of API requests, or something)

goroutine 1:
rescheduled
attempt to read from <-s.requestMetricsBuffer, which will block

goroutine 1 is now blocked waiting for a metric to appear on the channel, but nobody else will be able to write to the channel as requestMetricsMutex is held (all incoming requests will now also deadlock, waiting on that mutex).

I think this can be addressed by changing:

<-s.requestMetricsBuffer

to

select { case _ = <-s.requestMetricsBuffer: default: }

So that we don't do a blocking channel read with the mutex held (along with a big fat comment), but maybe its worth considering if this is too clever and we should unconditionally grab the mutex for all reads + writes of s.requestMetricsBuffer

Unless I'm reading this wrong and there is nothing to worry about, which is possible!

jameshageman-stripe · 2019-01-16T20:08:00Z

@bobby-stripe @dcarney-stripe and I had some further discussion in person that I thought I'd share.

The circular buffer implementation has grown in complexity enough to warrant it's extraction from the stripe.go package, due to the required coordination between multiple channels, goroutines, and/or mutexes depending on the implementation. The complexity of implementing a safe, concurrent circular buffer led us to discuss the reasoning for using a circular buffer over a regular FIFO channel. Some points:

With a circular buffer, the oldest request metrics get dropped when the buffer is full
With a regular channel, the newest request metrics get dropped when the buffer is full
The buffer only fills up when there are concurrently executing requests
When we send a requestMetrics struct, it is already for a past request. The gap between recording a request metric and sending it could be on the order of minutes if the client is not sending many requests.
We do not expect or require delivery of metrics for every request
Using a circular buffer would reduce the average time between a request being recorded, but it does not stop clients from potentially sending requestMetrics that are many minutes old.

With these points in mind, I think it's worth reconsidering using a standard FIFO channel instead of a circular buffer, as the benefit (slightly younger request metrics on average) may not outweigh the added complexity.

brandur-stripe · 2019-01-16T23:16:45Z

With these points in mind, I think it's worth reconsidering using a standard FIFO channel instead of a circular buffer, as the benefit (slightly younger request metrics on average) may not outweigh the added complexity.

Yeah, fair enough.

I think the circular buffer would be slightly better implementation, but things change given that it's not that easy. I totally buy the complexity argument, and even with thorough review, there's still a much higher chance of introducing a bug with our own buffer implementation than just using a standard channel.

I'm +1 going back to FIFO channel if you guys are. Thanks for looking into it at least!

jameshageman-stripe · 2019-01-17T00:48:18Z

Reverted back to a simple FIFO channel.

r? @brandur-stripe

brandur-stripe

Awesome! Left a couple minor comments, but looking great. Thanks for the updates!

brandur-stripe · 2019-01-17T17:05:17Z

stripe.go

@@ -779,6 +835,9 @@ const minNetworkRetriesDelay = 500 * time.Millisecond

 const uploadsURL = "https://uploads.stripe.com"

+// The number of requestMetric objects to buffer for client telemetry.
+const telemetryBufferSize = 16


Can you move this up a couple lines? (The constant names are ordered alphabetically so let's try to keep them that way.)

Also, would you mind amending the comment to mention that additional objects will be dropped if the buffer is full? It'll make it extra clear what the lib does when the buffer is full (as opposed to other behaviors like growing the buffer, or sending metrics synchronously, etc.)

(I know you mention that at the site where the channel is actually written to, but I think it'd be helpful to have here as well).

brandur-stripe · 2019-01-17T17:05:22Z

stripe.go

+			}
+
+			// If the metrics buffer is full, discard the new metrics. Otherwise, add
+			// them to the buffer.


Thanks for the comment!

brandur-stripe · 2019-01-17T17:14:49Z

stripe.go

+	// to Stripe in subsequent requests via the `X-Stripe-Client-Telemetry` header.
+	//
+	// Defaults to false.
+	EnableTelemetry bool


Likewise, can you put EnableTelemetry into the right place alphabetically?

brandur-stripe · 2019-01-17T17:19:27Z

stripe.go

@@ -338,8 +371,14 @@ func (s *BackendImplementation) Do(req *http.Request, body *bytes.Buffer, v inte
 			}
 		}

+		// `requestStart` is used solely for client telemetry and, unlike `start`,
+		// does not account for the time spent building the request body.
+		requestStart := time.Now()


Can we just use start? All that's happening between the two is some structs being initialized locally. In practice it's likely to have a negligible effect on timing, and I don't think it's worth complicating the code over.

dcarney · 2019-01-17T18:04:57Z

stripe.go

+				s.Logger.Printf("Unable to encode client telemetry: %s", err)
+			}
+		default:
+			// no metrics available, ignore.


nit: might be worth amending this comment to point out that this default case needs to be here to enable a non-blocking receive on the channel. Wouldn't want some over-zealous refactorer to remove it, thinking that it was an innocuous block.

dcarney · 2019-01-17T18:18:28Z

stripe_test.go

+		},
+	).(*stripe.BackendImplementation)
+
+	for i := 0; i < 2; i++ {


It's initially unclear why this loop executes twice. Can you add a quick comment that explains that it's because the telemetry comes from the previous completed request/response?

dcarney · 2019-01-17T18:19:20Z

stripe_test.go

+	message := "Hello, client."
+	requestNum := 0
+
+	testServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {


Great test. 👍 I love Go's httptest package!

dcarney · 2019-01-17T18:22:16Z

stripe_test.go

+		requestNum++
+
+		telemetryStr := r.Header.Get("X-Stripe-Client-Telemetry")
+		switch requestNum {


dcarney · 2019-01-17T18:27:20Z

@jameshageman-stripe This is looking great! Most of my comments were small little nitpicks about comments.

I think the general concept of using the simple buffered channel is the way to go, and led to some very clean code. We can always revisit this if we decide to do something more sophisticated.

jameshageman-stripe · 2019-01-17T19:06:32Z

@dcarney @brandur-stripe Thanks for your feedback! I think I've addressed all of the comments, and I just added another assertion that checks that we actually measure request duration.

Let me know if you see anything else you'd like to discuss :)

brandur-stripe · 2019-01-17T19:22:16Z

Awesome — LGTM. Thanks for all the hard work here James!

brandur-stripe · 2019-01-17T19:49:18Z

Released as 55.12.0.

* Functional-style idempotency key generation * missed one * fix bad user reference

jameshageman-stripe added 3 commits January 11, 2019 13:05

Add client telemetry metrics when stripe.EnableTelemetry = true

b10296b

add tests

24c136c

ignore marshal error

8678a55

brandur-stripe reviewed Jan 12, 2019

View reviewed changes

jameshageman-stripe added 2 commits January 11, 2019 16:33

handle error, add json struct, move flag to BackendConfig

fe9ca80

update comment

d774e2d

jameshageman-stripe added 2 commits January 14, 2019 09:10

make telemetry structs private

9176e8f

test against struct value, no JSON string

b16ca36

jameshageman-stripe added 2 commits January 14, 2019 11:15

detect data race with failing test

c91b869

store metrics in a bounded channel to prevent data races

33b806c

jameshageman-stripe force-pushed the client-telemetry branch from 81b1481 to 0752b9d Compare January 15, 2019 21:50

implement a circular channel buffer

7f09c4a

jameshageman-stripe force-pushed the client-telemetry branch from 0752b9d to 7f09c4a Compare January 15, 2019 22:05

jameshageman-stripe force-pushed the client-telemetry branch from aea3fbe to 7f09c4a Compare January 15, 2019 22:41

jameshageman-stripe added 2 commits January 15, 2019 14:50

add logging to debug failing coverage test

a7094e7

add some hacky synchronization to wait for the circular buffer goroutine

057c805

change ringBuffer to a single channel + a mutex

23f64cd

add a global stripe.EnableTelemetry flag

793abff

jameshageman-stripe force-pushed the client-telemetry branch from 07c4467 to 793abff Compare January 16, 2019 17:07

bobby-stripe reviewed Jan 16, 2019

View reviewed changes

go back to a standard FIFO channel

8ef268b

brandur-stripe reviewed Jan 17, 2019

View reviewed changes

dcarney reviewed Jan 17, 2019

View reviewed changes

stripe_test.go

requestNum++

telemetryStr := r.Header.Get("X-Stripe-Client-Telemetry")

switch requestNum {

Copy link

dcarney Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

jameshageman-stripe force-pushed the client-telemetry branch from 6eef65d to 02de4bb Compare January 17, 2019 18:25

address PR comments

60be451

jameshageman-stripe force-pushed the client-telemetry branch from 02de4bb to 60be451 Compare January 17, 2019 18:27

jameshageman-stripe added 2 commits January 17, 2019 10:41

simple comment about the NoDataRace test

1c1764e

make duration test more predictable

d87ed19

brandur-stripe merged commit d543f9d into stripe:master Jan 17, 2019

This was referenced Jan 28, 2019

Add Stripe client telemetry to request headers stripe/stripe-node#557

Merged

Add Stripe client telemetry to request headers stripe/stripe-java#661

Merged

nadaismail-stripe pushed a commit that referenced this pull request Oct 18, 2024

Functional-style idempotency key generation (#766)

b30c599

* Functional-style idempotency key generation * missed one * fix bad user reference

Add Stripe client telemetry to request headers #766

Add Stripe client telemetry to request headers #766

Conversation

jameshageman-stripe commented Jan 11, 2019 • edited Loading

brandur-stripe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandur-stripe Jan 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshageman-stripe commented Jan 12, 2019

jameshageman-stripe commented Jan 12, 2019

brandur-stripe commented Jan 14, 2019

jameshageman-stripe commented Jan 14, 2019

brandur-stripe commented Jan 14, 2019 • edited Loading

jameshageman-stripe commented Jan 14, 2019

brandur-stripe commented Jan 14, 2019 • edited Loading

jameshageman-stripe commented Jan 15, 2019

jameshageman-stripe commented Jan 15, 2019

jameshageman-stripe commented Jan 15, 2019

jameshageman-stripe commented Jan 16, 2019

bobby-stripe Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

jameshageman-stripe commented Jan 16, 2019

brandur-stripe commented Jan 16, 2019

jameshageman-stripe commented Jan 17, 2019

brandur-stripe left a comment

Choose a reason for hiding this comment

brandur-stripe Jan 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcarney commented Jan 17, 2019

jameshageman-stripe commented Jan 17, 2019

brandur-stripe commented Jan 17, 2019

brandur-stripe commented Jan 17, 2019

jameshageman-stripe commented Jan 11, 2019 •

edited

Loading

brandur-stripe Jan 14, 2019 •

edited

Loading

brandur-stripe commented Jan 14, 2019 •

edited

Loading

brandur-stripe commented Jan 14, 2019 •

edited

Loading

bobby-stripe Jan 16, 2019 •

edited

Loading

brandur-stripe Jan 17, 2019 •

edited

Loading