Rewrite metrics ingestion path and integrate with just the central veneur. #578

clin-stripe · 2018-11-08T02:22:20Z

To make this a bit easier to review, check out the commits tab. Things are roughly broken out step-by-step.

Review guide

Sorry this is like 2000 lines. Most of it is testing, altho I get that that doesn't make it easier to review. FWIW, I asked @asf-stripe to look at this halfway through and got the thumbs up to move forward.

If you're not sure how to get started, I suggest going in this order:

Review the main business logic: Start with aggingestor.go, go to aggworker.go, go to sinkflusher.go.
Review the mixed histogram implementation if you're familiar with this: go to mixedhisto.go
Review the test framework and maybe skim the test cases (this is most of the lines).

Summary

This PR does the following:

Point the gRPC import server at the new code.
Build out a metricsingester that works for the way central veneurs currently work: metrics aggregation is sharded across several workers and a flush function runs every so often to collect metrics and send them to sinks.

In other words, this is a non-breaking change. But it will allow local veneurs to forward all metrics to the central veneur.

Architecture

The new metrics ingester does everything the old one does but a bit cleaned up.

AggregatingIngestor is the entry point (aggingestor.go). Through the Ingest and Merge methods, the AggregatingIngestor shards metrics across a series of workers. At a set interval, it flushes the aggregated metrics out of the workers and feeds them into a flusher (see below).
AggregatingWorkers (aggworker.go) take in metrics and feed them into samplers, doing all the work of metrics aggregation. They work as separate goroutines to facilitate parallelism.
sinkflusher.go contains the code to flush metrics to sinks. Given a set of samplers, the sinkflusher generates raw metrics and sinks them to sinks.

Testing

I wrote a testing framework that allows us to express tests as table tests. So while there's only a few top level tests, we actually capture about 20 different test cases in TestE2EFlushingIngester, TestMixedHistoMerge, and TestMixedHistoSample.

We could use additional coverage at the ingestor level so that we're not just testing import-path cases or depend on the import path for testing. This will come in a later PR (unless y'all think we should have it now).

We also have the benefit of the old e2e test coverage because those tests should be using the new path as well.

Other things to note

Mixed histogram stuff

I wrote a new sampler, MixedHistogramSampler, to capture the mixed histo behavior of "percentiles are aggregated globally, min max count are emitted with host tags". There's a bunch of testing around this.
In order for things not to break as we roll out, it's necessary for us to not change the behavior of MixedScope histograms. Therefore, the behavior for the central veneur is to not emit mixed min/max/counts unless a metric is imported with a new scope, MixedGlobal.

Design doc

https://docs.google.com/document/d/1JXHLj0VI1nIiRcNLu3MXvy_qt2V38yvjj6ejW5oa4dA/edit#

r? @aditya-stripe @ChimeraCoder

Motivation

(stripe internal)
https://docs.google.com/document/d/1C1IBh7AbWrHRkAJBbUHM0jx0XKg2lo3ISmffUrzaWT4/edit#

… compiles.

clin-stripe · 2018-11-14T18:04:54Z

@asf-stripe feedback make sure to preserve e2e local -> central forwarding test.

…urs.

…imports.

stripe-ci · 2018-11-20T02:46:58Z

Gerald Rule: Copy Observability on Veneur, Unilog, Falconer pull requests

cc @stripe/observability
cc @stripe/observability-stripe

aubrey-stripe · 2018-11-20T19:09:32Z

@clin-stripe did you intend to assign me to this PR? What can I do to help here?

clin-stripe · 2018-11-20T20:00:36Z

@aubrey-stripe You mentioned at the falalfel that you'd be interested in reviewing so I thought I'd give you a heads up!

asf-stripe

OK, I've skimmed this, left comments in places that seemed to need comments, and have a few wider-angle questions (besides the failing tests, what's up with that?):

I see your work is pulling the Hostname property out from the metrics' tags; that's cool, but I am wondering what we're going to do with the other tags that we currently use to identify metrics from particular subsets of hosts (e.g. our host_type and host_az tag); or let's say I'm confused how the Hostname becomes more important than the others, and how that will relate to what we're going to do when we switch the default on Mixed scope Histograms (I would love for us to not report host_az and host_type on the global histograms, since those are always going to be lies, and at least the AZ tag adds cardinality that we really don't need).
There's a comment I left in there about perf-sensitivity; I'd like to make sure we're not accidentally regressing, so do you mind adding benchmarks / running existing ones to compare how local veneur perf will be affected?

Other than that, I think this has the right shape, let's polish it!

asf-stripe · 2018-11-21T17:10:10Z

importsrv/server_test.go

+// TestE2EFlushingIngester tests the integration of the import endpoint with
+// the flushing ingester.
+func TestE2EFlushingIngester(t *testing.T) {
+	for _, tc := range testE2EFlushingCases {


yay table-driven tests!

asf-stripe · 2018-11-21T17:14:44Z

importsrv/server_test.go

+//			}
+//		})
+//	}
+//}


What are these for? If they're intended to be useful, please uncomment, otherwise delete (:

asf-stripe · 2018-11-21T17:16:04Z

importsrv/testtools_test.go

+// PROTOBUF
+//
+
+func pbmetrics(ms ...*metricpb.Metric) []*metricpb.Metric {


heh, that's a neat trick for avoiding the []*metricpb.Metric{ ... } syntax grossness.

asf-stripe · 2018-11-21T17:17:51Z

metricingester/aggingestor.go

+	}
+}
+
+func OptLogger(logger *logrus.Logger) ingesterOption {


Believe this is unused, but should be used in server.go, like the OptTraceClient function below.

I'm not sure what happens when you use a nil logger either... does it work, or does it die?

asf-stripe · 2018-11-21T17:20:11Z

metricingester/aggingestor.go

+	return ing
+}
+
+// TODO(clin): This needs to take ctx.


Agreed, would it expand scope too much to put it in?

I'll introduce this in the next PR for adding the ingestor interface for local veneur. This will need context since we'll be making RPCs in that interface.

asf-stripe · 2018-11-21T17:21:51Z

metricingester/aggworker.go

+type aggWorker struct {
+	samplers samplerEnvelope
+
+	inC    chan Metric


If I navigated this code correctly, this is a fairly perf-critical bit; we recently (last year? @cory-stripe would know) spent a bunch of time making most channels take pointers to avoid copying. I'm not necessarily saying this should be a pointer channel, but I think I'd like to see a benchmark to ensure we're not regressing.

Will write benchmark.

@asf-stripe I wrote a benchmark and compared a pointer Metric to non-pointer metric in the aggWorker.Ingest() path, but there was no performance difference. On my machine, both clocked in at about 680ns/op.

Yeah, it wasn't always obvious when go preferred a pointer to avoid copying. We tried it as a blanket thing and didn't see much difference in some cases.

asf-stripe · 2018-11-21T17:24:02Z

metricingester/obs.go

+	if span, ok := opentracing.SpanFromContext(ctx).(*trace.Span); ok {
+		return log.
+			WithField("trace_id", span.TraceID).
+			WithField("span_id", span.SpanID)


This looks interesting, what is it for / how does it look when logging? If it doesn't print hex numbers yet, you might want to make it do that - that's how LS presents them at least, so should make it easier to cross-link between tools.

asf-stripe · 2018-11-21T17:25:00Z

metricingester/sinkflusher.go

+
+	tags := map[string]string{"part": "post"}
+	for _, sinkInstance := range s.sinks {
+		// TODO(clin): Add back ms once we finalize the ms client pull request.


What's ms?

it's supposed to say metrics which are back in, will remove comment

asf-stripe · 2018-11-21T17:28:19Z

sinks/channel/channel.go

+	"github.com/stripe/veneur/trace"
+)
+
+type ChannelMetricSink struct {


Neat! I think that dedups a bunch of code from our tests (:

clin-stripe · 2018-11-22T19:00:13Z

Missed this comment when we went over this together.

I see your work is pulling the Hostname property out from the metrics' tags; that's cool, but I am wondering what we're going to do with the other tags that we currently use to identify metrics from particular subsets of hosts (e.g. our host_type and host_az tag); or let's say I'm confused how the Hostname becomes more important than the others, and how that will relate to what we're going to do when we switch the default on Mixed scope Histograms (I would love for us to not report host_az and host_type on the global histograms, since those are always going to be lies, and at least the AZ tag adds cardinality that we really don't need).

Hostname is special only because it's used to make grouping decisions. In the special case of MixedHistogram, we need to make sure that hostname is not part of the grouping decision. That is, two mixedhistograms with the same name and tags need to be merged together regardless of their hostname. host_set and host_type, by contrast, are tags that we currently group histograms by.

Just to clarify, we're not planning on dropping mixedhistogram behavior in this change. It would be super nice to drop it of course ... but I think we need to do some work to migrate use cases for this first. Same with dropping host_az.

clin88 · 2019-12-10T01:46:29Z

y no merge y'all

clin88 · 2019-12-10T01:46:35Z

:'(

stripe-ci assigned aditya-stripe Nov 8, 2018

Refactor gRPC import to call the new ingester interface.

ec1ab2e

clin-stripe force-pushed the clin/central-veneur-take-local branch from 7b14fb2 to b6d88cc Compare November 8, 2018 02:26

clin-stripe added 3 commits November 10, 2018 11:47

Implement ingester and worker.

0df42d1

Implement flusher.

f255e56

Move channel sink into a different package so it can be exported.

89a7ac9

clin-stripe force-pushed the clin/central-veneur-take-local branch from 1678923 to 3833aae Compare November 10, 2018 21:37

asf-stripe self-assigned this Nov 12, 2018

Add a testing framework and an end to end test. Add wiring until test…

bec6df3

… compiles.

clin-stripe added 5 commits November 15, 2018 11:29

Add a mixedhistogram sampler and tests.

481c946

Add supporting for mixed histogram merging + tests.

3af1f3e

Integrate the mixed histogram sampler.

9fd6d5b

Integrate sets and thread hostname through everything.

3251872

Add support for a mixedglobal scope to facilitate migrating leaf vene…

8d6d344

…urs.

clin-stripe force-pushed the clin/central-veneur-take-local branch from 3833aae to 8d6d344 Compare November 20, 2018 00:08

clin-stripe added 4 commits November 19, 2018 16:26

Allow host tag override in signalfx.

a20128c

Add back instrumentation.

591a5f9

Minor cleanups: add arbitrary sinks to server initialization and fix …

5b7bcb8

…imports.

Fix race condition due to unthreadsafe trace client.

5225813

clin-stripe changed the title ~~WIP Migrate metrics ingestion into new package and have global ingester call it.~~ Rewrite metrics ingestion path and integrate with just the central veneur. Nov 20, 2018

clin-stripe requested a review from asf-stripe November 20, 2018 02:47

clin-stripe assigned aubrey-stripe Nov 20, 2018

asf-stripe suggested changes Nov 21, 2018

View reviewed changes

stripe-ci unassigned asf-stripe, aditya-stripe and aubrey-stripe Nov 21, 2018

stripe-ci assigned clin-stripe Nov 21, 2018

clin-stripe added 3 commits November 22, 2018 11:01

Remove unnecessary hostname passing to mixedhisto.

7571656

Code review feedback.

a9ee9de

Add benchmark.

b5404ab

clin-stripe force-pushed the clin/central-veneur-take-local branch from 30a0f36 to b5404ab Compare November 23, 2018 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite metrics ingestion path and integrate with just the central veneur. #578

Rewrite metrics ingestion path and integrate with just the central veneur. #578

clin-stripe commented Nov 8, 2018 •

edited

Loading

clin-stripe commented Nov 14, 2018

stripe-ci commented Nov 20, 2018

aubrey-stripe commented Nov 20, 2018

clin-stripe commented Nov 20, 2018

asf-stripe left a comment

asf-stripe Nov 21, 2018

asf-stripe Nov 21, 2018

asf-stripe Nov 21, 2018

asf-stripe Nov 21, 2018

asf-stripe Nov 21, 2018

asf-stripe Nov 21, 2018

clin-stripe Nov 21, 2018

asf-stripe Nov 21, 2018

clin-stripe Nov 21, 2018

clin-stripe Nov 22, 2018

cory-stripe Nov 26, 2018

asf-stripe Nov 21, 2018

asf-stripe Nov 21, 2018

clin-stripe Nov 21, 2018

asf-stripe Nov 21, 2018

clin-stripe commented Nov 22, 2018

clin88 commented Dec 10, 2019

clin88 commented Dec 10, 2019

Rewrite metrics ingestion path and integrate with just the central veneur. #578

Are you sure you want to change the base?

Rewrite metrics ingestion path and integrate with just the central veneur. #578

Conversation

clin-stripe commented Nov 8, 2018 • edited Loading

Review guide

Summary

Architecture

Testing

Other things to note

Mixed histogram stuff

Design doc

Motivation

clin-stripe commented Nov 14, 2018

stripe-ci commented Nov 20, 2018

aubrey-stripe commented Nov 20, 2018

clin-stripe commented Nov 20, 2018

asf-stripe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clin-stripe commented Nov 22, 2018

clin88 commented Dec 10, 2019

clin88 commented Dec 10, 2019

clin-stripe commented Nov 8, 2018 •

edited

Loading