-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite metrics ingestion path and integrate with just the central veneur. #578
base: master
Are you sure you want to change the base?
Conversation
7b14fb2
to
b6d88cc
Compare
1678923
to
3833aae
Compare
@asf-stripe feedback make sure to preserve e2e local -> central forwarding test. |
3833aae
to
8d6d344
Compare
Gerald Rule: Copy Observability on Veneur, Unilog, Falconer pull requestscc @stripe/observability |
@clin-stripe did you intend to assign me to this PR? What can I do to help here? |
@aubrey-stripe You mentioned at the falalfel that you'd be interested in reviewing so I thought I'd give you a heads up! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've skimmed this, left comments in places that seemed to need comments, and have a few wider-angle questions (besides the failing tests, what's up with that?):
-
I see your work is pulling the
Hostname
property out from the metrics' tags; that's cool, but I am wondering what we're going to do with the other tags that we currently use to identify metrics from particular subsets of hosts (e.g. ourhost_type
andhost_az
tag); or let's say I'm confused how theHostname
becomes more important than the others, and how that will relate to what we're going to do when we switch the default on Mixed scope Histograms (I would love for us to not reporthost_az
andhost_type
on the global histograms, since those are always going to be lies, and at least the AZ tag adds cardinality that we really don't need). -
There's a comment I left in there about perf-sensitivity; I'd like to make sure we're not accidentally regressing, so do you mind adding benchmarks / running existing ones to compare how local veneur perf will be affected?
Other than that, I think this has the right shape, let's polish it!
// TestE2EFlushingIngester tests the integration of the import endpoint with | ||
// the flushing ingester. | ||
func TestE2EFlushingIngester(t *testing.T) { | ||
for _, tc := range testE2EFlushingCases { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yay table-driven tests!
importsrv/server_test.go
Outdated
// } | ||
// }) | ||
// } | ||
//} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these for? If they're intended to be useful, please uncomment, otherwise delete (:
// PROTOBUF | ||
// | ||
|
||
func pbmetrics(ms ...*metricpb.Metric) []*metricpb.Metric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, that's a neat trick for avoiding the []*metricpb.Metric{ ... }
syntax grossness.
} | ||
} | ||
|
||
func OptLogger(logger *logrus.Logger) ingesterOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe this is unused, but should be used in server.go, like the OptTraceClient
function below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what happens when you use a nil logger
either... does it work, or does it die?
return ing | ||
} | ||
|
||
// TODO(clin): This needs to take ctx. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, would it expand scope too much to put it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll introduce this in the next PR for adding the ingestor interface for local veneur. This will need context since we'll be making RPCs in that interface.
type aggWorker struct { | ||
samplers samplerEnvelope | ||
|
||
inC chan Metric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I navigated this code correctly, this is a fairly perf-critical bit; we recently (last year? @cory-stripe would know) spent a bunch of time making most channels take pointers to avoid copying. I'm not necessarily saying this should be a pointer channel, but I think I'd like to see a benchmark to ensure we're not regressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will write benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asf-stripe I wrote a benchmark and compared a pointer Metric to non-pointer metric in the aggWorker.Ingest() path, but there was no performance difference. On my machine, both clocked in at about 680ns/op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it wasn't always obvious when go preferred a pointer to avoid copying. We tried it as a blanket thing and didn't see much difference in some cases.
metricingester/obs.go
Outdated
if span, ok := opentracing.SpanFromContext(ctx).(*trace.Span); ok { | ||
return log. | ||
WithField("trace_id", span.TraceID). | ||
WithField("span_id", span.SpanID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks interesting, what is it for / how does it look when logging? If it doesn't print hex numbers yet, you might want to make it do that - that's how LS presents them at least, so should make it easier to cross-link between tools.
metricingester/sinkflusher.go
Outdated
|
||
tags := map[string]string{"part": "post"} | ||
for _, sinkInstance := range s.sinks { | ||
// TODO(clin): Add back ms once we finalize the ms client pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's ms
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's supposed to say metrics which are back in, will remove comment
"github.com/stripe/veneur/trace" | ||
) | ||
|
||
type ChannelMetricSink struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! I think that dedups a bunch of code from our tests (:
Missed this comment when we went over this together.
Hostname is special only because it's used to make grouping decisions. In the special case of MixedHistogram, we need to make sure that hostname is not part of the grouping decision. That is, two mixedhistograms with the same name and tags need to be merged together regardless of their hostname. host_set and host_type, by contrast, are tags that we currently group histograms by. Just to clarify, we're not planning on dropping mixedhistogram behavior in this change. It would be super nice to drop it of course ... but I think we need to do some work to migrate use cases for this first. Same with dropping host_az. |
30a0f36
to
b5404ab
Compare
y no merge y'all |
:'( |
To make this a bit easier to review, check out the commits tab. Things are roughly broken out step-by-step.
Review guide
Sorry this is like 2000 lines. Most of it is testing, altho I get that that doesn't make it easier to review. FWIW, I asked @asf-stripe to look at this halfway through and got the thumbs up to move forward.
If you're not sure how to get started, I suggest going in this order:
Summary
This PR does the following:
In other words, this is a non-breaking change. But it will allow local veneurs to forward all metrics to the central veneur.
Architecture
The new metrics ingester does everything the old one does but a bit cleaned up.
AggregatingIngestor is the entry point (aggingestor.go). Through the Ingest and Merge methods, the AggregatingIngestor shards metrics across a series of workers. At a set interval, it flushes the aggregated metrics out of the workers and feeds them into a flusher (see below).
AggregatingWorkers (aggworker.go) take in metrics and feed them into samplers, doing all the work of metrics aggregation. They work as separate goroutines to facilitate parallelism.
sinkflusher.go contains the code to flush metrics to sinks. Given a set of samplers, the sinkflusher generates raw metrics and sinks them to sinks.
Testing
I wrote a testing framework that allows us to express tests as table tests. So while there's only a few top level tests, we actually capture about 20 different test cases in TestE2EFlushingIngester, TestMixedHistoMerge, and TestMixedHistoSample.
We could use additional coverage at the ingestor level so that we're not just testing import-path cases or depend on the import path for testing. This will come in a later PR (unless y'all think we should have it now).
We also have the benefit of the old e2e test coverage because those tests should be using the new path as well.
Other things to note
Mixed histogram stuff
I wrote a new sampler, MixedHistogramSampler, to capture the mixed histo behavior of "percentiles are aggregated globally, min max count are emitted with host tags". There's a bunch of testing around this.
In order for things not to break as we roll out, it's necessary for us to not change the behavior of MixedScope histograms. Therefore, the behavior for the central veneur is to not emit mixed min/max/counts unless a metric is imported with a new scope, MixedGlobal.
Design doc
https://docs.google.com/document/d/1JXHLj0VI1nIiRcNLu3MXvy_qt2V38yvjj6ejW5oa4dA/edit#
r? @aditya-stripe @ChimeraCoder
Motivation
(stripe internal)
https://docs.google.com/document/d/1C1IBh7AbWrHRkAJBbUHM0jx0XKg2lo3ISmffUrzaWT4/edit#