-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open census Span to View converter #1239
base: master
Are you sure you want to change the base?
Conversation
nice, thanks. will take a look, need something like this. slightly relevant maybe - am generally slightly concerned about 'metrics overload' from this (a lot of these won't be all that useful), curious if you have any thoughts about being able to turn certain 'views' (as opencensus calls them) on/off on a running server. |
Metric overload is a worry for sure. It's easy to make em, but sometimes hard to track dwon where they come from. I would say that if this creates a metric that is useless, then the span was going to be useless anyway. The idea here is to answer the "where is all the time spent" question, in particular when we start load testing and want to know the bottle neck. The other thing we could think about is using span annotations, to create a sort of debug, info, leveling system. Then we cld selectively enable/disable per environment. If you fire this branch up you can see how many are going to get made after running a few commands. |
We have added open census span calls in interesting parts of the code, but without creating a jaeger or zipkin cluster they are not going anywhere. Here we add a SpanConverter that converts our Spans into OpenCensus distribution Views. It contains an in memory map of the Measures for these Views, recording values for the length of time reported to be spent in a Span. This list has a hard limit of 100, as there is a risk of a metric explosion should an element of the system include some ID value in it's Span names. Gin does this and those are discarded, but there might be others that we'll have to deal with in the future when discovered.
e1c3065
to
e5b9545
Compare
Distribution buckets should be configurable. eg: |
m, ok := c.measures[sig] | ||
c.viewsMu.Unlock() | ||
|
||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concurrent access without lock. Outside of locks this could register/insert more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it looked wrong to me to, when I lifted it from the main prom exporter.
In actual fact there is no risk here, only that a view gets created twice, but OC will take care of that with it's deduplication logic.
I think this should be optional similar to other views (api views, agent views, etc.) Maybe add a new registration function with distribution buckets? If we are concerned about metrics overload, perhaps the registration function could take an optional list of span names to be excluded/included? |
Thanks for the pointer to our dist config, will hook that up. I thought about a white list, but that would get painful quick imo. The size limit should stop it going nuts. Will take a look at optionality. It could have it's own top level flag, like the prom one. |
@mantree Sounds good. The registration function could even have an optional span filter function. Then we can remove url, sanitize, 100 limit, 100 char limit and leave those up to the main.go/components. |
spanTimeNanos := sd.EndTime.Sub(sd.StartTime) | ||
spanTimeMillis := float64(int64(spanTimeNanos / time.Millisecond)) | ||
|
||
stats.Record(context.Background(), m.M(spanTimeMillis)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should rip some of the attributes off the span data as well and put them onto the context. this may be slightly painful, but tags are likely important so that we have context for eg function_id, app_id... these seem to map to the fields SpanData.Attributes
and tag.Map
, however it's not immediately clear from digging around the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attributes, MessageEvents and Annotations are all empty for our spans.
|
||
if !ok { | ||
m = stats.Float64(sig+"_span_time", "The span length in milliseconds", "ms") | ||
v := &view.View{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like the wrong place for view registration. we should maybe/probably stats.Record
everything and not limit the stats.Record
bit (it's gated by rejectSpan
as well as getMeasure
and an arbitrary 100 limit imposed). at least, stats.Record
and view.Measure
are separate processes and should be exclusive. Tolga linked to some pointers for how the other stuff is handled atm, I think initially we're mostly registering views in initialization, we can maybe make this more friendly (ie no reboot required) later but it's relatively nice to work with for the ability to toggle them on/off for eg things that extend fn. then we end up whitelisting things in, which I kind of agree is painful, while at the same time it does make us organize things (like stat/trace names) so it's a net good I think. I think we can remove the limits (100) with that in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also won't require a lock - we shouldn't be locking for a stats library, that is on them!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything has already been recorded, that's already happened to produce the SpanData. This is copying up the recorded values into the Measure.
api/server/spanconverter.go
Outdated
} | ||
|
||
//Gin creates spans for all paths, containing ID values. | ||
//We can safely discard these, as other histograms are being created for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space between first char always (sorry, am that guy)... // We
const labelKeySizeLimit = 100 | ||
|
||
// sanitize returns a string that is trunacated to 100 characters if it's too | ||
// long, and replaces non-alphanumeric characters to underscores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is all stuff we control right? we're the ones making the names up? I don't think we need this (and if we do, we should probably be using tags for some of the stuff in the name?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trouble is we don't control them all. See the problem with the Gin generated spans.
agree with some of Tolga's points, think this is the right direction though - getting things into |
Ok, I've looked at pulling the Latency config from main, but it creates a cyclic dependency, so it's not going to happen. I've had a look at passing the value down and through the various layers to try and bring it to the Converter, but that ends up being a total mess, it just doesn't feel worth it. I think to get this closer to the suggestions would really involve replacing the Spans with normal metrics. The idea here is to get some value from what we have, if they prove useful then that would strengthen the case to get Jaeger/Zipkin setup and make proper use of them. |
hmm, not sure about motivation here if this is the case, I think they're 2 different goals. we lose all causal context by way of this method, and I think Jaeger would be useful in its own right without this (actually, I know it will be useful, because I've used it to fix bugs and performance tweaks already...). we just get a different set of data with Jaeger, especially since this is averaging down in distribution buckets, tail latency samples are lost -- Jaeger is immensely useful in this respect, even when sampling .01% of traffic. for this, it covers up holes in our metrics for things we're already gathering for traces, this is really useful too; docker latencies, for example, are a great example of when both are useful and we have doubled logic to get these things right now (ie before this) - maybe this is ok and what we should be doing instead, I'm not sure I have the answer, we can discuss (it sounds like you've run into some challenges, trust your judgment on this...) |
@@ -743,6 +744,14 @@ func WithPrometheus() Option { | |||
} | |||
} | |||
|
|||
func WithSpanConverter() Option { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this take an argument for latency buckets? Not the end of the world, but it would be good to keep buckets consistent with service side.
Span to View converter
We have added open census span calls in interesting parts of the code,
but without creating a jaeger or zipkin cluster they are not going
anywhere.
Here we add a SpanConverter that converts our Spans into OpenCensus
distribution Views.
It contains an in memory map of the Measures for these Views,
recording values for the length of time reported to be spent in a
Span.
This list has a hard limit of 100, as there is a risk of a metric
explosion should an element of the system include some ID value in
it's Span names. Gin does this and those are discarded, but there
might be others that we'll have to deal with in the future when discovered.