Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open census Span to View converter #1239

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

tcoupland
Copy link
Contributor

@tcoupland tcoupland commented Sep 25, 2018

Span to View converter

We have added open census span calls in interesting parts of the code,
but without creating a jaeger or zipkin cluster they are not going
anywhere.

Here we add a SpanConverter that converts our Spans into OpenCensus
distribution Views.

It contains an in memory map of the Measures for these Views,
recording values for the length of time reported to be spent in a
Span.

This list has a hard limit of 100, as there is a risk of a metric
explosion should an element of the system include some ID value in
it's Span names. Gin does this and those are discarded, but there
might be others that we'll have to deal with in the future when discovered.

@rdallman
Copy link
Contributor

nice, thanks. will take a look, need something like this. slightly relevant maybe - am generally slightly concerned about 'metrics overload' from this (a lot of these won't be all that useful), curious if you have any thoughts about being able to turn certain 'views' (as opencensus calls them) on/off on a running server.

@tcoupland
Copy link
Contributor Author

Metric overload is a worry for sure. It's easy to make em, but sometimes hard to track dwon where they come from.

I would say that if this creates a metric that is useless, then the span was going to be useless anyway. The idea here is to answer the "where is all the time spent" question, in particular when we start load testing and want to know the bottle neck.

The other thing we could think about is using span annotations, to create a sort of debug, info, leveling system. Then we cld selectively enable/disable per environment.

If you fire this branch up you can see how many are going to get made after running a few commands.

We have added open census span calls in interesting parts of the code,
but without creating a jaeger or zipkin cluster they are not going
anywhere.

Here we add a SpanConverter that converts our Spans into OpenCensus
distribution Views.

It contains an in memory map of the Measures for these Views,
recording values for the length of time reported to be spent in a
Span.

This list has a hard limit of 100, as there is a risk of a metric
explosion should an element of the system include some ID value in
it's Span names. Gin does this and those are discarded, but there
might be others that we'll have to deal with in the future when discovered.
@tcoupland tcoupland changed the title Prometheus exporter for open census spans Open census Span to View converter Sep 26, 2018
@skinowski
Copy link
Member

Distribution buckets should be configurable. eg:
https://github.com/fnproject/fn/blob/master/cmd/fnserver/main.go#L30

m, ok := c.measures[sig]
c.viewsMu.Unlock()

if !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent access without lock. Outside of locks this could register/insert more than once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it looked wrong to me to, when I lifted it from the main prom exporter.

In actual fact there is no risk here, only that a view gets created twice, but OC will take care of that with it's deduplication logic.

@skinowski
Copy link
Member

skinowski commented Sep 26, 2018

I think this should be optional similar to other views (api views, agent views, etc.) Maybe add a new registration function with distribution buckets? If we are concerned about metrics overload, perhaps the registration function could take an optional list of span names to be excluded/included?

@tcoupland
Copy link
Contributor Author

Thanks for the pointer to our dist config, will hook that up.

I thought about a white list, but that would get painful quick imo. The size limit should stop it going nuts.

Will take a look at optionality. It could have it's own top level flag, like the prom one.

@skinowski
Copy link
Member

@mantree Sounds good. The registration function could even have an optional span filter function. Then we can remove url, sanitize, 100 limit, 100 char limit and leave those up to the main.go/components.

spanTimeNanos := sd.EndTime.Sub(sd.StartTime)
spanTimeMillis := float64(int64(spanTimeNanos / time.Millisecond))

stats.Record(context.Background(), m.M(spanTimeMillis))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rip some of the attributes off the span data as well and put them onto the context. this may be slightly painful, but tags are likely important so that we have context for eg function_id, app_id... these seem to map to the fields SpanData.Attributes and tag.Map, however it's not immediately clear from digging around the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attributes, MessageEvents and Annotations are all empty for our spans.


if !ok {
m = stats.Float64(sig+"_span_time", "The span length in milliseconds", "ms")
v := &view.View{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like the wrong place for view registration. we should maybe/probably stats.Record everything and not limit the stats.Record bit (it's gated by rejectSpan as well as getMeasure and an arbitrary 100 limit imposed). at least, stats.Record and view.Measure are separate processes and should be exclusive. Tolga linked to some pointers for how the other stuff is handled atm, I think initially we're mostly registering views in initialization, we can maybe make this more friendly (ie no reboot required) later but it's relatively nice to work with for the ability to toggle them on/off for eg things that extend fn. then we end up whitelisting things in, which I kind of agree is painful, while at the same time it does make us organize things (like stat/trace names) so it's a net good I think. I think we can remove the limits (100) with that in place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also won't require a lock - we shouldn't be locking for a stats library, that is on them!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything has already been recorded, that's already happened to produce the SpanData. This is copying up the recorded values into the Measure.

}

//Gin creates spans for all paths, containing ID values.
//We can safely discard these, as other histograms are being created for them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space between first char always (sorry, am that guy)... // We

const labelKeySizeLimit = 100

// sanitize returns a string that is trunacated to 100 characters if it's too
// long, and replaces non-alphanumeric characters to underscores.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all stuff we control right? we're the ones making the names up? I don't think we need this (and if we do, we should probably be using tags for some of the stuff in the name?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trouble is we don't control them all. See the problem with the Gin generated spans.

@rdallman
Copy link
Contributor

agree with some of Tolga's points, think this is the right direction though - getting things into stats.Record looks pretty easy, which is sweet (dare I show you our opentracing->prometheus exporter from last October?)

@tcoupland
Copy link
Contributor Author

Ok, I've looked at pulling the Latency config from main, but it creates a cyclic dependency, so it's not going to happen. I've had a look at passing the value down and through the various layers to try and bring it to the Converter, but that ends up being a total mess, it just doesn't feel worth it.

I think to get this closer to the suggestions would really involve replacing the Spans with normal metrics. The idea here is to get some value from what we have, if they prove useful then that would strengthen the case to get Jaeger/Zipkin setup and make proper use of them.

@rdallman
Copy link
Contributor

The idea here is to get some value from what we have, if they prove useful then that would strengthen the case to get Jaeger/Zipkin setup and make proper use of them.

hmm, not sure about motivation here if this is the case, I think they're 2 different goals. we lose all causal context by way of this method, and I think Jaeger would be useful in its own right without this (actually, I know it will be useful, because I've used it to fix bugs and performance tweaks already...). we just get a different set of data with Jaeger, especially since this is averaging down in distribution buckets, tail latency samples are lost -- Jaeger is immensely useful in this respect, even when sampling .01% of traffic. for this, it covers up holes in our metrics for things we're already gathering for traces, this is really useful too; docker latencies, for example, are a great example of when both are useful and we have doubled logic to get these things right now (ie before this) - maybe this is ok and what we should be doing instead, I'm not sure I have the answer, we can discuss (it sounds like you've run into some challenges, trust your judgment on this...)

@@ -743,6 +744,14 @@ func WithPrometheus() Option {
}
}

func WithSpanConverter() Option {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this take an argument for latency buckets? Not the end of the world, but it would be good to keep buckets consistent with service side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants