-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add apm-agent-go tracing of API and event publisher #816
Conversation
6b3444a
to
ee7579a
Compare
This is run on my laptop. Variance is high, so hard to draw strong conclusions from this. It does look like enabling tracing has a significant impact on throughput, since the best I see with tracing disabled is ~65rps, with tracing enabled the best is ~55rps.
Run 1
Run 2
Run 3:
Run 1
Run 2
Run 3
Run 1
Run 2
Run 3
|
Sampling at 50% maintains throughput of ~65rps. It might be worth running benchmarks on the dedicated benchmark hardware, but it seems to me that it's safe to put this in since it's disabled by default. |
p.m.RLock() | ||
defer p.m.RUnlock() | ||
if p.stopped { | ||
return errChannelClosed | ||
} | ||
|
||
span, ctx := elasticapm.StartSpan(ctx, "Send", "Publisher") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the difference between elasticapm.StartSpan
and tx.StartSpan
and why you use one over the other one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elasticapm.StartSpan pulls info (transaction and parent span) out of a context.Context, tx.StartSpan does not.
select { | ||
case <-ctx.Done(): | ||
return ctx.Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be the case that if the agent fails to instrument a request, for whatever reason, we just balk out and do not process it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Already in some cases, ctx doesn't contain a transaction or span, namely when the request is coming through the pipe listener.
beater/pub.go
Outdated
func (p *publisher) processPendingReq(req pendingReq) { | ||
var tx *elasticapm.Transaction | ||
maybeStartSpan := func(name, spanType string) *elasticapm.Span { | ||
if tx != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about adding nil checks in the methods themselves in the agent so that users don't have to worry whether there is a transaction going on or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that part of the API involves fields on the transaction, which means doing something like
tx.Result = "foo"
will panic if tx is nil. I do agree it's not great as it is, though. The alternative is that we unexpose all fields and replace them with methods. That might be better. On the other hand, forcing the nil check ensures that the instrumented application avoids overhead as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be reasonable to do a limited amount of nil checking (maybe on Transaction.StartSpan), but I think it would be more useful to force callers to check the result by returning a second bool result (the "comma-ok" pattern), like:
// tracer.StartTransaction will return (nil, false) if the tracer is inactive or the transaction's name is being filtered
tx, ok := tracer.StartTransaction("name", "type")
if ok {
defer tx.Done(-1)
tx.Context.Set...
}
// tx.StartSpan will return (nil, false) if tx is nil, if the transaction is not being sampled, or if the transaction's max spans limit has been reached.
span, ok := tx.StartSpan("name", "type", parentSpan)
if ok {
defer span.Done(-1)
span.Context.Set...
}
We can also make the Done methods no-ops for nil Transactions and Spans, but I'm a bit wary about that. A developer seeing that might think it's also reasonable to use the object's fields without checking the value for nil or the bool result.
Alternatively, I'm considering adding a new helper func DoSpan(ctx context.Context, name, spanType string, f func(ctx context.Context))
. This would take care of checking if the span is nil, and calling its Done method, so you could do something like this:
ctx := context.Background()
if req.trace {
tx, ok := tracer.StartTransaction("name", "type")
if ok {
defer tx.Done(-1)
ctx = elasticapm.ContextWithTransaction(ctx, tx)
}
}
var events []beat.Event
elasticapm.DoSpan(ctx, "Transform", "Publisher", func() {
events = req.payload.Transform(req.config)
})
DoSpan will be defined something like:
func DoSpan(ctx context.Context, name, spanType string, f func(context.Context)) {
span, ctx := elasticapm.StartSpan(name, spanType)
if span != nil {
defer span.Done(-1)
}
f(ctx)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just created elastic/apm-agent-go#92, which is the other other alternative: never create nil spans. StartTransaction already never returns nil transactions. This simplifies the code here to:
var tx *elasticapm.Transaction
if req.trace {
tx = tracer.StartTransaction("name", "type")
defer tx.Done(-1)
}
span := tx.StartSpan("Transform", "Publisher", nil)
events := req.payload.Transform(req.config)
span.Done(-1)
re performance: i did quite a few tests on my own too, and didn't see any noticeable difference between tracing enabled, disabled and master. good job! i don't know how you are testing, but generally i get higher variance when running both apm-server and elasticsearch locally, as they are 2 heavy processes competing for cpu (plus hey itself). sample results:
|
// the APM server's own execution. | ||
func initTracer(info beat.Info, config *Config, logger *logp.Logger) (*elasticapm.Tracer, net.Listener, error) { | ||
if !config.Tracing.isEnabled() { | ||
os.Setenv("ELASTIC_APM_ACTIVE", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice if everything that is passed via env vars could be passed as well as an argument to NewTracer
, since we are importing the agent already; but it is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I have an issue for this: elastic/apm-agent-go#88
Thank you for running those tests. Yeah, I was thoughtlessly running ES locally -- I'll set it up on another machine next time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
I think would be nice if @simitt and/or @graphaelli can have a quick look
I'm going to leave this here until the Go agent hits beta, and then update with the beta release. There will be one or two more breaking changes between now and then. |
This will be used by the server for tracing itself.
9fb4b31
to
a25f42f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great.
- What do you think of setting the user-agent in for the pipe producer to make the logs clear where the additional message are coming from, currently:
2018-06-18T11:38:24.173-0400 INFO [request] beater/handlers.go:248 handled request {"request_id": "392fd4e8-f5c4-4266-89b5-1f23ec97ec8e", "method": "POST", "URL": "/v1/transactions", "content_length": 881, "remote_address": "pipe", "user-agent": "Go-http-client/1.1", "response_code": 202}
- How about not instrumenting the healthcheck endpoint?
_meta/beat.yml
Outdated
@@ -27,6 +27,11 @@ apm-server: | |||
# Maximum number of new connections to accept simultaneously (0 means unlimited) | |||
# max_connections: 0 | |||
|
|||
# Tracing support for the server's HTTP endpoints and event publisher. | |||
tracing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be commented out since it's the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
SGTM
Also SGTM. Will address both before landing. |
Actually, we can't do that at the moment, as there's no way to override the user-agent header. I'll create an issue on the agent repo. I'll land without this, I hope you don't mind. If it's an issue we can update logHandler to check the remote address, but it would be better not to have special casing in there. |
Definitely fine with me to land as-is regarding ua. |
This commit adds optional/configurable support for tracing the APM server's own HTTP endpoints and beats event publisher. Enable tracing by adding the following to apm-server.yml: tracing: enabled: true
We add tracing of the server's API and beats event publisher. Tracing is disabled by default, and can be enabled via apm-server.yml:
This is fairly minimal in terms of what we're capturing. In a future PR, we should look at adding spans for the decode, validate, transform and report processing phases.