Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add apm-agent-go tracing of API and event publisher #816

Merged
merged 4 commits into from
Jun 19, 2018

Conversation

axw
Copy link
Member

@axw axw commented Apr 9, 2018

We add tracing of the server's API and beats event publisher. Tracing is disabled by default, and can be enabled via apm-server.yml:

apm-server:
    tracing:
        enabled: true

This is fairly minimal in terms of what we're capturing. In a future PR, we should look at adding spans for the decode, validate, transform and report processing phases.

@alvarolobato alvarolobato added this to the 6.4 milestone May 9, 2018
@axw axw force-pushed the trace-apm-server branch 11 times, most recently from 6b3444a to ee7579a Compare May 14, 2018 09:38
@axw axw changed the title WIP: trace intake endpoints Add apm-agent-go tracing of API and event publisher May 14, 2018
@axw
Copy link
Member Author

axw commented May 15, 2018

This is run on my laptop. Variance is high, so hard to draw strong conclusions from this. It does look like enabling tracing has a significant impact on throughput, since the best I see with tracing disabled is ~65rps, with tracing enabled the best is ~55rps.

hey-apm -run 30s (@ master / 4d7b820)

Run 1

http://localhost:8200/v1/errors 0         
  [202] 886 responses (100.00%)           
  total 886 responses (28.82 rps)         
                                          
http://localhost:8200/v1/errors 1         
  [202] 297 responses (100.00%)           
  total 297 responses (9.66 rps)          
                                          
http://localhost:8200/v1/transactions 2   
  [202] 2043 responses (100.00%)          
  total 2043 responses (66.46 rps)        
                                          
http://localhost:8200/healthcheck 3       
  [200] 30 responses (100.00%)            
  total 30 responses (0.98 rps)           

Run 2

http://localhost:8200/v1/errors 0       
  [202] 551 responses (100.00%)         
  total 551 responses (17.15 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 186 responses (100.00%)         
  total 186 responses (5.79 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 1168 responses (91.04%)         
  [503] 115 responses (8.96%)           
  total 1283 responses (39.94 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.93 rps)         

Run 3:

http://localhost:8200/v1/errors 0       
  [202] 384 responses (100.00%)         
  total 384 responses (11.96 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 131 responses (100.00%)         
  total 131 responses (4.08 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 740 responses (78.81%)          
  [503] 199 responses (21.19%)          
  total 939 responses (29.26 rps)       
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.93 rps)         

hey-apm -run 30s (@ trace-apm-server / ee7579a, tracing.enabled:false)

Run 1

http://localhost:8200/v1/errors 0        
  [202] 888 responses (100.00%)          
  total 888 responses (28.90 rps)        
                                         
http://localhost:8200/v1/errors 1        
  [202] 297 responses (100.00%)          
  total 297 responses (9.67 rps)         
                                         
http://localhost:8200/v1/transactions 2  
  [202] 1980 responses (100.00%)         
  total 1980 responses (64.43 rps)       
                                         
http://localhost:8200/healthcheck 3      
  [200] 30 responses (100.00%)           
  total 30 responses (0.98 rps)          

Run 2

http://localhost:8200/v1/errors 0       
  [202] 881 responses (100.00%)         
  total 881 responses (28.70 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 296 responses (100.00%)         
  total 296 responses (9.64 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 2053 responses (100.00%)        
  total 2053 responses (66.88 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.98 rps)         

Run 3

http://localhost:8200/v1/errors 0        
  [202] 884 responses (100.00%)          
  total 884 responses (27.50 rps)        
                                         
http://localhost:8200/v1/errors 1        
  [202] 295 responses (100.00%)          
  total 295 responses (9.18 rps)         
                                         
http://localhost:8200/v1/transactions 2  
  [202] 1872 responses (100.00%)         
  total 1872 responses (58.24 rps)       
                                         
http://localhost:8200/healthcheck 3      
  [200] 30 responses (100.00%)           
  total 30 responses (0.93 rps)          

hey-apm -run 30s (@ trace-apm-server / ee7579a, tracing.enabled:true)

Run 1

http://localhost:8200/v1/errors 0       
  [202] 709 responses (99.58%)          
  [503] 3 responses (0.42%)             
  total 712 responses (23.43 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 238 responses (99.58%)          
  [503] 1 responses (0.42%)             
  total 239 responses (7.87 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 1769 responses (99.72%)         
  [503] 5 responses (0.28%)             
  total 1774 responses (58.39 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.99 rps)         

Run 2

http://localhost:8200/v1/errors 0       
  [202] 713 responses (99.58%)          
  [503] 3 responses (0.42%)             
  total 716 responses (22.89 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 238 responses (99.58%)          
  [503] 1 responses (0.42%)             
  total 239 responses (7.64 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 1676 responses (98.53%)         
  [503] 25 responses (1.47%)            
  total 1701 responses (54.38 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.96 rps)         

Run 3

http://localhost:8200/v1/errors 0       
  [202] 758 responses (99.87%)          
  [503] 1 responses (0.13%)             
  total 759 responses (24.69 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 255 responses (100.00%)         
  total 255 responses (8.29 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 1654 responses (99.76%)         
  [503] 4 responses (0.24%)             
  total 1658 responses (53.93 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.98 rps)         

@axw
Copy link
Member Author

axw commented May 15, 2018

Sampling at 50% maintains throughput of ~65rps.

It might be worth running benchmarks on the dedicated benchmark hardware, but it seems to me that it's safe to put this in since it's disabled by default.

@axw axw force-pushed the trace-apm-server branch from ee7579a to 7fe0326 Compare May 15, 2018 03:35
p.m.RLock()
defer p.m.RUnlock()
if p.stopped {
return errChannelClosed
}

span, ctx := elasticapm.StartSpan(ctx, "Send", "Publisher")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between elasticapm.StartSpan and tx.StartSpan and why you use one over the other one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elasticapm.StartSpan pulls info (transaction and parent span) out of a context.Context, tx.StartSpan does not.

select {
case <-ctx.Done():
return ctx.Err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be the case that if the agent fails to instrument a request, for whatever reason, we just balk out and do not process it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Already in some cases, ctx doesn't contain a transaction or span, namely when the request is coming through the pipe listener.

beater/pub.go Outdated
func (p *publisher) processPendingReq(req pendingReq) {
var tx *elasticapm.Transaction
maybeStartSpan := func(name, spanType string) *elasticapm.Span {
if tx != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about adding nil checks in the methods themselves in the agent so that users don't have to worry whether there is a transaction going on or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that part of the API involves fields on the transaction, which means doing something like

tx.Result = "foo"

will panic if tx is nil. I do agree it's not great as it is, though. The alternative is that we unexpose all fields and replace them with methods. That might be better. On the other hand, forcing the nil check ensures that the instrumented application avoids overhead as much as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be reasonable to do a limited amount of nil checking (maybe on Transaction.StartSpan), but I think it would be more useful to force callers to check the result by returning a second bool result (the "comma-ok" pattern), like:

// tracer.StartTransaction will return (nil, false) if the tracer is inactive or the transaction's name is being filtered
tx, ok := tracer.StartTransaction("name", "type")
if ok {
    defer tx.Done(-1)
    tx.Context.Set...
}

// tx.StartSpan will return (nil, false) if tx is nil, if the transaction is not  being sampled, or if the transaction's max spans limit has been reached.
span, ok := tx.StartSpan("name", "type", parentSpan)
if ok {
    defer span.Done(-1)
    span.Context.Set...
}

We can also make the Done methods no-ops for nil Transactions and Spans, but I'm a bit wary about that. A developer seeing that might think it's also reasonable to use the object's fields without checking the value for nil or the bool result.

Alternatively, I'm considering adding a new helper func DoSpan(ctx context.Context, name, spanType string, f func(ctx context.Context)). This would take care of checking if the span is nil, and calling its Done method, so you could do something like this:

ctx := context.Background()
if req.trace {
    tx, ok := tracer.StartTransaction("name", "type")
    if ok {
        defer tx.Done(-1)
        ctx = elasticapm.ContextWithTransaction(ctx, tx)
    }
}

var events []beat.Event
elasticapm.DoSpan(ctx, "Transform", "Publisher", func() {
    events = req.payload.Transform(req.config)
})

DoSpan will be defined something like:

func DoSpan(ctx context.Context, name, spanType string, f func(context.Context)) {
    span, ctx := elasticapm.StartSpan(name, spanType)
    if span != nil {
        defer span.Done(-1)
    }
    f(ctx)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just created elastic/apm-agent-go#92, which is the other other alternative: never create nil spans. StartTransaction already never returns nil transactions. This simplifies the code here to:

var tx *elasticapm.Transaction
if req.trace {
    tx = tracer.StartTransaction("name", "type")
    defer tx.Done(-1)
}

span := tx.StartSpan("Transform", "Publisher", nil)
events := req.payload.Transform(req.config)
span.Done(-1)

@jalvz
Copy link
Contributor

jalvz commented May 17, 2018

re performance: i did quite a few tests on my own too, and didn't see any noticeable difference between tracing enabled, disabled and master. good job!

i don't know how you are testing, but generally i get higher variance when running both apm-server and elasticsearch locally, as they are 2 heavy processes competing for cpu (plus hey itself).
apm locally + elasticsearch on cloud generally gives more repeatable results.

sample results:

duration 1m0s  events 10  spans 10  frames 10  concurrency 5
pushed  accepted  throughput  latency  index  max rss  effic  branch  flags
5.2Mb   4.4Mb     467.5dps    201ms    85.6%  406.6Mb  0.656  master
5.2Mb   4.6Mb     482.0dps    196ms    85.9%  421.3Mb  0.650  trace-  apm-server.tracing.enabled=false
5.2Mb   4.3Mb     469.5dps    210ms    89.6%  394.8Mb  0.648  trace-  apm-server.tracing.enabled=true

duration 1m0s  events 10  spans 10  frames 10  concurrency 1
pushed  accepted  throughput  latency  index  max rss  effic  branch  flags
4.6Mb   4.6Mb     488.8dps    194ms    86.0%  396.1Mb  0.700  master
4.7Mb   4.7Mb     493.7dps    192ms    86.3%  407.5Mb  0.685  trace-  apm-server.tracing.enabled=false
4.5Mb   4.5Mb     497.8dps    198ms    89.6%  392.8Mb  0.690  trace-  apm-server.tracing.enabled=true

// the APM server's own execution.
func initTracer(info beat.Info, config *Config, logger *logp.Logger) (*elasticapm.Tracer, net.Listener, error) {
if !config.Tracing.isEnabled() {
os.Setenv("ELASTIC_APM_ACTIVE", "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if everything that is passed via env vars could be passed as well as an argument to NewTracer, since we are importing the agent already; but it is fine

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I have an issue for this: elastic/apm-agent-go#88

@axw
Copy link
Member Author

axw commented May 18, 2018

re performance: i did quite a few tests on my own too, and didn't see any noticeable difference between tracing enabled, disabled and master. good job!

Thank you for running those tests. Yeah, I was thoughtlessly running ES locally -- I'll set it up on another machine next time.

@axw axw force-pushed the trace-apm-server branch from 7fe0326 to c961d66 Compare May 18, 2018 02:22
Copy link
Contributor

@jalvz jalvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!
I think would be nice if @simitt and/or @graphaelli can have a quick look

@axw axw requested review from graphaelli and simitt May 24, 2018 00:45
@axw
Copy link
Member Author

axw commented May 29, 2018

I'm going to leave this here until the Go agent hits beta, and then update with the beta release. There will be one or two more breaking changes between now and then.

This will be used by the server for tracing itself.
@axw axw force-pushed the trace-apm-server branch 2 times, most recently from 9fb4b31 to a25f42f Compare June 18, 2018 05:19
Copy link
Member

@graphaelli graphaelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

  1. What do you think of setting the user-agent in for the pipe producer to make the logs clear where the additional message are coming from, currently:
2018-06-18T11:38:24.173-0400	INFO	[request]	beater/handlers.go:248	handled request	{"request_id": "392fd4e8-f5c4-4266-89b5-1f23ec97ec8e", "method": "POST", "URL": "/v1/transactions", "content_length": 881, "remote_address": "pipe", "user-agent": "Go-http-client/1.1", "response_code": 202}
  1. How about not instrumenting the healthcheck endpoint?

_meta/beat.yml Outdated
@@ -27,6 +27,11 @@ apm-server:
# Maximum number of new connections to accept simultaneously (0 means unlimited)
# max_connections: 0

# Tracing support for the server's HTTP endpoints and event publisher.
tracing:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be commented out since it's the default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@axw
Copy link
Member Author

axw commented Jun 19, 2018

  1. What do you think of setting the user-agent in for the pipe producer to make the logs clear where the additional message are coming from, currently:

SGTM

  1. How about not instrumenting the healthcheck endpoint?

Also SGTM.

Will address both before landing.

@axw
Copy link
Member Author

axw commented Jun 19, 2018

What do you think of setting the user-agent in for the pipe producer to make the logs clear where the additional message are coming from, currently:

Actually, we can't do that at the moment, as there's no way to override the user-agent header. I'll create an issue on the agent repo. I'll land without this, I hope you don't mind. If it's an issue we can update logHandler to check the remote address, but it would be better not to have special casing in there.

@graphaelli
Copy link
Member

Definitely fine with me to land as-is regarding ua.

@axw axw force-pushed the trace-apm-server branch from a25f42f to 94a3b81 Compare June 19, 2018 02:28
axw added 3 commits June 19, 2018 11:17
This commit adds optional/configurable support for
tracing the APM server's own HTTP endpoints and
beats event publisher. Enable tracing by adding
the following to apm-server.yml:

    tracing:
      enabled: true
@axw axw force-pushed the trace-apm-server branch from 94a3b81 to c8dd591 Compare June 19, 2018 03:17
@axw axw merged commit 0b122a0 into elastic:master Jun 19, 2018
@axw axw deleted the trace-apm-server branch June 19, 2018 03:35
@axw axw removed the in progress label Jun 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants