Add apm-agent-go tracing of API and event publisher #816

axw · 2018-04-09T01:45:22Z

We add tracing of the server's API and beats event publisher. Tracing is disabled by default, and can be enabled via apm-server.yml:

apm-server:
    tracing:
        enabled: true

This is fairly minimal in terms of what we're capturing. In a future PR, we should look at adding spans for the decode, validate, transform and report processing phases.

axw · 2018-05-15T03:01:02Z

This is run on my laptop. Variance is high, so hard to draw strong conclusions from this. It does look like enabling tracing has a significant impact on throughput, since the best I see with tracing disabled is ~65rps, with tracing enabled the best is ~55rps.

hey-apm -run 30s (@ master / 4d7b820)

Run 1

http://localhost:8200/v1/errors 0         
  [202] 886 responses (100.00%)           
  total 886 responses (28.82 rps)         
                                          
http://localhost:8200/v1/errors 1         
  [202] 297 responses (100.00%)           
  total 297 responses (9.66 rps)          
                                          
http://localhost:8200/v1/transactions 2   
  [202] 2043 responses (100.00%)          
  total 2043 responses (66.46 rps)        
                                          
http://localhost:8200/healthcheck 3       
  [200] 30 responses (100.00%)            
  total 30 responses (0.98 rps)

Run 2

http://localhost:8200/v1/errors 0       
  [202] 551 responses (100.00%)         
  total 551 responses (17.15 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 186 responses (100.00%)         
  total 186 responses (5.79 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 1168 responses (91.04%)         
  [503] 115 responses (8.96%)           
  total 1283 responses (39.94 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.93 rps)

Run 3:

http://localhost:8200/v1/errors 0       
  [202] 384 responses (100.00%)         
  total 384 responses (11.96 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 131 responses (100.00%)         
  total 131 responses (4.08 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 740 responses (78.81%)          
  [503] 199 responses (21.19%)          
  total 939 responses (29.26 rps)       
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.93 rps)

hey-apm -run 30s (@ trace-apm-server / ee7579a, tracing.enabled:false)

Run 1

http://localhost:8200/v1/errors 0        
  [202] 888 responses (100.00%)          
  total 888 responses (28.90 rps)        
                                         
http://localhost:8200/v1/errors 1        
  [202] 297 responses (100.00%)          
  total 297 responses (9.67 rps)         
                                         
http://localhost:8200/v1/transactions 2  
  [202] 1980 responses (100.00%)         
  total 1980 responses (64.43 rps)       
                                         
http://localhost:8200/healthcheck 3      
  [200] 30 responses (100.00%)           
  total 30 responses (0.98 rps)

Run 2

http://localhost:8200/v1/errors 0       
  [202] 881 responses (100.00%)         
  total 881 responses (28.70 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 296 responses (100.00%)         
  total 296 responses (9.64 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 2053 responses (100.00%)        
  total 2053 responses (66.88 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.98 rps)

Run 3

http://localhost:8200/v1/errors 0        
  [202] 884 responses (100.00%)          
  total 884 responses (27.50 rps)        
                                         
http://localhost:8200/v1/errors 1        
  [202] 295 responses (100.00%)          
  total 295 responses (9.18 rps)         
                                         
http://localhost:8200/v1/transactions 2  
  [202] 1872 responses (100.00%)         
  total 1872 responses (58.24 rps)       
                                         
http://localhost:8200/healthcheck 3      
  [200] 30 responses (100.00%)           
  total 30 responses (0.93 rps)

hey-apm -run 30s (@ trace-apm-server / ee7579a, tracing.enabled:true)

Run 1

http://localhost:8200/v1/errors 0       
  [202] 709 responses (99.58%)          
  [503] 3 responses (0.42%)             
  total 712 responses (23.43 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 238 responses (99.58%)          
  [503] 1 responses (0.42%)             
  total 239 responses (7.87 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 1769 responses (99.72%)         
  [503] 5 responses (0.28%)             
  total 1774 responses (58.39 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.99 rps)

Run 2

http://localhost:8200/v1/errors 0       
  [202] 713 responses (99.58%)          
  [503] 3 responses (0.42%)             
  total 716 responses (22.89 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 238 responses (99.58%)          
  [503] 1 responses (0.42%)             
  total 239 responses (7.64 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 1676 responses (98.53%)         
  [503] 25 responses (1.47%)            
  total 1701 responses (54.38 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.96 rps)

Run 3

http://localhost:8200/v1/errors 0       
  [202] 758 responses (99.87%)          
  [503] 1 responses (0.13%)             
  total 759 responses (24.69 rps)       
                                        
http://localhost:8200/v1/errors 1       
  [202] 255 responses (100.00%)         
  total 255 responses (8.29 rps)        
                                        
http://localhost:8200/v1/transactions 2 
  [202] 1654 responses (99.76%)         
  [503] 4 responses (0.24%)             
  total 1658 responses (53.93 rps)      
                                        
http://localhost:8200/healthcheck 3     
  [200] 30 responses (100.00%)          
  total 30 responses (0.98 rps)

axw · 2018-05-15T03:32:28Z

Sampling at 50% maintains throughput of ~65rps.

It might be worth running benchmarks on the dedicated benchmark hardware, but it seems to me that it's safe to put this in since it's disabled by default.

jalvz · 2018-05-15T07:44:54Z

beater/pub.go

 	p.m.RLock()
 	defer p.m.RUnlock()
 	if p.stopped {
 		return errChannelClosed
 	}

+	span, ctx := elasticapm.StartSpan(ctx, "Send", "Publisher")


what is the difference between elasticapm.StartSpan and tx.StartSpan and why you use one over the other one?

elasticapm.StartSpan pulls info (transaction and parent span) out of a context.Context, tx.StartSpan does not.

jalvz · 2018-05-15T07:45:34Z

beater/pub.go

 	select {
+	case <-ctx.Done():
+		return ctx.Err()


could be the case that if the agent fails to instrument a request, for whatever reason, we just balk out and do not process it?

Nope. Already in some cases, ctx doesn't contain a transaction or span, namely when the request is coming through the pipe listener.

jalvz · 2018-05-15T07:46:46Z

beater/pub.go

+func (p *publisher) processPendingReq(req pendingReq) {
+	var tx *elasticapm.Transaction
+	maybeStartSpan := func(name, spanType string) *elasticapm.Span {
+		if tx != nil {


what do you think about adding nil checks in the methods themselves in the agent so that users don't have to worry whether there is a transaction going on or not?

The problem is that part of the API involves fields on the transaction, which means doing something like

tx.Result = "foo"

will panic if tx is nil. I do agree it's not great as it is, though. The alternative is that we unexpose all fields and replace them with methods. That might be better. On the other hand, forcing the nil check ensures that the instrumented application avoids overhead as much as possible.

It might be reasonable to do a limited amount of nil checking (maybe on Transaction.StartSpan), but I think it would be more useful to force callers to check the result by returning a second bool result (the "comma-ok" pattern), like:

// tracer.StartTransaction will return (nil, false) if the tracer is inactive or the transaction's name is being filtered tx, ok := tracer.StartTransaction("name", "type") if ok { defer tx.Done(-1) tx.Context.Set... } // tx.StartSpan will return (nil, false) if tx is nil, if the transaction is not being sampled, or if the transaction's max spans limit has been reached. span, ok := tx.StartSpan("name", "type", parentSpan) if ok { defer span.Done(-1) span.Context.Set... }

We can also make the Done methods no-ops for nil Transactions and Spans, but I'm a bit wary about that. A developer seeing that might think it's also reasonable to use the object's fields without checking the value for nil or the bool result.

Alternatively, I'm considering adding a new helper func DoSpan(ctx context.Context, name, spanType string, f func(ctx context.Context)). This would take care of checking if the span is nil, and calling its Done method, so you could do something like this:

ctx := context.Background() if req.trace { tx, ok := tracer.StartTransaction("name", "type") if ok { defer tx.Done(-1) ctx = elasticapm.ContextWithTransaction(ctx, tx) } } var events []beat.Event elasticapm.DoSpan(ctx, "Transform", "Publisher", func() { events = req.payload.Transform(req.config) })

DoSpan will be defined something like:

func DoSpan(ctx context.Context, name, spanType string, f func(context.Context)) { span, ctx := elasticapm.StartSpan(name, spanType) if span != nil { defer span.Done(-1) } f(ctx) }

I've just created elastic/apm-agent-go#92, which is the other other alternative: never create nil spans. StartTransaction already never returns nil transactions. This simplifies the code here to:

var tx *elasticapm.Transaction if req.trace { tx = tracer.StartTransaction("name", "type") defer tx.Done(-1) } span := tx.StartSpan("Transform", "Publisher", nil) events := req.payload.Transform(req.config) span.Done(-1)

jalvz · 2018-05-17T14:46:20Z

re performance: i did quite a few tests on my own too, and didn't see any noticeable difference between tracing enabled, disabled and master. good job!

i don't know how you are testing, but generally i get higher variance when running both apm-server and elasticsearch locally, as they are 2 heavy processes competing for cpu (plus hey itself).
apm locally + elasticsearch on cloud generally gives more repeatable results.

sample results:

duration 1m0s  events 10  spans 10  frames 10  concurrency 5
pushed  accepted  throughput  latency  index  max rss  effic  branch  flags
5.2Mb   4.4Mb     467.5dps    201ms    85.6%  406.6Mb  0.656  master
5.2Mb   4.6Mb     482.0dps    196ms    85.9%  421.3Mb  0.650  trace-  apm-server.tracing.enabled=false
5.2Mb   4.3Mb     469.5dps    210ms    89.6%  394.8Mb  0.648  trace-  apm-server.tracing.enabled=true

duration 1m0s  events 10  spans 10  frames 10  concurrency 1
pushed  accepted  throughput  latency  index  max rss  effic  branch  flags
4.6Mb   4.6Mb     488.8dps    194ms    86.0%  396.1Mb  0.700  master
4.7Mb   4.7Mb     493.7dps    192ms    86.3%  407.5Mb  0.685  trace-  apm-server.tracing.enabled=false
4.5Mb   4.5Mb     497.8dps    198ms    89.6%  392.8Mb  0.690  trace-  apm-server.tracing.enabled=true

jalvz · 2018-05-17T15:11:30Z

beater/beater.go

+// the APM server's own execution.
+func initTracer(info beat.Info, config *Config, logger *logp.Logger) (*elasticapm.Tracer, net.Listener, error) {
+	if !config.Tracing.isEnabled() {
+		os.Setenv("ELASTIC_APM_ACTIVE", "false")


would be nice if everything that is passed via env vars could be passed as well as an argument to NewTracer, since we are importing the agent already; but it is fine

Agreed, I have an issue for this: elastic/apm-agent-go#88

axw · 2018-05-18T01:21:43Z

re performance: i did quite a few tests on my own too, and didn't see any noticeable difference between tracing enabled, disabled and master. good job!

Thank you for running those tests. Yeah, I was thoughtlessly running ES locally -- I'll set it up on another machine next time.

jalvz

This is great!
I think would be nice if @simitt and/or @graphaelli can have a quick look

axw · 2018-05-29T05:51:02Z

I'm going to leave this here until the Go agent hits beta, and then update with the beta release. There will be one or two more breaking changes between now and then.

This will be used by the server for tracing itself.

graphaelli

Looks great.

What do you think of setting the user-agent in for the pipe producer to make the logs clear where the additional message are coming from, currently:

2018-06-18T11:38:24.173-0400	INFO	[request]	beater/handlers.go:248	handled request	{"request_id": "392fd4e8-f5c4-4266-89b5-1f23ec97ec8e", "method": "POST", "URL": "/v1/transactions", "content_length": 881, "remote_address": "pipe", "user-agent": "Go-http-client/1.1", "response_code": 202}

How about not instrumenting the healthcheck endpoint?

graphaelli · 2018-06-18T15:28:50Z

_meta/beat.yml

@@ -27,6 +27,11 @@ apm-server:
  # Maximum number of new connections to accept simultaneously (0 means unlimited)
  # max_connections: 0

+  # Tracing support for the server's HTTP endpoints and event publisher.
+  tracing:


should be commented out since it's the default

axw · 2018-06-19T01:10:12Z

What do you think of setting the user-agent in for the pipe producer to make the logs clear where the additional message are coming from, currently:

SGTM

How about not instrumenting the healthcheck endpoint?

Also SGTM.

Will address both before landing.

axw · 2018-06-19T01:36:05Z

What do you think of setting the user-agent in for the pipe producer to make the logs clear where the additional message are coming from, currently:

Actually, we can't do that at the moment, as there's no way to override the user-agent header. I'll create an issue on the agent repo. I'll land without this, I hope you don't mind. If it's an issue we can update logHandler to check the remote address, but it would be better not to have special casing in there.

graphaelli · 2018-06-19T01:37:13Z

Definitely fine with me to land as-is regarding ua.

This commit adds optional/configurable support for tracing the APM server's own HTTP endpoints and beats event publisher. Enable tracing by adding the following to apm-server.yml: tracing: enabled: true

watson added the in progress label Apr 9, 2018

alvarolobato added [zube]: Inbox and removed [zube]: Inbox labels May 9, 2018

alvarolobato assigned axw May 9, 2018

alvarolobato added this to the 6.4 milestone May 9, 2018

axw force-pushed the trace-apm-server branch 11 times, most recently from 6b3444a to ee7579a Compare May 14, 2018 09:38

axw changed the title ~~WIP: trace intake endpoints~~ Add apm-agent-go tracing of API and event publisher May 14, 2018

axw added [zube]: In Review and removed [zube]: In Progress in progress labels May 14, 2018

axw force-pushed the trace-apm-server branch from ee7579a to 7fe0326 Compare May 15, 2018 03:35

jalvz reviewed May 15, 2018

View reviewed changes

jalvz reviewed May 17, 2018

View reviewed changes

axw force-pushed the trace-apm-server branch from 7fe0326 to c961d66 Compare May 18, 2018 02:22

jalvz approved these changes May 22, 2018

View reviewed changes

axw requested review from graphaelli and simitt May 24, 2018 00:45

axw added the in progress label May 29, 2018

pipelistener: add a net.Pipe-based net.Listener

009607d

This will be used by the server for tracing itself.

axw force-pushed the trace-apm-server branch 2 times, most recently from 9fb4b31 to a25f42f Compare June 18, 2018 05:19

graphaelli approved these changes Jun 18, 2018

View reviewed changes

axw mentioned this pull request Jun 19, 2018

Support overriding the user-agent header elastic/apm-agent-go#118

Closed

axw force-pushed the trace-apm-server branch from a25f42f to 94a3b81 Compare June 19, 2018 02:28

axw added 3 commits June 19, 2018 11:17

beater: add optional tracing

140241e

This commit adds optional/configurable support for tracing the APM server's own HTTP endpoints and beats event publisher. Enable tracing by adding the following to apm-server.yml: tracing: enabled: true

vendor: add libbeat/publisher/testing

a133220

vendor: add apm-agent-go, golang.org/x/sync

c8dd591

axw force-pushed the trace-apm-server branch from 94a3b81 to c8dd591 Compare June 19, 2018 03:17

axw merged commit 0b122a0 into elastic:master Jun 19, 2018

axw deleted the trace-apm-server branch June 19, 2018 03:35

zube bot added [zube]: Done and removed [zube]: In Review labels Jun 19, 2018

axw removed the in progress label Jun 19, 2018

axw mentioned this pull request Jun 26, 2018

Add apm-agent-go tracing of API and event publisher #1029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add apm-agent-go tracing of API and event publisher #816

Add apm-agent-go tracing of API and event publisher #816

axw commented Apr 9, 2018 •

edited by zube bot

Loading

axw commented May 15, 2018

axw commented May 15, 2018

jalvz May 15, 2018

axw May 15, 2018

jalvz May 15, 2018

axw May 15, 2018

jalvz May 15, 2018

axw May 15, 2018

axw May 16, 2018

axw May 17, 2018

jalvz commented May 17, 2018 •

edited

Loading

jalvz May 17, 2018

axw May 18, 2018

axw commented May 18, 2018

jalvz left a comment

axw commented May 29, 2018

graphaelli left a comment

graphaelli Jun 18, 2018

axw Jun 19, 2018

axw commented Jun 19, 2018

axw commented Jun 19, 2018

graphaelli commented Jun 19, 2018

Add apm-agent-go tracing of API and event publisher #816

Add apm-agent-go tracing of API and event publisher #816

Conversation

axw commented Apr 9, 2018 • edited by zube bot Loading

axw commented May 15, 2018

axw commented May 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jalvz commented May 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axw commented May 18, 2018

jalvz left a comment

Choose a reason for hiding this comment

axw commented May 29, 2018

graphaelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axw commented Jun 19, 2018

axw commented Jun 19, 2018

graphaelli commented Jun 19, 2018

axw commented Apr 9, 2018 •

edited by zube bot

Loading

jalvz commented May 17, 2018 •

edited

Loading