Throughput of resque jobs reduced when datadog is enabled #511

NobodysNightmare · 2018-08-15T10:27:41Z

Hey there,

we are using datadog for a few week in production now. Lately we noticed that one of our resque based processes now took several hours instead of finishing within 15 minutes, as we were used to.

Sadly, I think we could pinpoint the problem to ddtrace.

Context

Our process involves executing ~2500 resque jobs. They are all enqueued at the start of the process and as soon as all have finished, we are done.

They are executed on two different application servers, where each app server has 5 workers available.

Since introducing datadog we noticed that the process took much longer than it used to take. Since we did not immediately link the introduction of datadog to the slowdown, we had it run a few times slowly, while trying to fix the issue by improving the performance of our own jobs.

We run on this environment:

ruby 2.3.1
ddtrace 0.13.1

Diagnosis

For todays run, we disabled (c.tracer enabled: false) ddtrace on one app server (A), while keeping it enabled on a different app server (B).

In general we noticed that the process was much faster again. We finished within 5 minutes in total. Due to a second monitoring source (an own gem that writes to a logfile, shipped to an ELK stack), we could also see how many jobs were finished per server:

Server	Number of jobs processed
A	2,108 (~85%)
B	363 (~15%)

This means the server without ddtrace got nearly 6 times as many jobs processed.

Our logging also provides execution times per job, I am not sure how much I can trust these numbers, since it is based on before and after hooks on the resque job. Theoretically, ddtrace could have its own hooks that kick in before or after our logging mechanism and thus making the total runtime of the worker higher. Still we can see a difference here:

What's visible is, that the datadog server has some reportedly very long running executions, while the other server has no exceptionally high execution times at all.

For a few of the job executions I have traces available in datadog, the number of spans for these traces is between 50 and 100, though all of the traces fall into the bucket of less than 1 second of execution time. I wouldn't be surprised to also see some jobs with higher span counts (low thousands), but those should be an exception.

This is the latency distribution as per datadog:

I also tried zooming that graph out and comparing it to our numbers. Especially in the high runtimes, the data does not align very well:

vs

You can see that datadog has no visible results above 2.5 seconds (even though the p99 indicates that something has to follow), while our own reporting sees at least a few executions in that regime.

Sadly we don't have any insight which part of the tracing is so slow. I also know that tracing comes at a cost, but I would not expect it to be that high.

The text was updated successfully, but these errors were encountered:

delner · 2018-08-15T17:58:53Z

Hey @NobodysNightmare, thanks for the thorough report. These details are really helpful in framing the scope of the issue and will definitely be of use getting to the bottom of this.

Most of your jobs look very short, and might be affected by a change we've introduced in 0.14.0 under #474. As a possible super-quick (but maybe too optimistic) fix to this issue, I'd try that out, if you can, to see if it resolves the issue.

If that doesn't resolve things in a satisfactory way, it would be helpful to see how 0.14.0 affects this by updating what you have above with new distributions or other diagnostics, and go from there. If we reach that point, @pawelchcki might be great resource.

pawelchcki · 2018-08-16T08:33:29Z

Before #474 we were using SyncWriter class to perform writes before the worker exits.

This worked well enough with Resque workers that do not use Threads, however it turned out that with threads this Writer can severely degrade performance - as probably evidenced by your observation. The data probably is also not aligned because the added latency is caused by SyncWriter which happens after the span is closed - based on this It seems your issue has high probability of being solved by upgrading to 0.14.0.

We were quite cautious when rolling out #474 to ensure that short running jobs (<1s) would still be able to quickly flush all data before shutting down.

NobodysNightmare · 2018-08-16T09:28:45Z

Okidoki, I will give 0.14 a try (you released that minutes after my issue, didn't you? :) )...

delner · 2018-08-16T14:26:11Z

Maybe... 😉

NobodysNightmare · 2018-08-24T07:01:59Z

Preliminary data from another service that had throughput issues indicates that 0.14.1 has significantly better performance than 0.13.1.

I will give you an updated comparison for the process detailed above, but I expect much better results there too :)

delner · 2018-08-24T13:34:55Z

Good to hear @NobodysNightmare ! Keep us posted with your findings!

NobodysNightmare · 2018-09-04T08:50:07Z

Our last run of the process had a higher throughput with datadog than without it.

This either means, that you added magical unicorns and datadog already makes our application faster or that any differences are now within the margins of measurement accuracy (those are probably rather wide for the given process).

I consider this issue fixed :)

delner · 2018-09-10T15:16:27Z

@NobodysNightmare I'll be sure to update our patch notes to explain our implementation of magical unicorns.

But I'm glad you're happy with it now; thanks for working us through this one! This kind of feedback is really helpful.

delner self-assigned this Aug 15, 2018

delner added bug Involves a bug integrations Involves tracing integrations community Was opened by a community member labels Aug 15, 2018

NobodysNightmare closed this as completed Sep 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throughput of resque jobs reduced when datadog is enabled #511

Throughput of resque jobs reduced when datadog is enabled #511

NobodysNightmare commented Aug 15, 2018 •

edited

Loading

delner commented Aug 15, 2018

pawelchcki commented Aug 16, 2018

NobodysNightmare commented Aug 16, 2018

delner commented Aug 16, 2018

NobodysNightmare commented Aug 24, 2018

delner commented Aug 24, 2018

NobodysNightmare commented Sep 4, 2018

delner commented Sep 10, 2018

Throughput of resque jobs reduced when datadog is enabled #511

Throughput of resque jobs reduced when datadog is enabled #511

Comments

NobodysNightmare commented Aug 15, 2018 • edited Loading

Context

Diagnosis

delner commented Aug 15, 2018

pawelchcki commented Aug 16, 2018

NobodysNightmare commented Aug 16, 2018

delner commented Aug 16, 2018

NobodysNightmare commented Aug 24, 2018

delner commented Aug 24, 2018

NobodysNightmare commented Sep 4, 2018

delner commented Sep 10, 2018

NobodysNightmare commented Aug 15, 2018 •

edited

Loading