Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throughput of resque jobs reduced when datadog is enabled #511

Closed
NobodysNightmare opened this issue Aug 15, 2018 · 8 comments
Closed

Throughput of resque jobs reduced when datadog is enabled #511

NobodysNightmare opened this issue Aug 15, 2018 · 8 comments
Assignees
Labels
bug Involves a bug community Was opened by a community member integrations Involves tracing integrations

Comments

@NobodysNightmare
Copy link

NobodysNightmare commented Aug 15, 2018

Hey there,

we are using datadog for a few week in production now. Lately we noticed that one of our resque based processes now took several hours instead of finishing within 15 minutes, as we were used to.

Sadly, I think we could pinpoint the problem to ddtrace.

Context

Our process involves executing ~2500 resque jobs. They are all enqueued at the start of the process and as soon as all have finished, we are done.

They are executed on two different application servers, where each app server has 5 workers available.

Since introducing datadog we noticed that the process took much longer than it used to take. Since we did not immediately link the introduction of datadog to the slowdown, we had it run a few times slowly, while trying to fix the issue by improving the performance of our own jobs.

We run on this environment:

ruby 2.3.1
ddtrace 0.13.1

Diagnosis

For todays run, we disabled (c.tracer enabled: false) ddtrace on one app server (A), while keeping it enabled on a different app server (B).

In general we noticed that the process was much faster again. We finished within 5 minutes in total. Due to a second monitoring source (an own gem that writes to a logfile, shipped to an ELK stack), we could also see how many jobs were finished per server:

Server Number of jobs processed
A 2,108 (~85%)
B 363 (~15%)

This means the server without ddtrace got nearly 6 times as many jobs processed.

Our logging also provides execution times per job, I am not sure how much I can trust these numbers, since it is based on before and after hooks on the resque job. Theoretically, ddtrace could have its own hooks that kick in before or after our logging mechanism and thus making the total runtime of the worker higher. Still we can see a difference here:

image

What's visible is, that the datadog server has some reportedly very long running executions, while the other server has no exceptionally high execution times at all.

For a few of the job executions I have traces available in datadog, the number of spans for these traces is between 50 and 100, though all of the traces fall into the bucket of less than 1 second of execution time. I wouldn't be surprised to also see some jobs with higher span counts (low thousands), but those should be an exception.

This is the latency distribution as per datadog:

image

I also tried zooming that graph out and comparing it to our numbers. Especially in the high runtimes, the data does not align very well:

image

vs

image

You can see that datadog has no visible results above 2.5 seconds (even though the p99 indicates that something has to follow), while our own reporting sees at least a few executions in that regime.

Sadly we don't have any insight which part of the tracing is so slow. I also know that tracing comes at a cost, but I would not expect it to be that high.

@delner
Copy link
Contributor

delner commented Aug 15, 2018

Hey @NobodysNightmare, thanks for the thorough report. These details are really helpful in framing the scope of the issue and will definitely be of use getting to the bottom of this.

Most of your jobs look very short, and might be affected by a change we've introduced in 0.14.0 under #474. As a possible super-quick (but maybe too optimistic) fix to this issue, I'd try that out, if you can, to see if it resolves the issue.

If that doesn't resolve things in a satisfactory way, it would be helpful to see how 0.14.0 affects this by updating what you have above with new distributions or other diagnostics, and go from there. If we reach that point, @pawelchcki might be great resource.

@delner delner self-assigned this Aug 15, 2018
@delner delner added bug Involves a bug integrations Involves tracing integrations community Was opened by a community member labels Aug 15, 2018
@pawelchcki
Copy link
Contributor

Before #474 we were using SyncWriter class to perform writes before the worker exits.

This worked well enough with Resque workers that do not use Threads, however it turned out that with threads this Writer can severely degrade performance - as probably evidenced by your observation. The data probably is also not aligned because the added latency is caused by SyncWriter which happens after the span is closed - based on this It seems your issue has high probability of being solved by upgrading to 0.14.0.

We were quite cautious when rolling out #474 to ensure that short running jobs (<1s) would still be able to quickly flush all data before shutting down.

@NobodysNightmare
Copy link
Author

Okidoki, I will give 0.14 a try (you released that minutes after my issue, didn't you? :) )...

@delner
Copy link
Contributor

delner commented Aug 16, 2018

Maybe... 😉

@NobodysNightmare
Copy link
Author

Preliminary data from another service that had throughput issues indicates that 0.14.1 has significantly better performance than 0.13.1.

I will give you an updated comparison for the process detailed above, but I expect much better results there too :)

@delner
Copy link
Contributor

delner commented Aug 24, 2018

Good to hear @NobodysNightmare ! Keep us posted with your findings!

@NobodysNightmare
Copy link
Author

Our last run of the process had a higher throughput with datadog than without it.

This either means, that you added magical unicorns and datadog already makes our application faster or that any differences are now within the margins of measurement accuracy (those are probably rather wide for the given process).

I consider this issue fixed :)

@delner
Copy link
Contributor

delner commented Sep 10, 2018

@NobodysNightmare I'll be sure to update our patch notes to explain our implementation of magical unicorns.

But I'm glad you're happy with it now; thanks for working us through this one! This kind of feedback is really helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug community Was opened by a community member integrations Involves tracing integrations
Projects
None yet
Development

No branches or pull requests

3 participants