Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix graphql.resolve span durations #4387

Merged
merged 5 commits into from
Jun 7, 2024
Merged

Fix graphql.resolve span durations #4387

merged 5 commits into from
Jun 7, 2024

Conversation

crysmags
Copy link
Collaborator

@crysmags crysmags commented Jun 7, 2024

What does this PR do?

This pull request closes #3925

Motivation

From PR #3925:
What does this PR do?
Fix a bug in the graphql plugin that causes all graphql.resolve spans to have incorrect durations
Motivation
When looking at APM traces for graphql requests, we noticed that the graphql.resolve spans all seem to extend until the end of the parent graphql.execute span, rather than when that field resolver actually completes. This makes it hard to identify which specific field resolvers are performance bottlenecks of a graphql query.
Example 1 - note that all graphql.resolve spans extend to roughly the same end time, even for simple fields that should be trivially fast to resolve:

2024-06-07_10-53-43

Example 2 - on another, much more complex graphql query, it's even more apparent that something is wrong and causing all spans to extend until the entire graphql.execute span finishes; I would expect most of the purple bars to end earlier, and only one or a few long-tail field resolver spans would extend till the end.
Screenshot 2024-01-03 at 11 58 34 PM

Additional Notes
I believe this was a regression introduced in PR #3177, which refactored TracingPlugin subclasses to move the common span.finish() calls into the TracingPlugin base class.

However, the change to the graphql plugin's resolve operation in that PR stopped passing the finishTime argument from the finish method through to span.finish(); the base TracingPlugin.finish method does not use/accept any arguments and calls span.finish() with no args. Due to this change, span.finish() is being called without a finish time argument, and thus falls back to using the current time as the finish time.

Why does this result in all resolve spans finishing at the end of the execute span? If I'm understanding the code correctly, it looks like the graphql plugin defers finishing the resolve operations until the entire execute operation has completed - here. Although this code is still publishing the field's recorded finishTime to the finish channel, it is no longer being passed through to the span (as discussed above), so all the resolve spans are marked as finished at the current time, after the overall execute has completed.

Testing
The existing graphql plugin tests didn't appear to explicitly validate the span durations (other than a basic greater-than-zero check), so I made some guesses at how best to validate span durations within the existing test framework in order to add a unit test for this fix.

The new test introduces some new graphql field resolvers that intentionally take a while (via fake timers) to resolve:

2 async fields use setTimeout with different values to validate the duration recorded of concurrent graphql resolver promises
a 3rd new field directly ticks the clock as a synchronous resolver function to validate the duration recorded for a synchronous resolver
New test - fails as expected when run without the fix in resolve.js -- the faster async resolver's span is incorrectly recorded as taking 1234 milliseconds instead of the expected 100:
Screenshot 2024-01-05 at 10 38 37 PM

New test - passes with the changes to resolver.js:
Screenshot 2024-01-05 at 10 38 53 PM

Plugin Checklist

Additional Notes

@crysmags crysmags requested review from a team as code owners June 7, 2024 00:49
Copy link

github-actions bot commented Jun 7, 2024

Overall package size

Self size: 6.62 MB
Deduped: 61.88 MB
No deduping: 62.16 MB

Dependency sizes

name version self size total size
@datadog/native-appsec 8.0.1 15.59 MB 15.6 MB
@datadog/native-iast-taint-tracking 2.1.0 14.91 MB 14.92 MB
@datadog/pprof 5.3.0 9.85 MB 10.22 MB
protobufjs 7.2.5 2.77 MB 6.56 MB
@datadog/native-iast-rewriter 2.3.1 2.15 MB 2.24 MB
@opentelemetry/core 1.14.0 872.87 kB 1.47 MB
@datadog/native-metrics 2.0.0 898.77 kB 1.3 MB
@opentelemetry/api 1.8.0 1.21 MB 1.21 MB
import-in-the-middle 1.7.4 70.19 kB 739.86 kB
msgpack-lite 0.1.26 201.16 kB 281.59 kB
opentracing 0.14.7 194.81 kB 194.81 kB
semver 7.5.4 93.4 kB 123.8 kB
pprof-format 2.1.0 111.69 kB 111.69 kB
@datadog/sketches-js 2.1.0 109.9 kB 109.9 kB
lodash.sortby 4.7.0 75.76 kB 75.76 kB
lru-cache 7.14.0 74.95 kB 74.95 kB
ignore 5.2.4 51.22 kB 51.22 kB
int64-buffer 0.1.10 49.18 kB 49.18 kB
shell-quote 1.8.1 44.96 kB 44.96 kB
istanbul-lib-coverage 3.2.0 29.34 kB 29.34 kB
tlhunter-sorted-set 0.1.0 24.94 kB 24.94 kB
limiter 1.1.5 23.17 kB 23.17 kB
dc-polyfill 0.1.4 23.1 kB 23.1 kB
retry 0.13.1 18.85 kB 18.85 kB
jest-docblock 29.7.0 8.99 kB 12.76 kB
crypto-randomuuid 1.0.0 11.18 kB 11.18 kB
path-to-regexp 0.1.7 6.78 kB 6.78 kB
koalas 1.0.2 6.47 kB 6.47 kB
module-details-from-path 1.0.3 4.47 kB 4.47 kB

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@pr-commenter
Copy link

pr-commenter bot commented Jun 7, 2024

Benchmarks

Benchmark execution time: 2024-06-07 14:21:37

Comparing candidate commit ff62ffd in PR branch crysmags/pr-3925 with baseline commit 6c43686 in branch master.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 257 metrics, 8 unstable metrics.

scenario:plugin-graphql-with-depth-off-18

  • 🟩 max_rss_usage [-139.114MB; -74.622MB] or [-14.707%; -7.889%]

@crysmags crysmags force-pushed the crysmags/pr-3925 branch from 029aa81 to ff62ffd Compare June 7, 2024 14:08
Copy link

codecov bot commented Jun 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.42%. Comparing base (02dc5ce) to head (ff62ffd).
Report is 1 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #4387       +/-   ##
===========================================
+ Coverage   69.19%   80.42%   +11.23%     
===========================================
  Files           1        3        +2     
  Lines         198      373      +175     
  Branches       33       33               
===========================================
+ Hits          137      300      +163     
- Misses         61       73       +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@crysmags crysmags changed the title Crysmags/pr 3925 Fix graphql.resolve span durations Jun 7, 2024
@crysmags crysmags merged commit 929cc16 into master Jun 7, 2024
133 checks passed
@crysmags crysmags deleted the crysmags/pr-3925 branch June 7, 2024 15:14
ida613 pushed a commit that referenced this pull request Jun 7, 2024
* Fix graphql.resolve span duration

---------

Co-authored-by: Scott Bezek <[email protected]>
@ida613 ida613 mentioned this pull request Jun 7, 2024
ida613 pushed a commit that referenced this pull request Jun 7, 2024
* Fix graphql.resolve span duration

---------

Co-authored-by: Scott Bezek <[email protected]>
@ida613 ida613 mentioned this pull request Jun 7, 2024
ida613 pushed a commit that referenced this pull request Jun 10, 2024
* Fix graphql.resolve span duration

---------

Co-authored-by: Scott Bezek <[email protected]>
ida613 added a commit that referenced this pull request Jun 10, 2024
ida613 pushed a commit that referenced this pull request Jun 10, 2024
* Fix graphql.resolve span duration

---------

Co-authored-by: Scott Bezek <[email protected]>
@scottvanta
Copy link
Contributor

scottvanta commented Jul 1, 2024

Thanks @crysmags for getting the fix from #3925 merged (I was on paternity leave, so apologies for suddenly going AWOL on that PR)!

I'm a little confused by the changes to the tests though:

I'm still not sure why the mocked time on my PR worked locally but not in CI, but for the sake of test coverage on span.duration, I might recommend considering the approach I originally implemented (before being asked to mock time), which uses some small delays to force/check span.duration values. i.e. essentially revert this commit back to the version on the left: 798670e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants