Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddtrace/tracer,profiler: share agent default URL resolution logic #2719

Merged
merged 6 commits into from
Jul 16, 2024

Conversation

nsrip-dd
Copy link
Contributor

@nsrip-dd nsrip-dd commented May 31, 2024

What does this PR do?

Move the agent default URL resolution logic to internal so it can be shared
by the tracer and profiler.

This is motivated by the profiler failing to connect to the default trace agent
Unix domain socket in some situations. That gets picked up as a benefit of
sharing more logic, so this PR adds a test for that functionality.

Motivation

The ddtrace/tracer library has code to resolve the agent URL from the program
environment, including checking for the default Unix domain socket. Right now
the profiler doesn't check for the default Unix domain socket and so will fail
to send profiles to the right place in some circumstances. The tracer and
profiler should ideally share all this logic.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

@nsrip-dd nsrip-dd force-pushed the nick.ripley/share-agent-url-resolution branch 2 times, most recently from 2ff5739 to 6aef0fd Compare May 31, 2024 14:53
@pr-commenter
Copy link

pr-commenter bot commented May 31, 2024

Benchmarks

Benchmark execution time: 2024-06-07 13:22:10

Comparing candidate commit b8a865b in PR branch nick.ripley/share-agent-url-resolution with baseline commit e0b2c6d in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.

scenario:BenchmarkStartSpanConcurrent-24

  • 🟩 execution_time [-605.019ns; -291.181ns] or [-8.324%; -4.006%]

@nsrip-dd nsrip-dd force-pushed the nick.ripley/share-agent-url-resolution branch from 6aef0fd to 0189499 Compare May 31, 2024 15:22
The ddtrace/tracer library has code to resolve the agent URL from the
program environment, including checking for the default Unix domain
socket, which would be good to use from the profiler. Put all this logic
in internal so that the tracer and profiler can both use it.
@nsrip-dd nsrip-dd force-pushed the nick.ripley/share-agent-url-resolution branch from 0189499 to 8b0cf45 Compare May 31, 2024 15:43
@nsrip-dd nsrip-dd marked this pull request as ready for review May 31, 2024 16:09
@nsrip-dd nsrip-dd requested review from a team as code owners May 31, 2024 16:09
Copy link
Member

@darccio darccio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

profiler/options.go Show resolved Hide resolved
Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM for the most parts, but PTAL at my comments.

// - Then, DD_AGENT_HOST and DD_TRACE_AGENT_PORT if either are set
// - Then, the default UDS path given here, if the path exists
// (the path is replacable for testing purposes, otherwise use DefaultTraceAgentUDSPath)
// - Finally, localhost:8126
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This description is not accurate. The UDS path takes precedence over DD_AGENT_HOST and DD_TRACE_AGENT_PORT. I made a flow chart that shows the full picture:

agent-resolution drawio

My google drive link for this is here (datadog internal): https://drive.google.com/file/d/134CKDYMwWRaphRD5-JjlZq7u71D79cKt/view?usp=sharing

Maybe you can refactor the code itself to make this more clear as well? In particular moving the os.Getenv("DD_AGENT_HOST") and os.Getenv("DD_TRACE_AGENT_PORT") calls below os.Stat(defaultSocketAPM)?

Copy link
Contributor Author

@nsrip-dd nsrip-dd Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a test case which sets those environment variables as well as creating the (test) UDS file, and confirms that the env vars are used: https://github.com/DataDog/dd-trace-go/pull/2719/files#diff-8db259852f0d34c3adcab6352683e75d796c0618397f08656c197403c6667256R51. Is the test inaccurate?

Copy link
Member

@felixge felixge Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, this agent discovery algorithm is a bit cursed (not your fault : p). I didn't carefully study this line:

	if _, err := os.Stat(defaultSocketAPM); host == "" && port == "" && err == nil {

So in my mind I thought it was:

	if _, err := os.Stat(defaultSocketAPM); err == nil {

But given the actual code, I guess the real discovery algorithm implemented by this function is this one?

image

So either I'm still confused, or I think the comment doesn't quite match the reality. In particular:

Then, DD_AGENT_HOST and DD_TRACE_AGENT_PORT if either are set

If this was correct, then DD_AGENT_HOST= DD_TRACE_AGENT_PORT=1234 should produce :1234 and DD_AGENT_HOST=foo DD_TRACE_AGENT_PORT= should produce foo:. But in reality they produce localhost:1234 and foo:8126 respectively. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the diagram you've shared matches how it should work according to this internal RFC: https://github.com/DataDog/architecture/blob/master/rfcs/apm/integrations/trace-autodetect-agent-config/rfc.md#explicit-configuration. The doc comment is incomplete. I've updated the doc comment to specify the defaults if only one of DD_AGENT_HOST or DD_TRACE_AGENT_PORT are defined, which are "localhost" and "8126", respectively. I've also added a few more test cases for this. See b2c2809.

internal/agent.go Outdated Show resolved Hide resolved
if url.Scheme == "unix" {
WithUDS(url.Path)(&c)
} else {
c.agentURL = url.String() + "/profiling/v1/input"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discovery Algorithm Change

Am I correctly understanding that the current profiler agent discovery algorithm looked like this:

image

And we're replacing it with this new algorithm that's used by the tracer already?

image

If yes, I think we should include these graphics in the release notes to make it clear what we're changing. We should also advise people who run into issues with the new algorithm to open a GH issue and downgrade until we have a workaround.

Alternatively we could implement a DD_PROFILING_LEGACY_AGENT_DISCOVERY env var that would allow people to switch back to the old algorithm.

But for now I can't imagine there to be a plausible need for this. What are the chances that an agent socket exists and for some reason that's not where they want to send the profiles, but they're okay with using that socket for traces (or traces are not enabled).

Hostname detection

Unrelated to this PR, but while studying the code I noticed that the tracer is using a sophisticated hostname resolution approach introduced in #1712 when not using a UDS for the connection. I think the profiler should do the same ... and I'm kinda annoyed we were not pinged on the PR that introduced this 😞. Maybe you can create a follow-up task to look into this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discovery Algorithm Change

See the other comment for the intended resolution order. I think we should document this change in the release that introduces. We can specify that it is a bug fix (or enhancement?) for the profiler, to check for the default trace agent Unix domain socket if the agent URL is not explicitly configured. I don't think we'll need a fallback option.

Hostname detection

Opened PROF-9903 for this.

profiler/options.go Show resolved Hide resolved
nsrip-dd added 3 commits June 7, 2024 08:35
The comment describing the agent URL resolution did not specify what
happens if only one of DD_AGENT_HOST or DD_TRACE_AGENT_PORT are defined.
Do so, according to the internal RFC for agent URL resolution. Rearrange
the code so that it more clearly matches the priority order. Before this
commit, the check for whether the host or port are defined was not
clearly visible, hidden away in the check for the existence of the
default UDS path.

Add some more tests to check the host/port behavior. Also, the assert
package functions expect the "want" before the "got", so fix that up in
the test cases.
Suggested in review. Make DefaultTraceAgentUDSPath a variable in the
internal package and have the tracer and profiler tests modify that,
rather than keep their own variables. That way, the actual tracer and
profiler implementations don't need to rely on it.
Specifically set up a handler for /profiling/v1/input to test that we
don't use the filesystem path to the Unix domain socket in the HTTP
request path.
Copy link
Contributor

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale Stuck for more than 1 month label Jun 28, 2024
@nsrip-dd nsrip-dd removed the stale Stuck for more than 1 month label Jun 28, 2024
@darccio
Copy link
Member

darccio commented Jul 4, 2024

@nsrip-dd @felixge What's the current state for this PR? Thanks!

@felixge felixge self-requested a review July 12, 2024 08:11
@@ -63,6 +63,9 @@ const (
defaultAgentPort = "8126"
)

// Path to the default trace agent Unix domain socket, replaced by tests
var defaultSocketAPM = internal.DefaultTraceAgentUDSPath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be dead code? I don't see anything referencing it. Tests still pass if I remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's dead code, not needed after 9b525fc. Thanks for catching it.

Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🙇🏻‍♂️. Sorry for the delay.

@nsrip-dd nsrip-dd enabled auto-merge (squash) July 16, 2024 12:16
@nsrip-dd nsrip-dd merged commit a8bb469 into main Jul 16, 2024
163 of 164 checks passed
@nsrip-dd nsrip-dd deleted the nick.ripley/share-agent-url-resolution branch July 16, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants