Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make the rules sample limiter and default sampling rate configurable #215

Merged
merged 19 commits into from
Mar 10, 2022

Conversation

dgoffredo
Copy link
Contributor

@dgoffredo dgoffredo commented Feb 7, 2022

Tip: Tell github to "hide whitespace" for this diff. I moved some try blocks around, which changed indentation.

These revisions address the issue raised in customer issue APMS-6277 ("Attempting to use configuration 'max_traces_per_second' to limit tracing to 5 per second") and the work item OKR2,KR1 ("DD_TRACE_SAMPLE_RATE implemented and documented").

Brett mentioned that in dd-trace-py, the DD_TRACE_SAMPLE_RATE environment variable results in the equivalent sampling rule being appended to any other rules that the user might have specified. This means that agent-led priority sampling only happens if DD_TRACE_SAMPLE_RATE doesn't have a value. Additionally, the rule-based sampler's limiter can be configured using the environment variable DD_TRACE_RATE_LIMIT.

All of that behavior is implemented here.

@dgoffredo dgoffredo changed the title make the rules sample limiter configurable make the rules sample limiter and default sampling rate configurable Feb 7, 2022
@dgoffredo dgoffredo requested a review from cgilmour February 7, 2022 14:20
Copy link
Contributor

@cgilmour cgilmour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small bunch of comments, but generally looks OK to me.
I'll mark this approved after you have considered them and possibly pushed additional changes.

error << name_for_diagnostic << " has trailing non-floating-point characters: " << text;
return ot::make_unexpected(error.str());
}
if (value < minimum || value > maximum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the value is "NaN" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN is out of range, which is the desired behavior but maybe technically odd. I'll let it fail here rather than having a special case forbidding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I need to check that it's inside the interval, not that it's not outside the interval.

@@ -85,6 +87,36 @@ std::vector<std::string> tokenize_propagation_style(const std::string &input) {
return result;
}

ot::expected<double, std::string> parseDouble(const std::string &text, double minimum,
double maximum,
const char *name_for_diagnostic) try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd typically make it the caller's responsibility to detail the error, but the "it's a parser" exclusion can be applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callers were always saying "$name_for_diagnostic whatever the error condition was," so I combined them. An alternative might be:

    auto maybe_value = parseDouble(analytics_rate, 0.0, 1.0);
    if (!maybe_value) {
      maybe_value.error().insert(0, "While parsing DD_TRACE_ANALYTICS_SAMPLE_RATE: ");
      return ot::make_unexpected(maybe_value.error());
    }

Yes, I'll change it to that.

}

SECTION("dedicated constructor configures based on desired allowed-per-second") {
const double per_second = 23.887;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23.97 (24FPS) would have been a cooler choice.

REQUIRE(metrics.find("_dd.agent_psr") != metrics.end());
REQUIRE(metrics.find("_dd.rule_psr") != metrics.end());
REQUIRE(metrics.find("_dd.limit_psr") != metrics.end());
REQUIRE(metrics.find("_dd.agent_psr") == metrics.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the rates sent by the agent are never used? Or is there a new test required to handle that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any configuration now that will cause the priority sampler to be used. There's always a matching rule.

New test for which case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our conversation yesterday, now I understand what you mean. Have since pushed changes to correct the behavior. I'll update the PR description, too...

@@ -61,45 +78,52 @@ TEST_CASE("tracer options from environment variables") {
{"DD_TRACE_REPORT_HOSTNAME", "true"},
{"DD_TRACE_ANALYTICS_ENABLED", "true"},
{"DD_TRACE_ANALYTICS_SAMPLE_RATE", "0.5"},
{"DD_TAGS", "host:my-host-name,region:us-east-1,datacenter:us,partition:5"}},
{"DD_TAGS", "host:my-host-name,region:us-east-1,datacenter:us,partition:5"},
{"DD_TRACE_RATE_LIMIT", "200"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests for invalid values that a user might set for these variables?

uint64_t trace_tags_propagation_max_length = 512;
// `sampling_max_per_second` is the limit on the number of rule-controlled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it max or limit?

{"DD_TAGS", "host:my-host-name,region:us-east-1,datacenter:us,partition:5"}},
{"DD_TAGS", "host:my-host-name,region:us-east-1,datacenter:us,partition:5"},
{"DD_TRACE_RATE_LIMIT", "200"},
{"DD_TRACE_SAMPLE_RATE", "0.7"}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just curious: should we test this for different locales (such as ones that use comma instead of period), or would that just be unit-testingstd::stod?

Copy link
Contributor Author

@dgoffredo dgoffredo Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question. std::stod does what std::strtod does, which does what strtod does, which changes based on the C locale.

My preference is to not think about it, and to hope that the tests never have to run in a different locale. I do wonder about people pulling our code and running the tests, though. Hm.

One option is to interpolate always from a double serialized using to_string.

Let's leave it like this for now.

@dgoffredo
Copy link
Contributor Author

Thanks for taking a look. I pushed some commits addressing your points, and commented on others.

@dgoffredo dgoffredo force-pushed the david.goffredo/configurable-rule-sampling-limit branch 2 times, most recently from 665ea8e to 184d485 Compare February 9, 2022 14:44
@@ -10,7 +10,7 @@ namespace opentracing {

ot::expected<TracerOptions> optionsFromConfig(const char *configuration,
std::string &error_message) {
TracerOptions options{"localhost", 8126, "", "web", "", 1.0};
TracerOptions options;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line almost destroyed me. Hours lost...

@dgoffredo dgoffredo force-pushed the david.goffredo/configurable-rule-sampling-limit branch from 5484578 to 9e03984 Compare February 11, 2022 19:03
@dgoffredo
Copy link
Contributor Author

This is ready for another round of review. See from the commit fix: fallback rule ... onward.

@dgoffredo dgoffredo force-pushed the david.goffredo/configurable-rule-sampling-limit branch from 510c208 to d443d66 Compare March 8, 2022 22:30
@dgoffredo dgoffredo force-pushed the david.goffredo/configurable-rule-sampling-limit branch from d443d66 to 3b2bdcb Compare March 9, 2022 13:02
Copy link
Contributor

@cgilmour cgilmour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few comments for additional consideration.
Thanks!

// traces that may be sampled per second. This includes traces that match
// the implicit "catch-all" rule appended to `sampling_rules`. This option
// is also configurable as the environment variable DD_TRACE_RATE_LIMIT.
double sampling_limit_per_second = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an integer instead? It's a quantity, and I'm not certain values of 0.2 or 1.5 make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it's a use case anyone would want, but I did verify that values less than one have the expected behavior: 0.5 means at most one every two seconds. Might have even added a unit test.

It does no harm and allows for that kind of thing, so I'll keep it.

// by the environment variable DD_PROPAGATION_STYLE_INJECT.
//
// Extra keys will be ignored.
// Accepts configuration as a JSON object. See `optionsFromConfig` in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - this comment wasn't being maintained when changes in the actual implementation occurred.
Does that part (.cpp) now need similar comments about what each item is for and the expected values, or just "read the code" for that detail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose the latter because I didn't feel like writing the documentation myself, and saw that it might fall out of sync anyway :)

I couldn't find a place in the public documentation, either, that describes everything supported here. One thing to consider is making this a public contract, maintenance and all, and then linking to it from the public docs.

Not today, though.

@@ -85,6 +87,30 @@ std::vector<std::string> tokenize_propagation_style(const std::string &input) {
return result;
}

ot::expected<double, std::string> parseDouble(const std::string &text, double minimum,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for amusement, what happens if minimum or maximum are NaN, -Inf, +Inf etc?
(Don't change any code as a result of this question)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const TracerOptions &input) {
using namespace std::string_literals;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion on this, but for people unfamiliar with the syntax, it's surprising to see the trailing s on the literals.
There's other parts of the code that could be modified to use string literals, but the choice at the time was to be explicit instead of subtle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the trailing s is both explicit and subtle.

We could adopt a house style that says "avoid use of special string literals, because they are easily confused with built-in literals and may surprise the reader."

I'd prefer to use the literals instead.

@@ -85,4 +85,5 @@ RUN go get github.com/jakm/msgpack-cli
# Copy these last so that edits don't need as many image rebuild steps
COPY ./test/integration/nginx/nginx_integration_test.sh ./
COPY ./test/integration/nginx/expected_*.json ./
# CMD [ "bash", "-x", "./nginx_integration_test.sh"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably left over from some hackering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, will fix.

"_dd.limit_psr": 1,
"_dd.rule_psr": 1,
"_sampling_priority_v1": 2
"_dd.agent_psr": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tags might need double-checking for valid combinations of limit_psr, rule_psr and agent_psr being present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From standup with Caleb:

In this example, those limit/rule metrics are not present. Wasn't there something about the omission of these meaning that the service is labeled "partially configured"? Double-check that to make sure that we don't introduce a case where a tracer with rules that don't match ends up being labeled "partially configured" even though it is "configured."

@dgoffredo
Copy link
Contributor Author

"CodeQL / Analyze (cpp)" takes its time. I didn't consider that when I reviewed the addition. Might be a good idea to get CircleCI to run that in parallel with the integration test and coverage report.

@dgoffredo dgoffredo merged commit 2b017b7 into master Mar 10, 2022
@dgoffredo dgoffredo deleted the david.goffredo/configurable-rule-sampling-limit branch March 10, 2022 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants