Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/span limits #816

Closed

Conversation

anshulkumar19
Copy link
Contributor

Fixes #365

Changes

The changes made in this PR addresses the feature of imposing span limits parameters as defined in spec.

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

@anshulkumar19 anshulkumar19 requested a review from a team June 2, 2021 17:38
@codecov
Copy link

codecov bot commented Jun 2, 2021

Codecov Report

Merging #816 (3f14a59) into main (93dd39b) will increase coverage by 0.65%.
The diff coverage is 82.61%.

❗ Current head 3f14a59 differs from pull request most recent head 41c05dc. Consider uploading reports for the commit 41c05dc to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #816      +/-   ##
==========================================
+ Coverage   95.52%   96.16%   +0.65%     
==========================================
  Files         156      153       -3     
  Lines        6618     6458     -160     
==========================================
- Hits         6321     6210     -111     
+ Misses        297      248      -49     
Impacted Files Coverage Δ
sdk/src/trace/span.h 100.00% <ø> (ø)
...include/opentelemetry/sdk/common/attribute_utils.h 91.43% <50.00%> (-5.79%) ⬇️
sdk/src/trace/tracer_provider.cc 72.23% <50.00%> (ø)
sdk/include/opentelemetry/sdk/trace/recordable.h 100.00% <100.00%> (ø)
sdk/include/opentelemetry/sdk/trace/span_data.h 100.00% <100.00%> (ø)
sdk/src/trace/span.cc 91.57% <100.00%> (+0.11%) ⬆️
sdk/src/trace/tracer.cc 87.10% <100.00%> (ø)
sdk/src/trace/tracer_context.cc 84.00% <100.00%> (+2.19%) ⬆️
sdk/test/resource/resource_test.cc 92.86% <0.00%> (-4.08%) ⬇️
... and 13 more

@@ -114,11 +114,21 @@ class AttributeMap
});
}

const std::size_t GetAttributeMapSize() const noexcept
Copy link
Contributor

@maxgolov maxgolov Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. Please check my PR here, as it'd be nicer if we simply expose all properties of real std::map-alike container here:

class AttributeMap : public std::unordered_map<std::string, OwnedAttributeValue>

You can simply use size() instead of adding non-standard getter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will pick your changes from attribute_utils.h and modify accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. That works. I'll reset my status to unblock your review.

Copy link
Contributor

@maxgolov maxgolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be much more elegant if that's just a usual std::map container. I am doing it in the other PR I mentioned in the comment. #771

}

private:
SpanLimits span_limit_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add SpanLimits as a data member here. It will get allocated every time a new recordable is created ( i.e, every time a new span is created ). And SpanLimits values are not going to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not intended change, maybe didn't sync correctly.

if (attribute_map_.GetAttributeMapSize() < GetSpanLimits().AttributeCountLimit
|| attribute_map_.KeyExists(key)) {
attribute_map_.SetAttribute(key, value);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is a good idea to drop the attributes silently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.
We can either log every time when attribute is dropped, but that would be excessive logging and is not recommended in the spec.
Second way, we can store the information of how many attributes are added in the span and so the number of attributes dropped. This should be enough information, any opinion on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this won't really work... Because I'd see that class being used not necessarily only for Span Attributes. But for Resource Attributes as well. Do resource attributes have the same limit? I mean - can we move this check elsewhere.. Can we keep it a generic converter-helper class to transform attributes from API AttributeValue to OwnedAttributeValue, and just keep it as generic as possible, without binding it to Span specifics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit is specific to Span attributes, links, and events. SDK doesn't own/copy any of these properties internally, so it is not possible to impose this limit within sdk. All these properties are propagated to exporters, and it is up to the exporters to impose the limits. The exporters may choose not to transform these properties to OwnedAttributeValue and instead transform/serialize to (say) JSON format for performance reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lalitb - I think we should have that trigger (limit enforcement) somewhere when we transform into OwnedAttributeValue. I don't have concrete proposals. Maybe we should merge the foundation class for some enforcement, proposed here in this PR. Then find out a good way to bind to that class.


} // namespace trace
} // namespace sdk

Copy link
Member

@lalitb lalitb Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a class with static methods to return the limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are having a global tracer provider and span limits are associated with that only, so that should be better as it will simplify arguments passing. Will make the changes in next commit.

@lalitb
Copy link
Member

lalitb commented Jun 3, 2021

@anshulkumar19 - While I am thinking about the approach to follow for supporting the span limits, the problem I see is that our implementation doesn't copies/owns any of the span properties inside api/sdk, and these properties (links, events, attributes) are propagated directly to exporters. SpanData is one sample implementation of Recordable which does owns data but is meant to be used by the in-memory exporter. This means SDK can't limit the number of these elements going to the exporter. The only way to impose the limit would be within the exporter code, and this is not what is expected from the specification.
Let me know your thoughts on this, and if you see any feasible approach for implementation. As this is an optional feature, we should go ahead with implementation only if it doesn't involve major changes in the way we propagate span properties to exporters.

@maxgolov maxgolov self-requested a review June 3, 2021 23:20
return lock;
}

int AttributeCountLimit = 128;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for naïve question :

  1. are we anticipating Span limits for different exporters (Jaeger, Zipkin, OTLP, or some other custom exporter, say FluentD-MsgPack) to be the same, and following these exact defaults?

  2. If the answer to Q1 is No, then can we have these limits make somehow tunable, and be part of some form of Exporter configuration? This is becoming a bit tricky though. In case of multi-processor, in case if we want to route our telemetry to different exporter, and if every exporter has its own limits..

Apologies if it's a dumb question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec mentions these limits set to tracer provider and we have one global tracer provider, so in the current scenario all of these limits should be same.
Moreover, according to my understanding since these limits are there for the safety from erroneous code adding absurd number of attributes to a span so it should be related to span related limits rather than depending upon any exporters. So, once these limits are defined they should be followed for each kind of span (sdk/exporters).

@maxgolov
Copy link
Contributor

maxgolov commented Jun 3, 2021

Reading the spec - these are all soft limits - the recommended limit of 1000 elements. SDKs MAY provide a way to change this limit..., so it is recommended and it MAY change. Question here, so let's say my concrete exporter for a given telemetry flow / system imposes a different limit, other than 1000. Because it MAY, based on spec.

And I fork my data via multi-processor to two different exporters at once. For example:

  • console exporter.
  • my (custom) MsgPack exporter.
  • OTLP exporter.

It seems like then the issue is to configure the soft limits and intercept these externally , beyond the multi-processor , somewhere when Span / Span Events reach concrete exporter. That exporter then should know what limit to impose. In that case if the limit is exceeded, there has to be a mechanism to notify the caller about the failure of a concrete exporter.. Exporter 1 fails (limit exceeded), but exporters 2 and 3 let span+events through. Because their allowed limits MAY BE higher than those of Exporter 1. I don't see limit in fluentd / FluentD forward protocol / MsgPack spec, for example..

Then that failure notification for only one of the exporters - has to be bubbling-up the status code back to the caller itself. This notification would most likely be asynchronous. Because if we are "forking" events to different destinations in a batching thread in the background, that is the only moment we'd actually know which concrete exporter gonna drop and which one gonna let it go? We can't immediately return a failure to the caller.. Unless we iterate for each set of limits - for each exporter, and find beforehand what exporter lets it get thru and what not. Also the other interesting thing, as we merge resource attributes with span attributes, aren't we potentially exceeding the combined limit... Say, I set resource attributes to include some 100 fields, and my physical channel / protocol allows 1000 fields (columns) total, and I add 901 field in my telemetry data - how does the combined sum limit get calc'd.. Appears like in most cases it's something that only lower-level would know, not the API.

I'd like to avoid imposing those soft limit in the API. And find a good way to solve this limiting elsewhere. I think your limiter class, in general, looking Okay. We can merge that. But I'm not sure how to plug it in best.

Taking into account that:

  • there has to be a way to configure the soft-limit(s) differently.
  • soft-limits for different exporter / systems could be different, esp in case of multi-processor.

I'd like to avoid a seemingly easy solution to limit our future ability to operate in more complex, real-world configurations.

I think we should also consider some async callback / error notification. Because the limiting spec says we should to report back the failure(s) in case if soft limit is exceeded. Don't see that in the PR here yet. Because we just don't have that feature in SDK. Maybe we can start with ability to notify user of error(s)? Some form of internal async error callback. Otherwise limiting and not telling back the caller that we dropped something is very confusing.. Probably nearly as bad as accepting the data, exceeding the limit later on somewhere during HTTP POST or gRPC message post, and dropping it. Same result - data loss, users unhappy.

@maxgolov
Copy link
Contributor

maxgolov commented Jun 4, 2021

Regarding my last point on async error callback - notification. I'd like to propose this being added as a generic facility into OpenTelemetry C++ SDK, following this pattern we have here:

https://github.com/microsoft/cpp_client_telemetry/blob/master/lib/include/public/DebugEvents.hpp

https://github.com/microsoft/cpp_client_telemetry/blob/18fb1a1cf1a73077a32140d482719271153eff5a/lib/include/public/DebugEvents.hpp#L137

https://github.com/microsoft/cpp_client_telemetry/blob/18fb1a1cf1a73077a32140d482719271153eff5a/lib/include/public/DebugEvents.hpp#L173

It's a pub-sub model, when caller may receive an "interrupt" error notification - DebugEvent- sent from SDK worker thread(s) in case if something went bad. Caller listens for those DebugEvent notifications via registering DebugEventListener . That way the caller gets their OnDebugEvent(x) invoked, and depending on contents x - e.g. SPAN_LIMIT_EXCEEDED - may decide what to do. As long as they handle those errors wisely, i.e. can't call recursively call SDK from interrupt / error handler thread. The answer then is - we don't print anything, we don't log anything, we let the caller decide what they want to do with those errors. They may ignore these notifications. Or they may transform these into additional layer of telemetry on SDK errors, if they wish to do so. Using whatever choice, either logging these locally or (async) re-dispatching these errors back into local exporter, e.g. stdout.

@lalitb
Copy link
Member

lalitb commented Jun 4, 2021

Regarding my last point on async error callback - notification. I'd like to propose this being added as a generic facility into OpenTelemetry C++ SDK, following this pattern we have here:

https://github.com/microsoft/cpp_client_telemetry/blob/master/lib/include/public/DebugEvents.hpp

https://github.com/microsoft/cpp_client_telemetry/blob/18fb1a1cf1a73077a32140d482719271153eff5a/lib/include/public/DebugEvents.hpp#L137

https://github.com/microsoft/cpp_client_telemetry/blob/18fb1a1cf1a73077a32140d482719271153eff5a/lib/include/public/DebugEvents.hpp#L173

It's a pub-sub model, when caller may receive an "interrupt" error notification - DebugEvent- sent from SDK worker thread(s) in case if something went bad. Caller listens for those DebugEvent notifications via registering DebugEventListener . That way the caller gets their OnDebugEvent(x) invoked, and depending on contents x - e.g. SPAN_LIMIT_EXCEEDED - may decide what to do. As long as they handle those errors wisely, i.e. can't call recursively call SDK from interrupt / error handler thread. The answer then is - we don't print anything, we don't log anything, we let the caller decide what they want to do with those errors. They may ignore these notifications. Or they may transform these into additional layer of telemetry on SDK errors, if they wish to do so. Using whatever choice, either logging these locally or (async) re-dispatching these errors back into local exporter, e.g. stdout.

@maxgolov - Thanks for bringing the error handling scenario, as it has been discussed in past too. Request you to please add your recommendations here #408 . Specs also provide guidance on supporting custom error handlers here.

@maxgolov
Copy link
Contributor

maxgolov commented Jun 4, 2021

@lalitb - guidance is rather soft, it's not a requirement but rather a recommendation. The link you are showing says : These are examples of how end users might register custom error handlers. Examples are for illustration purposes only. Language library authors are free to deviate from these provided that their design matches the requirements outlined above.

I can send in a PR with a proposal for the DebugEvent emitter / DebugEventListener pattern implementation. It allows both - local async subscriptions for SDK errors. Say at Exporter (or maybe at TracerProvider level). Or a subscription to single global debug/error event handler for global lifecycle / state management errors that could occur outside of specific TracerProvider or Exporter.

@lalitb
Copy link
Member

lalitb commented Jun 4, 2021

I can send in a PR with a proposal for the DebugEvent emitter / DebugEventListener pattern implementation. It allows both - local async subscriptions for SDK errors. Say at Exporter (or maybe at TracerProvider level). Or a subscription to single global debug/error event handler for global lifecycle / state management errors that could occur outside of specific TracerProvider or Exporter.

We can discuss the proposal within #408 before implementation. In general, we can start as per specs guidelines and provide the functionality to let applications define custom error handlers ( which can be async/sync as per their requirements), and otel sdk providing global methods to set and get these error handlers. This will also allow the applications to plugin the otel logging api as a custom error handler going ahead. I don't think we should be designing a new error handling framework for this purpose.

@anshulkumar19
Copy link
Contributor Author

@anshulkumar19 - While I am thinking about the approach to follow for supporting the span limits, the problem I see is that our implementation doesn't copies/owns any of the span properties inside api/sdk, and these properties (links, events, attributes) are propagated directly to exporters. SpanData is one sample implementation of Recordable which does owns data but is meant to be used by the in-memory exporter. This means SDK can't limit the number of these elements going to the exporter. The only way to impose the limit would be within the exporter code, and this is not what is expected from the specification.
Let me know your thoughts on this, and if you see any feasible approach for implementation. As this is an optional feature, we should go ahead with implementation only if it doesn't involve major changes in the way we propagate span properties to exporters.

I was about to add the similar constraints in exporter side "SetAttributes" functions so that these limits would be imposed there. Since the spec talks imposing these limits only to the sdk side, so that becomes tricky.
I can see the "sdk_exporters" are defined in the structure of spec (https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace). Is it suggesting that these sdk constraints can be implied on exporters as well, because eventually exporters are attached to the spans/traces via sdk?

@lalitb
Copy link
Member

lalitb commented Jun 4, 2021

I was about to add the similar constraints in exporter side "SetAttributes" functions so that these limits would be imposed there. Since the spec talks imposing these limits only to the sdk side, so that becomes tricky.
I can see the "sdk_exporters" are defined in the structure of spec (https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace). Is it suggesting that these sdk constraints can be implied on exporters as well, because eventually exporters are attached to the spans/traces via sdk?

With the current otel-cpp design, the only place to filter these span properties would be at the exporter level. And if that is the case, a better approach would be to pass these limit configuration during exporter startup along with other exporter options. Eg in Zipkin, it would be here:

struct ZipkinExporterOptions
{
// The endpoint to export to. By default the OpenTelemetry Collector's default endpoint.
std::string endpoint = GetDefaultZipkinEndpoint();
TransportFormat format = TransportFormat::kJson;
std::string service_name = "default-service";
std::string ipv4;
std::string ipv6;
};

Each exporter would have different memory requirements, and thus different configuration values. This is something not defined by specs, and let exporters decide how they want to implement it.

But for now, I don't think we should implement this optional requirement at the SDK level when there is nothing to filter there at SDK.

@pyohannes, @maxgolov - any thoughts?

@maxgolov
Copy link
Contributor

maxgolov commented Jun 4, 2021

With the current otel-cpp design, the only place to filter these span properties would be at the exporter level..

Yes, I agree with that.

@anshulkumar19
Copy link
Contributor Author

@lalitb @maxgolov As per suggestions for not implementing this optional feature at SDK level, closing the PR for now.

@lalitb
Copy link
Member

lalitb commented Jun 7, 2021

Thanks for your work @anshulkumar19 - as @maxgolov suggested this may be utilized in the future if we plan to have an upper bound for OwnedAttributeValue for a specific use-case. We can revisit this PR at that time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding span collection limits
3 participants