-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings #36094
base: main
Are you sure you want to change the base?
[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings #36094
Conversation
…e and timeout settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it will need make fmt
to be run to fix up a couple lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, the approach is sound. Just a couple of talking points below. Thankyou for contributing this!
// If top level queue is enabled - per-endpoint queue must be disable | ||
// This helps us to avoid unexpected issues with mixing 2 level of exporter queues | ||
if cfg.QueueSettings.Enabled { | ||
oCfg.QueueConfig.Enabled = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the ConsumeTraces()
, and it would be similar for metrics and logs. This is the code for when it exports.
for exp, td := range exporterSegregatedTraces {
start := time.Now()
err := exp.ConsumeTraces(ctx, td)
exp.consumeWG.Done()
errs = multierr.Append(errs, err)
...
}
Now that the exp
sub-exporter has no queue, won't exp.ConsumeTraces()
block until exporting is finished? This would, I assume, make the export times much slower because it has to export the routed data in-series. This would take quite a long time especially if there are hundreds of backends to export to, likely causing a timeout.
What might be necessary is an errgroup
or similar which can execute these now-blocking-calls to the exporter. I would be interested to hear your thoughts or maybe @jpkrohling's or @MovieStoreGuy's thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to an export queue also means that includes the queue workers which allow parallel processing, which allows a layer of latency hiding.
Including a new "queue" worker in this could be hazardous in how it is done, say if you create a routine for each split batch you could end up with high amount of routine scheduling which can cause a significant performance hit. Then if you do more an async send on channels you have the original risk that this solves due to the queue moving from the current queue structure, to now a buffered channel that now needs to be shared again.
For this iteration, I would suggest leaving it out of this change and keep as an issue in the backlog of "investigate and address if required"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that part (
Than yes, disabling seb-exporter queue in code might be too risky for all users
I have removed those lines
# (Optional) One or more lines of additional information to render under the primary note. | ||
# These lines will be padded with 2 spaces and then inserted directly into the document. | ||
# Use pipe (|) for multiline entries. | ||
subtext: OTLP sub-exporter queue will be automatically disabled if loadbalancing exporter queue is enabled to avoid data loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you've consolidated all exporter queues into one, their overall size is smaller. This could lead to dropped data in some deployments, since you are going from n*queueSize
to 1*queueSize
- I don't think this is necessarily a breaking change but I think it should be mentioned in the changelog, so users can increase the queue size if they run into issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated changelog to reflect this part
@@ -349,35 +344,6 @@ func TestConsumeTracesUnexpectedExporterType(t *testing.T) { | |||
assert.EqualError(t, res, fmt.Sprintf("unable to export traces, unexpected exporter type: expected exporter.Traces but got %T", newNopMockExporter())) | |||
} | |||
|
|||
func TestBuildExporterConfig(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How difficult would it be to add a test that verifies data is not lost if an exporter is taken out of the ring? Could be useful since I don't see much verification of this behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sound rather E2E test than unit test, because in the current implementation I couldn't find a way how to control sub-exporters from within test routine. So I can assume that complexity for this test is too high and might require big refactoring of a whole component
done |
Fix go.mod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this solution! I think it only needs more explicit documentation and better care for current users explicitly setting the resiliency options.
@@ -90,6 +90,7 @@ Refer to [config.yaml](./testdata/config.yaml) for detailed examples on using th | |||
* `traceID`: Routes spans based on their `traceID`. Invalid for metrics. | |||
* `metric`: Routes metrics based on their metric name. Invalid for spans. | |||
* `streamID`: Routes metrics based on their datapoint streamID. That's the unique hash of all it's attributes, plus the attributes and identifying information of its resource, scope, and metric data | |||
* loadbalancing exporter supports set of standard [queuing, batching, retry and timeout settings](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more documentation: what happens when a failure occurs and then the ring is changed? Will it be directed to a new backend (yes!)? This expected behavior should be explicitly documented to our users.
@@ -32,20 +40,94 @@ func createDefaultConfig() component.Config { | |||
otlpDefaultCfg.Endpoint = "placeholder:4317" | |||
|
|||
return &Config{ | |||
TimeoutSettings: exporterhelper.NewDefaultTimeoutConfig(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior should be clearly documented. I think it would even make sense to log a warning in case users tried to use that option.
That said, there ARE users relying on this feature at the moment. What should we do about them? I think we should copy their current values to the load-balancer level, which would give them roughly the same desired behavior they have today.
So, here's how it could work:
- if the OTLP Exporter options do not include the resiliency options, use our new defaults
- if there are, copy the ones from the OTLP section into the top-level config, and set the OTLP one to the default values, writing a log message stating that this is happening
- if there are options at both levels, the ones at the load-balancing level wins, as that's the newest one and we can assume it's the latest intent from the user (but log a WARN in this case, it's serious enough)
I don't think we need a feature flag or deprecation plan for this.
Description
Problem statement
loadbalancing
exporter is actually a wrapper that's creates and manages set of actualotlp
exportersThose
otlp
exporters technically shares same configuration parameters that are defined onloadbalancing
exporter level, includingsending_queue
configuration. The only difference isendpoint
parameter that are substituted byloadbalancing
exporter itselfThis means, that
sending_queue
,retry_on_failure
andtimeout
settings can be defined only onotlp
sub-exporters, while top-levelloadbalancing
exporter is missing all those settingsThis configuration approach produces several issue, that are already reported by users:
loadbalancing
exporter (see [loadbalancing-exporter] enabling exporterhelper persistent queue breaks loadbalancer exporting #16826). That's happens becauseotlp
sub-exporters are sharing the same configurations, including configuration of the queue, i.e. they all are using the samestorage
instance at the same time which is not possible at the momentsending_queue
configuration (see [exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378). That's happens because Queue is defined on level ofotlp
sub-exporters and if this exporter cannot flush data from queue (for example, endpoint is not available anymore) there is no other options that just to discard data from queue, i.e. there is no higher level queue and persistent storage where data can be returned is case of permanent failureThere might be some other potential issue that was already tracked and related to current configuration approach
Proposed solution
The easiest way to solve issues above - is to use standard approach for queue, retry and timeout configuration using
exporterhelper
This will bring queue, retry and timeout functionality to the top-level of
loadbalancing
exporter, instead ofotlp
sub-exportersRelated to mentioned issues it will bring:
otlp
sub-exporters (not directly of course)Noticeable changes
loadbalancing
exporter on top-level now usesexporterhelper
with all supported functionality by itsending_queue
will be automatically disabled onotlp
exporters when it already present on top-levelloadbalancing
exporter. This change is done to prevent data loss onotlp
exporters because queue there doesn't provide expected result. Also it will prevent potential misconfiguration from user side and as result - irrelevant reported issuesexporter
attribute for metrics generated fromotlp
sub-exporters now includes endpoint for better visibility and to segregate them from top-levelloadbalancing
exporter - was"exporter": "loadbalancing"
, now"exporter": "loadbalancing/127.0.0.1:4317"
otlp
sub-exporters now includes additional attributeendpoint
with endpoint value with the same reasons as for metricsLink to tracking issue
Fixes #35378
Fixes #16826
Testing
Proposed changes was heavily tested on large K8s environment with set of different scale-up/scale-down event using persistent queue configuration - no data loss were detected, everything works as expected
Documentation
README.md
was updated to reflect new configuration parameters available. Sampleconfig.yaml
was updated as well