-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define data flow directly between components #9077
Comments
Also see #9078 as a working proof of concept. |
Thank you for putting this together. This is interesting. A couple of ideas.
receivers:
otlp:
endpoint: ...
logs_consumers: [redact]
connectors:
redact:
...
logs_consumers: [routing]
routing:
...
logs_consumers: [label, batch]
label:
...
logs_consumers: [hot]
batch:
logs_consumers: [cold, count]
count:
metrics_consumers: [hot]
exporters:
hot:
...
cold:
... However, it might be hard to follow the data flow in big configuration files. |
Not sure I understand this but I think it should be fairly intuitive in practice. To use an examples: Let's say we are using the count connector, which can ingest any data type and generates metrics. However, to match your scenario, let's suppose it also generates logs which describe the counts. If we want to send traces and logs to the connector, and have it generate both metrics and logs, it would look like this: service:
traces:
otlp: [ count ] # I want to send spans to the connector
logs:
otlp: [ count ] # I want to send logs to the connector
count: [ foo ] # I want the connector to generate logs
metrics:
count: [ foo ] # I want the connector to generate metrics Is there redundancy here? I would argue no, since each edge has a clear and distinct effect on the behavior. In any case, it's no different than what we have now, except that it doesn't require the user to also reason about pipelines. service:
pipelines:
traces:
receivers: [ otlp ]
exporters: [ count ]
logs/in:
receivers: [ otlp ]
exporters: [ count ]
logs/out:
receivers: [ count ]
exporters: [ foo ]
metrics:
receivers: [ count ]
exporters: [ foo ] |
Can you elaborate on this? Is it just that there isn't a contextual term which directly indicates that we're defining data flow? If so, then we could easily have
I'm not strictly opposed to this but there is something to be said for being able to define a component in isolation from other components. The implementation for this may be a bit different too, so I can't say with as much confidence that it can be supported on top of the current pipelines system. |
Yes, |
I've updated the proposal to place edge definitions under |
I'm a big fan of this idea in general. In particular, I really like the idea of being able to specify As noted in the meeting, thinking about things this way makes it a lot easier to imagine combining separate file configurations. |
Vector specifies pipelines in a similar-ish way to this, may be interesting to compare with them (example configuration). It looks like they have each component specify where do they receive data from instead of where they are sending data |
A common complaint today for users is how complicated it is to configure the collector, so I appreciate the effort to make this simpler. The proposed configuration for
Hopefully my comments here are helpful and don't detract from the goal here. 👍 |
There wouldn't be any change in this regard. When you put a processor (or receiver or exporter) into multiple pipelines with different data types, an instance is creates per data type. This would stay the same. The only change is that processors would not also instance per pipeline. Connectors follow a similar pattern to other components, but since they sit in multiple pipelines, there is an instance per ordered pair of data types.
I have received a lot of preliminary feedback, mostly ranging from neutral to positive. Part of this proposal is that this is added behind a feature gate so that users can try it and provide more meaningful feedback. I agree we should make sure to get it right, so we should be open to changing the design based on feedback.
I think this becomes more of a concern if we get to the point where we are thinking about advancing the feature gate. In a scenario where this change is well received and we advance the feature gate, we would need to be very clear about this being an alternative. I think demos and tools will naturally orient towards the style which users prefer. More discussion on this point would definitely be wise. For now, I'm primarily asking the questions - Is this a better user experience? If so, is it too late to make such an improvement?
I suspect this is mostly a matter of priorities but such tooling is very nontrivial and it's not clear to me that the project would want to commit to maintaining this. It's likely better to have solutions managed by other parties, but of course we should be open to working with them. For example, I recall it being mentioned in the past that an extension could embed such a tool. |
That's fair. Do we have a defined commitment to a feature that's behind a feature gate? How locked in are we to supporting a feature in the long run that's behind a feature gate? If it's a big commitment (even behind a feature gate) I'd think we want to be pretty confident in the value. Otherwise, if it's something we can remove without trouble it's a lot easier to introduce it and get feedback later 👍 Most of my UX concerns are based on this.
Totally understand. I don't want to stop progress based just because it's not the totally perfect end-goal. Relying on outside products that aren't in the same CNCF/OpenTelemetry domain isn't a real long term supported option. I just think from a usability perspective that's really the best option. The UX of modifying data pipelines/dataflows directly in YAML has a relatively low ceiling, at least in my opinion. |
We have language describing the level of commitment for feature gates here, but basically we have plenty of room to remove as long as we don't advance the gate past |
Good to know. In that case, I don't have major concerns with proceeding at this point. It'd be good to get more feedback from users and figure out documentation and usability concerns during |
Is your feature request related to a problem? Please describe.
Connectors provide a variety of capabilities within the collector:
Most importantly, these capabilities respect the collector's pipeline model, which has been used since inception. Prior to connectors, the above problems required novel exceptions to the pipeline model.
While I firmly believe connectors are the correct technical solution for the capabilities above, they clearly introduce a corresponding increase in configuration complexity in order to respect the pipeline model.
Describe the solution you'd like
It should be possible to define data flow directly between components without explicitly managing pipelines. (e.g. Component A sends data to Component B. Component B sends data to Components D and E).
My proposed solution would preserve the pipeline model exactly as it is and build an optional simplified abstraction on top of it. Users could continue using the pre-existing pipelines syntax, or opt into this simplified syntax.
Using an example to illustrate the difference in configuration, consider the following set of connected pipelines:
The current way to configure this would be:
The proposed alternative would look like this:
How this can be accomplished
Initially, I propose adding a feature gate, which when enabled results in the following:
automatically wrapped into an equivalent connector type. e.g. a metrics processor becomes a metrics-to-metrics connector. We may enforce that all processors and connectors be declared under a single section (likely
processors
). Or, we can just automatically mergeprocessors
andconnectors
for compatibility with the current paradigm.service::pipelines
may not be specified. Instead, users must specifyservice::traces
,service::metrics
, andservice::logs
. These new sections can be thought of roughly as sets of edges in a component graph.from: [ to1, to2, ... ]
where the key is a component from which data flows and the values are the components to which data flows. In this style, exporters may never be used as keys, and receivers may never be used as values.logs/routing
is a pipeline withreceivers: [routing]
andexporters: [label, batch]
.The net effect of the above is that we autogenerate a set of pipelines, while allowing the user to reason only about the components and the data types which flow between them. The autogenerated pipelines for the above example look like this:
What is actually different between the two?
In short, very very little. The proposed configuration is intended to be as close to syntactic sugar as possible. However, a few nuances should be noted.
Processor Instancing
In the current model, we treat processors a bit differently than other components. Specifically, processors are instanced per pipeline. This has the effect that a single processor configuration may be used in separate pipelines, while still ensuring that data flowing through each processor instance is kept separate. This is a useful property of processors, but I think it is not widely recognized, and therefore may not be widely used.
In the proposed model, where each processor is wrapped into a connector, instances are not unique to a pipeline. Rather, they are shared by pipelines. This is an artifact of the way receivers and exporters work, which was inherited by connectors. The downside then is that if a user wishes to have separate instances of a processor, they must actually specify multiple configurations, even if they are exactly the same. This can be mitigated by using yaml anchors, for example, or if we adopt templated processors in the future. The upside here is that all components now share the same instancing model.
Processor ID Collisions w/ Receivers & Exporters
One minor constraint placed on the naming of connectors is that there cannot be ambiguity between connector IDs and either receiver or exporter IDs. e.g. If we had a connector and receiver both called
nop/1
, we would not know which is being used in a pipeline configuration. This is automatically detected and rejected at run time, though in practice it is very unlikely to be a problem, as it would first be necessary to share a common "type".In the proposed model, since processors are automatically wrapped into connectors, they would adopt this limitation as well.
Routing Connectors
Routing connectors are those which send data conditionally to downstream pipeline. Currently, these require that pipeline IDs are specified for specific outputs. Fortunately, the data type for a pipeline ID is
component.ID
, so we should be able to accept specific component IDs instead, when using this syntax.Other thoughts and implications
Data Types
Users must still reason about data types. It is possible that this could lead to some confusion, but I suspect it will be easy to adapt to. Cases which require only a single data type should remain very easy. For those which require multiple types, it is only necessary to understand which data types may be sent to the component, and correspondingly which data types may flow out of it. e.g. I can only send traces to
spanmetrics
and I expect to get metrics out of it.Component Class Placement
The relationship between component classes is naturally handled by the automatic conversion to pipelines. Simple rules can be made clear. e.g. in the proposed
from: [ to1, to2, ... ]
style, receivers may only be used as keys, and exporters only as values.Graph Considerations
Cycles and fanouts are automatically handled within the existing pipeline system, so no special handling must be introduced.
Processor / Connector Type Collisions
Some processors and connectors share the same "type". e.g.
spanmetrics
. For this reason, if we merge theprocessors
andconnectors
sections in configuration, we could introduce ambiguity. Instead, we may wish to just allow both, and merge them at run time. We would still have to reject ambiguous IDs. e.g. Cannot define a processor calledspanmetrics/foo
and also a connector called the same.Alternate Edge Definition Models
There are various other ways in which we could define edges. Again, as long as each edge has a clear direction and a data type, we can easily autogenerate corresponding pipelines. Other examples:
The text was updated successfully, but these errors were encountered: