Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define data flow directly between components #9077

Open
djaglowski opened this issue Dec 11, 2023 · 13 comments
Open

Define data flow directly between components #9077

djaglowski opened this issue Dec 11, 2023 · 13 comments
Labels
discussion-needed Community discussion needed

Comments

@djaglowski
Copy link
Member

djaglowski commented Dec 11, 2023

Is your feature request related to a problem? Please describe.

Connectors provide a variety of capabilities within the collector:

  • Sequencing of pipelines, including merging and replicating data streams. (e.g. process data, replicate, process more in differing ways)
  • Conditional data flow (e.g. routing, failover)
  • Deriving new data streams from old (e.g. span metrics)
  • Correlated data processing (e.g. sample by resource)

Most importantly, these capabilities respect the collector's pipeline model, which has been used since inception. Prior to connectors, the above problems required novel exceptions to the pipeline model.

While I firmly believe connectors are the correct technical solution for the capabilities above, they clearly introduce a corresponding increase in configuration complexity in order to respect the pipeline model.

Describe the solution you'd like

It should be possible to define data flow directly between components without explicitly managing pipelines. (e.g. Component A sends data to Component B. Component B sends data to Components D and E).

My proposed solution would preserve the pipeline model exactly as it is and build an optional simplified abstraction on top of it. Users could continue using the pre-existing pipelines syntax, or opt into this simplified syntax.

Using an example to illustrate the difference in configuration, consider the following set of connected pipelines:

image

The current way to configure this would be:

service:
  pipelines:
    logs/in:
      receivers: [otlp]
      processors: [redact]
      exporters: [routing]
    logs/cold:
      receivers: [routing]
      processors: [batch]
      exporters: [count, cold]
    logs/hot:
      receivers: [routing]
      processors: [label]
      exporters: [hot]
    metrics/hot:
      receivers: [count]
      exporters: [hot]

The proposed alternative would look like this:

service:
  dataflow:
    logs:
      otlp:    [redact]
      redact:  [routing]
      routing: [label, batch]
      label:   [hot]
      batch:   [count, cold]
    metrics:
      count:   [hot]

How this can be accomplished

Initially, I propose adding a feature gate, which when enabled results in the following:

  1. Processors and connectors are merged into a single class of components. Specifically, each processor type would be
    automatically wrapped into an equivalent connector type. e.g. a metrics processor becomes a metrics-to-metrics connector. We may enforce that all processors and connectors be declared under a single section (likely processors). Or, we can just automatically merge processors and connectors for compatibility with the current paradigm.
  2. service::pipelines may not be specified. Instead, users must specify service::traces, service::metrics, and service::logs. These new sections can be thought of roughly as sets of edges in a component graph.
  3. The exact syntax for defining edges should be debated, but importantly, every edge must have both a direction and a data type. As a concrete proposal, I suggest from: [ to1, to2, ... ] where the key is a component from which data flows and the values are the components to which data flows. In this style, exporters may never be used as keys, and receivers may never be used as values.
  4. The set of edges are automatically interpreted into a set of connected pipelines. The data type of the pipeline comes from the type sections in which the edge is listed. The name of the pipeline is autogenerated. In the specific edge format proposed above, the name could be that of the component from which data flows. e.g. logs/routing is a pipeline with receivers: [routing] and exporters: [label, batch].

The net effect of the above is that we autogenerate a set of pipelines, while allowing the user to reason only about the components and the data types which flow between them. The autogenerated pipelines for the above example look like this:

image

What is actually different between the two?

In short, very very little. The proposed configuration is intended to be as close to syntactic sugar as possible. However, a few nuances should be noted.

Processor Instancing

In the current model, we treat processors a bit differently than other components. Specifically, processors are instanced per pipeline. This has the effect that a single processor configuration may be used in separate pipelines, while still ensuring that data flowing through each processor instance is kept separate. This is a useful property of processors, but I think it is not widely recognized, and therefore may not be widely used.

In the proposed model, where each processor is wrapped into a connector, instances are not unique to a pipeline. Rather, they are shared by pipelines. This is an artifact of the way receivers and exporters work, which was inherited by connectors. The downside then is that if a user wishes to have separate instances of a processor, they must actually specify multiple configurations, even if they are exactly the same. This can be mitigated by using yaml anchors, for example, or if we adopt templated processors in the future. The upside here is that all components now share the same instancing model.

Processor ID Collisions w/ Receivers & Exporters

One minor constraint placed on the naming of connectors is that there cannot be ambiguity between connector IDs and either receiver or exporter IDs. e.g. If we had a connector and receiver both called nop/1, we would not know which is being used in a pipeline configuration. This is automatically detected and rejected at run time, though in practice it is very unlikely to be a problem, as it would first be necessary to share a common "type".

In the proposed model, since processors are automatically wrapped into connectors, they would adopt this limitation as well.

Routing Connectors

Routing connectors are those which send data conditionally to downstream pipeline. Currently, these require that pipeline IDs are specified for specific outputs. Fortunately, the data type for a pipeline ID is component.ID, so we should be able to accept specific component IDs instead, when using this syntax.

Other thoughts and implications

Data Types

Users must still reason about data types. It is possible that this could lead to some confusion, but I suspect it will be easy to adapt to. Cases which require only a single data type should remain very easy. For those which require multiple types, it is only necessary to understand which data types may be sent to the component, and correspondingly which data types may flow out of it. e.g. I can only send traces to spanmetrics and I expect to get metrics out of it.

Component Class Placement

The relationship between component classes is naturally handled by the automatic conversion to pipelines. Simple rules can be made clear. e.g. in the proposed from: [ to1, to2, ... ] style, receivers may only be used as keys, and exporters only as values.

Graph Considerations

Cycles and fanouts are automatically handled within the existing pipeline system, so no special handling must be introduced.

Processor / Connector Type Collisions

Some processors and connectors share the same "type". e.g. spanmetrics. For this reason, if we merge the processors and connectors sections in configuration, we could introduce ambiguity. Instead, we may wish to just allow both, and merge them at run time. We would still have to reject ambiguous IDs. e.g. Cannot define a processor called spanmetrics/foo and also a connector called the same.

Alternate Edge Definition Models

There are various other ways in which we could define edges. Again, as long as each edge has a clear direction and a data type, we can easily autogenerate corresponding pipelines. Other examples:

# proposed model - from 1 : to N
service:
  dataflow:
    logs:
      from: [ to1, to2, ...]

# inverse of proposed model - to N : from 1
service:
  dataflow:
    logs:
      to: [ from1, from2, ... ]

# matrix model - from N : to M
service:
  dataflow:
    logs:
      - from: [ from1, from2, ... ]
        to: [ to1, to2, ... ]

# singular edges - from 1 : to 1
service:
  dataflow:
    logs:
      from1: to1
      from2: to2

# mixed model
service:
  dataflow:
    logs:
      from:
        from1: [ to1, to2, ... ]
      to:
        to1: [ from1, from2, ... ]
@djaglowski
Copy link
Member Author

Also see #9078 as a working proof of concept.

@djaglowski djaglowski added the discussion-needed Community discussion needed label Dec 11, 2023
@dmitryax
Copy link
Member

dmitryax commented Dec 12, 2023

Thank you for putting this together. This is interesting. A couple of ideas.

  1. It's a bit unclear how the edges between different types are resolved. Let's say I have connectors supporting multiple types on both input and output. If I connect them together with the suggested configuration between particular types, I'd have to define several configurations per type explicitly to avoid ambiguity, right?

  2. Having this under service::[logs|metrics|trace] is not very intuitive compared to service::pipelines. Maybe we can consider having metrics_consumers, traces_consumers and logs_consumers fields right in the configuration interface of all the components except for exporters where applicable? This would address (1) as well I believe. Something like:

receivers:
  otlp:
    endpoint: ...
    logs_consumers: [redact]
connectors:
  redact:
    ...
    logs_consumers: [routing]
  routing:
    ...
    logs_consumers: [label, batch]
  label:
    ...
    logs_consumers: [hot]
  batch:
    logs_consumers: [cold, count]
  count:
    metrics_consumers: [hot]
    
exporters:
  hot:
    ...
  cold:
    ...

However, it might be hard to follow the data flow in big configuration files.

@djaglowski
Copy link
Member Author

It's a bit unclear how the edges between different types are resolved. Let's say I have connectors supporting multiple types on both input and output. If I connect them together with the suggested configuration between particular types, I'd have to define several configurations per type explicitly to avoid ambiguity, right?

Not sure I understand this but I think it should be fairly intuitive in practice. To use an examples:

Let's say we are using the count connector, which can ingest any data type and generates metrics. However, to match your scenario, let's suppose it also generates logs which describe the counts.

If we want to send traces and logs to the connector, and have it generate both metrics and logs, it would look like this:

service:
  traces:
    otlp: [ count ] # I want to send spans to the connector
  logs:
    otlp: [ count ] # I want to send logs to the connector
    count: [ foo ] # I want the connector to generate logs
  metrics:
    count: [ foo ] # I want the connector to generate metrics

Is there redundancy here? I would argue no, since each edge has a clear and distinct effect on the behavior. In any case, it's no different than what we have now, except that it doesn't require the user to also reason about pipelines.

service:
  pipelines:
    traces:
      receivers: [ otlp ]
      exporters: [ count ]
    logs/in:
      receivers: [ otlp ]
      exporters: [ count ]
    logs/out:
      receivers: [ count ]
      exporters: [ foo ]
    metrics:
      receivers: [ count ]
      exporters: [ foo ]

@djaglowski
Copy link
Member Author

Having this under service::[logs|metrics|trace] is not very intuitive compared to service::pipelines.

Can you elaborate on this? Is it just that there isn't a contextual term which directly indicates that we're defining data flow? If so, then we could easily have service::dataflow or something equivalent.

Maybe we can consider having metrics_consumers, traces_consumers and logs_consumers fields right in the configuration interface of all the components except for exporters where applicable?

I'm not strictly opposed to this but there is something to be said for being able to define a component in isolation from other components. The implementation for this may be a bit different too, so I can't say with as much confidence that it can be supported on top of the current pipelines system.

@dmitryax
Copy link
Member

Is it just that there isn't a contextual term which directly indicates that we're defining data flow? If so, then we could easily have service::dataflow or something equivalent.

Yes, service::dataflow sounds better

@djaglowski
Copy link
Member Author

I've updated the proposal to place edge definitions under service::dataflow

@kentquirk
Copy link
Member

I'm a big fan of this idea in general. In particular, I really like the idea of being able to specify from and to, because when people think about receivers they are generally looking to specify to, while exporters generally think about where they get data from.

As noted in the meeting, thinking about things this way makes it a lot easier to imagine combining separate file configurations.

@mx-psi
Copy link
Member

mx-psi commented Dec 14, 2023

Vector specifies pipelines in a similar-ish way to this, may be interesting to compare with them (example configuration). It looks like they have each component specify where do they receive data from instead of where they are sending data

@crobert-1
Copy link
Member

crobert-1 commented Dec 15, 2023

A common complaint today for users is how complicated it is to configure the collector, so I appreciate the effort to make this simpler. The proposed configuration for routing is especially much clearer than today.

  1. I'd like to try to understand the processor instances point a bit better. Today, we already have a single processor instance getting data from multiple sources, as you can have multiple receivers in a single pipeline. Are there any concerns about having a single processor instance work on multiple data types? Is there work required for this to work? Any new concern around locking or modifying data in parallel?
  2. Have you been able to ask very many users for thoughts on this solution? In my mind a popular vote or user survey would determine the real value of this, since it's mostly just UX. For some use-cases this seems more clear to me, and then in others less clear. I'd want to make sure that if we're introducing a whole new pipeline alternative that it's something that really is helpful, and not just an alternative configuration that adds more complexity, if that makes sense.
  3. Do you have thoughts on how to communicate this new configuration functionality to users? One concern I have is that if we add support for this we'll have to really think about how we present basic demos and configuration options to new users. Being presented with two ways to do the same thing can be a bit overwhelming for someone just trying to get started. We also have tons of example config files floating around, so this wouldn't be found by very many people unless they're searching, or we communicate it effectively. Just some thoughts, hopefully not too far off track here.
  4. The clearest way for me to understand how data is flowing is to look at the the image you've included. Is there a reason we can't just support a GUI that can visualize the data pipeline, then spit out the proper pipeline configuration? Something like otelbin.io, but the other way around? Or even just direct users to put configurations in otelbin.io more often to visually validate it's set up the way they hope? We could even propose an enhancement there to support the other direction of visualization if we wanted.

Hopefully my comments here are helpful and don't detract from the goal here. 👍

@djaglowski
Copy link
Member Author

I'd like to try to understand the processor instances point a bit better. Today, we already have a single processor instance getting data from multiple sources, as you can have multiple receivers in a single pipeline. Are there any concerns about having a single processor instance work on multiple data types? Is there work required for this to work? Any new concern around locking or modifying data in parallel?

There wouldn't be any change in this regard. When you put a processor (or receiver or exporter) into multiple pipelines with different data types, an instance is creates per data type. This would stay the same. The only change is that processors would not also instance per pipeline. Connectors follow a similar pattern to other components, but since they sit in multiple pipelines, there is an instance per ordered pair of data types.

Have you been able to ask very many users for thoughts on this solution? In my mind a popular vote or user survey would determine the real value of this, since it's mostly just UX. For some use-cases this seems more clear to me, and then in others less clear. I'd want to make sure that if we're introducing a whole new pipeline alternative that it's something that really is helpful, and not just an alternative configuration that adds more complexity, if that makes sense.

I have received a lot of preliminary feedback, mostly ranging from neutral to positive. Part of this proposal is that this is added behind a feature gate so that users can try it and provide more meaningful feedback. I agree we should make sure to get it right, so we should be open to changing the design based on feedback.

Do you have thoughts on how to communicate this new configuration functionality to users? One concern I have is that if we add support for this we'll have to really think about how we present basic demos and configuration options to new users. Being presented with two ways to do the same thing can be a bit overwhelming for someone just trying to get started. We also have tons of example config files floating around, so this wouldn't be found by very many people unless they're searching, or we communicate it effectively. Just some thoughts, hopefully not too far off track here.

I think this becomes more of a concern if we get to the point where we are thinking about advancing the feature gate. In a scenario where this change is well received and we advance the feature gate, we would need to be very clear about this being an alternative. I think demos and tools will naturally orient towards the style which users prefer. More discussion on this point would definitely be wise. For now, I'm primarily asking the questions - Is this a better user experience? If so, is it too late to make such an improvement?

The clearest way for me to understand how data is flowing is to look at the the image you've included. Is there a reason we can't just support a GUI that can visualize the data pipeline, then spit out the proper pipeline configuration? Something like otelbin.io, but the other way around? Or even just direct users to put configurations in otelbin.io more often to visually validate it's set up the way they hope? We could even propose an enhancement there to support the other direction of visualization if we wanted.

I suspect this is mostly a matter of priorities but such tooling is very nontrivial and it's not clear to me that the project would want to commit to maintaining this. It's likely better to have solutions managed by other parties, but of course we should be open to working with them. For example, I recall it being mentioned in the past that an extension could embed such a tool.

@crobert-1
Copy link
Member

Part of this proposal is that this is added behind a feature gate so that users can try it and provide more meaningful feedback.

That's fair. Do we have a defined commitment to a feature that's behind a feature gate? How locked in are we to supporting a feature in the long run that's behind a feature gate? If it's a big commitment (even behind a feature gate) I'd think we want to be pretty confident in the value. Otherwise, if it's something we can remove without trouble it's a lot easier to introduce it and get feedback later 👍 Most of my UX concerns are based on this.

I suspect this is mostly a matter of priorities but such tooling is very nontrivial and it's not clear to me that the project would want to commit to maintaining this. It's likely better to have solutions managed by other parties, but of course we should be open to working with them. For example, I recall it being mentioned in the past that an extension could embed such a tool.

Totally understand. I don't want to stop progress based just because it's not the totally perfect end-goal. Relying on outside products that aren't in the same CNCF/OpenTelemetry domain isn't a real long term supported option. I just think from a usability perspective that's really the best option. The UX of modifying data pipelines/dataflows directly in YAML has a relatively low ceiling, at least in my opinion.

@djaglowski
Copy link
Member Author

Do we have a defined commitment to a feature that's behind a feature gate? How locked in are we to supporting a feature in the long run that's behind a feature gate? If it's a big commitment (even behind a feature gate) I'd think we want to be pretty confident in the value. Otherwise, if it's something we can remove without trouble it's a lot easier to introduce it and get feedback later 👍 Most of my UX concerns are based on this.

We have language describing the level of commitment for feature gates here, but basically we have plenty of room to remove as long as we don't advance the gate past alpha but even a beta gate leaves room as well.

@crobert-1
Copy link
Member

We have language describing the level of commitment for feature gates here, but basically we have plenty of room to remove as long as we don't advance the gate past alpha but even a beta gate leaves room as well.

Good to know. In that case, I don't have major concerns with proceeding at this point. It'd be good to get more feedback from users and figure out documentation and usability concerns during alpha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion-needed Community discussion needed
Projects
None yet
Development

No branches or pull requests

5 participants