Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Change Proposal] Add support for synthetic source #340

Closed
jsoriano opened this issue May 20, 2022 · 16 comments · Fixed by #419
Closed

[Change Proposal] Add support for synthetic source #340

jsoriano opened this issue May 20, 2022 · 16 comments · Fixed by #419
Assignees
Labels
discuss Issue needs discussion enhancement New feature or request Team:Ecosystem Label for the Packages Ecosystem team

Comments

@jsoriano
Copy link
Member

jsoriano commented May 20, 2022

Elasticsearch is going to introduce a new feature called synthetic source. It allows to ingest documents without keeping the source, and rebuild it if needed from the indexed data. Not storing the source produces significative disk space savings.

We may decide in the future to enable synthetic source everywhere, but in the meantime we should have a way to enable it per data stream.

Slightly Blocked by: elastic/kibana#132818

@jsoriano jsoriano added the discuss Issue needs discussion label May 20, 2022
@jlind23 jlind23 added the Team:Ecosystem Label for the Packages Ecosystem team label May 20, 2022
@ruflin
Copy link
Member

ruflin commented May 23, 2022

++ having this available in the package spec. Ideally, it would also be possible to turn it on / off in the Fleet UI somewhere.

@jlind23 jlind23 added the enhancement New feature or request label May 24, 2022
@joshdover
Copy link
Contributor

joshdover commented Jun 15, 2022

This is a 3rd use case for expanding our proposal in elastic/kibana#132818 to allow enabling this on an integration or data stream level for testing before we allow packages to enable this by default.

@andresrc
Copy link

My preferred approach as of today would be:

Initial Phase - enabling integrations

  • We would have a boolean configuration option at the data stream level in order to enable synthetic source.
    • The default value would initially be false so that we keep the current behaviour by default.
    • If set to true Fleet would add the appropriate info to the mapping (and roll over if needed).
    • We would need to know if previous versions of Fleet would just ignore the option, to understand if we need to bump kibana.version when adding this option or not.
  • Integrations would be able to gradually opt-in just by releasing a new version of the packages

Future phase

  • We might have options in Fleet to dynamically change this setting. Granularity would need to be defined.
  • The default might change at some point.

@jsoriano
Copy link
Member Author

@andresrc the current idea we had as of elastic/kibana#132818 is the following:

  • Nothing is added to the spec by now.
  • The advanced opt-in flag is added in Fleet, so users (and developers) can optionally enable synthetic source on any data stream.
  • Depending on feedback/results:
    • If no problems or limitations are found, we remove the opt-in flag, and we enable synthetic source everywhere.
    • If it depends on the use case, we add the flag to the spec, with a default to be determined.

This is functionally similar to what you are suggesting, but we wouldn't need to add a new kind of configuration, and this feature would be available in all data streams, without needing changes on integrations.

@andresrc
Copy link

Thanks @jsoriano , that approach looks better, yes.

The questions is: Would the flag be global or per-data stream? I think it should be per-data stream as folks might be relying on _source on some custom queries / visualizations etc., and if the synthetic one is not exactly the same this would be a breaking change. So we should be able to control this per-data stream.

In the future, we could add the setting the package so that they can set the default values (e.g, we should use it for every new data stream, as there is no risk of breaking change)

@jsoriano
Copy link
Member Author

Would the flag be global or per-data stream? I think it should be per-data stream as folks might be relying on _source on some custom queries / visualizations etc., and if the synthetic one is not exactly the same this would be a breaking change. So we should be able to control this per-data stream.

It seems that the current plan is to implement this flag at the package level (see elastic/kibana#132818 (comment)). Do you think this would be enough granularity?

In the future, we could add the setting the package so that they can set the default values (e.g, we should use it for every new data stream, as there is no risk of breaking change)

Yep, if we see any risk of breaking changes, we should add a setting to ensure backwards compatibility. But as the feature is being designed now, it should be backwards compatibility, except for some known cases we may not have in integrations.

@andresrc
Copy link

In the future, we could add the setting the package so that they can set the default values (e.g, we should use it for every new data stream, as there is no risk of breaking change)

Better than global for sure. It will depend on wether we find there's a different impact on logs and metrics, but it might be a reasonable compromise.

@joshdover
Copy link
Contributor

I think it should be per-data stream as folks might be relying on _source on some custom queries / visualizations etc., and if the synthetic one is not exactly the same this would be a breaking change.

Won't this be a breaking change for all data streams then? We don't know how users are using this in their own custom visualizations or external systems. If we flip this on by default on a package upgrade, we can/will break things for some users.

I think we have to make this an opt-in feature for existing packages / data streams and consider making it the default in the next major version. Now that said, I think we could get away with making it a breaking change in the package rather than in the Stack. The problem is we don't have a UI flow for helping users through package major version upgrades.

@jsoriano
Copy link
Member Author

It'd be good to confirm how many users of our Fleet-based solutions are using _source as is now. If we find that this feature is beneficial for the majority, it would be a pity that we have to delay its general introduction for these corner cases. For users of _source, it may also happen that the synthetic source also works for them.

The opt-in flag is a good tool to help users and developers test the results for the cases where _source is used.

When we decide to enable it, we may also decide to do it by default in new policies only, letting it disabled on existing policies and upgrades, so users need to explicitly enable it in potentially breaking cases.

I wouldn't leave the decision to package developers (that is making this a flag in the package). What uses cases would we cover with this? This would open the gate to packages that make long-term use of _source, something that we may want discourage if we see important benefits on not storing it. And this doesn't avoid producing breaking changes, as users of specific packages would find the same issues if they use _source.

@jsoriano
Copy link
Member Author

@tommyers-elastic has found that one use we do of _source in integrations is in ML jobs.

@alvarezmelissa87 do you think that these jobs would work with synthetic source?

@ruflin
Copy link
Member

ruflin commented Aug 26, 2022

I expect with elastic/kibana#132818 to soon have a way for us to enable synthetic source per data stream. Experiments on existing data streams have shown very promising results. Because of this, I'm convinced we should start the discussion on the package-spec flag.

For now synthetics sources works well for metrics. My ideal scenario would be:

  • Existing data streams stay as is until the flag is overwritten in the manifest
  • Any new metrics data stream that is added, to packages has source turned off by default but it could be overwritten if needed

I wonder how we could make the above possible with default but stay compatible. Different package-spec versions?

@jsoriano
Copy link
Member Author

I'm convinced we should start the discussion on the package-spec flag.

Is the use of synthetic source a decision to make by the package developer? If it is, we need a flag. If it isn't, then I don't think we should put anything in the spec.

I think we need to wait for some more results before making a decision here. My general impression at the moment is that we are going to be able to enable it in all metrics data streams by default, specially given that efforts are being done to overcome current limitations as the issues with ignore_above. See also #340 (comment).

When we take the step of enabling synthetic source by default I am ok though with keeping the setting in the fleet UI, allowing users to keep the source if they want it for some reason, and keeping installed data streams as they are to avoid breaking current uses of the source. This decision doesn't affect the behavior of the package, so I think it is ok if users can take it.
But we can also take the product decision of enabling it in all metrics data streams and don't leave a escape hatch, in any case this would not be a decision to make by each package developer.

(As a counter-example of who should make the decision, we have TSDB, where we decided to add a flag in the spec, because we thought that the decision of using TSDB is a decision to make by the package developer and not by the user, as this will also affect how the data can be queried)

@ruflin
Copy link
Member

ruflin commented Aug 26, 2022

Is the use of synthetic source a decision to make by the package developer? If it is, we need a flag. If it isn't, then I don't think we should put anything in the spec.

Yes, a package dev can decide to turn it on / off per data stream. It might be possible to overwrite this in Fleet at least short term, long term this might not even be needed.

@jsoriano
Copy link
Member Author

Is the use of synthetic source a decision to make by the package developer? If it is, we need a flag. If it isn't, then I don't think we should put anything in the spec.

Yes, a package dev can decide to turn it on / off per data stream. It might be possible to overwrite this in Fleet at least short term, long term this might not even be needed.

But why would a package developer decide if synthetic source should be used or not? Is there any use case for this?

@ruflin
Copy link
Member

ruflin commented Aug 26, 2022

The simplified version is, metrics-* have synthetics source, logs-* use _source. But there will be exceptions to this and that is where the package dev needs to have the option to overwrite our default.

@jsoriano
Copy link
Member Author

The simplified version is, metrics-* have synthetics source, logs-* use _source. But there will be exceptions to this and that is where the package dev needs to have the option to overwrite our default.

Ok, for the exceptions it can make sense to have a toggle for this. Let's then add something similar to what we added for TSDB in #357.

cc @kpollich this will affect the plans for elastic/kibana#132818, as Fleet will need to read this new setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs discussion enhancement New feature or request Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants