Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Data Source Categorization Fields #958

Closed
wants to merge 4 commits into from

Conversation

jamiehynds
Copy link
Contributor

@jamiehynds jamiehynds commented Aug 27, 2020

  • Have you signed the contributor license agreement? ✅
  • Have you followed the [contributor guidelines] (https://github.com/elastic/ecs/blob/master/CONTRIBUTING.md)? ✅
  • For proposing substantial changes or additions to the schema, have you reviewed the RFC process? ✅
  • If submitting code/script changes, have you verified all tests pass locally using make test? N/A
  • If submitting schema/fields updates, have you generated new artifacts by running make and committed those changes? N/A
  • Is your pull request against master? Unless there is a good reason otherwise, we prefer pull requests against master and will backport as needed. ✅
  • Have you added an entry to the CHANGELOG.next.md? N/A

Preview of the RFC

@jamiehynds jamiehynds added the RFC label Aug 27, 2020
@ebeahan ebeahan changed the title [RFC] 0000 Data Source Categorization Fields [RFC] Data Source Categorization Fields Aug 27, 2020
ebeahan
ebeahan previously approved these changes Aug 27, 2020
Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. -->

- Stage: **0 (strawperson)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
- Date: **August 26 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update right before we merge to reflect current date.

- Web server

## Usage
Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source. Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: #845 (comment).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source. Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: #845 (comment).
Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source. Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: [#845 (comment)](https://github.com/elastic/ecs/pull/845#issuecomment-651414817).

Looks like the Markdown link got lost in the copy/paste.

- productivity
- proxy
- queue/message queue
- security
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How wide/all-encompassing are these feels intended to be? It looks like a mixture of pretty narrow as well as pretty wide categories. For example, would all firewall, audit, edr, ids/ips, threat intelligence, and vulnerability scanner categories also be marked security?

Similar thoughts with things like proxy, application, and cloud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We included some generic categories to allow for searching/correlation across these categories, e.g. show me events across all my security data sources, cloud sources, etc. It cloud also open up the possibility for subcategories e.g. AWS being cloud, but within AWS, CloudTrail could fall under security.


The fieldset we use to describe the data source is up for discussion, data_stream.category is a possibility. Here are proposed allowed values:

- apm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing. I suggest we standardize on the capitalization and naming. For example we have an event.category of "iam" but a proposed data_stream.category of "Identity and access management". Also we have an example of "ids" for observer.type and a proposed data_stream.category of "IDS".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
-->

Elastic currently supports ingestion of data from 180+ sources, and growing. However, we do not have a coherent way to categorise these sources. This has resulted in a disconnect in how we categorize these sources from the Elastic website, in-product experiences and ECS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the allowed values data_stream.category and observer.type should be the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to bring this up.

I'm not sure I would go this direction. I think we should establish a list of allowed values, and make sure sources and pipelines populate based on this predictable list. Otherwise we could get all sorts of arbitrary differences in capitalizations and ways of writing things.

@jamiehynds
Copy link
Contributor Author

@mostlyjason would you mind reviewing the list of proposed categories and suggest any additional categories on the o11y side, if any. Thanks!

@jamiehynds
Copy link
Contributor Author

@cosiomoises @paulewing Would you mind taking a look at the proposed security data source categories? These categories may eventually be used to suggest relevant detection rules based on enabled integrations. Would be great to get your thoughts.

@mostlyjason
Copy link

@jamiehynds how will these categories be used? Also, how do they relate to our existing categories here https://www.elastic.co/integrations. I see several new ones added and several missing. Is this intended to be a replacement?

@jamiehynds
Copy link
Contributor Author

@mostlyjason The intent is to provide alignment across the entire user experience from the Elastic web site (integration page), to Elastic in-product experiences (e.g. ingest manager), to index patterns, to ECS. ECS can govern that alignment via these proposed fields.

It also opens up the possibility of aligning detection rules to enabled data sources, e.g. if a user has added a firewall data source, we can suggest appropriate detection rules that related to firewalls. Maybe there's a similar use case for alerts on the o11y side?

These categories are intended to replace the existing integration categories. We haven't included existing categories such as AWS, Azure and Kubernetes as ECS doesn't use vendor names in the schema.

@@ -0,0 +1,75 @@
# 0000: Data Source Categorization Fields
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ingest Management we had many iterations on naming on data source has also some history in it. I'm wondering what exactly we categorise here. Is it the data itself which is in data_streams? Do we category the data_streams? Do we categorize the source from where the data is coming from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ruflin - the intent here is to categorize the source from where the data is coming from.

@ruflin ruflin mentioned this pull request Sep 30, 2020
@ebeahan
Copy link
Member

ebeahan commented Oct 19, 2020

Thanks everyone for the great feedback and discussion.

With this being a stage 0 candidate, the only criteria required for advancement is agreement that the premise has utility and could be an appropriate addition to ECS. Unless there are objections, I propose we capture the shared feedback and concerns in the proposal doc and begin refining and addressing concerns in the subsequent stages.

I've captured this summary of feedback and concerns:

@jamiehynds - is there anyone else's feedback we may need to capture at this stage?

@jamiehynds
Copy link
Contributor Author

Thanks @ebeahan. Before we proceed, I'd like to get some insight from @ruflin on previous discussions around categorising of data sources and whether we should include data_stream categorisations here too.

@ruflin
Copy link
Contributor

ruflin commented Oct 21, 2020

Two question from my side:

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for the feedback so far.

Thanks as well @ebeahan, I agree we should capture these concerns in the RFC document itself. As we reach a conclusion on some, we can document conclusions in the RFC. The RFC document should stand on its own -- including the concerns & resolutions -- without needing to refer to the PRs themselves too much.

I think the criteria for stage 0 has been met a long time ago (this is appropriate in ECS).

With all of the questions in the air at the moment, I suggest we retarget this PR to stage 1. This way we can get closure in this PR, rather than carrying over the discussion to the next PR.

# 0000: Data Source Categorization Fields
<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. -->

- Stage: **0 (strawperson)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we retarget to stage 1, since there's been so much discussion already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++


Elastic currently supports ingestion of data from 180+ sources, and growing. However, we do not have a coherent way to categorise these sources. This has resulted in a disconnect in how we categorize these sources from the Elastic website, in-product experiences and ECS.

The fieldset we use to describe the data source is up for discussion, data_stream.category is a possibility. Here are proposed allowed values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have been the one suggesting data_stream.category as a possibility, a while ago.

But as the data_stream RFC is progressing, I no longer think this is the right approach.

I think the data_stream fields should be only dedicated to the indexing strategy itself, such as "how the index name is created".

I agree that a way of categorizing data sources is needed, but I think we should have this be another field, that would also makes sense in the 7.x monolithic indices. Having an out of place data_stream.category field there would not be appropriate.

Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
-->

Elastic currently supports ingestion of data from 180+ sources, and growing. However, we do not have a coherent way to categorise these sources. This has resulted in a disconnect in how we categorize these sources from the Elastic website, in-product experiences and ECS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to bring this up.

I'm not sure I would go this direction. I think we should establish a list of allowed values, and make sure sources and pipelines populate based on this predictable list. Otherwise we could get all sorts of arbitrary differences in capitalizations and ways of writing things.

## References

* https://github.com/elastic/ecs/issues/901
* https://github.com/elastic/ecs/pull/845
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the link provided by @ruflin to the references, please?

Thanks for providing it, Nic 👍

However let's make sure the link stands the test of time, and link via the latest tag, rather than master:

Suggested change
* https://github.com/elastic/ecs/pull/845
* https://github.com/elastic/ecs/pull/845
* https://github.com/elastic/package-registry/blob/v0.12.1/util/package.go#L27

@ebeahan
Copy link
Member

ebeahan commented Mar 29, 2021

Discussed with @jamiehynds out-of-band, and we're not moving forward with this effort at this time.

@ebeahan ebeahan closed this Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants