Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge custom and core multi_fields array #982

Merged
merged 8 commits into from
Jan 6, 2021

Conversation

jonathan-buttner
Copy link
Contributor

@jonathan-buttner jonathan-buttner commented Sep 29, 2020

We'd like to introduce custom multi_fields definitions in the endpoint package's custom schema. An example of this is here:
https://github.com/elastic/endpoint-package/pull/79/files#diff-7f0ee89a2e91f4b29aa03f75b80a16acR22-R26

---
- name: file
  title: File
  group: 2
  short: Fields describing files.
  description: >
    Custom file
  fields:
    - name: path
      multi_fields:
        - name: caseless
          type: keyword
          normalizer: lowercase

Currently, the ECS scripts do not merge the multi_fields array but instead uses the custom schema's definition after merging the included files. Since the custom schema's definition overwrites the core schema's definition, the custom schema must include any multi_fields core elements in its definition otherwise they'll inadvertently be removed. The above example will result in the path.text field being removed: https://github.com/elastic/ecs/blob/master/schemas/file.yml#L62-L64

This PR adds functionality to merge the custom multi_fields array with the core one. The approach I took was to convert the list into a map so we can perform deduplication. The keys in the map come from the list entries (which are a map) name field. The included custom schema will override the core schema if it defines a multi_field entry with the same name field.

@jonathan-buttner jonathan-buttner marked this pull request as ready for review September 29, 2020 13:49
Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonathan-buttner! Sorry for taking a bit for an initial review.

The use-case makes good sense, and I think this will be a good addition to the tooling. After testing out the changes, I did have a couple of notes.

scripts/schema/loader.py Outdated Show resolved Hide resolved
def dedup_and_merge_lists(list_a, list_b):
list_a_set = array_of_dicts_to_set(list_a)
list_b_set = array_of_dicts_to_set(list_b)
return set_of_sets_to_array(list_a_set | list_b_set)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor issue I stumbled across while testing this out. Not sure it would be a blocker to merging, but worth noting the behavior.

The union will remove exact duplicate items:

> list_a_set
{frozenset({('name', 'text'), ('type', 'text')})}

> list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}

> list_a_set | list_b_set
{frozenset({('name', 'text'), ('type', 'text')}), frozenset({('type', 'keyword'), ('normalizer', 'lowercase'), ('name', 'caseless')})}

But if the sets are not exact duplicates, it could lead to duplicate field names:

> list_a_set
{frozenset({('type', 'text'), ('name', 'text')})}

> list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'keyword'), ('name', 'text')})}

> list_a_set | list_b_set
{frozenset({('normalizer', 'lowercase'), ('type', 'keyword'), ('name', 'caseless')}), frozenset({('type', 'text'), ('name', 'text')}), frozenset({('type', 'keyword'), ('name', 'text')})}

schema include file:

---
  - name: file
    title: File
    group: 2
    short: Fields describing files.
    description: >
      Custom file
    fields:
      - name: path
        multi_fields:
          - name: caseless
            type: keyword
            normalizer: lowercase
          - name: text  
            type: keyword <= I imagine this would only happen by accident 😃

Resulting intermediate state:

        multi_fields:
        - flat_name: file.path.caseless
          ignore_above: 1024
          name: caseless
          normalizer: lowercase
          type: keyword
        - flat_name: file.path.text
          ignore_above: 1024
          name: text
          type: keyword
        - flat_name: file.path.text
          name: text
          norms: false
          type: text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch, what do we think the expected behavior should be in this scenario? I could put in a check to ensure that two of the same name fields don't exist in the resulting set and throw an error if they do? Or maybe just have core override?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should dedupe on name and take the most recent definition in the case of dupes (this would allow for overrides).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webmat do you have any thoughts? I recall back in #864, logic was removed from the tooling to allow --include supplied custom fields to be more permissive:

This means the tooling must now accept included files as they are, with all of the power this entails.

Perhaps we simply make sure to note that users need to be aware of introducing such duplicates fields?

Copy link
Contributor

@webmat webmat Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.

The --include option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.

But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text multi-field with such a custom definition:

        multi_fields:
        - name: text
          norms: false
          type: text
          normalizer: ua_normalizer 

I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:

        - name: text
          normalizer: ua_normalizer 

@madirey
Copy link
Contributor

madirey commented Oct 12, 2020

Thanks for doing this!

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this, that's a good addition!

Side note: you're using this to add a .caseless multi-field, but with the coming of query param case_sensitive in 7.10, are you sure you need this multi-field?

In any case, this is a good addition, this will make adjustments to multi-fields much smoother.

def dedup_and_merge_lists(list_a, list_b):
list_a_set = array_of_dicts_to_set(list_a)
list_b_set = array_of_dicts_to_set(list_b)
return set_of_sets_to_array(list_a_set | list_b_set)
Copy link
Contributor

@webmat webmat Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @madirey. We should keep it simple and only ensure we have unique multi-field names.

The --include option is meant to override, so the ideal behaviour is for a custom multi-field definition to replace or be merged with an entry of the same name. I'm on the fence on whether to merge/replace an entry of the same name, though. Happy to be convinced either way.

But to take a concrete example, let's say someone has tuned a normalizer that works well for user agent strings, I want them to be able to replace the default user_agent.original.text multi-field with such a custom definition:

        multi_fields:
        - name: text
          norms: false
          type: text
          normalizer: ua_normalizer 

I think I have a preference with merging the pre-existing multi-field definitions of the same name, as this is more in line with how everything else is handled with custom fields. And it has the bonus of allowing a more terse custom definition:

        - name: text
          normalizer: ua_normalizer 

@ebeahan ebeahan added the ready Issues we'd like to address in the future. label Nov 3, 2020
@ebeahan ebeahan removed the ready Issues we'd like to address in the future. label Nov 17, 2020
@ebeahan
Copy link
Member

ebeahan commented Dec 1, 2020

@jonathan-buttner Is this still a need, or are you pursuing using the new case_sensitive option instead?

@jonathan-buttner
Copy link
Contributor Author

@jonathan-buttner Is this still a need, or are you pursuing using the new case_sensitive option instead?

Sorry completely dropped the ball on this one. I've been trying to get some features done for 7.11. I think it'd be nice to still have this. I probably won't get to in until after feature freeze for 7.11 though, if that's ok. I don't think it's super important but would be nice to have it.

@jonathan-buttner
Copy link
Contributor Author

@ebeahan I think this PR is in a better spot now haha. I updated the description as well but I took the approach deduping based on the name field in the multi_field list of dictionaries.

Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting and overriding based on the multi-field name 👍

I have comments on how the tests are put together.

scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved
scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved
scripts/tests/unit/test_schema_loader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@webmat webmat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ebeahan ebeahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the one nit for the changelog, looks good! 👍

CHANGELOG.next.md Show resolved Hide resolved
webmat pushed a commit to webmat/ecs that referenced this pull request Jan 6, 2021
@jonathan-buttner jonathan-buttner deleted the merge-multi-fields branch January 6, 2021 17:16
@jonathan-buttner
Copy link
Contributor Author

Thanks Mat and Eric!

webmat pushed a commit that referenced this pull request Jan 6, 2021
webmat pushed a commit that referenced this pull request Jan 6, 2021
@ebeahan ebeahan removed the review label Jan 6, 2021
dseeley added a commit to dseeley/ecs that referenced this pull request May 5, 2021
* bumping version for 1.x release branch (elastic#921)

* [1.x] add related.hosts (elastic#913) (elastic#924)

* [1.x][DOCS] Fixes SIEM links (elastic#936)

* [1.x] Consolidate field-details doc template (elastic#897) (elastic#946)

* Add http.[request|response].mime_type (elastic#944) (elastic#949)

* [1.x] Cut 1.6 Changelog (elastic#933) (elastic#952) (elastic#953)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] Add threat.technique.subtechnique (elastic#951) (elastic#956)

Co-authored-by: Ross Wolf <[email protected]>

* [1.x] Nest as for foreign reuse (elastic#960) (elastic#962)

* [1.x] Remove `expected_event_types` from protocol (elastic#964) (elastic#965)

* [1.x] Expand definitions of source and destination field sets (elastic#967) (elastic#973)

* [1.x] Introduce `--strict` flag (elastic#937) (elastic#975)

* [1.x] Add example value composite type checking (elastic#966) (elastic#976)

* Add example value composite type checking (elastic#966)
* generate csv artifact

* [1.x] Add event category configuration (elastic#963) (elastic#977)

* [1.x] Add normalizer multi-field capability (elastic#971) (elastic#978)

Co-authored-by: Eric Beahan <[email protected]>

Co-authored-by: Madison Caldwell <[email protected]>

* [1.x] Add mapping network event guidance doc (elastic#969) (elastic#983)

* [1.x] Removing unneeded link under `Additional Information` (elastic#984) (elastic#985)

* [1.x] Add discrete attribute to field details page headers (elastic#989) (elastic#990)

* [1.x] Uniformity across domain name breakdown fields (elastic#981) (elastic#994)

Co-authored-by: Mathieu Martin <[email protected]>

* Add --oss flag to the ECS generator script (elastic#991) (elastic#995)

* Add network directions ingress and egress (elastic#945) (elastic#997)

* Mention ECS Mapper in the main documentation (elastic#987) (elastic#1000)

Co-authored-by: Dan Roscigno <[email protected]>

* [1.x] Introduce experimental artifacts (elastic#993) (elastic#1001)

Co-authored-by: Mathieu Martin <[email protected]>

* Bump version to 1.8.0-dev in branch 1.x (elastic#1011)

* Cut 1.7 changelog (elastic#1010) (elastic#1012)

* [1.x] Clarify that file extension should exclude the dot. (elastic#1016) (elastic#1020)

* [1.x] Add usage docs section (elastic#988) (elastic#1024)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] feat: include alias path when generating template (elastic#877) (elastic#1035)

Co-authored-by: Richard Gomez <[email protected]>

* [1.x] Add support for `scaling_factor` in the generator (elastic#1042) (elastic#1055)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] Add fallback for constant_keyword (elastic#1046) (elastic#1056)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] Add wildcard type support to go code generator (elastic#1050) (elastic#1057)

* add wildcard type support

* also add version and constant_keyword

* changelog

* [1.x] New default make task that generates main and experimental artifacts. (elastic#1041) (elastic#1060)

Also changing the order of the 'generate' task: it now starts with the new generator, then runs the legacy scripts.

* [1.x] Change the index pattern in the sample template. (elastic#1048) (elastic#1068)

* [1.x] Prepare link to Logs docs changing with the 7.10 release in "getting-started" (elastic#1073) (elastic#1079)

Co-authored-by: EamonnTP <[email protected]>

* [1.x] Prepare link to Logs docs changing with the 7.10 release in "products-solutions" page (elastic#1074) (elastic#1083)

Co-authored-by: EamonnTP <[email protected]>

* [1.x] Add event.category session. (elastic#1049) (elastic#1093)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] Add event.category registry (elastic#1040) (elastic#1094)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] Add --ref support for experimental artifacts (elastic#1063) (elastic#1101)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] Remove experimental event.original definition (elastic#1053) (elastic#1104)

* [1.x] Add missing `process.thread.name` to experimental definitions (elastic#1103) (elastic#1106)

* [1.x] Remove index parameter for wildcard fields (elastic#1115) (elastic#1119)

* [1.x] Add dns.answer object into experimental schema (elastic#1118) (elastic#1121)

* [1.x] Clarify x509 definition guidance for network events with only one cert (elastic#1114) (elastic#1123)

* [1.x] Indicate when artifacts include experimental changes (elastic#1117) (elastic#1125)

* [1.x] Add os.type field, with list of allowed values (elastic#1111) (elastic#1130)

* [1.x] Add support for constant_keyword's 'value' parameter (elastic#1112) (elastic#1132)

* [1.x] Beta label support (elastic#1051) (elastic#1133)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] Backport elastic#1134 and elastic#1135 (elastic#1136)

* Remove temporary ifeval in "getting started" page, add link to Metrics docs (elastic#1134)
* Remove temporary ifeval from products page, add link to Metrics (elastic#1135)

* Two small documentation backports (elastic#1149)

* Remove an incorrect `event.type` from the 'converting' page (elastic#1146)
* Mention Logstash support for ECS in the 'products' page (elastic#1147)

* [1.x] Reinforce the exclusion of the leading dot from url.extension (elastic#1151) (elastic#1152)

* [1.x] Make all fields linkable directly via an HTML ID (elastic#1148) (elastic#1154)

* [1.x] Tracing fields should be at the root (elastic#1165)

* Add notice to the tracing field set, about not nesting field names. (elastic#1162)
* Tracing fields should be at top level in Beats artifact (elastic#1164)

* [1.x] Usage of brackets for a URL containing IPv6 address (elastic#1131) (elastic#1168)

* [1.x] 6.x index template data type fallback (elastic#1171) (elastic#1172)

* [1.x] Apply RFC 0007 stage 3 changes - multi-user (elastic#1066) (elastic#1175)

Conflict: deleted file rfcs/text/0007-multiple-users.md as RFCs are not backported to version branches.

* [1.x] Handle `error.stack_trace` case for ES 6.x template (elastic#1176) (elastic#1177)

* [1.x] Add composable index templates artifacts (elastic#1156) (elastic#1179)

* [1.x] Move _meta section back inside mappings, in legacy templates. (elastic#1186) (elastic#1187)

Backports the following commits to 1.x:

* Move _meta section back inside mappings, in legacy templates. (elastic#1186) 

This fixes an issue introduced by elastic#1156, discovered in elastic#1180. Composable templates support `_meta` at the template's root, but legacy templates don't. So we're just putting it back inside the mappings for legacy templates.

This also fixes missing updates to the component template, after the introduction of wildcard in elastic#1098.

* [1.x] Apply the RFC 0005 stage 2 (host metrics) changes in the experimental artifacts (elastic#1159) (elastic#1184)

Co-authored-by: Mathieu Martin <[email protected]>

* [1.x] Stage 3 changes for wildcard RFC 0001 (elastic#1098) (elastic#1183)

* [1.x] Conditional handling in es_template.template_settings (elastic#1191) (elastic#1192)

* [1.x] Artifacts docs page (elastic#1189) (elastic#1195)

* [1.x] Remove beta warning label from categorization fields docs (elastic#1067) (elastic#1196)

* [1.x] Correct wording of `event.reference` description (elastic#1181) (elastic#1197)

* Bump version to 1.9.0-dev in branch 1.x (elastic#1198)

* [1.x] Cut 1.8 FF changelog.next.md elastic#1199 (elastic#1201)

* Merge custom and core multi_fields arrays (elastic#982) (elastic#1213)

Co-authored-by: Jonathan Buttner <[email protected]>

* [1.x] Stage 2 changes for RFC 0009 - data_stream fields (elastic#1215) (elastic#1222)

* [1.x] add http.request.id (elastic#1208) (elastic#1223)

Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: Gil Raphaelli <[email protected]>

* [1.x] add cloud.service.name (elastic#1204) (elastic#1224)

* add cloud.platform

* expand cloud.platform description

* move to cloud.service.name

Co-authored-by: Gil Raphaelli <[email protected]>

* [1.x] Add ssdeep hash (elastic#1169) (elastic#1227)

Co-authored-by: Andrew Stucki <[email protected]>

* [CI] Switch to GitHub actions (elastic#1236) (elastic#1245)

Co-authored-by: Eric Beahan <[email protected]>

Co-authored-by: Andrew Stucki <[email protected]>

* Revert wildcard adoption back to experimental stage (elastic#1235) (elastic#1243)

* Add scaled_float type to go generator (elastic#1250) (elastic#1251)

* add scaled_float

* changelog

* Add categorization fields usage docs (elastic#1242) (elastic#1257)

* add time_zone, postal_code, and continent_code (elastic#1229) (elastic#1258)

* Specify MAC address format (elastic#456) (elastic#1260)

Co-authored-by: Robin Schneider <[email protected]>

* finalize 1.8.0 changelog (elastic#1262) (elastic#1265)

* Add additional host fields (elastic#1248) (elastic#1267)

Co-authored-by: kaiyan-sheng <[email protected]>

* Stage 1 changes for RFC 0014 - extend pe fields (elastic#1256) (elastic#1270)

* Add 2 fields to code_signature (elastic#1269) (elastic#1272)

Co-authored-by: Yamin Tian <[email protected]>

* Stage 3 changes for RFC 0007 - remove beta attribute (elastic#1271) (elastic#1273)

* Stage 1 experimental changes for RFC 0008 - threat.indicator fields (elastic#1268) (elastic#1274)

* Stage 1 changes for RFC 0015 - add elf fieldset (elastic#1261) (elastic#1275)

* Cut 1.9 FF CHANGELOG.next.md (elastic#1277)

* lock go version in actions (elastic#1283) (elastic#1290)

* Bump jinja2 from 2.11.2 to 2.11.3 in /scripts (elastic#1310) (elastic#1320)

* Bump jinja2 from 2.11.2 to 2.11.3 in /scripts

* Bump pyyaml from 5.3b1 to 5.4 in /scripts (elastic#1318) (elastic#1325)

Co-authored-by: Eric Beahan <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Adjust terminology - change whitelist to allowlist (elastic#1315) (elastic#1331)

Co-authored-by: Dominic Page <[email protected]>

* Remove -dev label from 1.9 version (elastic#1329)

* remove -dev label from 1.9 version

* generate artifacts

* removing rules artifacts

* Cut 1.9 changelog (elastic#1328)

* move 1.9 changes to changelog

* add 1.9 release changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants