Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPM] Add mapping field types to index template generation #59894

Closed
wants to merge 11 commits into from

Conversation

skh
Copy link
Contributor

@skh skh commented Mar 11, 2020

❗️DO NOT MERGE❗️ until #59376 is merged. Change this PR's base to master once it is.

Summary

Mostly implements #55865

  • mostly follows the logic in https://github.com/elastic/beats/blob/master/libbeat/template/processor.go how to handle various data types
  • does not, however, care about backward compatibility with anything earlier than 7.6, which simplifies the logic quite a bit
  • flattens all fields (instead of expanding all fields, as beats does) after discussion with @ruflin
  • expands (un-flattens) and deduplicates all fields
  • silently removes alias fields with a path pointing to a non-existing field
  • still missing:
    • handling of object data type
    • integration tests
    • all other settings in the generated index templates need to be reviewed
  • there is some code duplication now between fields/field.ts and kibana/index_pattern/install.ts which I kept because it still might diverge

How to test this?

  • On a fresh elasticsearch setup (i.e. with system & base package not yet installed) open the IngestManager UI. The initial package installation, including index template generation, should succeed without error.
  • Installation of all other available packages should likewise succeed without error.
  • Inspect the mappings in the generated index templates in Kibana's dev console (GET _cat/templates, then GET /_template/$TEMPLATE_NAME). The relevant templates are logs-$PACKAGE.$DATASET and metrics-$PACKAGE.$DATASET, so e.g. metrics-system.process
  • Inspect server/services/epm/elasticsearch/template/__snapshots__/template.test.ts.snap, especially the snapshot for the system.yml test
  • server/services/epm/fields/__snapshots__/field.test.ts.snap might also be interesting, as it shows the intermediate data structure after processFields(), but before the transformation into the mappings structure.

@skh skh changed the title 55865 index template types [EPM] Add mapping field types to index template generation Mar 11, 2020
@skh skh added the Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project label Mar 11, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Feature:EPM)

@skh skh marked this pull request as ready for review March 12, 2020 16:35
@skh skh requested a review from neptunian March 12, 2020 16:54
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@neptunian
Copy link
Contributor

neptunian commented Mar 13, 2020

@skh Personally, I'm not sure I see how useful it is to have a large snapshot with no unit tests. If I'm using the snapshot to understand the mappings logic, it would be difficult. It also seems like a snapshot that big is probably very redundant in the use cases so we just have long tests and bigger files unnecessarily, unless they are all unique use cases? I think we should at least create unit tests similar to these so we know what we are testing for https://github.com/elastic/beats/blob/d9a4c9c240a9820fab15002592e5bb6db318543b/libbeat/template/processor_test.go . I did something similar with the index patterns https://github.com/elastic/kibana/blob/master/x-pack/plugins/ingest_manager/server/services/epm/kibana/index_pattern/install.test.ts. Also, a snapshot test or a unit test for expandFields, dedupFields, validateAliasFields and getField seems like it would be pretty straightforward.

@skh
Copy link
Contributor Author

skh commented Mar 13, 2020

@neptunian Good point, I can add more specific tests.

The reason I added such a large test file and snapshot was that this contains real data from the system package, which was very useful during development. It will, however, grow stale over time and lose usefulness so I'm fine with replacing it with something else.

@skh skh changed the base branch from feature-ingest to master March 13, 2020 14:22
@skh skh changed the base branch from master to feature-ingest March 13, 2020 14:24
@skh
Copy link
Contributor Author

skh commented Mar 13, 2020

Changing the base to master looked messy so I changed it back. Please continue reviewing, I will NOT merge this into feature-ingest, but I might need to create a new branch from master and cherry-pick my changes.

};

export function processFields(fields: Fields): Fields {
expandFields(fields);
Copy link
Contributor

@neptunian neptunian Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make expandFields more functional by not mutating the fields array like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a left-over of the old processFields implementation. Would it be acceptable to change it in a follow-up PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* These can result from expandFields when the input contains dotted field
* names that share parts of their hierarchy.
*/
function dedupFields(fields: Fields): Fields {
Copy link
Contributor

@neptunian neptunian Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are two fields that have the same name but are not a group type, it throws an error. Is that desirable? I thought this was possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two fields could be acceptable if the are exactly the same. I can amend that.

If they differ in any way (different type, or for some types, different in other properties) that would result in two different mappings for the same field present. Even if ES accepts that, I don't think it is desirable.

Copy link
Contributor

@neptunian neptunian Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. If two fields are have the same name and are not group types with fields, I don't think it should throw an error, but continue. If they have the same name, but different types, we should perhaps continue as well? I'm not sure if we should throw an error and stop the install. @ruflin what do you think?

Currently there is an error when trying to install nginx which has a field with the same name but different types with one or both types not being a group type :

UnhandledPromiseRejectionWarning: Error: Can't merge fields {"name":"answers","level":"extended","type":"object","object_type":"keyword","description":"An array containing an object for each answer section returned by the server.\nThe main keys that should be present in these objects are defined by ECS. Records that have more information may contain more keys than what ECS defines.\nNot all DNS data sources give all details about DNS answers. At minimum, answer objects must contain the data key. If more information is available, map as much of it to ECS as possible, and add any additional fields to the answer objects as custom fields."} and {"name":"answers","type":"group","fields":[{"name":"class","level":"extended","type":"keyword","ignore_above":1024,"description":"The class of DNS data contained in this resource record.","example":"IN"}]}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. If we create the Kibana index pattern, I think we should ignore some errors as the data might come from different packages and there might be some conflicts (there shouldn't be). If we generate the template for a package, all should fit perfectly. In the best case we detect it already when building the package that something is off.

What should be possible is that the same field shows up twice and one has more details. For example foo is keyword and later also as ignore_above: 1024. But in general we should try to keep things as simple as possible as this kind of inheritance is a problem for index templates today.

My suggestion: Lets ignore errors for now as part of this PR to keep it moving forward, we need all the other changes in this PR. And then we follow up on a discussion on how we handle conflicts exactly.

function dedupFields(fields: Fields): Fields {
const dedupedFields: Fields = [];
fields.forEach(field => {
const found = dedupedFields.find(f => {
Copy link
Contributor

@neptunian neptunian Mar 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more efficient way to dedup instead of having the fields array search itself for each field. Could you create an object that checks if a key exists instead or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an issue with the size of the input this function gets? It is at most the content of any one fields.yml file from a dataset in a package.

If it is an issue, can we change it in a follow-up PR? I can open an issue so we don't forget.

Copy link
Contributor

@neptunian neptunian Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the input length (fields), its that its searching another list (dedupedFields) for each field to check if its duplicated. Since this is only a single package it's not a big deal, but having another PR would be ideal where we use an object and set each field name as a property of the object and then for each field check if it's been set (instead of looping through the dupes array).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but dedupedFields only can grow to the length of fields, so if I'm not mistaken this is still O(n^2), which is not terrible with the data we have in the packages.

I'll happily open another issue to fix this soon, but I don't think it should be a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neptunian
Copy link
Contributor

@neptunian Good point, I can add more specific tests.

The reason I added such a large test file and snapshot was that this contains real data from the system package, which was very useful during development. It will, however, grow stale over time and lose usefulness so I'm fine with replacing it with something else.

I'm okay with keeping it, so long as we have other unit tests. It's just that it doesn't necessarily prove that everything is correct as every possible use case is probably not met in the system package? I guess I see it as more of a backup test unless the snapshot illustrated distinct readable cases.

@skh
Copy link
Contributor Author

skh commented Mar 16, 2020

Obsoleted by #60266

@jen-huang jen-huang added the Team:Fleet Team label for Observability Data Collection Fleet team label Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants