Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic field assertion (types, presence) #143

Merged
merged 42 commits into from
Oct 27, 2020

Conversation

mtojek
Copy link
Contributor

@mtojek mtojek commented Oct 19, 2020

This PR introduces a basic field type assertion during pipeline and system tests. When a pipeline or system test is executed, generated results are examined against correctness with fields schema (fields YAML files). All inconsistencies will be reported.

Issue: #90

Action items (TODOs):

  • Add AWS sample data stream with fields
  • Load JSON unit test (with ES document) from JSON file
  • Validate JSON results in pipeline tests
  • Validate JSON results in system tests
  • Validate document body
  • Validate field type
  • Fix all edge cases in pipeline tests
  • Fix all edge cases in system tests

@mtojek mtojek self-assigned this Oct 19, 2020
@ycombinator
Copy link
Contributor

ycombinator commented Oct 19, 2020

I assume you will choose a different test package than aws when it comes to validating JSON results in system tests?

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 19, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #143 updated]

  • Start Time: 2020-10-27T14:31:04.686+0000

  • Duration: 14 min 26 sec

Test stats 🧪

Test Results
Failed 0
Passed 9
Skipped 0
Total 9

@mtojek mtojek changed the title Implement field assertion Implement basic field assertion (types, presence) Oct 19, 2020
@mtojek
Copy link
Contributor Author

mtojek commented Oct 19, 2020

I assume you will choose a different test package than aws when it comes to validating JSON results in system tests?

Yes, I will try to use the only one present, apache, but it would be nice to spend some time and populate the repo (already created such issue in integrations).

I've also spotted that we need to validate sample events. See: https://github.com/elastic/integrations/blob/master/packages/aws/data_stream/sns/sample_event.json . The file is raw dump from Elasticsearch, but we're interested in _source content only.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 20, 2020

Not sure what should I do with entries like:

[2020-10-20T16:31:02.897Z] Error: error running package system tests: could not check for expected data in data stream: one or more errors found in documents stored in logs-apache.access-ep data stream: [0] field "agent.hostname" is not defined

The field agent.hostname is not defined anywhere as it's a common one. Should I accept it even though there is no definition for it in fields.yml?

@ycombinator
Copy link
Contributor

ycombinator commented Oct 20, 2020

The field agent.hostname is not defined anywhere as it's a common one. Should I accept it even though there is no definition for it in fields.yml?

I assume this field is somehow getting defined in the index template's mapping. Do you how/where that's happening?

@mtojek
Copy link
Contributor Author

mtojek commented Oct 21, 2020

I assume this field is somehow getting defined in the index template's mapping. Do you how/where that's happening?

No idea. Maybe @ruflin know some more details? I suspect that there are more hidden definitions.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 21, 2020

The system test runner reports tests results under the "Tests" tab:
https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Felastic-package/detail/PR-143/12/tests

Moving on...

@mtojek
Copy link
Contributor Author

mtojek commented Oct 21, 2020

I need to fix the geo_point field:

[0] field "source.geo.location.lon" is not defined
[1] field "source.geo.location.lon" is not defined
[2] field "source.geo.location.lon" is not defined
[3] field "source.geo.location.lon" is not defined
[4] field "source.geo.location.lon" is not defined
[5] field "source.geo.location.lon" is not defined
[6] field "source.geo.location.lat" is not defined
[7] field "source.geo.location.lat" is not defined

@mtojek mtojek requested a review from ycombinator October 22, 2020 14:25
@mtojek
Copy link
Contributor Author

mtojek commented Oct 26, 2020

Hm.. it looks that the CI failed for a couple of fields: https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Felastic-package/detail/PR-143/28/tests

Investigating. Not sure why the CI hasn't reported these issues earlier. Current status:

2020/10/26 18:49:10 DEBUG deleting data in data stream...
--- Test results for package: apache - START ---
╭─────────┬─────────────┬───────────┬───────────┬────────────────────────────────────────────────────────────────────────────────────────────┬───────────────╮
│ PACKAGE │ DATA STREAM │ TEST TYPE │ TEST NAME │ RESULT                                                                                     │  TIME ELAPSED │
├─────────┼─────────────┼───────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────────────┼───────────────┤
│ apache  │ access      │ system    │           │ FAIL: one or more errors found in documents stored in logs-apache.access-ep data stream    │ 24.752844956s │
│ apache  │ error       │ system    │           │ FAIL: one or more errors found in documents stored in logs-apache.error-ep data stream     │  22.48721029s │
│ apache  │ status      │ system    │           │ FAIL: one or more errors found in documents stored in metrics-apache.status-ep data stream │ 27.469190637s │
╰─────────┴─────────────┴───────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────────────┴───────────────╯

FAILURE DETAILS:

apache/access :
[0] field "host.containerized" is not defined
[1] field "host.hostname" is not defined
[2] field "host.name" is not defined
[3] field "event.dataset" is not defined
[4] field "input.type" is not defined
[5] field "input.type" is not defined
[6] field "input.type" is not defined
[7] field "input.type" is not defined
[8] field "host.containerized" is not defined
[9] field "input.type" is not defined
apache/error :
[0] field "input.type" is not defined
[1] field "event.timezone" is not defined
apache/status :
[0] field "host.containerized" is not defined
--- Test results for package: apache - END   ---

EDIT:

I notice one issue with the current implementation. It returns the issue issue found for the document. I suspect it would be great to return all validation problems as we do for spec.

EDIT2:

I'm confused now. I added a couple of fields and it seems that there're new weird ones coming :) (log.input?). Maybe we shouldn't fail on undefined fields and just check the type correctness? I see now a case in which a field change pushed to filebeat/metricbeat (new field added) fails all system tests for integrations. WDYT, @ycombinator ?

@ycombinator
Copy link
Contributor

I added a couple of fields and it seems that there're new weird ones coming :) (log.input?). Maybe we shouldn't fail on undefined fields and just check the type correctness? I see now a case in which a field change pushed to filebeat/metricbeat (new field added) fails all system tests for integrations.

I imagine such new common fields should not be added frequently so I would actually prefer to be strict, at least initially and see how much of a PITA it becomes for us. If we suddenly see all tests failing for integrations, it's a pretty good indication that it's not something wrong with one or two integrations but something more widespread and common. Looking at the failure details it should become pretty clear what that common new field is and then we can add it to our skip list.

Also, depending on how we solve #147, we might not have to worry about this issue of new common fields being added every so often.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 27, 2020

Ok, I will adjust the implementation to report all problems and then add missing fields. I'm only worried about future situations in which we have to adjust all integrations (time consuming), but maybe it won't be so hard.

@ycombinator
Copy link
Contributor

ycombinator commented Oct 27, 2020

Ok, I will adjust the implementation to report all problems and then add missing fields. I'm only worried about future situations in which we have to adjust all integrations (time consuming), but maybe it won't be so hard.

Yes, this is a fair point. I think we'll have to see how it goes. If it starts to become a hassle, I'll be +1 to not failing on undefined fields. But I'm also hoping we can find an elegant solution to #147, maybe something that involves being able to detect these common fields from some other source than the package field definition files.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 27, 2020

I modified the code to get all validation errors at once, but it went up to 300 errors :) See: https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Felastic-package/detail/PR-143/33/tests

I will adjust the code to report only unique records.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 27, 2020

I managed to clean up all reported fields: https://beats-ci.elastic.co/blue/organizations/jenkins/Beats%2Felastic-package/detail/PR-143/35/tests

Field families that are not related to the specific integration, but appear in documents: cloud.*, host.*, event.*.

@mtojek
Copy link
Contributor Author

mtojek commented Oct 27, 2020

@ycombinator I fixed all issues and addressed your comments. Please take a look when you're free. Keep in mind the number of additional fields I had to put here.

@mtojek mtojek requested a review from ycombinator October 27, 2020 12:42
isFieldFamilyMatching("event", key) || // too many common fields
isFieldFamilyMatching("host", key) || // too many common fields
isFieldFamilyMatching("metricset", key) || // field is deprecated
isFieldFamilyMatching("event.module", key) // field is deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is now redundant with isFieldFamilyMatching("event", key) but I think we should keep it anyway because even if we were to somehow get rid of isFieldFamilyMatching("event", key) in the future, we would want this line to be here, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe it's better to keep it here even though it's redundant.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice job on this feature!

@mtojek mtojek merged commit 86e1f26 into elastic:master Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants