Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata overwrites event data #7959

Closed
jsargiot opened this issue Aug 13, 2018 · 7 comments
Closed

Metadata overwrites event data #7959

jsargiot opened this issue Aug 13, 2018 · 7 comments

Comments

@jsargiot
Copy link

jsargiot commented Aug 13, 2018

Prepare a log file with:

{ "fieldA": "A1", "fieldB": "B1", "host": "somehostname" }

Filebeat input configuration:

- type: log
  enabled: true
  paths:
    - /var/log/filebeat-test/bug.log
  fields_under_root: true
  json:
    keys_under_root: true
    overwrite_keys: true
    add_error_key: true

Resulting output:

{
  "@timestamp": "2018-08-13T19:38:44.658Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "doc",
    "version": "6.3.2"
  },
  "prospector": {
    "type": "log"
  },
  "offset": 0,
  "fieldA": "A1",
  "fieldB": "B1",
  "host": {                        # <-- should have "host": "somehostname",
    "name": "master1"
  },
  "source": "/var/log/filebeat-test/bug.log",
  "input": {
    "type": "log"
  },
  "beat": {
    "name": "master1",
    "hostname": "master1",
    "version": "6.3.2"
  }
}

The problem here is that host was overwritten by host.name from this PR: #7051 and documentation states that:

overwrite_keys
If keys_under_root and this setting are enabled, then the values from the decoded JSON object overwrite the fields that Filebeat normally adds (type, source, offset, etc.) in case of conflicts.

https://www.elastic.co/guide/en/beats/filebeat/master/filebeat-input-log.html#filebeat-input-log-config-json

I've been digging a little around the code and the issue seems to be in the order in which these actions are performed: https://github.com/tsg/beats/blob/6.3/libbeat/publisher/pipeline/processor.go#L32-L41

Beats and host metadata are added at the end, right before global processors, maybe it should be added before the client processors, or avoid overwritting fields that are already in the event when adding the fields. The purpose was to avoid any processor removing beats data (#5149 (comment)) but with the undesired effect of overwritting event data.

This was also discussed here: https://discuss.elastic.co/t/logstash-errors-after-upgrading-to-filebeat-6-3-0/135984/19

Btw, if we agree on one of the two alternatives (move the adding beat field a few steps earlier or avoid overwrite) I can work on a PR.

@ruflin
Copy link
Contributor

ruflin commented Aug 14, 2018

There are a few fields in Beats that are "special" as you found out. The main problem with your data is that it would conflict with our template, meaning the template defines host.name so host cannot be a keyword and an object at the same time.

It would say it is ok if we allow to overwrite host.name but changing the type I think would have too many negative side effects.

@jsargiot
Copy link
Author

The thing here is that I'm sending the output to kafka, so the template basically doesn't play in my case... so I could afford to have this mismatch.

What do you think about a flag builtin_overwrites_event with a default value of true to mimic current behaviour. If set to false this flag avoids that any X.Y.Z "special" field overwrites any of the event fields of X, Y, or Z if they are values, so host.name won't replace host nor host.name in case any of those exists as a value, but would write the value if host exists as a map but host.name doesn't. We could add a warning in the docs about mismatch with the template.

If I'm not mistaken, the new flag should alter the behaviour of the default branch in: https://github.com/elastic/beats/blob/6.4/libbeat/common/mapstr.go#L74 and https://github.com/elastic/beats/blob/6.4/libbeat/common/mapstr.go#L93.

If you're ok, I can prepare a PR for that change and we'll discuss it, what do you think @ruflin ?

@ruflin
Copy link
Contributor

ruflin commented Aug 15, 2018

What about putting your host field under a prefix like your-prefix.host and rename it later? Or put everything under json and not use fields_under_root and to the modifications later? Do you pull the data with Logstash from Kafka?

@jsargiot
Copy link
Author

What about putting your host field under a prefix like your-prefix.host and rename it later?

This is a workaround I tested, basically (for those interested):

        - type: log
          enabled: true
          paths:
            - /var/log/nginx/json-access.log
          fields:
            beat.type: nginx_access
            beat._transform: json
          fields_under_root: true

I don't process json in the input, but I flag the events with beat._transform:json.

and in filebeat.yml I'm adding a processor that converts to json if beat._transform matches json

processors:
 - decode_json_fields:
     fields:
       - message
     target: ""
     overwrite_keys: true
     when:
       equals:
         beat._transform: json
  - drop_fields:
      fields:
        - beat._transform
        - message
      when:
        has_fields:
          - beat._transform      

This works because global processors are applied last, so the decode_json_fields with overwrite_keys: true actually overwrites any data on the event, this ensures that info from the event is kept. I haven't tested the performance of this approach though (specially for those events that do not have beat._transform: json).

Do you pull the data with Logstash from Kafka?

Logstash is one of several systems that pulls the data, that's why I'm so adamant of changing the events. I'll stay in 6.2.4 for now until I can update each of the components to use the json field or a prefixed structure.

@ruflin
Copy link
Contributor

ruflin commented Aug 17, 2018

Thanks for sharing the workaround. Are there some cases where your above workaround will not work?

@jsargiot
Copy link
Author

Are there some cases where your above workaround will not work?

For my usecase it works perfectly, it just polutes the config a little bit and makes it a bit more "complicated" but does the job.

Also, in terms of performance doesn't seem to affect, as the json decoding & dropping message is done in a "deferred" way and the only overhead are the conditions.

I think it's ok to close this issue, thanks @ruflin for your help!

@ruflin
Copy link
Contributor

ruflin commented Aug 20, 2018

Let's close this issue for now. We can still reopen in case of other issues popping up. Also it's good to have the issue here for other people stumbling over the same issue.

@ruflin ruflin closed this as completed Aug 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants