Skip to content
This repository has been archived by the owner on Oct 2, 2020. It is now read-only.

improve undefined field handling #16

Merged
merged 1 commit into from
Feb 22, 2019
Merged

Conversation

richm
Copy link
Member

@richm richm commented Feb 8, 2019

@ewolinetz @jcantrill @nhosoi @josefkarasek
I think I have figured out a way to keep merge_json_log on by default, without running into the problem where fields can have different values. Basically, convert unknown fields to their JSON string representation.

@lukas-vlcek
Copy link
Member

Hold on, I will provide example which will demonstrate this is not enough...

@richm
Copy link
Member Author

richm commented Feb 12, 2019

Hold on, I will provide example which will demonstrate this is not enough...

?

@lukas-vlcek
Copy link
Member

lukas-vlcek commented Feb 18, 2019

@richm The main problem is with dots in field names. They are interpreted as path separator. This can still lead to issues in certain (but non-rare) cases. Let me give you a simple example using OOTB ES 5.6.14:

$ curl -X POST -H "Content-type: application/json" \
  localhost:9200/test/one -d '{"foo.bar":"Hello"}'

# First document successfully indexed and mapping is created dynamically (see below)
{  
   "_index":"test",
   "_type":"one",
   "_id":"AWkAlzEzn0xV1HWswgES",
   "_version":1,
   "result":"created",
   "_shards":{  
      "total":2,
      "successful":1,
      "failed":0
   },
   "created":true
}

# Now, try index another document
$ curl -X POST -H "Content-type: application/json" \
  localhost:9200/test/one -d '{"foo":"Bad apple"}'

{  
   "error":{  
      "root_cause":[  
         {  
            "type":"mapper_parsing_exception",
            "reason":"object mapping for [foo] tried to parse field [foo] as object, but found a concrete value"
         }
      ],
      "type":"mapper_parsing_exception",
      "reason":"object mapping for [foo] tried to parse field [foo] as object, but found a concrete value"
   },
   "status":400
}

So the issue can happen when the following occurs in specific order:

  1. The document with dot(s) in a field name comes first, like {"foo.bar": _some value_}.
  2. Later comes another document having field equal to some prefix of the dotted field name, like {"foo": _non object_}.

Step no.1 resulted in mapping:

$ curl -X GET localhost:9200/test/_mapping?pretty
{
  "test" : {
    "mappings" : {
      "one" : {
        "properties" : {
          "foo" : {
            "properties" : {
              "bar" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
}}}}}}}}}}

In step no.2 we are trying to index non object value to field "foo" which is not possible. It is already mapped as an object.

When we switch the order in which the documents are indexed we still run into similar issue:

$ curl -X POST -H "Content-type: application/json" \
  localhost:9200/test2/one -d '{"foo":"Bad apple"}'

# First document successfully indexed
{  
   "_index":"test2",
   "_type":"one",
   "_id":"AWkAuLFen0xV1HWswgEW",
   "_version":1,
   "result":"created",
   "_shards":{  
      "total":2,
      "successful":1,
      "failed":0
   },
   "created":true
}

$ curl -X POST -H "Content-type: POST -H "Content-type: application/json" \
  localhost:9200/test2/one -d '{"foo.bar":"Hello"}'

# We run into similar issue...
{  
   "error":{  
      "root_cause":[  
         {  
            "type":"mapper_parsing_exception",
            "reason":"Could not dynamically add mapping for field [foo.bar]. Existing mapping for [foo] must be of type object but found [text]."
         }
      ],
      "type":"mapper_parsing_exception",
      "reason":"Could not dynamically add mapping for field [foo.bar]. Existing mapping for [foo] must be of type object but found [text]."
   },
   "status":400
}

I can not think of simple solution to this. We can either accept the fact that custom documents will be rejected (and this can be "nondeterministics" depending on which document will make it first...) or we can try to inspect incoming document in collector and check each field name for dots and compare it with existing mappings and decide what to do with conflicts... but this will not scale and still will suffer from near-real time aspect of ES (or better from network latency between collector ES client and ES node, ie the collector may not have up-to-date mapping info).

It seems that one possible approach is by using Elasticsearch ingest node, which provide several processors including dot expander and if any of it fails, then there can be defined failure pipeline handler to index such documents into different index for example.

@richm
Copy link
Member Author

richm commented Feb 18, 2019

ok - so we can't generally support dots in field names - so this solution will solve 80%+ of the cases we currently have - not aware of any cases we have seen in which the customer has dots in the field names.

@lukas-vlcek
Copy link
Member

lukas-vlcek commented Feb 18, 2019

I have seen such cases. For example here: https://bugzilla.redhat.com/show_bug.cgi?id=1666141#c5
The mapping in ES 2.x was like:

...
          "exception": {            # <=== custom mapping, field is string
            "type": "string",
            "fields": {
              "raw": {
                "type": "string",
                "index": "not_analyzed",
                "ignore_above": 256
              }
            }
          },
          "exception.class": {      # <=== custom mapping, field is not object.. this is issue
            "type": "string",
            "fields": {
              "raw": {
                "type": "string",
                "index": "not_analyzed",
                "ignore_above": 256
              }
            }
          },
...

So user is having both the fields: exception and exception.class, (plus more examples in the BZ).
This was causing heavy pain when migrating from ES 2.x to ES 5.x.

@richm
Copy link
Member Author

richm commented Feb 18, 2019

So user is having both the fields: exception and exception.class, (plus more examples in the BZ).
This was causing heavy pain when migrating from ES 2.x to ES 5.x.

Is that because ES 2 allowed you to have fields with a dot in the name, but ES 5 did not?

What if we

  • convert unknown field values to string
  • convert . in field names to _ (which is what the original fluentd elasticsearch pipeline did)

If the user really wants exception.class to be converted to {"exception":{"class", "value"}} in Elasticsearch, the user will either need to

  • change the application to log "{\"exception\":{\"class\", \"value\"}}"
  • write a custom fluentd record modifier to convert "exception.class":"value" to be {"exception":{"class", "value"}}, and convert "exception":"string value" to be something like {"exception":{"msg", "string value"}} - that is, map "exception":"string value" to "exception.msg":"string value" - the user will have to "make up" a name for the "exception" field to be mapped to its object representation

With respect to the problem of what to do with a string valued field named "exception" when we've already added "exception" as an object field - how would a custom Elasticsearch ingester handle this? Would it be able to look up internally that there is already an "exception" object field, and
Even with a custom Elasticsearch ingester, that code would still need to be able to convert "exception":"string value" to be something like {"exception":{"msg", "string value"}}

@lukas-vlcek
Copy link
Member

@richm

Is that because ES 2 allowed you to have fields with a dot in the name, but ES 5 did not?

It is because ES 2 and ES 5 interpret dots in field names differently (with short period where in some ES 2 versions the dots in field names were forbidden). In ES 2 it is just part of the field name, whereas in ES 5 it is a path separator. And this means that in some cases you can not directly migrate indices from ES 2 to ES 5 - see the second example of conflicting fields in official docs (we were hitting this too).

On general I would recommend you read ticket #15951 which goes through this topic in more detail.

With respect to the problem of what to do with a string valued field named "exception" when we've already added "exception" as an object field - how would a custom Elasticsearch ingester handle this?

I do not have experience with ingest pipelines yet, but I think it will simply fail indexing the document, meaning the handling failures docs is relevant here. (I think it can just store the document into extra index which can be processed later using ad-hoc logic, similarly to orphan index...).

Would it be able to look up internally that there is already an "exception" object field, and
Even with a custom Elasticsearch ingester, that code would still need to be able to convert "exception":"string value" to be something like {"exception":{"msg", "string value"}}

I am not sure if ingest processors have access to index mapping. Probably not. Still, if you think about this, there isn't any nice automated solution to this problem. Think about this, if you already have exception object in mappings then you can not index exception as string. So you end-up rejecting all documents having exception strings, however, if you first encounter exception as a string and this makes it into index mappings, then you end-up rejecting all documents having exception as object. This is not deterministic, this depends solely on your input data.

@richm
Copy link
Member Author

richm commented Feb 19, 2019

No matter what we do, we will have a problem with upgrading indices from ES 2 to ES 5 where there are dots in the field name, unless we implement some sort of custom reindexer, which is outside the scope of this PR.
As this PR stands now, it gets us most of the cases where fields can have multiple data types.
I will add an option to convert . in field names to _ (or some other configurable character) to avoid the "exception" vs. "exception.class" problem, which will solve even more cases that fail currently with MERGE_JSON_LOG.
Finally, if the user really wants to have "exception.class", then they can add exception to the list of fields we just pass through, in extra_keep_fields https://github.com/ViaQ/fluent-plugin-viaq_data_model#configuration - AND - the user must implement some sort of custom filter logic in fluentd to handle the "exception" vs. "exception.class" issue (e.g. by moving "exception" to be "exception":{"msg":...} )

With that being said - do you think it is still worth pursuing this PR, in order to be able to support MERGE_JSON_LOG for those customers who will continue to rely on it?

@lukas-vlcek
Copy link
Member

I think it is worth pushing forward.

@richm richm force-pushed the undefined_to_string branch from 2f261e5 to 43ef109 Compare February 20, 2019 00:11
@richm richm changed the title undefined_to_string - convert unknown fields to string value improve undefined field handling Feb 20, 2019
@richm
Copy link
Member Author

richm commented Feb 20, 2019

@richm richm force-pushed the undefined_to_string branch from 43ef109 to 332babf Compare February 20, 2019 00:34
if @undefined_max_num_fields > -1 && undefined_keys.length > @undefined_max_num_fields
undefined = {}
undefined_keys.each{|k|undefined[k] = record.delete(k)}
record[@undefined_name] = JSON.dump(undefined)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check @use_undefined for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - using undefined_max_num_fields implies that you want to use undefined_name - otherwise, there is no other place to put that value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the README

@richm richm force-pushed the undefined_to_string branch from 332babf to c840413 Compare February 20, 2019 01:51
@nhosoi
Copy link

nhosoi commented Feb 20, 2019

/lgtm

Copy link
Member

@jcantrill jcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nits othwise LGTM

lib/fluent/plugin/filter_viaq_data_model.rb Outdated Show resolved Hide resolved
lib/fluent/plugin/filter_viaq_data_model.rb Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@ewolinetz
Copy link

minor readme rephrasing, otherwise /lgtm

added support for the following new config parameters:

- undefined_to_string - convert all undefined fields to their
  JSON string representation e.g. convert `4` to `"4"`
- undefined_dot_replace_char - if an undefined field name
  contains the `.` character, replace it with `_` e.g.
  replace field "a.b.c" with "a_b_c"
- undefined_max_num_fields - if the number of undefined
  fields is greater than this number, convert all of the
  undefined fields to a JSON string blob and store it in
  the field `"undefined"`
@richm richm force-pushed the undefined_to_string branch from c840413 to 8b5ef11 Compare February 20, 2019 20:46
@ewolinetz
Copy link

/lgtm

@richm richm merged commit d48417e into ViaQ:master Feb 22, 2019
@richm richm deleted the undefined_to_string branch February 22, 2019 20:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants