improve undefined field handling #16

richm · 2019-02-08T21:56:14Z

@ewolinetz @jcantrill @nhosoi @josefkarasek
I think I have figured out a way to keep merge_json_log on by default, without running into the problem where fields can have different values. Basically, convert unknown fields to their JSON string representation.

lukas-vlcek · 2019-02-09T09:17:00Z

Hold on, I will provide example which will demonstrate this is not enough...

richm · 2019-02-12T15:34:05Z

Hold on, I will provide example which will demonstrate this is not enough...

?

lukas-vlcek · 2019-02-18T13:26:54Z

@richm The main problem is with dots in field names. They are interpreted as path separator. This can still lead to issues in certain (but non-rare) cases. Let me give you a simple example using OOTB ES 5.6.14:

$ curl -X POST -H "Content-type: application/json" \
  localhost:9200/test/one -d '{"foo.bar":"Hello"}'

# First document successfully indexed and mapping is created dynamically (see below)
{  
   "_index":"test",
   "_type":"one",
   "_id":"AWkAlzEzn0xV1HWswgES",
   "_version":1,
   "result":"created",
   "_shards":{  
      "total":2,
      "successful":1,
      "failed":0
   },
   "created":true
}

# Now, try index another document
$ curl -X POST -H "Content-type: application/json" \
  localhost:9200/test/one -d '{"foo":"Bad apple"}'

{  
   "error":{  
      "root_cause":[  
         {  
            "type":"mapper_parsing_exception",
            "reason":"object mapping for [foo] tried to parse field [foo] as object, but found a concrete value"
         }
      ],
      "type":"mapper_parsing_exception",
      "reason":"object mapping for [foo] tried to parse field [foo] as object, but found a concrete value"
   },
   "status":400
}

So the issue can happen when the following occurs in specific order:

The document with dot(s) in a field name comes first, like {"foo.bar": _some value_}.
Later comes another document having field equal to some prefix of the dotted field name, like {"foo": _non object_}.

Step no.1 resulted in mapping:

$ curl -X GET localhost:9200/test/_mapping?pretty
{
  "test" : {
    "mappings" : {
      "one" : {
        "properties" : {
          "foo" : {
            "properties" : {
              "bar" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
}}}}}}}}}}

In step no.2 we are trying to index non object value to field "foo" which is not possible. It is already mapped as an object.

When we switch the order in which the documents are indexed we still run into similar issue:

$ curl -X POST -H "Content-type: application/json" \
  localhost:9200/test2/one -d '{"foo":"Bad apple"}'

# First document successfully indexed
{  
   "_index":"test2",
   "_type":"one",
   "_id":"AWkAuLFen0xV1HWswgEW",
   "_version":1,
   "result":"created",
   "_shards":{  
      "total":2,
      "successful":1,
      "failed":0
   },
   "created":true
}

$ curl -X POST -H "Content-type: POST -H "Content-type: application/json" \
  localhost:9200/test2/one -d '{"foo.bar":"Hello"}'

# We run into similar issue...
{  
   "error":{  
      "root_cause":[  
         {  
            "type":"mapper_parsing_exception",
            "reason":"Could not dynamically add mapping for field [foo.bar]. Existing mapping for [foo] must be of type object but found [text]."
         }
      ],
      "type":"mapper_parsing_exception",
      "reason":"Could not dynamically add mapping for field [foo.bar]. Existing mapping for [foo] must be of type object but found [text]."
   },
   "status":400
}

I can not think of simple solution to this. We can either accept the fact that custom documents will be rejected (and this can be "nondeterministics" depending on which document will make it first...) or we can try to inspect incoming document in collector and check each field name for dots and compare it with existing mappings and decide what to do with conflicts... but this will not scale and still will suffer from near-real time aspect of ES (or better from network latency between collector ES client and ES node, ie the collector may not have up-to-date mapping info).

It seems that one possible approach is by using Elasticsearch ingest node, which provide several processors including dot expander and if any of it fails, then there can be defined failure pipeline handler to index such documents into different index for example.

richm · 2019-02-18T14:29:15Z

ok - so we can't generally support dots in field names - so this solution will solve 80%+ of the cases we currently have - not aware of any cases we have seen in which the customer has dots in the field names.

lukas-vlcek · 2019-02-18T15:53:01Z

I have seen such cases. For example here: https://bugzilla.redhat.com/show_bug.cgi?id=1666141#c5
The mapping in ES 2.x was like:

...
          "exception": {            # <=== custom mapping, field is string
            "type": "string",
            "fields": {
              "raw": {
                "type": "string",
                "index": "not_analyzed",
                "ignore_above": 256
              }
            }
          },
          "exception.class": {      # <=== custom mapping, field is not object.. this is issue
            "type": "string",
            "fields": {
              "raw": {
                "type": "string",
                "index": "not_analyzed",
                "ignore_above": 256
              }
            }
          },
...

So user is having both the fields: exception and exception.class, (plus more examples in the BZ).
This was causing heavy pain when migrating from ES 2.x to ES 5.x.

richm · 2019-02-18T17:49:55Z

So user is having both the fields: exception and exception.class, (plus more examples in the BZ).
This was causing heavy pain when migrating from ES 2.x to ES 5.x.

Is that because ES 2 allowed you to have fields with a dot in the name, but ES 5 did not?

What if we

convert unknown field values to string
convert . in field names to _ (which is what the original fluentd elasticsearch pipeline did)

If the user really wants exception.class to be converted to {"exception":{"class", "value"}} in Elasticsearch, the user will either need to

change the application to log "{\"exception\":{\"class\", \"value\"}}"
write a custom fluentd record modifier to convert "exception.class":"value" to be {"exception":{"class", "value"}}, and convert "exception":"string value" to be something like {"exception":{"msg", "string value"}} - that is, map "exception":"string value" to "exception.msg":"string value" - the user will have to "make up" a name for the "exception" field to be mapped to its object representation

With respect to the problem of what to do with a string valued field named "exception" when we've already added "exception" as an object field - how would a custom Elasticsearch ingester handle this? Would it be able to look up internally that there is already an "exception" object field, and
Even with a custom Elasticsearch ingester, that code would still need to be able to convert "exception":"string value" to be something like {"exception":{"msg", "string value"}}

lukas-vlcek · 2019-02-19T11:48:02Z

@richm

Is that because ES 2 allowed you to have fields with a dot in the name, but ES 5 did not?

It is because ES 2 and ES 5 interpret dots in field names differently (with short period where in some ES 2 versions the dots in field names were forbidden). In ES 2 it is just part of the field name, whereas in ES 5 it is a path separator. And this means that in some cases you can not directly migrate indices from ES 2 to ES 5 - see the second example of conflicting fields in official docs (we were hitting this too).

On general I would recommend you read ticket #15951 which goes through this topic in more detail.

With respect to the problem of what to do with a string valued field named "exception" when we've already added "exception" as an object field - how would a custom Elasticsearch ingester handle this?

I do not have experience with ingest pipelines yet, but I think it will simply fail indexing the document, meaning the handling failures docs is relevant here. (I think it can just store the document into extra index which can be processed later using ad-hoc logic, similarly to orphan index...).

Would it be able to look up internally that there is already an "exception" object field, and
Even with a custom Elasticsearch ingester, that code would still need to be able to convert "exception":"string value" to be something like {"exception":{"msg", "string value"}}

I am not sure if ingest processors have access to index mapping. Probably not. Still, if you think about this, there isn't any nice automated solution to this problem. Think about this, if you already have exception object in mappings then you can not index exception as string. So you end-up rejecting all documents having exception strings, however, if you first encounter exception as a string and this makes it into index mappings, then you end-up rejecting all documents having exception as object. This is not deterministic, this depends solely on your input data.

richm · 2019-02-19T15:23:51Z

No matter what we do, we will have a problem with upgrading indices from ES 2 to ES 5 where there are dots in the field name, unless we implement some sort of custom reindexer, which is outside the scope of this PR.
As this PR stands now, it gets us most of the cases where fields can have multiple data types.
I will add an option to convert . in field names to _ (or some other configurable character) to avoid the "exception" vs. "exception.class" problem, which will solve even more cases that fail currently with MERGE_JSON_LOG.
Finally, if the user really wants to have "exception.class", then they can add exception to the list of fields we just pass through, in extra_keep_fields https://github.com/ViaQ/fluent-plugin-viaq_data_model#configuration - AND - the user must implement some sort of custom filter logic in fluentd to handle the "exception" vs. "exception.class" issue (e.g. by moving "exception" to be "exception":{"msg":...} )

With that being said - do you think it is still worth pursuing this PR, in order to be able to support MERGE_JSON_LOG for those customers who will continue to rely on it?

lukas-vlcek · 2019-02-19T15:58:22Z

I think it is worth pushing forward.

richm · 2019-02-20T00:12:49Z

@jcantrill @josefkarasek @ewolinetz @nhosoi ptal

nhosoi · 2019-02-20T01:45:15Z

lib/fluent/plugin/filter_viaq_data_model.rb

+        if @undefined_max_num_fields > -1 && undefined_keys.length > @undefined_max_num_fields
+          undefined = {}
+          undefined_keys.each{|k|undefined[k] = record.delete(k)}
+          record[@undefined_name] = JSON.dump(undefined)


No need to check @use_undefined for this case?

no - using undefined_max_num_fields implies that you want to use undefined_name - otherwise, there is no other place to put that value

I updated the README

nhosoi · 2019-02-20T01:51:49Z

/lgtm

jcantrill

few nits othwise LGTM

lib/fluent/plugin/filter_viaq_data_model.rb

README.md

ewolinetz · 2019-02-20T20:23:58Z

minor readme rephrasing, otherwise /lgtm

added support for the following new config parameters: - undefined_to_string - convert all undefined fields to their JSON string representation e.g. convert `4` to `"4"` - undefined_dot_replace_char - if an undefined field name contains the `.` character, replace it with `_` e.g. replace field "a.b.c" with "a_b_c" - undefined_max_num_fields - if the number of undefined fields is greater than this number, convert all of the undefined fields to a JSON string blob and store it in the field `"undefined"`

ewolinetz · 2019-02-22T15:09:39Z

/lgtm

richm force-pushed the undefined_to_string branch from 2f261e5 to 43ef109 Compare February 20, 2019 00:11

richm changed the title ~~undefined_to_string - convert unknown fields to string value~~ improve undefined field handling Feb 20, 2019

richm force-pushed the undefined_to_string branch from 43ef109 to 332babf Compare February 20, 2019 00:34

nhosoi reviewed Feb 20, 2019

View reviewed changes

richm force-pushed the undefined_to_string branch from 332babf to c840413 Compare February 20, 2019 01:51

jcantrill approved these changes Feb 20, 2019

View reviewed changes

lib/fluent/plugin/filter_viaq_data_model.rb Outdated Show resolved Hide resolved

lib/fluent/plugin/filter_viaq_data_model.rb Outdated Show resolved Hide resolved

ewolinetz reviewed Feb 20, 2019

View reviewed changes

README.md Outdated Show resolved Hide resolved

richm force-pushed the undefined_to_string branch from c840413 to 8b5ef11 Compare February 20, 2019 20:46

richm merged commit d48417e into ViaQ:master Feb 22, 2019

richm deleted the undefined_to_string branch February 22, 2019 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve undefined field handling #16

improve undefined field handling #16

richm commented Feb 8, 2019

lukas-vlcek commented Feb 9, 2019

richm commented Feb 12, 2019

lukas-vlcek commented Feb 18, 2019 •

edited

Loading

richm commented Feb 18, 2019

lukas-vlcek commented Feb 18, 2019 •

edited

Loading

richm commented Feb 18, 2019

lukas-vlcek commented Feb 19, 2019

richm commented Feb 19, 2019

lukas-vlcek commented Feb 19, 2019

richm commented Feb 20, 2019

nhosoi Feb 20, 2019

richm Feb 20, 2019

richm Feb 20, 2019

nhosoi commented Feb 20, 2019

jcantrill left a comment

ewolinetz commented Feb 20, 2019

ewolinetz commented Feb 22, 2019

improve undefined field handling #16

improve undefined field handling #16

Conversation

richm commented Feb 8, 2019

lukas-vlcek commented Feb 9, 2019

richm commented Feb 12, 2019

lukas-vlcek commented Feb 18, 2019 • edited Loading

richm commented Feb 18, 2019

lukas-vlcek commented Feb 18, 2019 • edited Loading

richm commented Feb 18, 2019

lukas-vlcek commented Feb 19, 2019

richm commented Feb 19, 2019

lukas-vlcek commented Feb 19, 2019

richm commented Feb 20, 2019

nhosoi Feb 20, 2019

Choose a reason for hiding this comment

richm Feb 20, 2019

Choose a reason for hiding this comment

richm Feb 20, 2019

Choose a reason for hiding this comment

nhosoi commented Feb 20, 2019

jcantrill left a comment

Choose a reason for hiding this comment

ewolinetz commented Feb 20, 2019

ewolinetz commented Feb 22, 2019

lukas-vlcek commented Feb 18, 2019 •

edited

Loading

lukas-vlcek commented Feb 18, 2019 •

edited

Loading