-
Notifications
You must be signed in to change notification settings - Fork 8
Conversation
Hold on, I will provide example which will demonstrate this is not enough... |
? |
@richm The main problem is with dots in field names. They are interpreted as path separator. This can still lead to issues in certain (but non-rare) cases. Let me give you a simple example using OOTB ES 5.6.14: $ curl -X POST -H "Content-type: application/json" \
localhost:9200/test/one -d '{"foo.bar":"Hello"}'
# First document successfully indexed and mapping is created dynamically (see below)
{
"_index":"test",
"_type":"one",
"_id":"AWkAlzEzn0xV1HWswgES",
"_version":1,
"result":"created",
"_shards":{
"total":2,
"successful":1,
"failed":0
},
"created":true
}
# Now, try index another document
$ curl -X POST -H "Content-type: application/json" \
localhost:9200/test/one -d '{"foo":"Bad apple"}'
{
"error":{
"root_cause":[
{
"type":"mapper_parsing_exception",
"reason":"object mapping for [foo] tried to parse field [foo] as object, but found a concrete value"
}
],
"type":"mapper_parsing_exception",
"reason":"object mapping for [foo] tried to parse field [foo] as object, but found a concrete value"
},
"status":400
} So the issue can happen when the following occurs in specific order:
Step no.1 resulted in mapping: $ curl -X GET localhost:9200/test/_mapping?pretty
{
"test" : {
"mappings" : {
"one" : {
"properties" : {
"foo" : {
"properties" : {
"bar" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}}}}}}}}}} In step no.2 we are trying to index non object value to field When we switch the order in which the documents are indexed we still run into similar issue: $ curl -X POST -H "Content-type: application/json" \
localhost:9200/test2/one -d '{"foo":"Bad apple"}'
# First document successfully indexed
{
"_index":"test2",
"_type":"one",
"_id":"AWkAuLFen0xV1HWswgEW",
"_version":1,
"result":"created",
"_shards":{
"total":2,
"successful":1,
"failed":0
},
"created":true
}
$ curl -X POST -H "Content-type: POST -H "Content-type: application/json" \
localhost:9200/test2/one -d '{"foo.bar":"Hello"}'
# We run into similar issue...
{
"error":{
"root_cause":[
{
"type":"mapper_parsing_exception",
"reason":"Could not dynamically add mapping for field [foo.bar]. Existing mapping for [foo] must be of type object but found [text]."
}
],
"type":"mapper_parsing_exception",
"reason":"Could not dynamically add mapping for field [foo.bar]. Existing mapping for [foo] must be of type object but found [text]."
},
"status":400
} I can not think of simple solution to this. We can either accept the fact that custom documents will be rejected (and this can be "nondeterministics" depending on which document will make it first...) or we can try to inspect incoming document in collector and check each field name for dots and compare it with existing mappings and decide what to do with conflicts... but this will not scale and still will suffer from near-real time aspect of ES (or better from network latency between collector ES client and ES node, ie the collector may not have up-to-date mapping info). It seems that one possible approach is by using Elasticsearch ingest node, which provide several processors including dot expander and if any of it fails, then there can be defined failure pipeline handler to index such documents into different index for example. |
ok - so we can't generally support dots in field names - so this solution will solve 80%+ of the cases we currently have - not aware of any cases we have seen in which the customer has dots in the field names. |
I have seen such cases. For example here: https://bugzilla.redhat.com/show_bug.cgi?id=1666141#c5 ...
"exception": { # <=== custom mapping, field is string
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
},
"exception.class": { # <=== custom mapping, field is not object.. this is issue
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
},
... So user is having both the fields: |
Is that because ES 2 allowed you to have fields with a dot in the name, but ES 5 did not? What if we
If the user really wants
With respect to the problem of what to do with a string valued field named |
It is because ES 2 and ES 5 interpret dots in field names differently (with short period where in some ES 2 versions the dots in field names were forbidden). In ES 2 it is just part of the field name, whereas in ES 5 it is a path separator. And this means that in some cases you can not directly migrate indices from ES 2 to ES 5 - see the second example of conflicting fields in official docs (we were hitting this too). On general I would recommend you read ticket #15951 which goes through this topic in more detail.
I do not have experience with ingest pipelines yet, but I think it will simply fail indexing the document, meaning the handling failures docs is relevant here. (I think it can just store the document into extra index which can be processed later using ad-hoc logic, similarly to orphan index...).
I am not sure if ingest processors have access to index mapping. Probably not. Still, if you think about this, there isn't any nice automated solution to this problem. Think about this, if you already have |
No matter what we do, we will have a problem with upgrading indices from ES 2 to ES 5 where there are dots in the field name, unless we implement some sort of custom reindexer, which is outside the scope of this PR. With that being said - do you think it is still worth pursuing this PR, in order to be able to support MERGE_JSON_LOG for those customers who will continue to rely on it? |
I think it is worth pushing forward. |
2f261e5
to
43ef109
Compare
43ef109
to
332babf
Compare
if @undefined_max_num_fields > -1 && undefined_keys.length > @undefined_max_num_fields | ||
undefined = {} | ||
undefined_keys.each{|k|undefined[k] = record.delete(k)} | ||
record[@undefined_name] = JSON.dump(undefined) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check @use_undefined for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - using undefined_max_num_fields
implies that you want to use undefined_name
- otherwise, there is no other place to put that value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the README
332babf
to
c840413
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few nits othwise LGTM
minor readme rephrasing, otherwise /lgtm |
added support for the following new config parameters: - undefined_to_string - convert all undefined fields to their JSON string representation e.g. convert `4` to `"4"` - undefined_dot_replace_char - if an undefined field name contains the `.` character, replace it with `_` e.g. replace field "a.b.c" with "a_b_c" - undefined_max_num_fields - if the number of undefined fields is greater than this number, convert all of the undefined fields to a JSON string blob and store it in the field `"undefined"`
c840413
to
8b5ef11
Compare
/lgtm |
@ewolinetz @jcantrill @nhosoi @josefkarasek
I think I have figured out a way to keep merge_json_log on by default, without running into the problem where fields can have different values. Basically, convert unknown fields to their JSON string representation.