Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing dots in field names #15951

Closed
clintongormley opened this issue Jan 13, 2016 · 28 comments
Closed

Allowing dots in field names #15951

clintongormley opened this issue Jan 13, 2016 · 28 comments
Labels
discuss Meta :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@clintongormley
Copy link
Contributor

As part of the Great Mapping Refactoring (#8870), we had to reject field names containing dots (#12068), eg:

{ 
  "foo.bar": "val1",  
  "foo": {
    "bar": "val2"
  }
}

The behaviour was undefined and resulted in ambiguities when trying to reference fields with the dot notation used in queries and aggregations.

Removing support for dots has caused pain for a number of users and especially as Elasticsearch is being used more and more for the metrics use case (where dotted fields are common), we should consider what we can do to improve this situation. Now that mappings are much stricter (and immutable), it becomes feasible to revisit the question of whether to allow dots to occur in field names.

Replace dots with another character

The first and simplest solution is to simply replace dots in field names with another character (eg _) as is done by the Logstash de_dot filter and which will be supported natively in Elasticsearch by the node ingest de_dot processor.

Treat dots as paths

Another solution would be to treat fields with dots in them as "paths" rather than field names. In other words, these two documents would be equivalent:

{ "foo.bar": "value" }
{ "foo": { "bar": "value" }}

To use an edge case as an example, the following document:

{
  "foo.bar" : {
    "baz": "val1"
  },
  "foo": {
    "bar.baz": "val2"
  }

}

would result in the following mapping:

{
  "properties": {
    "foo": {
      "type": "object",
      "properties": {
        "bar": {
          "type": "object",
          "properties": {
            "baz": {
              "type": "string"
            }
          }
        }
      }
    }
  }
}

The lucene field would be called foo.bar.baz and would contain the terms ["val1", "val2]. Stored fields or doc values (for supported datatypes), would both contain ["val1", "val2"].

Issues with this approach

This solution works well for search and aggregations, but leaves us with two incongruities:

_source=

The first occurs when using the _source= parameter to do source filtering on the response. The reason for this is that the _source field is stored as provided - it is not normalized before being stored For instance:

GET _search?_source=foo.*

would return:

{
  "foo.bar" : {
    "baz": "val1"
  },
  "foo": {
    "bar.baz": "val2"
  }

}

rather than:

{
  "foo": {
    "bar": {
      "baz": [
        "val1",
        "val2"
      ]
    }
  }
}

Update requests

The second occurs during update requests, which uses the _source as a map-of-maps. Running an update like:

POST index/type/id/_update
{
  "doc": {
    "foo": {
      "bar": {
        "baz": "val3"
      }
    }
  }
}

could result (depending on how it is implemented) in any of the following:

Version 1:

{
  "foo": {
    "bar": {
      "baz": "val3"
    }
  }
}

Version 2:

{
  "foo": {
    "bar": {
      "baz": [
        "val1",
        "val2",
        "val3"
      ]
    }
  }
}

Version 3:

{
  "foo.bar": {
    "baz": "val1"
  },
  "foo": {
    "bar.baz": "val2",
    "bar": {
      "baz": "val3"
    }
  }
}
@clintongormley clintongormley added discuss :Search Foundations/Mapping Index mappings, including merging and defining field types Meta labels Jan 13, 2016
@kimchy
Copy link
Member

kimchy commented Jan 13, 2016

I spoke to @rjernst, and it might be simpler to disallow inconsistent dots in fields. So the first full path dots we allow, and the second one, we reject (similar to what we do with conflict on types). As an example: first allow {"x.y" : { "z" : "test" } }, then reject something like {"x" : {"y.z" : "test" }}

If we can do this, it might also be simpler to do it in 2.x, and it is more constraint compared to the above solution, and later we can extend (if we need to) to implement the above.

@rjernst
Copy link
Member

rjernst commented Jan 13, 2016

@kimchy I don't think that will actually work, because of how we parse values and just append them. I don't think we could distinguish, without some nasty-ish logic, if a field is just appending to an existing field, or another field with the same path (and same goes for inside the mapper service itself with storing the mappers).

@clintongormley There are two things that bug me. First, why do we have _source filtering at all? We already have stored fields which can serve that same purpose (returning a subset of the document on search). The second thing that bugs me is that your example works at all. We allow duplicate values for a field to append instead of error? That is leniency at its best: I don't know of any json parsers that emit arrays as duplicate keys (at least not by default), which means the user is probably serializing themselves, and very likely has a bug in their serialization. I don't think we should support either of those features, but dropping _source filtering would at least remove your concern so we could do the dots-as-paths option?

@nik9000
Copy link
Member

nik9000 commented Jan 13, 2016

I wonder if we can delay a decision on how to handle updates by not supporting dots in the document merge use case. You can always use scripts to be totally clear there if you need.

@jpountz
Copy link
Contributor

jpountz commented Jan 13, 2016

I don't think we could distinguish, without some nasty-ish logic, if a field is just appending to an existing field, or another field with the same path

Actually this should already work today. The second document will trigger a dynamic mapping update that will be rejected since the mapping would have two mappers that have the same path: #15243

@clintongormley
Copy link
Contributor Author

Actually this should already work today. The second document will trigger a dynamic mapping update that will be rejected since the mapping would have two mappers that have the same path: #15243

If we go with treating dots as paths ,then this won't work correctly, eg a document containing both forms (eg { foo: { bar.baz:..}},{foo.bar:{ baz...}}} will be rejected if indexed first, but if indexed after a document containing just one form, it will be accepted.

First, why do we have _source filtering at all?

@rjernst because users want to be able to get back what they put in, and to be able to distinguish between values such as:

  • ""
  • null
  • []
  • "val"
  • ["val"]
  • ["val", null]

You can't do this with stored fields.

The second thing that bugs me is that your example works at all. We allow duplicate values for a field to append instead of error? That is leniency at its best: I don't know of any json parsers that emit arrays as duplicate keys (at least not by default), which means the user is probably serializing themselves,

Where do you see duplicate keys?

{
  "foo.bar" : {
    "baz": "val1"
  },
  "foo": {
    "bar.baz": "val2"
  }
}

The above is perfectly valid JSON - no duplicate keys there. The fact that foo: bar.baz: val and foo.bar: baz: val end up being added to the same lucene field foo.bar.baz is just an artefact of the way we could translate dots to paths.

@clintongormley
Copy link
Contributor Author

I don't think that will actually work, because of how we parse values and just append them. I don't think we could distinguish, without some nasty-ish logic, if a field is just appending to an existing field, or another field with the same path (and same goes for inside the mapper service itself with storing the mappers).

The only way I can see this working is as follows. Fields with dots are mapped with dots, so { "foo": { "bar.baz": "val" }} would be mapped as:

{
  "properties": {
    "foo": {
      "type": "object",
      "properties": {
        "bar.baz": {
          "type": "string"
        }
      }
    }
  }
}

When adding a new field:

If the field name contains dots (eg `foo.bar.baz`):
    Does the first part of the field (`foo`) exist as a field in the mapping already?
        If yes: conflict
        If no: add field as `foo.bar.baz`
Else (field name does not contain dots, eg `foo`)
    Do any fields exist in the mapping which start with `foo.`?
        If yes: conflict
        If no:  add field as `foo`

This logic would prevent conflicting paths from being added.

When looking up a field (eg foo.bar.baz) in search/aggs etc:

Does `foo` exist?
Does `foo.bar` exist?
Does `foo.bar.baz` exist?

@clintongormley
Copy link
Contributor Author

By the way, the decision about dots also affects the node ingest plugin, which treats dots as steps in a path hierarchy, and has no support for escaping. This may not be a problem as long as the de_dot processor runs first but otherwise, it'll suffer from similar issues to those described above.

@rjernst
Copy link
Member

rjernst commented Jan 18, 2016

@clintongormley The logic you described there for adding new fields and searching is exactly why I don't like that approach. That is much more complicated than what we have today (and I especially don't like that the lookup of a field for search becomes linear on the object level of the field).

I am still convinced that doing dots-as-path is the correct choice. There are really two sides to users pain here, the first is dynamic mappings, and the second is explicit mappings. In the first case, I believe we can implement it within the document parsing that we have (which is where dynamic mappings are determined), and can be done independently of the second case (which I believe is harder, but still doable).

As for your concerns about _source, my thoughts are as follows:

  1. Source filtering should be viewed as a regex on the full path of a field, so I think returning the source as-is for all fields with a full path that match the regex is correct (so it should return your first example).
  2. Update requests should work as they do today, which I believe merges the new document with the previous version. Therefore your "version 3" should be what happens. I do also think that we shouldn't be so concerned with returning a normalized view, but that can/should be explored in a separate issue.

@jpountz has expressed a concern with this approach and the edge cases it brings, in particular with nested fields. I think in the case, for example, where foo is already a nested field, we will need to reject foo.bar as a field, since it should require that foo is an object field. But I think that is completely doable and testable.

While discussing with @jpountz he also made me realize escaping might be simpler than I originally thought. However, I still think this dots-as-path approach is correct for a couple reasons:

  1. It works with existing 1.x indexes for upgrade.
  2. It does not complicate the api, or have possible confusion for users who pass in foo.bar in there document, but then later find they must query with foo\.bar.

@clintongormley
Copy link
Contributor Author

As for your concerns about _source, my thoughts are as follows:

Agreed on both counts.

But I think that is completely doable and testable.

Good to hear. As long as the implemented solution is known to deal with the edge cases correctly, I'm happy.

It works with existing 1.x indexes for upgrade.

Note, mappings in 1.x indices have field names like:

"foo": {
    "type": "object",
    "properties": {
        "bar.baz": {....

So that structure would need to be updated on upgrade to:

"foo": {
    "type": "object",
    "properties": {
        "bar": {
            "type": "object",
            "properties": {
                "baz": ...

@GlenRSmith
Copy link
Contributor

Any update on the likelihood of implementing something around this?

@jpountz
Copy link
Contributor

jpountz commented Feb 22, 2016

@GlenRSmith I know @rjernst is currently exploring treating dots in field names as sub objects.

@felixbarny
Copy link
Member

Another use case where I need dots in field names is for tracking request parameters. I currently store them like this:

"params": {
  "param1": "value1",
  "param2": "value2" 
}

I don't really have control over the names of the request parameters so the only option is to de_dot the parameter names. But then I can't use the stored information to reproduce/replay the captured request. Converting the parameters into

"params": [
  "key": "param1",
  "value": "value1"
]

isn't an option either, because I want do aggregations on specific parameters in Grafana.

Yet another use case of mine is that I store configuration parameters in Elasticsearch where the config keys are field names and contain dots.

So a big +1 from my side.

@ryanmaclean
Copy link

I hate to pile on, but our use case is identical to @felixbarny's. Perhaps I'm not understanding the need to treat these as sub-objects, but if that could be an option (even if it were the default) it would be much better than the only way to handle fields that contain dots.

@bneff
Copy link

bneff commented Mar 14, 2016

I also have a use case similar to @felixbarny.

rjernst added a commit to rjernst/elasticsearch that referenced this issue Apr 14, 2016
In 2.0 we began restricting fields to not contains dots in their names.
This change adds back part of dots in fieldnames support. Specifically,
it allows indexing documents that contain dots in the field names, when
the correct corresponding mappers exist. For example, if mappings
contain an object field `foo`, and a subfield `bar`, then indexing a
document with `foo.bar` will work.

see elastic#15951
@felixbarny
Copy link
Member

Nice! Could you explain/document how this works now?

@bneelima84
Copy link

Hi, I am using 5.0.0.4 alpha release and tried to create an index with the below mapping (which has dots in field names);

`{"mappings" : {
        "def" : { 
           "_all": {
          "enabled": false
        },
        "_source": {
          "enabled": false
        },
            "properties" : {
                "Id" : { "type" : "string", "index" : "not_analyzed", "store" : true },

        "first.Name" : { "type" : "string", "index" : "not_analyzed" },
        "Last.Name" : { "type" : "string", "index" : "not_analyzed" },
                  "Middle.Name" : { "type" : "string", "index" : "not_analyzed" },
                    "Qual" : { "type" : "string", "index" : "not_analyzed" }
        }
        }
    }
 }`

Bu this fails as below:
{
"error":   
{
"root_cause":   
[
1]  
0:    
{
"type": "mapper_parsing_exception"
"reason": "Field name [Middle.Name] cannot contain '.'"

}

"type": "mapper_parsing_exception"
"reason": "Failed to parse mapping [def]: Field name [Middle.Name] cannot contain '.'"
"caused_by":   
{
"type": "mapper_parsing_exception"
"reason": "Field name [Middle.Name] cannot contain '.'"

}

}

"status": 400
}

Am I missing anything ?

@rjernst
Copy link
Member

rjernst commented Jul 15, 2016

The current support for dot in field names is for dynamic mappings and document parsing. When specifying mappings directly, you will still need to split up the fields recursively. I opened #19443 to address this.

@cdenneen
Copy link

cdenneen commented Aug 4, 2016

Can dots in field names be patched in 2.3.x. Otherwise it will require 1.x -> 2.x (re-work to undo all the dots in field names), then 2.x -> 5.x (allow dots back).
Currently this is a real show stopper in upgrading ES past 1.7 since can't use upgrade path and 1.x->5.x isn't supported.

@s1monw
Copy link
Contributor

s1monw commented Aug 4, 2016

@cdenneen we are looking into possible solutions I will update the issue when we have more to say.

@jpountz
Copy link
Contributor

jpountz commented Aug 4, 2016

@cdenneen I would like to clarify (it might be clear to you but not necessarily for other readers) that data will need to be reindexed anyway between 1.x and 5.x since elasticsearch only supports one major version back, and the version that matters in that case is the version that was used to create the index. So 5.x will not be able to read any index created in 1.x.

@cdenneen
Copy link

cdenneen commented Aug 4, 2016

@s1monw thanks

@jpountz yes that's why i was saying upgrade path 1.x->5.x isn't supported but 1.x->2.x and 2.x->5.x is... but in order to do that you'd have to undo the dot fields for the 2.x upgrade and then put them back in 5.x after that upgrade... so unless there is a 1.x->5.x upgrade path I would think there needs to be a 2.x patch to support this to allow the upgrade to work (stepping up the major versions)

@GlenRSmith
Copy link
Contributor

@cdenneen I think you're missing the point. A 2.x patch wouldn't help you. Indices created in 1.x can't be read in 5.x. Full stop. Not even if you had no conflicts and upgraded to 2.x first.

@cdenneen
Copy link

cdenneen commented Aug 5, 2016 via email

@jpountz
Copy link
Contributor

jpountz commented Aug 5, 2016

@cdenneen No. An index that lives in a 2.x cluster but was created with 1.x cannot be upgraded to 5.x.

@clintongormley
Copy link
Contributor Author

@cdenneen just to clarify, if we get support for dots in fields into 2.4, you'd be able to upgrade to 2.4, reindex to a new index, the upgrade to 5.x

An alternate route would be to create a new 5.x cluster, then use reindex-from-remote to pull the indices you want to take with you into 5.x directly.

@pktxu
Copy link

pktxu commented Nov 9, 2016

@clintongormley graylog is affected by this, is it still being considered for inclusion in an hypothetic 2.4.2 release?

@rjernst
Copy link
Member

rjernst commented Nov 9, 2016

Support for dots in field names was added in 2.4.0:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/dots-in-names.html

@aslakhellesoy
Copy link

@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Meta :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests