Input validation #9059

clintongormley · 2014-12-24T13:12:02Z

In #6736 I started trying to define specs for valid IDs, index names, field names etc, to avoid problems such as conflicts created by having an ID called _mapping.

I think this is the wrong approach - a significant number of users will find that they have used identifiers which are no longer illegal. Instead, we should identify where conflicts are possible and implement escaping to allow the unambiguous avoidance of conflict.

Possible conflicts can arise in:

directory or file names (eg index "C:/")
URL paths (eg an ID called _mapping or _create)
query string (eg a routing value containing a comma)
field paths (eg a field containing a .)
scripting (eg a fieldname called _id)

We should provide a simple consistent escaping mechanism with minimal restrictions on allowable characters.

Directory or file names

Today we use the actual index name to create a directory on the filesystem. Instead, the original index name should be stored in the index metadata, and the directory name should be escaped as printable ASCII only (eg percent encoding?).

There are more restrictions that need to be applied to filesystem names, most of which are already enforced today:

must be lowercase
must NOT start with an underscore, a + or a -
index names starting with '.' are reserved
. and .. are forbidden, though they can occur in the name
\, /, *, ?, ", <, >, |, ,, are forbidden
illegal windows filenames are forbidden: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Also avoid these names followed immediately by an extension; for example, NUL.txt is not recommended.

The same should apply to snapshot names.

Filename length should be checked after escaping, not before.

URL paths and query strings

Unicode characters and / are already handled by standard URI encoding. The only special characters are:

a leading underscore _
a comma ,

These can be escaped automatically by prepending \ (so we would need to escape literal \'s as well).

The * is used as a wildcard for various URL parameters, eg index and type names. We cannot automatically escape the wildcard (because usually it should be a plain wildcard). Similarly, the user can't escape it as my_literal_\*, because the \ would itself be escaped as: my_literal_\\*, meaning that the wildcard would still function as a wildcard.

The * should not be allowed in any identifiers which support wildcards (eg index, type, node names etc).

Index wildcards can use a leading + or - to include/exclude patterns, but these characters should already be excluded by the index naming rules above.

Field paths and Scripting

A field containing a . can be escaped as foo\.bar (and a field containing \ as foo\\bar).

Fields beginning with _ can clash with metadata fields like _id or _timestamp, both in queries, aggregations, sorting, and scripting.

There are four options:

Forbid fieldnames beginning with an underscore - this is too stringent and will impact existing data
Reserve existing metadata field names - this makes it difficult to add more
Allow \ escaping to distinguish the metadata _id field from the body \_id field
Add a single namespace for metadata fields, eg _root._id or _doc._id (_meta is already taken unfortunately)

Also see #5558

The text was updated successfully, but these errors were encountered:

rjernst · 2014-12-29T20:37:27Z

Forbid fieldnames beginning with an underscore - this is too stringent and will impact existing data

Why is this too stringent? I can't imagine there are a ton of people that have fields beginning with underscores, but even for those that it does, it doesn't seem like this solves the issue. How would these escaped field names show up in results: as unescaped or escaped? If unescaped, then you have the same problem, just on output instead of input. If escaped, then the user still has to change their code to handle the escaped values, so why is it any different than changing to a name not beginning with an underscore?

clintongormley · 2014-12-30T10:56:18Z

Forbid fieldnames beginning with an underscore - this is too stringent and will impact existing data

Why is this too stringent?

We're trying to avoid breaking existing applications, eg see #6736 (comment)

I can't imagine there are a ton of people that have fields beginning with underscores, but even for those that it does, it doesn't seem like this solves the issue. How would these escaped field names show up in results: as unescaped or escaped? If unescaped, then you have the same problem, just on output instead of input.

On output, metadata fields are displayed at the same level as the _source field, which is why there is no conflict. It is only when we refer to fields by name (eg in queries, aggs, sorting, scripts) that there is potential for conflict.

That said, I still don't know which of the four options I mentioned is the best solution.

colings86 · 2015-02-20T11:01:07Z

As described in #8042 we should also consider aggregation names too

geekpete · 2015-04-29T07:22:22Z

As per my original ticket: #6589

Creating index names that are physically unwritable on the filesystem could be tested first with a simple write test to see if they're writeable, and if not, reject the index creation. This would save the awful scenario where an index creation is attempted by elasticsearch but is unable to allocate primary shards due to the base inability to even create the directory name.

So a simple test of file path validity first to avoid: "java.io.IOException: Invalid file path" and reject if it fails the test.

The result when this is not handled can be a cluster in red state until the user/admin can decipher what special encoding to use (in the example of an index name with bad chars like control codes) in order to delete the index and restore cluster to green state.

An example of an index with bad chars as described is:

whatever\u0000\u0001whatever

which shows up this way under /_aliases?pretty and is only removed using a curl -XDELETE against:

localhost:9200/whatever%00%01whatever

These characters are control codes that need to be properly escaped to be dealt with.

http://www.fileformat.info/info/unicode/char/0001/index.htm
http://www.fileformat.info/info/unicode/char/0000/index.htm

So my general point being, some really low level tests should be added first that validates if a given string is even usable all the way through and it can fail with a simple test to see if it would succeed or not before continuing on to the next step. Then the reason for it failing matters less as it just won't do it if it's not possible on whatever platform/combination. You can lay these sorts of tests down first before trying to guess all the scenarios where it might fail in order to manually validate

rjernst · 2015-05-26T09:25:00Z

@clintongormley I think we should consider doing strict validation as a start, and adding support for crazy names as a follow up. I think it is more important to get validation into 2.0 than work out how to support crazy names.

mikemccand · 2015-09-30T19:23:06Z

@clintongormley I think we should consider doing strict validation as a start, and adding support for crazy names as a follow up. I think it is more important to get validation into 2.0 than work out how to support crazy names.

+1 for simple, strict rules like were proposed in the original #6736 ... this would have prevented the horrible issues like #13858.

jpountz · 2015-09-30T19:53:59Z

+1 for simple strict rules

nik9000 · 2018-03-12T19:33:30Z

@clintongormley is there anything left to do from the original issue? Can we close it so long as we promise to live with its spirit in our hearts?

clintongormley · 2018-03-13T09:38:00Z

@nik9000 this is so out of date that I don't think the thread is really relevant anymore. It is probably still worth writing up rules for what is and is not allowed - these don't exist today, except in code. And there are still places that you can name yourself into trouble, although users seldom get into trouble with these anymore. Let's close.

imotov · 2019-08-01T14:04:12Z

although users seldom get into trouble with these anymore. Let's close.

It seems to be still a problem, it still confuses users and I cannot find where it is documented at the moment https://discuss.elastic.co/t/legal-character-set-for-field-names/190796

geekpete · 2019-08-01T23:08:38Z

Should this be filed as a docs issue now then, it seems like quite a large scope though.
Our current tests would be trying invalid characters and words in various places to ensure we don't regress, maybe we can more easily sift the collection of what's invalid from our tests?
This is a scenario where use of something like Swagger would allow much easier collection of such definitive lists for (automatic) documentation?

clintongormley added discuss >breaking v2.0.0-beta1 :Core/Infra/REST API REST infrastructure and utilities labels Dec 24, 2014

clintongormley mentioned this issue Dec 24, 2014

Define valid index, type, field, id, routing values #6736

Closed

This was referenced Dec 30, 2014

REST API _suggest endpoint mistakenly creates documents with id = _suggest #5442

Closed

Disable _id.path by default #5558

Closed

colings86 mentioned this issue Feb 20, 2015

Aggregations order : Escaping the user defined AGG_NAME #8042

Closed

clintongormley mentioned this issue Apr 7, 2015

Elasticsearch should raise a mapper exception when reserved "mapping fields" are used inside "properties" #10456

Closed

gbrayut mentioned this issue May 19, 2015

Allow to configure a concrete indexRoot to lscount/lsstat bosun-monitor/bosun#974

Closed

rjernst mentioned this issue May 26, 2015

Mapping should throw when field names with dots are specified #11337

Closed

clintongormley mentioned this issue Jun 2, 2015

Incorrect search result using a top level field ES 1.3.4 #11447

Closed

jpountz mentioned this issue Jul 8, 2015

Mappings: Restrict unexpected characters from field names #12094

Closed

clintongormley added v2.0.0 v2.1.0 and removed v2.0.0-beta1 v2.0.0 labels Aug 13, 2015

xuzha mentioned this issue Oct 1, 2015

Forbid index name . and .. #13862

Merged

markwalkom mentioned this issue Oct 7, 2015

Restrict document ID values #14009

Closed

clintongormley mentioned this issue Oct 13, 2015

whitespace allowed in template field name #14033

Closed

clintongormley added v2.2.0 and removed v2.1.0 labels Nov 20, 2015

clintongormley mentioned this issue Nov 21, 2015

colon sign (:) not excluded from filenames #7148

Closed

jpountz mentioned this issue Dec 19, 2015

Restrict index names to ASCII #15553

Closed

spinscale added v2.3.0 and removed v2.2.0 labels Dec 23, 2015

joschi mentioned this issue Jan 20, 2016

Handling "." characters in message field names with ES 2.0 Graylog2/graylog2-server#1667

Merged

russcam mentioned this issue Feb 2, 2016

Type inference on anonymous objects elastic/elasticsearch-net#1761

Closed

fcheslack mentioned this issue Feb 26, 2016

strings not escaped in WriteBulkBytes mattbaird/elastigo#258

Open

clintongormley added v2.4.0 and removed v2.3.0 labels Mar 16, 2016

jpountz mentioned this issue Apr 8, 2016

Terms order aggregation name cannot contain a period #17600

Closed

clintongormley mentioned this issue Aug 4, 2016

indices starting with - (dash) cause problems if used with wildcards #19800

Closed

clintongormley added v2.4.1 and removed v2.4.0 labels Aug 24, 2016

clintongormley added v2.4.2 and removed v2.4.1 labels Sep 23, 2016

clintongormley removed the v2.4.2 label Nov 6, 2016

clintongormley mentioned this issue Nov 23, 2016

Grok Processor does not support non-(a-zA-Z_) field characters for field names #21745

Closed

clintongormley self-assigned this Nov 26, 2016

javanna removed the discuss label May 5, 2017

jsvd mentioned this issue Sep 20, 2017

Add a new type of validation for defining a pipeline_id elastic/logstash#8164

Open

clintongormley closed this as completed Mar 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input validation #9059

Input validation #9059

clintongormley commented Dec 24, 2014

rjernst commented Dec 29, 2014

clintongormley commented Dec 30, 2014

colings86 commented Feb 20, 2015

geekpete commented Apr 29, 2015

rjernst commented May 26, 2015

mikemccand commented Sep 30, 2015

jpountz commented Sep 30, 2015

nik9000 commented Mar 12, 2018

clintongormley commented Mar 13, 2018

imotov commented Aug 1, 2019

geekpete commented Aug 1, 2019

Input validation #9059

Input validation #9059

Comments

clintongormley commented Dec 24, 2014

Directory or file names

URL paths and query strings

Field paths and Scripting

rjernst commented Dec 29, 2014

clintongormley commented Dec 30, 2014

colings86 commented Feb 20, 2015

geekpete commented Apr 29, 2015

rjernst commented May 26, 2015

mikemccand commented Sep 30, 2015

jpountz commented Sep 30, 2015

nik9000 commented Mar 12, 2018

clintongormley commented Mar 13, 2018

imotov commented Aug 1, 2019

geekpete commented Aug 1, 2019