-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input validation #9059
Comments
Why is this too stringent? I can't imagine there are a ton of people that have fields beginning with underscores, but even for those that it does, it doesn't seem like this solves the issue. How would these escaped field names show up in results: as unescaped or escaped? If unescaped, then you have the same problem, just on output instead of input. If escaped, then the user still has to change their code to handle the escaped values, so why is it any different than changing to a name not beginning with an underscore? |
We're trying to avoid breaking existing applications, eg see #6736 (comment)
On output, metadata fields are displayed at the same level as the That said, I still don't know which of the four options I mentioned is the best solution. |
As described in #8042 we should also consider aggregation names too |
As per my original ticket: #6589 Creating index names that are physically unwritable on the filesystem could be tested first with a simple write test to see if they're writeable, and if not, reject the index creation. This would save the awful scenario where an index creation is attempted by elasticsearch but is unable to allocate primary shards due to the base inability to even create the directory name. So a simple test of file path validity first to avoid: "java.io.IOException: Invalid file path" and reject if it fails the test. The result when this is not handled can be a cluster in red state until the user/admin can decipher what special encoding to use (in the example of an index name with bad chars like control codes) in order to delete the index and restore cluster to green state. An example of an index with bad chars as described is:
which shows up this way under /_aliases?pretty and is only removed using a curl -XDELETE against:
These characters are control codes that need to be properly escaped to be dealt with. http://www.fileformat.info/info/unicode/char/0001/index.htm So my general point being, some really low level tests should be added first that validates if a given string is even usable all the way through and it can fail with a simple test to see if it would succeed or not before continuing on to the next step. Then the reason for it failing matters less as it just won't do it if it's not possible on whatever platform/combination. You can lay these sorts of tests down first before trying to guess all the scenarios where it might fail in order to manually validate |
@clintongormley I think we should consider doing strict validation as a start, and adding support for crazy names as a follow up. I think it is more important to get validation into 2.0 than work out how to support crazy names. |
+1 for simple, strict rules like were proposed in the original #6736 ... this would have prevented the horrible issues like #13858. |
+1 for simple strict rules |
@clintongormley is there anything left to do from the original issue? Can we close it so long as we promise to live with its spirit in our hearts? |
@nik9000 this is so out of date that I don't think the thread is really relevant anymore. It is probably still worth writing up rules for what is and is not allowed - these don't exist today, except in code. And there are still places that you can name yourself into trouble, although users seldom get into trouble with these anymore. Let's close. |
It seems to be still a problem, it still confuses users and I cannot find where it is documented at the moment https://discuss.elastic.co/t/legal-character-set-for-field-names/190796 |
Should this be filed as a docs issue now then, it seems like quite a large scope though. |
In #6736 I started trying to define specs for valid IDs, index names, field names etc, to avoid problems such as conflicts created by having an ID called
_mapping
.I think this is the wrong approach - a significant number of users will find that they have used identifiers which are no longer illegal. Instead, we should identify where conflicts are possible and implement escaping to allow the unambiguous avoidance of conflict.
Possible conflicts can arise in:
_mapping
or_create
).
)_id
)We should provide a simple consistent escaping mechanism with minimal restrictions on allowable characters.
Directory or file names
Today we use the actual index name to create a directory on the filesystem. Instead, the original index name should be stored in the index metadata, and the directory name should be escaped as printable ASCII only (eg percent encoding?).
There are more restrictions that need to be applied to filesystem names, most of which are already enforced today:
+
or a-
\
,/
,*
,?
,"
,<
,>
,|
,,
are forbiddenThe same should apply to snapshot names.
Filename length should be checked after escaping, not before.
URL paths and query strings
Unicode characters and
/
are already handled by standard URI encoding. The only special characters are:_
,
These can be escaped automatically by prepending
\
(so we would need to escape literal\
's as well).The
*
is used as a wildcard for various URL parameters, eg index and type names. We cannot automatically escape the wildcard (because usually it should be a plain wildcard). Similarly, the user can't escape it asmy_literal_\*
, because the\
would itself be escaped as:my_literal_\\*
, meaning that the wildcard would still function as a wildcard.The
*
should not be allowed in any identifiers which support wildcards (eg index, type, node names etc).Index wildcards can use a leading
+
or-
to include/exclude patterns, but these characters should already be excluded by the index naming rules above.Field paths and Scripting
A field containing a
.
can be escaped asfoo\.bar
(and a field containing\
asfoo\\bar
).Fields beginning with
_
can clash with metadata fields like_id
or_timestamp
, both in queries, aggregations, sorting, and scripting.There are four options:
\
escaping to distinguish the metadata_id
field from the body\_id
field_root._id
or_doc._id
(_meta
is already taken unfortunately)Also see #5558
The text was updated successfully, but these errors were encountered: