-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch accepts invalid json with unpredictable behavior #19614
Comments
I could see this becoming a long discussion around whether that one is invalid json or not and whether we should return a parse exception or some other error. The json library we use for parsing allows this, then we should improve this on our end rather than being lenient. This reminds me of #19547 too and is a very common problem with the way we pull parse json. It can easily be solved case by case but every single parser in our codebase is subject to this so it would be nice to have some generic solution for it. Not sure if there are alternatives to adding lots of ifs to all our pull parsers, we should evaluate that. |
For the record, the option is
|
According to the JSON spec, this isn't invalid JSON. The spec doesn't mention how duplicate keys should be treated. Many languages will simply overwrite older values with newer values, without generating any warning. This is essentially what Elasticsearch does today, and i'm not sure it is worth a 20-30% penalty to prevent this behaviour. |
Yes, strictly speaking (the rfc only says the keys SHOULD be unique), this is valid. I also agree that the performance penalty isn't worth it. It would, however, be nice to document this behavior and perhaps (if it's easy) have an option to turn on strict checking (ideally per request) - it would be useful as debugging tool and perhaps when running tests. |
Allowing duplicate keys adds a lot of confusion: https://discuss.elastic.co/t/using-the-remove-processor-for-ingest-node/56500 Maybe for certain apis we should enable strict parsing? (admin like APIs?) |
Discussed in FixitFriday: let's play with the jackon feature to reject duplicated keys and make sure that it works and has a reasonable performance hit. If it is not satisfactory, then let's look into whether there are things that we can do at a higher level such as ObjectParser. |
Macrobenchmark ResultsWe have run our whole macrobenchmark suite with We see at most a reduction in median indexing throughput of 3% for our macrobenchmark suite (PMC track). Microbenchmark ResultsI also double-checked a few scenarios with a microbenchmark and saw similar results (see https://gist.github.com/danielmitterdorfer/9236796a46f3956447171313a6a0b365): Below are the results of both configurations showing the average time for one iteration (smaller is better).
For smaller JSON objects (49 bytes) the overhead of duplication check is 8ns or 1.6%. For a large JSON object (6440 bytes) the overhead of duplication check is in the range 1.12us [1] and 1.3us [2] or in the range 5.8% and 6.7%. [1] best case duplication check enabled 20.578 us, worst case duplication check enabled: 19.458 us Please refer to the gist for more details. |
Thanks @danielmitterdorfer. To me that means we should do it. We can have an undocumented escape hatch if we do not feel confident the overhead will be low in all cases. |
@jpountz The relevant code is in a |
That would work for me. Or we could handle it like |
Thanks @danielmitterdorfer. I agree with @jpountz, and a first step would be to see if our tests pass (I'm pretty sure we will have to adapt some of them). Also, the same JSON factory is used for both parsing and generating JSON: if we enable this feature then we'll also see if we generate duplicate keys somewhere, which is cool. |
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default. This ensures that JSON keys are always unique. While this has a performance impact, benchmarking has indicated that the typical drop in indexing throughput is around 1 - 2%. As a last resort, we allow users to still disable strict duplicate checks by setting `-Des.json.strict_duplicate_detection=false` which is intentionally undocumented. Closes elastic#19614
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default. This ensures that JSON keys are always unique. While this has a performance impact, benchmarking has indicated that the typical drop in indexing throughput is around 1 - 2%. As a last resort, we allow users to still disable strict duplicate checks by setting `-Des.json.strict_duplicate_detection=false` which is intentionally undocumented. Closes #19614
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default for all XContent types (not only JSON). We have also changed the name of the system property to disable this feature from `es.json.strict_duplicate_detection` to the now more appropriate name `es.xcontent.strict_duplicate_detection`. Relates elastic#19614 Relates elastic#22073
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default for all XContent types (not only JSON). We have also changed the name of the system property to disable this feature from `es.json.strict_duplicate_detection` to the now more appropriate name `es.xcontent.strict_duplicate_detection`. Relates #19614 Relates #22073
When a key is present in json object multiple times it doesn't raise a parse error and only last value is used. This should instead raise
json_parse_exception
.Elasticsearch version: verified on 2.x, 5.0.0-alpha3
Steps to reproduce:
curl -X PUT localhost:9200/i -d '{"settings": {"number_of_replicas": 2}, "settings": {"number_of_shards": 1}}'
curl -X GET localhost:9200/i
The text was updated successfully, but these errors were encountered: