Elasticsearch accepts invalid json with unpredictable behavior #19614

honzakral · 2016-07-27T08:52:34Z

When a key is present in json object multiple times it doesn't raise a parse error and only last value is used. This should instead raise json_parse_exception.

Elasticsearch version: verified on 2.x, 5.0.0-alpha3

Steps to reproduce:

curl -X PUT localhost:9200/i -d '{"settings": {"number_of_replicas": 2}, "settings": {"number_of_shards": 1}}'
curl -X GET localhost:9200/i

The text was updated successfully, but these errors were encountered:

javanna · 2016-07-27T09:12:57Z

I could see this becoming a long discussion around whether that one is invalid json or not and whether we should return a parse exception or some other error. The json library we use for parsing allows this, then we should improve this on our end rather than being lenient.

This reminds me of #19547 too and is a very common problem with the way we pull parse json. It can easily be solved case by case but every single parser in our codebase is subject to this so it would be nice to have some generic solution for it. Not sure if there are alternatives to adding lots of ifs to all our pull parsers, we should evaluate that.

tlrx · 2016-07-27T10:00:08Z

The json library we use for parsing allows this, then we should improve this on our end rather than being lenient.

For the record, the option is JsonParser.STRICT_DUPLICATE_DETECTION and has the following warning:

Note that enabling this feature will incur performance overhead 
due to having to store and check additional information: 
this typically adds 20-30% to execution time for basic parsing.

clintongormley · 2016-07-27T18:11:49Z

According to the JSON spec, this isn't invalid JSON. The spec doesn't mention how duplicate keys should be treated. Many languages will simply overwrite older values with newer values, without generating any warning. This is essentially what Elasticsearch does today, and i'm not sure it is worth a 20-30% penalty to prevent this behaviour.

honzakral · 2016-07-27T19:19:37Z

Yes, strictly speaking (the rfc only says the keys SHOULD be unique), this is valid. I also agree that the performance penalty isn't worth it. It would, however, be nice to document this behavior and perhaps (if it's easy) have an option to turn on strict checking (ideally per request) - it would be useful as debugging tool and perhaps when running tests.

martijnvg · 2016-07-28T09:56:16Z

Allowing duplicate keys adds a lot of confusion: https://discuss.elastic.co/t/using-the-remove-processor-for-ingest-node/56500

Maybe for certain apis we should enable strict parsing? (admin like APIs?)

jpountz · 2016-07-29T09:52:22Z

Discussed in FixitFriday: let's play with the jackon feature to reject duplicated keys and make sure that it works and has a reasonable performance hit. If it is not satisfactory, then let's look into whether there are things that we can do at a higher level such as ObjectParser.

danielmitterdorfer · 2016-12-01T12:46:10Z

Macrobenchmark Results

We have run our whole macrobenchmark suite with JsonParser.STRICT_DUPLICATE_DETECTION == false (baseline) and JsonParser.STRICT_DUPLICATE_DETECTION == true (STRICT_DUPLICATE_DETECTION).

We see at most a reduction in median indexing throughput of 3% for our macrobenchmark suite (PMC track).

Microbenchmark Results

I also double-checked a few scenarios with a microbenchmark and saw similar results (see https://gist.github.com/danielmitterdorfer/9236796a46f3956447171313a6a0b365):

Below are the results of both configurations showing the average time for one iteration (smaller is better).

JsonParser.Feature.STRICT_DUPLICATE_DETECTION: false:

Benchmark                      Mode  Cnt   Score   Error  Units
JsonParserBenchmark.largeJson  avgt   60  19.414 ± 0.044  us/op
JsonParserBenchmark.smallJson  avgt   60   0.479 ± 0.001  us/op

JsonParser.Feature.STRICT_DUPLICATE_DETECTION: true:

Benchmark                      Mode  Cnt   Score   Error  Units
JsonParserBenchmark.largeJson  avgt   60  20.642 ± 0.064  us/op
JsonParserBenchmark.smallJson  avgt   60   0.487 ± 0.001  us/op

For smaller JSON objects (49 bytes) the overhead of duplication check is 8ns or 1.6%. For a large JSON object (6440 bytes) the overhead of duplication check is in the range 1.12us [1] and 1.3us [2] or in the range 5.8% and 6.7%.

[1] best case duplication check enabled 20.578 us, worst case duplication check enabled: 19.458 us
[2] worst case duplication check enabled: 20.706 us, best case duplication check disabled: 19.370 us

Please refer to the gist for more details.

jpountz · 2016-12-01T13:08:34Z

Thanks @danielmitterdorfer. To me that means we should do it. We can have an undocumented escape hatch if we do not feel confident the overhead will be low in all cases.

danielmitterdorfer · 2016-12-01T13:45:04Z

We can have an undocumented escape hatch

@jpountz The relevant code is in a static block so we can't use our settings infrastructure. I guess that means we'd use a system property?

jpountz · 2016-12-01T13:51:24Z

That would work for me. Or we could handle it like INDICES_MAX_CLAUSE_COUNT_SETTING I suppose, which is a node setting that sets the static limit on the number of boolean clauses.

tlrx · 2016-12-05T09:24:58Z

Thanks @danielmitterdorfer.

I agree with @jpountz, and a first step would be to see if our tests pass (I'm pretty sure we will have to adapt some of them). Also, the same JSON factory is used for both parsing and generating JSON: if we enable this feature then we'll also see if we generate duplicate keys somewhere, which is cool.

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default. This ensures that JSON keys are always unique. While this has a performance impact, benchmarking has indicated that the typical drop in indexing throughput is around 1 - 2%. As a last resort, we allow users to still disable strict duplicate checks by setting `-Des.json.strict_duplicate_detection=false` which is intentionally undocumented. Closes elastic#19614

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default. This ensures that JSON keys are always unique. While this has a performance impact, benchmarking has indicated that the typical drop in indexing throughput is around 1 - 2%. As a last resort, we allow users to still disable strict duplicate checks by setting `-Des.json.strict_duplicate_detection=false` which is intentionally undocumented. Closes #19614

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default for all XContent types (not only JSON). We have also changed the name of the system property to disable this feature from `es.json.strict_duplicate_detection` to the now more appropriate name `es.xcontent.strict_duplicate_detection`. Relates elastic#19614 Relates elastic#22073

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default for all XContent types (not only JSON). We have also changed the name of the system property to disable this feature from `es.json.strict_duplicate_detection` to the now more appropriate name `es.xcontent.strict_duplicate_detection`. Relates #19614 Relates #22073

honzakral added >bug :Core/Infra/REST API REST infrastructure and utilities labels Jul 27, 2016

clintongormley added the discuss label Jul 27, 2016

jpountz removed the discuss label Jul 29, 2016

jpountz assigned danielmitterdorfer Jul 29, 2016

dakrone mentioned this issue Aug 30, 2016

handling of json payloads with duplicate field names #20241

Closed

javanna mentioned this issue Dec 7, 2016

Resolve index names in indices_boost #21393

Merged

danielmitterdorfer mentioned this issue Dec 9, 2016

Enable strict duplicate checks for JSON content #22073

Merged

danielmitterdorfer closed this as completed in #22073 Dec 14, 2016

danielmitterdorfer mentioned this issue Dec 16, 2016

Enable strict duplicate checks for all XContent types #22225

Merged

danielmitterdorfer mentioned this issue Dec 19, 2016

Cleanup obsolete hand-coded XContent duplicate checks #22253

Closed

danielmitterdorfer removed their assignment Jan 10, 2017

geekpete mentioned this issue Dec 14, 2018

Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

Closed

axw mentioned this issue Jun 26, 2024

[exporter/elasticsearch] deprecate/remove dedup config open-telemetry/opentelemetry-collector-contrib#33773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch accepts invalid json with unpredictable behavior #19614

Elasticsearch accepts invalid json with unpredictable behavior #19614

honzakral commented Jul 27, 2016 •

edited

Loading

javanna commented Jul 27, 2016

tlrx commented Jul 27, 2016

clintongormley commented Jul 27, 2016

honzakral commented Jul 27, 2016

martijnvg commented Jul 28, 2016

jpountz commented Jul 29, 2016

danielmitterdorfer commented Dec 1, 2016 •

edited by pquentin

Loading

jpountz commented Dec 1, 2016

danielmitterdorfer commented Dec 1, 2016

jpountz commented Dec 1, 2016

tlrx commented Dec 5, 2016

Elasticsearch accepts invalid json with unpredictable behavior #19614

Elasticsearch accepts invalid json with unpredictable behavior #19614

Comments

honzakral commented Jul 27, 2016 • edited Loading

javanna commented Jul 27, 2016

tlrx commented Jul 27, 2016

clintongormley commented Jul 27, 2016

honzakral commented Jul 27, 2016

martijnvg commented Jul 28, 2016

jpountz commented Jul 29, 2016

danielmitterdorfer commented Dec 1, 2016 • edited by pquentin Loading

Macrobenchmark Results

Microbenchmark Results

jpountz commented Dec 1, 2016

danielmitterdorfer commented Dec 1, 2016

jpountz commented Dec 1, 2016

tlrx commented Dec 5, 2016

honzakral commented Jul 27, 2016 •

edited

Loading

danielmitterdorfer commented Dec 1, 2016 •

edited by pquentin

Loading