Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch accepts invalid json with unpredictable behavior #19614

Closed
honzakral opened this issue Jul 27, 2016 · 11 comments
Closed

Elasticsearch accepts invalid json with unpredictable behavior #19614

honzakral opened this issue Jul 27, 2016 · 11 comments
Labels
>bug :Core/Infra/REST API REST infrastructure and utilities

Comments

@honzakral
Copy link
Contributor

honzakral commented Jul 27, 2016

When a key is present in json object multiple times it doesn't raise a parse error and only last value is used. This should instead raise json_parse_exception.

Elasticsearch version: verified on 2.x, 5.0.0-alpha3

Steps to reproduce:

  1. curl -X PUT localhost:9200/i -d '{"settings": {"number_of_replicas": 2}, "settings": {"number_of_shards": 1}}'
  2. curl -X GET localhost:9200/i
@honzakral honzakral added >bug :Core/Infra/REST API REST infrastructure and utilities labels Jul 27, 2016
@javanna
Copy link
Member

javanna commented Jul 27, 2016

I could see this becoming a long discussion around whether that one is invalid json or not and whether we should return a parse exception or some other error. The json library we use for parsing allows this, then we should improve this on our end rather than being lenient.

This reminds me of #19547 too and is a very common problem with the way we pull parse json. It can easily be solved case by case but every single parser in our codebase is subject to this so it would be nice to have some generic solution for it. Not sure if there are alternatives to adding lots of ifs to all our pull parsers, we should evaluate that.

@tlrx
Copy link
Member

tlrx commented Jul 27, 2016

The json library we use for parsing allows this, then we should improve this on our end rather than being lenient.

For the record, the option is JsonParser.STRICT_DUPLICATE_DETECTION and has the following warning:

Note that enabling this feature will incur performance overhead 
due to having to store and check additional information: 
this typically adds 20-30% to execution time for basic parsing.

@clintongormley
Copy link
Contributor

According to the JSON spec, this isn't invalid JSON. The spec doesn't mention how duplicate keys should be treated. Many languages will simply overwrite older values with newer values, without generating any warning. This is essentially what Elasticsearch does today, and i'm not sure it is worth a 20-30% penalty to prevent this behaviour.

@honzakral
Copy link
Contributor Author

Yes, strictly speaking (the rfc only says the keys SHOULD be unique), this is valid. I also agree that the performance penalty isn't worth it. It would, however, be nice to document this behavior and perhaps (if it's easy) have an option to turn on strict checking (ideally per request) - it would be useful as debugging tool and perhaps when running tests.

@martijnvg
Copy link
Member

Allowing duplicate keys adds a lot of confusion: https://discuss.elastic.co/t/using-the-remove-processor-for-ingest-node/56500

Maybe for certain apis we should enable strict parsing? (admin like APIs?)

@jpountz
Copy link
Contributor

jpountz commented Jul 29, 2016

Discussed in FixitFriday: let's play with the jackon feature to reject duplicated keys and make sure that it works and has a reasonable performance hit. If it is not satisfactory, then let's look into whether there are things that we can do at a higher level such as ObjectParser.

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Dec 1, 2016

Macrobenchmark Results

We have run our whole macrobenchmark suite with JsonParser.STRICT_DUPLICATE_DETECTION == false (baseline) and JsonParser.STRICT_DUPLICATE_DETECTION == true (STRICT_DUPLICATE_DETECTION).

We see at most a reduction in median indexing throughput of 3% for our macrobenchmark suite (PMC track).

Microbenchmark Results

I also double-checked a few scenarios with a microbenchmark and saw similar results (see https://gist.github.com/danielmitterdorfer/9236796a46f3956447171313a6a0b365):

Below are the results of both configurations showing the average time for one iteration (smaller is better).

JsonParser.Feature.STRICT_DUPLICATE_DETECTION: false:

Benchmark                      Mode  Cnt   Score   Error  Units
JsonParserBenchmark.largeJson  avgt   60  19.414 ± 0.044  us/op
JsonParserBenchmark.smallJson  avgt   60   0.479 ± 0.001  us/op

JsonParser.Feature.STRICT_DUPLICATE_DETECTION: true:

Benchmark                      Mode  Cnt   Score   Error  Units
JsonParserBenchmark.largeJson  avgt   60  20.642 ± 0.064  us/op
JsonParserBenchmark.smallJson  avgt   60   0.487 ± 0.001  us/op

For smaller JSON objects (49 bytes) the overhead of duplication check is 8ns or 1.6%. For a large JSON object (6440 bytes) the overhead of duplication check is in the range 1.12us [1] and 1.3us [2] or in the range 5.8% and 6.7%.

[1] best case duplication check enabled 20.578 us, worst case duplication check enabled: 19.458 us
[2] worst case duplication check enabled: 20.706 us, best case duplication check disabled: 19.370 us

Please refer to the gist for more details.

@jpountz
Copy link
Contributor

jpountz commented Dec 1, 2016

Thanks @danielmitterdorfer. To me that means we should do it. We can have an undocumented escape hatch if we do not feel confident the overhead will be low in all cases.

@danielmitterdorfer
Copy link
Member

We can have an undocumented escape hatch

@jpountz The relevant code is in a static block so we can't use our settings infrastructure. I guess that means we'd use a system property?

@jpountz
Copy link
Contributor

jpountz commented Dec 1, 2016

That would work for me. Or we could handle it like INDICES_MAX_CLAUSE_COUNT_SETTING I suppose, which is a node setting that sets the static limit on the number of boolean clauses.

@tlrx
Copy link
Member

tlrx commented Dec 5, 2016

Thanks @danielmitterdorfer.

I agree with @jpountz, and a first step would be to see if our tests pass (I'm pretty sure we will have to adapt some of them). Also, the same JSON factory is used for both parsing and generating JSON: if we enable this feature then we'll also see if we generate duplicate keys somewhere, which is cool.

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Dec 9, 2016
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.

As a last resort, we allow users to still disable strict duplicate checks
by setting `-Des.json.strict_duplicate_detection=false` which is
intentionally undocumented.

Closes elastic#19614
danielmitterdorfer added a commit that referenced this issue Dec 14, 2016
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.

As a last resort, we allow users to still disable strict duplicate checks
by setting `-Des.json.strict_duplicate_detection=false` which is
intentionally undocumented.

Closes #19614
danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue Dec 16, 2016
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default for all XContent types (not only JSON).

We have also changed the name of the system property to disable this feature
from `es.json.strict_duplicate_detection` to the now more appropriate name
`es.xcontent.strict_duplicate_detection`.

Relates elastic#19614
Relates elastic#22073
danielmitterdorfer added a commit that referenced this issue Dec 19, 2016
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default for all XContent types (not only JSON).

We have also changed the name of the system property to disable this feature
from `es.json.strict_duplicate_detection` to the now more appropriate name
`es.xcontent.strict_duplicate_detection`.

Relates #19614
Relates #22073
@danielmitterdorfer danielmitterdorfer removed their assignment Jan 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/REST API REST infrastructure and utilities
Projects
None yet
Development

No branches or pull requests

7 participants