Enable strict duplicate checks for JSON content #22073

danielmitterdorfer · 2016-12-09T11:46:51Z

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.

As a last resort, we allow users to still disable strict duplicate checks
by setting ~~-Des.json.strict_duplicate_detection=false~~
-Des.xcontent.strict_duplicate_detection=false which is
intentionally undocumented.

Edit: In a later PR (#28768) the name of this system property has changed from es.json.strict_duplicate_detection (valid for Elasticsearch versions from 6.0.0 up to but not including 6.3.0) to es.xcontent.strict_duplicate_detection (valid from Elasticsearch 6.3.0 and onwards). Note that this escape hatch only exists for the 6.x series. This PR's description is now updated to reflect that change.

Closes #19614

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default. This ensures that JSON keys are always unique. While this has a performance impact, benchmarking has indicated that the typical drop in indexing throughput is around 1 - 2%. As a last resort, we allow users to still disable strict duplicate checks by setting `-Des.json.strict_duplicate_detection=false` which is intentionally undocumented. Closes elastic#19614

danielmitterdorfer · 2016-12-09T11:47:08Z

Here's a bit more background on the key decisions:

I opted for a system property instead of a setting. This should stay enabled and the system property is meant as a last resort so I think this is justified. By implementing this as a system property we can keep all initialization code as is in a static initializer and benefit from all the safety guarantees (e.g. the JSON factory is effectively immutable)
I did not remove any additional duplicate checks as the user could deactivate JSON duplicate checks by a system property. If I had removed these checks we would not check at all in these cases. This could cause bugs that are very hard to find.
Elasticsearch treats duplicate keys now effectively as any other JSON syntax error. This means that in some cases we throw now JsonParseException instead of Elasticsearch specific exceptions (see tests).

Regarding the next steps (after the PR is merged): As this is a breaking change I wonder if it makes sense to backport it to 5.x. I also think that we should remove the system property in a future release (7.0?) if it turns out that we can leave strict duplicate checks always enabled. Then we can also remove our hand-written duplicate checks.

nik9000 · 2016-12-09T16:41:48Z

core/src/test/java/org/elasticsearch/action/fieldstats/FieldStatsRequestTests.java

@@ -66,7 +66,7 @@ public void testFieldsParsing() throws Exception {
        assertThat(request.getIndexConstraints()[3].getComparison(), equalTo(LTE));
        assertThat(request.getIndexConstraints()[4].getField(), equalTo("field5"));
        assertThat(request.getIndexConstraints()[4].getValue(), equalTo("2"));
-        assertThat(request.getIndexConstraints()[4].getProperty(), equalTo(MAX));
+        assertThat(request.getIndexConstraints()[4].getProperty(), equalTo(MIN));


For those following along at home, this is required because the example data had a duplicate key and @danielmitterdorfer switched it to min in the file.

nik9000

I left some comments - mostly around testing. I think the approach is fine but that we shouldn't change the tests, instead skipping them.

Also, if we do that, we should have a test that checks that the duplicate behavior is checked in some xcontent test. I didn't see one but I wasn't looking super hard.

nik9000 · 2016-12-09T16:44:29Z

core/src/test/java/org/elasticsearch/common/xcontent/ConstructingObjectParserTests.java

        e = e.getCause();
-        assertThat(e, instanceOf(IllegalArgumentException.class));
-        assertEquals("Can't repeat param [vegetable]", e.getMessage());
+        assertThat(e, instanceOf(JsonParseException.class));


So this is one problem with doing this as a read-on-startup property. We can no longer test things like this. This test is fairly worthless unless the property is false and we have to manually perform the duplication detection.

I think it'd be worth leaving a comment about that.

So this is one problem with doing this as a read-on-startup property.

I agree that testing is not as simple as it could be but my priority was that the production code is dead-simple (i.e. no fancy abilities to change the JSON factory at runtime after initial creation). Given the fact that it's likely that we will enable this check without an ability to opt-out at some point in the future, I felt this was the right trade-off.

nik9000 · 2016-12-09T16:46:03Z

core/src/test/java/org/elasticsearch/index/query/BoolQueryBuilderTests.java

@@ -386,8 +387,8 @@ public void testTooManyQueriesInObject() throws IOException {
                "    }\n" +
                "  }\n" +
                "}";
-        ParsingException ex = expectThrows(ParsingException.class, () -> parseQuery(query, ParseFieldMatcher.EMPTY));
-        assertEquals("[match] malformed query, expected [END_OBJECT] but found [FIELD_NAME]", ex.getMessage());
+        JsonParseException ex = expectThrows(JsonParseException.class, () -> parseQuery(query, ParseFieldMatcher.EMPTY));


Same here - this test is fairly out of date now.

nik9000 · 2016-12-09T16:48:01Z

core/src/test/java/org/elasticsearch/index/query/ConstantScoreQueryBuilderTests.java

     */
    public void testMultipleFilterElements() throws IOException {
        String queryString = "{ \"" + ConstantScoreQueryBuilder.NAME + "\" : {\n" +
                                    "\"filter\" : { \"term\": { \"foo\": \"a\" } },\n" +
                                    "\"filter\" : { \"term\": { \"foo\": \"x\" } },\n" +
                            "} }";
-        ParsingException e = expectThrows(ParsingException.class, () -> parseQuery(queryString));
-        assertThat(e.getMessage(), containsString("accepts only one 'filter' element"));
+        JsonParseException e = expectThrows(JsonParseException.class, () -> parseQuery(queryString));


Same here.

I wonder if it makes more sense to leave these tests as they are and skip them if the value of the duplicate check property is true. That way if we really want we can set it to false and run the tests again. And if they have a compile time reference to the duplicate check property when we remove the duplicate check property the test will stop compiling and we can use that as a reminder to remove that particular manual duplicate check.

Great idea; I love it. I'll revert the tests to their previous state, add the compile time constant and skip the tests as you've suggested.

nik9000 · 2016-12-09T16:48:14Z

...rc/test/java/org/elasticsearch/index/query/functionscore/FunctionScoreQueryBuilderTests.java

    }

-    private static void expectParsingException(String json, Matcher<String> messageMatcher) {
-        ParsingException e = expectThrows(ParsingException.class, () -> parseQuery(json));
+    private static <T extends Throwable> void expectException(String json, Class<T> expectedException, Matcher<String> messageMatcher) {


nik9000 · 2016-12-09T16:48:28Z

core/src/test/java/org/elasticsearch/search/nested/SimpleNestedIT.java

-                        .endObject()
-                        .endObject()
-                        .endObject()
+                            .startObject("properties")


nik9000 · 2016-12-09T16:49:08Z

docs/reference/migration/migrate_6_0/rest.asciidoc

+==== Duplicate Keys in JSON
+
+In previous versions of Elasticsearch, JSON documents were allowed to contain duplicate keys. Elasticsearch 6.0.0
+will enforce that all keys are unique.


s/will enforce/enforces/

nik9000 · 2016-12-09T16:49:52Z

modules/percolator/src/main/java/org/elasticsearch/percolator/TransportPercolateAction.java

@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques
                        }
                    } else if (token.isValue()) {
                        if ("size".equals(currentFieldName)) {
+                            if (percolateRequest.onlyCount()) {


Is this a leftover or did you move it or something? It feels like it doesn't belong.

I'm not adverse to adding it here, I just want to understand it.

With strict duplicate checks enabled we would throw an error because size is specified twice (once by the user, once added internally by our implementation of TransportPercolateAction) and this error message gives the user a clear feedback to the user what is wrong instead of just giving a (technically correct but) misleading error to the user about a "duplicate field size".

https://www.youtube.com/watch?v=hou0lU8WMgo

nik9000 · 2016-12-09T16:51:03Z

...java/org/elasticsearch/test/rest/yaml/restspec/ClientYamlSuiteRestApiParserFailingTests.java

@@ -102,7 +100,7 @@ public void testDuplicateParts() throws Exception {
                "    }," +
                "    \"body\": null" +
                "  }" +
-                "}", "Found duplicate part [index]");
+                "}", JsonParseException.class,"Duplicate field 'index'");


Same here with regards to the test.

…bled

…te_checks

danielmitterdorfer · 2016-12-12T12:29:05Z

@nik9000 Thanks for your review. I've pushed a few more commits that should address all your comments.

nik9000

I left one comment about using assumeTrue and assumeFalse instead of the skipping you are doing now. If you are ok with making that change then I'm fine with you merging after you make it. If you don't want to make the change we should discuss one more time just so I understand why it doesn't work here.

nik9000 · 2016-12-13T14:40:36Z

modules/percolator/src/main/java/org/elasticsearch/percolator/TransportPercolateAction.java

@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques
                        }
                    } else if (token.isValue()) {
                        if ("size".equals(currentFieldName)) {
+                            if (percolateRequest.onlyCount()) {


https://www.youtube.com/watch?v=hou0lU8WMgo

nik9000 · 2016-12-13T14:42:21Z

core/src/main/java/org/elasticsearch/common/xcontent/json/JsonXContent.java

    static {
        jsonFactory = new JsonFactory();
        jsonFactory.configure(JsonGenerator.Feature.QUOTE_FIELD_NAMES, true);
        jsonFactory.configure(JsonParser.Feature.ALLOW_COMMENTS, true);
        jsonFactory.configure(JsonFactory.Feature.FAIL_ON_SYMBOL_HASH_OVERFLOW, false); // this trips on many mappings now...
        // Do not automatically close unclosed objects/arrays in com.fasterxml.jackson.core.json.UTF8JsonGenerator#close() method
        jsonFactory.configure(JsonGenerator.Feature.AUTO_CLOSE_JSON_CONTENT, false);
+        try {
+            jsonFactory.configure(JsonParser.Feature.STRICT_DUPLICATE_DETECTION, isStrictDuplicateDetectionEnabled());
+        } catch (Exception ex) {


Is this exception still a problem?

Yes it is. A user could still pass -Des.json.strict_duplicate_detection=complete_bogus and provoke an exception when we try to convert the system property to a boolean. I don't want to be lenient and terminate early then.

If they specify a non-boolean I think they'll get false.

I agree that we should only support true and false though.

nik9000 · 2016-12-13T14:44:08Z

core/src/test/java/org/elasticsearch/common/settings/loader/JsonSettingsLoaderTests.java

@@ -48,6 +49,10 @@ public void testSimpleJsonSettings() throws Exception {
    }

    public void testDuplicateKeysThrowsException() {
+        if (JsonXContent.isStrictDuplicateDetectionEnabled()) {


What about assumeTrue("Test only makes sense if json parser doesn't have strict duplicate checks enabled and can be removed if we force it to be enabled", JsonXContent.isStrictDuplicateDetectionEnabled());?

Oh, I didn't know about assumeTrue / assumeFalse and have only noticed the approach that I've used in the code base. Btw, I think this should be assumeFalse. But this makes sense and I'll change the tests as you've suggested. Thanks for the pointer.

Cool. I think I did have it backwards.

nik9000 · 2016-12-13T14:44:26Z

core/src/test/java/org/elasticsearch/common/xcontent/json/JsonXContentTests.java

@@ -39,4 +40,15 @@ public void testBigInteger() throws Exception {
        JsonGenerator generator = new JsonFactory().createGenerator(os);
        doTestBigInteger(generator, os);
    }
+
+    public void testChecksForDuplicates() throws Exception {
+        if (JsonXContent.isStrictDuplicateDetectionEnabled() == false) {


assumeFalse here too?

Yes, I'll change all affected tests accordingly.

nik9000 · 2016-12-13T14:47:11Z

docs/reference/migration/migrate_6_0/rest.asciidoc

+==== Duplicate Keys in JSON
+
+In previous versions of Elasticsearch, JSON documents were allowed to contain duplicate keys. Elasticsearch 6.0.0
+ enforces that all keys are unique.


It might be worth writing a blog post around how to migrate documents with these duplicate keys. Because they'll blow up if you try to highlight a field or do source filtering. But they shouldn't blow up if you just return the _source without filtering.

I'll add it to my backlog and look into it.

nik9000 · 2016-12-13T14:48:14Z

modules/percolator/src/main/java/org/elasticsearch/percolator/TransportPercolateAction.java

@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques
                        }
                    } else if (token.isValue()) {
                        if ("size".equals(currentFieldName)) {
+                            if (percolateRequest.onlyCount()) {


nik9000 · 2016-12-13T14:48:41Z

modules/percolator/src/test/java/org/elasticsearch/percolator/PercolatorAggregationsIT.java


            boolean countOnly = randomBoolean();
            if (countOnly) {
                percolateRequestBuilder.setOnlyCount(countOnly);
+            } else {
+                // can only set size if we also keep track of matches (i.e. countOnly == false)
+                if (randomBoolean()) {


Now these make more sense.

…te_checks

nik9000 · 2016-12-14T22:01:24Z

@danielmitterdorfer: I think this doesn't change yaml, cbor, or smile parsing. Do they need this as well?

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default for all XContent types (not only JSON). We have also changed the name of the system property to disable this feature from `es.json.strict_duplicate_detection` to the now more appropriate name `es.xcontent.strict_duplicate_detection`. Relates elastic#19614 Relates elastic#22073

danielmitterdorfer · 2016-12-16T13:04:14Z

@nik9000 You're right. I've opened #22225. Can you please have look?

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default for all XContent types (not only JSON). We have also changed the name of the system property to disable this feature from `es.json.strict_duplicate_detection` to the now more appropriate name `es.xcontent.strict_duplicate_detection`. Relates #19614 Relates #22073

With this commit we cleanup hand-coded duplicate checks in XContent parsing. They were necessary previously but since we reconfigured the underlying parser in elastic#22073 and elastic#22225, these checks are obsolete and were also ineffective unless an undocumented system property has been set. As we also remove this escape hatch, we can remove the additional checks as well. Closes elastic#22253

With this commit we cleanup hand-coded duplicate checks in XContent parsing. They were necessary previously but since we reconfigured the underlying parser in #22073 and #22225, these checks are obsolete and were also ineffective unless an undocumented system property has been set. As we also remove this escape hatch, we can remove the additional checks as well. Closes #22253 Relates #34588

danielmitterdorfer added :Core/Infra/REST API REST infrastructure and utilities >breaking review v6.0.0-alpha1 labels Dec 9, 2016

danielmitterdorfer added the >bug label Dec 9, 2016

nik9000 reviewed Dec 9, 2016

View reviewed changes

nik9000 requested changes Dec 9, 2016

View reviewed changes

danielmitterdorfer added 5 commits December 12, 2016 10:49

Run custom duplicate check tests iff strict duplicate checks are disa…

80f1d07

…bled

Correct grammar in migration docs

5ba6b26

Merge remote-tracking branch 'origin/master' into json_strict_duplica…

41504c4

…te_checks

Merge remote-tracking branch 'origin/master' into json_strict_duplica…

d4a39ee

…te_checks

Test strict duplicate check

26e827c

clintongormley removed the >bug label Dec 12, 2016

nik9000 approved these changes Dec 13, 2016

View reviewed changes

danielmitterdorfer added 5 commits December 13, 2016 16:22

Use assume* instead of custom skipping logic

3de1d07

Merge remote-tracking branch 'origin/master' into json_strict_duplica…

6398450

…te_checks

Strict parsing of boolean system property in JsonXContent

53e06f7

Merge remote-tracking branch 'origin/master' into json_strict_duplica…

2116831

…te_checks

Add missing import

9c9e9f6

danielmitterdorfer merged commit 7e50580 into elastic:master Dec 14, 2016

danielmitterdorfer removed the review label Dec 14, 2016

danielmitterdorfer mentioned this pull request Dec 16, 2016

Enable strict duplicate checks for all XContent types #22225

Merged

danielmitterdorfer mentioned this pull request Dec 19, 2016

Cleanup obsolete hand-coded XContent duplicate checks #22253

Closed

Mpdreamz mentioned this pull request Dec 11, 2017

An item with the same key has already been added. (ES 5.2.0/Json: 9.0.1) elastic/elasticsearch-net#2653

Closed

tlrx mentioned this pull request Feb 20, 2018

Slow snapshots to S3 #24123

Closed

javanna mentioned this pull request Mar 6, 2018

JSON error in snapshots: Duplicate field #28906

Closed

danielmitterdorfer mentioned this pull request Oct 18, 2018

Remove hand-coded XContent duplicate checks #34588

Merged

geekpete mentioned this pull request Dec 14, 2018

Check existing data for duplicate field docs in the migration assistance APIs and Migration Assistant #36629

Closed

original-brownbear mentioned this pull request Apr 17, 2019

Repository with Broken Shards' index-N File Can't Create New Snapshots #41304

Closed

Enable strict duplicate checks for JSON content #22073

Enable strict duplicate checks for JSON content #22073

Conversation

danielmitterdorfer commented Dec 9, 2016 • edited Loading

danielmitterdorfer commented Dec 9, 2016 • edited Loading

Choose a reason for hiding this comment

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielmitterdorfer commented Dec 12, 2016

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Dec 14, 2016

danielmitterdorfer commented Dec 16, 2016

danielmitterdorfer commented Dec 9, 2016 •

edited

Loading

danielmitterdorfer commented Dec 9, 2016 •

edited

Loading