Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable strict duplicate checks for JSON content #22073

Conversation

danielmitterdorfer
Copy link
Member

@danielmitterdorfer danielmitterdorfer commented Dec 9, 2016

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.

As a last resort, we allow users to still disable strict duplicate checks
by setting -Des.json.strict_duplicate_detection=false
-Des.xcontent.strict_duplicate_detection=false which is
intentionally undocumented.

Edit: In a later PR (#28768) the name of this system property has changed from es.json.strict_duplicate_detection (valid for Elasticsearch versions from 6.0.0 up to but not including 6.3.0) to es.xcontent.strict_duplicate_detection (valid from Elasticsearch 6.3.0 and onwards). Note that this escape hatch only exists for the 6.x series. This PR's description is now updated to reflect that change.

Closes #19614

With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.

As a last resort, we allow users to still disable strict duplicate checks
by setting `-Des.json.strict_duplicate_detection=false` which is
intentionally undocumented.

Closes elastic#19614
@danielmitterdorfer
Copy link
Member Author

danielmitterdorfer commented Dec 9, 2016

Here's a bit more background on the key decisions:

  1. I opted for a system property instead of a setting. This should stay enabled and the system property is meant as a last resort so I think this is justified. By implementing this as a system property we can keep all initialization code as is in a static initializer and benefit from all the safety guarantees (e.g. the JSON factory is effectively immutable)
  2. I did not remove any additional duplicate checks as the user could deactivate JSON duplicate checks by a system property. If I had removed these checks we would not check at all in these cases. This could cause bugs that are very hard to find.
  3. Elasticsearch treats duplicate keys now effectively as any other JSON syntax error. This means that in some cases we throw now JsonParseException instead of Elasticsearch specific exceptions (see tests).

Regarding the next steps (after the PR is merged): As this is a breaking change I wonder if it makes sense to backport it to 5.x. I also think that we should remove the system property in a future release (7.0?) if it turns out that we can leave strict duplicate checks always enabled. Then we can also remove our hand-written duplicate checks.

@@ -66,7 +66,7 @@ public void testFieldsParsing() throws Exception {
assertThat(request.getIndexConstraints()[3].getComparison(), equalTo(LTE));
assertThat(request.getIndexConstraints()[4].getField(), equalTo("field5"));
assertThat(request.getIndexConstraints()[4].getValue(), equalTo("2"));
assertThat(request.getIndexConstraints()[4].getProperty(), equalTo(MAX));
assertThat(request.getIndexConstraints()[4].getProperty(), equalTo(MIN));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those following along at home, this is required because the example data had a duplicate key and @danielmitterdorfer switched it to min in the file.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments - mostly around testing. I think the approach is fine but that we shouldn't change the tests, instead skipping them.

Also, if we do that, we should have a test that checks that the duplicate behavior is checked in some xcontent test. I didn't see one but I wasn't looking super hard.

e = e.getCause();
assertThat(e, instanceOf(IllegalArgumentException.class));
assertEquals("Can't repeat param [vegetable]", e.getMessage());
assertThat(e, instanceOf(JsonParseException.class));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is one problem with doing this as a read-on-startup property. We can no longer test things like this. This test is fairly worthless unless the property is false and we have to manually perform the duplication detection.

I think it'd be worth leaving a comment about that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is one problem with doing this as a read-on-startup property.

I agree that testing is not as simple as it could be but my priority was that the production code is dead-simple (i.e. no fancy abilities to change the JSON factory at runtime after initial creation). Given the fact that it's likely that we will enable this check without an ability to opt-out at some point in the future, I felt this was the right trade-off.

@@ -386,8 +387,8 @@ public void testTooManyQueriesInObject() throws IOException {
" }\n" +
" }\n" +
"}";
ParsingException ex = expectThrows(ParsingException.class, () -> parseQuery(query, ParseFieldMatcher.EMPTY));
assertEquals("[match] malformed query, expected [END_OBJECT] but found [FIELD_NAME]", ex.getMessage());
JsonParseException ex = expectThrows(JsonParseException.class, () -> parseQuery(query, ParseFieldMatcher.EMPTY));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - this test is fairly out of date now.

*/
public void testMultipleFilterElements() throws IOException {
String queryString = "{ \"" + ConstantScoreQueryBuilder.NAME + "\" : {\n" +
"\"filter\" : { \"term\": { \"foo\": \"a\" } },\n" +
"\"filter\" : { \"term\": { \"foo\": \"x\" } },\n" +
"} }";
ParsingException e = expectThrows(ParsingException.class, () -> parseQuery(queryString));
assertThat(e.getMessage(), containsString("accepts only one 'filter' element"));
JsonParseException e = expectThrows(JsonParseException.class, () -> parseQuery(queryString));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

I wonder if it makes more sense to leave these tests as they are and skip them if the value of the duplicate check property is true. That way if we really want we can set it to false and run the tests again. And if they have a compile time reference to the duplicate check property when we remove the duplicate check property the test will stop compiling and we can use that as a reminder to remove that particular manual duplicate check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea; I love it. I'll revert the tests to their previous state, add the compile time constant and skip the tests as you've suggested.

}

private static void expectParsingException(String json, Matcher<String> messageMatcher) {
ParsingException e = expectThrows(ParsingException.class, () -> parseQuery(json));
private static <T extends Throwable> void expectException(String json, Class<T> expectedException, Matcher<String> messageMatcher) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

.endObject()
.endObject()
.endObject()
.startObject("properties")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

==== Duplicate Keys in JSON

In previous versions of Elasticsearch, JSON documents were allowed to contain duplicate keys. Elasticsearch 6.0.0
will enforce that all keys are unique.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/will enforce/enforces/

@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques
}
} else if (token.isValue()) {
if ("size".equals(currentFieldName)) {
if (percolateRequest.onlyCount()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a leftover or did you move it or something? It feels like it doesn't belong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not adverse to adding it here, I just want to understand it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With strict duplicate checks enabled we would throw an error because size is specified twice (once by the user, once added internally by our implementation of TransportPercolateAction) and this error message gives the user a clear feedback to the user what is wrong instead of just giving a (technically correct but) misleading error to the user about a "duplicate field size".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool.

@@ -102,7 +100,7 @@ public void testDuplicateParts() throws Exception {
" }," +
" \"body\": null" +
" }" +
"}", "Found duplicate part [index]");
"}", JsonParseException.class,"Duplicate field 'index'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with regards to the test.

@danielmitterdorfer
Copy link
Member Author

@nik9000 Thanks for your review. I've pushed a few more commits that should address all your comments.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment about using assumeTrue and assumeFalse instead of the skipping you are doing now. If you are ok with making that change then I'm fine with you merging after you make it. If you don't want to make the change we should discuss one more time just so I understand why it doesn't work here.

@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques
}
} else if (token.isValue()) {
if ("size".equals(currentFieldName)) {
if (percolateRequest.onlyCount()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static {
jsonFactory = new JsonFactory();
jsonFactory.configure(JsonGenerator.Feature.QUOTE_FIELD_NAMES, true);
jsonFactory.configure(JsonParser.Feature.ALLOW_COMMENTS, true);
jsonFactory.configure(JsonFactory.Feature.FAIL_ON_SYMBOL_HASH_OVERFLOW, false); // this trips on many mappings now...
// Do not automatically close unclosed objects/arrays in com.fasterxml.jackson.core.json.UTF8JsonGenerator#close() method
jsonFactory.configure(JsonGenerator.Feature.AUTO_CLOSE_JSON_CONTENT, false);
try {
jsonFactory.configure(JsonParser.Feature.STRICT_DUPLICATE_DETECTION, isStrictDuplicateDetectionEnabled());
} catch (Exception ex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exception still a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. A user could still pass -Des.json.strict_duplicate_detection=complete_bogus and provoke an exception when we try to convert the system property to a boolean. I don't want to be lenient and terminate early then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they specify a non-boolean I think they'll get false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should only support true and false though.

@@ -48,6 +49,10 @@ public void testSimpleJsonSettings() throws Exception {
}

public void testDuplicateKeysThrowsException() {
if (JsonXContent.isStrictDuplicateDetectionEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about assumeTrue("Test only makes sense if json parser doesn't have strict duplicate checks enabled and can be removed if we force it to be enabled", JsonXContent.isStrictDuplicateDetectionEnabled());?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know about assumeTrue / assumeFalse and have only noticed the approach that I've used in the code base. Btw, I think this should be assumeFalse. But this makes sense and I'll change the tests as you've suggested. Thanks for the pointer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I think I did have it backwards.

@@ -39,4 +40,15 @@ public void testBigInteger() throws Exception {
JsonGenerator generator = new JsonFactory().createGenerator(os);
doTestBigInteger(generator, os);
}

public void testChecksForDuplicates() throws Exception {
if (JsonXContent.isStrictDuplicateDetectionEnabled() == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assumeFalse here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll change all affected tests accordingly.

==== Duplicate Keys in JSON

In previous versions of Elasticsearch, JSON documents were allowed to contain duplicate keys. Elasticsearch 6.0.0
enforces that all keys are unique.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth writing a blog post around how to migrate documents with these duplicate keys. Because they'll blow up if you try to highlight a field or do source filtering. But they shouldn't blow up if you just return the _source without filtering.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it to my backlog and look into it.

@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques
}
} else if (token.isValue()) {
if ("size".equals(currentFieldName)) {
if (percolateRequest.onlyCount()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool.


boolean countOnly = randomBoolean();
if (countOnly) {
percolateRequestBuilder.setOnlyCount(countOnly);
} else {
// can only set size if we also keep track of matches (i.e. countOnly == false)
if (randomBoolean()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now these make more sense.

@danielmitterdorfer danielmitterdorfer merged commit 7e50580 into elastic:master Dec 14, 2016
@nik9000
Copy link
Member

nik9000 commented Dec 14, 2016

@danielmitterdorfer: I think this doesn't change yaml, cbor, or smile parsing. Do they need this as well?

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this pull request Dec 16, 2016
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default for all XContent types (not only JSON).

We have also changed the name of the system property to disable this feature
from `es.json.strict_duplicate_detection` to the now more appropriate name
`es.xcontent.strict_duplicate_detection`.

Relates elastic#19614
Relates elastic#22073
@danielmitterdorfer
Copy link
Member Author

@nik9000 You're right. I've opened #22225. Can you please have look?

danielmitterdorfer added a commit that referenced this pull request Dec 19, 2016
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default for all XContent types (not only JSON).

We have also changed the name of the system property to disable this feature
from `es.json.strict_duplicate_detection` to the now more appropriate name
`es.xcontent.strict_duplicate_detection`.

Relates #19614
Relates #22073
@tlrx tlrx mentioned this pull request Feb 20, 2018
danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this pull request Oct 18, 2018
With this commit we cleanup hand-coded duplicate checks in XContent
parsing. They were necessary previously but since we reconfigured the
underlying parser in elastic#22073 and elastic#22225, these checks are obsolete and
were also ineffective unless an undocumented system property has been
set. As we also remove this escape hatch, we can remove the additional
checks as well.

Closes elastic#22253
danielmitterdorfer added a commit that referenced this pull request Oct 19, 2018
With this commit we cleanup hand-coded duplicate checks in XContent
parsing. They were necessary previously but since we reconfigured the
underlying parser in #22073 and #22225, these checks are obsolete and
were also ineffective unless an undocumented system property has been
set. As we also remove this escape hatch, we can remove the additional
checks as well.

Closes #22253
Relates #34588
kcm pushed a commit that referenced this pull request Oct 30, 2018
With this commit we cleanup hand-coded duplicate checks in XContent
parsing. They were necessary previously but since we reconfigured the
underlying parser in #22073 and #22225, these checks are obsolete and
were also ineffective unless an undocumented system property has been
set. As we also remove this escape hatch, we can remove the additional
checks as well.

Closes #22253
Relates #34588
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking :Core/Infra/REST API REST infrastructure and utilities v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants