-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable strict duplicate checks for JSON content #22073
Enable strict duplicate checks for JSON content #22073
Conversation
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default. This ensures that JSON keys are always unique. While this has a performance impact, benchmarking has indicated that the typical drop in indexing throughput is around 1 - 2%. As a last resort, we allow users to still disable strict duplicate checks by setting `-Des.json.strict_duplicate_detection=false` which is intentionally undocumented. Closes elastic#19614
Here's a bit more background on the key decisions:
Regarding the next steps (after the PR is merged): As this is a breaking change I wonder if it makes sense to backport it to 5.x. I also think that we should remove the system property in a future release (7.0?) if it turns out that we can leave strict duplicate checks always enabled. Then we can also remove our hand-written duplicate checks. |
@@ -66,7 +66,7 @@ public void testFieldsParsing() throws Exception { | |||
assertThat(request.getIndexConstraints()[3].getComparison(), equalTo(LTE)); | |||
assertThat(request.getIndexConstraints()[4].getField(), equalTo("field5")); | |||
assertThat(request.getIndexConstraints()[4].getValue(), equalTo("2")); | |||
assertThat(request.getIndexConstraints()[4].getProperty(), equalTo(MAX)); | |||
assertThat(request.getIndexConstraints()[4].getProperty(), equalTo(MIN)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For those following along at home, this is required because the example data had a duplicate key and @danielmitterdorfer switched it to min
in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments - mostly around testing. I think the approach is fine but that we shouldn't change the tests, instead skipping them.
Also, if we do that, we should have a test that checks that the duplicate behavior is checked in some xcontent test. I didn't see one but I wasn't looking super hard.
e = e.getCause(); | ||
assertThat(e, instanceOf(IllegalArgumentException.class)); | ||
assertEquals("Can't repeat param [vegetable]", e.getMessage()); | ||
assertThat(e, instanceOf(JsonParseException.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is one problem with doing this as a read-on-startup property. We can no longer test things like this. This test is fairly worthless unless the property is false
and we have to manually perform the duplication detection.
I think it'd be worth leaving a comment about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is one problem with doing this as a read-on-startup property.
I agree that testing is not as simple as it could be but my priority was that the production code is dead-simple (i.e. no fancy abilities to change the JSON factory at runtime after initial creation). Given the fact that it's likely that we will enable this check without an ability to opt-out at some point in the future, I felt this was the right trade-off.
@@ -386,8 +387,8 @@ public void testTooManyQueriesInObject() throws IOException { | |||
" }\n" + | |||
" }\n" + | |||
"}"; | |||
ParsingException ex = expectThrows(ParsingException.class, () -> parseQuery(query, ParseFieldMatcher.EMPTY)); | |||
assertEquals("[match] malformed query, expected [END_OBJECT] but found [FIELD_NAME]", ex.getMessage()); | |||
JsonParseException ex = expectThrows(JsonParseException.class, () -> parseQuery(query, ParseFieldMatcher.EMPTY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - this test is fairly out of date now.
*/ | ||
public void testMultipleFilterElements() throws IOException { | ||
String queryString = "{ \"" + ConstantScoreQueryBuilder.NAME + "\" : {\n" + | ||
"\"filter\" : { \"term\": { \"foo\": \"a\" } },\n" + | ||
"\"filter\" : { \"term\": { \"foo\": \"x\" } },\n" + | ||
"} }"; | ||
ParsingException e = expectThrows(ParsingException.class, () -> parseQuery(queryString)); | ||
assertThat(e.getMessage(), containsString("accepts only one 'filter' element")); | ||
JsonParseException e = expectThrows(JsonParseException.class, () -> parseQuery(queryString)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
I wonder if it makes more sense to leave these tests as they are and skip them if the value of the duplicate check property is true
. That way if we really want we can set it to false and run the tests again. And if they have a compile time reference to the duplicate check property when we remove the duplicate check property the test will stop compiling and we can use that as a reminder to remove that particular manual duplicate check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea; I love it. I'll revert the tests to their previous state, add the compile time constant and skip the tests as you've suggested.
} | ||
|
||
private static void expectParsingException(String json, Matcher<String> messageMatcher) { | ||
ParsingException e = expectThrows(ParsingException.class, () -> parseQuery(json)); | ||
private static <T extends Throwable> void expectException(String json, Class<T> expectedException, Matcher<String> messageMatcher) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
.endObject() | ||
.endObject() | ||
.endObject() | ||
.startObject("properties") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
==== Duplicate Keys in JSON | ||
|
||
In previous versions of Elasticsearch, JSON documents were allowed to contain duplicate keys. Elasticsearch 6.0.0 | ||
will enforce that all keys are unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/will enforce/enforces/
@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques | |||
} | |||
} else if (token.isValue()) { | |||
if ("size".equals(currentFieldName)) { | |||
if (percolateRequest.onlyCount()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a leftover or did you move it or something? It feels like it doesn't belong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not adverse to adding it here, I just want to understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With strict duplicate checks enabled we would throw an error because size
is specified twice (once by the user, once added internally by our implementation of TransportPercolateAction
) and this error message gives the user a clear feedback to the user what is wrong instead of just giving a (technically correct but) misleading error to the user about a "duplicate field size".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
@@ -102,7 +100,7 @@ public void testDuplicateParts() throws Exception { | |||
" }," + | |||
" \"body\": null" + | |||
" }" + | |||
"}", "Found duplicate part [index]"); | |||
"}", JsonParseException.class,"Duplicate field 'index'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with regards to the test.
@nik9000 Thanks for your review. I've pushed a few more commits that should address all your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one comment about using assumeTrue
and assumeFalse
instead of the skipping you are doing now. If you are ok with making that change then I'm fine with you merging after you make it. If you don't want to make the change we should discuss one more time just so I understand why it doesn't work here.
@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques | |||
} | |||
} else if (token.isValue()) { | |||
if ("size".equals(currentFieldName)) { | |||
if (percolateRequest.onlyCount()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static { | ||
jsonFactory = new JsonFactory(); | ||
jsonFactory.configure(JsonGenerator.Feature.QUOTE_FIELD_NAMES, true); | ||
jsonFactory.configure(JsonParser.Feature.ALLOW_COMMENTS, true); | ||
jsonFactory.configure(JsonFactory.Feature.FAIL_ON_SYMBOL_HASH_OVERFLOW, false); // this trips on many mappings now... | ||
// Do not automatically close unclosed objects/arrays in com.fasterxml.jackson.core.json.UTF8JsonGenerator#close() method | ||
jsonFactory.configure(JsonGenerator.Feature.AUTO_CLOSE_JSON_CONTENT, false); | ||
try { | ||
jsonFactory.configure(JsonParser.Feature.STRICT_DUPLICATE_DETECTION, isStrictDuplicateDetectionEnabled()); | ||
} catch (Exception ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this exception still a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is. A user could still pass -Des.json.strict_duplicate_detection=complete_bogus
and provoke an exception when we try to convert the system property to a boolean
. I don't want to be lenient and terminate early then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they specify a non-boolean I think they'll get false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should only support true
and false
though.
@@ -48,6 +49,10 @@ public void testSimpleJsonSettings() throws Exception { | |||
} | |||
|
|||
public void testDuplicateKeysThrowsException() { | |||
if (JsonXContent.isStrictDuplicateDetectionEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about assumeTrue("Test only makes sense if json parser doesn't have strict duplicate checks enabled and can be removed if we force it to be enabled", JsonXContent.isStrictDuplicateDetectionEnabled());
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't know about assumeTrue
/ assumeFalse
and have only noticed the approach that I've used in the code base. Btw, I think this should be assumeFalse
. But this makes sense and I'll change the tests as you've suggested. Thanks for the pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I think I did have it backwards.
@@ -39,4 +40,15 @@ public void testBigInteger() throws Exception { | |||
JsonGenerator generator = new JsonFactory().createGenerator(os); | |||
doTestBigInteger(generator, os); | |||
} | |||
|
|||
public void testChecksForDuplicates() throws Exception { | |||
if (JsonXContent.isStrictDuplicateDetectionEnabled() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assumeFalse
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll change all affected tests accordingly.
==== Duplicate Keys in JSON | ||
|
||
In previous versions of Elasticsearch, JSON documents were allowed to contain duplicate keys. Elasticsearch 6.0.0 | ||
enforces that all keys are unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth writing a blog post around how to migrate documents with these duplicate keys. Because they'll blow up if you try to highlight a field or do source filtering. But they shouldn't blow up if you just return the _source
without filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it to my backlog and look into it.
@@ -184,6 +184,9 @@ public static SearchRequest createSearchRequest(PercolateRequest percolateReques | |||
} | |||
} else if (token.isValue()) { | |||
if ("size".equals(currentFieldName)) { | |||
if (percolateRequest.onlyCount()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
|
||
boolean countOnly = randomBoolean(); | ||
if (countOnly) { | ||
percolateRequestBuilder.setOnlyCount(countOnly); | ||
} else { | ||
// can only set size if we also keep track of matches (i.e. countOnly == false) | ||
if (randomBoolean()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now these make more sense.
@danielmitterdorfer: I think this doesn't change yaml, cbor, or smile parsing. Do they need this as well? |
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default for all XContent types (not only JSON). We have also changed the name of the system property to disable this feature from `es.json.strict_duplicate_detection` to the now more appropriate name `es.xcontent.strict_duplicate_detection`. Relates elastic#19614 Relates elastic#22073
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION' by default for all XContent types (not only JSON). We have also changed the name of the system property to disable this feature from `es.json.strict_duplicate_detection` to the now more appropriate name `es.xcontent.strict_duplicate_detection`. Relates #19614 Relates #22073
With this commit we cleanup hand-coded duplicate checks in XContent parsing. They were necessary previously but since we reconfigured the underlying parser in elastic#22073 and elastic#22225, these checks are obsolete and were also ineffective unless an undocumented system property has been set. As we also remove this escape hatch, we can remove the additional checks as well. Closes elastic#22253
With this commit we cleanup hand-coded duplicate checks in XContent parsing. They were necessary previously but since we reconfigured the underlying parser in #22073 and #22225, these checks are obsolete and were also ineffective unless an undocumented system property has been set. As we also remove this escape hatch, we can remove the additional checks as well. Closes #22253 Relates #34588
With this commit we cleanup hand-coded duplicate checks in XContent parsing. They were necessary previously but since we reconfigured the underlying parser in #22073 and #22225, these checks are obsolete and were also ineffective unless an undocumented system property has been set. As we also remove this escape hatch, we can remove the additional checks as well. Closes #22253 Relates #34588
With this commit we enable the Jackson feature 'STRICT_DUPLICATE_DETECTION'
by default. This ensures that JSON keys are always unique. While this has
a performance impact, benchmarking has indicated that the typical drop in
indexing throughput is around 1 - 2%.
As a last resort, we allow users to still disable strict duplicate checks
by setting
-Des.json.strict_duplicate_detection=false
-Des.xcontent.strict_duplicate_detection=false
which isintentionally undocumented.
Edit: In a later PR (#28768) the name of this system property has changed from
es.json.strict_duplicate_detection
(valid for Elasticsearch versions from 6.0.0 up to but not including 6.3.0) toes.xcontent.strict_duplicate_detection
(valid from Elasticsearch 6.3.0 and onwards). Note that this escape hatch only exists for the 6.x series. This PR's description is now updated to reflect that change.Closes #19614